首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Indexing quality determines whether the information content of an indexed document is accurately represented. Indexing effectiveness measures whether an indexed document is correctly retrieved every time it is relevant to a query. Measurement of these criteria is cumbersome and costly; data base producers therefore prefer inter-indexer consistency as a measure of indexing quality or effectiveness. The present article assesses the validity of this substitution in various environments.  相似文献   

2.
Ontologies are frequently used in information retrieval being their main applications the expansion of queries, semantic indexing of documents and the organization of search results. Ontologies provide lexical items, allow conceptual normalization and provide different types of relations. However, the optimization of an ontology to perform information retrieval tasks is still unclear. In this paper, we use an ontology query model to analyze the usefulness of ontologies in effectively performing document searches. Moreover, we propose an algorithm to refine ontologies for information retrieval tasks with preliminary positive results.  相似文献   

3.
一种基于本体的语义标引方法   总被引:4,自引:0,他引:4  
传统的采用主题词和关键词对文档进行标引的方法,由于不能提供语义推理而越来越不适合目前的网络环境。由于本体具有良好的概念层次结构和对逻辑推理的支持,在信息检索领域将有很大的应用价值。本文首先介绍本体的基本概念和领域本体的组成部分,然后提出了一种基于领域本体的语义标引方法,采用本体中的概念对文档进行语义层面的标引,为检索的智能推理提供基础。  相似文献   

4.
5.
Traditional index weighting approaches for information retrieval from texts depend on the term frequency based analysis of the text contents. A shortcoming of these indexing schemes, which consider only the occurrences of the terms in a document, is that they have some limitations in extracting semantically exact indexes that represent the semantic content of a document. To address this issue, we developed a new indexing formalism that considers not only the terms in a document, but also the concepts. In this approach, concept clusters are defined and a concept vector space model is proposed to represent the semantic importance degrees of lexical items and concepts within a document. Through an experiment on the TREC collection of Wall Street Journal documents, we show that the proposed method outperforms an indexing method based on term frequency (TF), especially in regard to the few highest-ranked documents. Moreover, the index term dimension was 80% lower for the proposed method than for the TF-based method, which is expected to significantly reduce the document search time in a real environment.  相似文献   

6.
7.
李江华  时鹏 《情报杂志》2012,31(4):112-116
Internet已成为全球最丰富的数据源,数据类型繁杂且动态变化,如何从中快速准确地检索出用户所需要的信息是一个亟待解决的问题.传统的搜索引擎基于语法的方式进行搜索,缺乏语义信息,难以准确地表达用户的查询需求和被检索对象的文档语义,致使查准率和查全率较低且搜索范围有限.本文对现有的语义检索方法进行了研究,分析了其中存在的问题,在此基础上提出了一种基于领域的语义搜索引擎模型,结合语义Web技术,使用领域本体元数据模型对用户的查询进行语义化规范,依据领域本体模式抽取文档中的知识并RDF化,准确地表达了用户的查询语义和作为被查询对象的文档语义,可以大大提高检索的准确性和检索效率,详细地给出了模型的体系结构、基本功能和工作原理.  相似文献   

8.
Pseudo-relevance feedback (PRF) is a classical technique to improve search engine retrieval effectiveness, by closing the vocabulary gap between users’ query formulations and the relevant documents. While PRF is typically applied on the same target corpus as the final retrieval, in the past, external expansion techniques have sometimes been applied to obtain a high-quality pseudo-relevant feedback set using the external corpus. However, such external expansion approaches have only been studied for sparse (BoW) retrieval methods, and its effectiveness for recent dense retrieval methods remains under-investigated. Indeed, dense retrieval approaches such as ANCE and ColBERT, which conduct similarity search based on encoded contextualised query and document embeddings, are of increasing importance. Moreover, pseudo-relevance feedback mechanisms have been proposed to further enhance dense retrieval effectiveness. In particular, in this work, we examine the application of dense external expansion to improve zero-shot retrieval effectiveness, i.e. evaluation on corpora without further training. Zero-shot retrieval experiments with six datasets, including two TREC datasets and four BEIR datasets, when applying the MSMARCO passage collection as external corpus, indicate that obtaining external feedback documents using ColBERT can significantly improve NDCG@10 for the sparse retrieval (by upto 28%) and the dense retrieval (by upto 12%). In addition, using ANCE on the external corpus brings upto 30% NDCG@10 improvements for the sparse retrieval and upto 29% for the dense retrieval.  相似文献   

9.
This paper proposes a method to improve retrieval performance of the vector space model (VSM) in part by utilizing user-supplied information of those documents that are relevant to the query in question. In addition to the user's relevance feedback information, information such as original document similarities is incorporated into the retrieval model, which is built by using a sequence of linear transformations. High-dimensional and sparse vectors are then reduced by singular value decomposition (SVD) and transformed into a low-dimensional vector space, namely the space representing the latent semantic meanings of words. The method has been tested with two test collections, the Medline collection and the Cranfield collection. In order to train the model, multiple partitions are created for each collection. Improvement of average precision of the averages over all partitions, compared with the latent semantic indexing (LSI) model, are 20.57% (Medline) and 22.23% (Cranfield) for the two training data sets, and 0.47% (Medline) and 4.78% (Cranfield) for the test data, respectively. The proposed method provides an approach that makes it possible to preserve user-supplied relevance information for the long term in the system in order to use it later.  相似文献   

10.
Latent semantic indexing (LSI) has been demonstrated to outperform lexical matching in information retrieval. However, the enormous cost associated with the singular value decomposition (SVD) of the large term-by-document matrix becomes a barrier for its application to scalable information retrieval. This work shows that information filtering using level search techniques can reduce the SVD computation cost for LSI. For each query, level search extracts a much smaller subset of the original term-by-document matrix, containing on average 27% of the original non-zero entries. When LSI is applied to such subsets, the average precision can degrade by as much as 23% due to level search filtering. However, for some document collections an increase in precision has also been observed. Further enhancement of level search can be based on a pruning scheme which deletes terms connected to only one document from the query-specific submatrix. Such pruning has achieved a 65% reduction (on average) in the number of non-zeros with a precision loss of 5% for most collections.  相似文献   

11.
The inverted file is the most popular indexing mechanism for document search in an information retrieval system. Compressing an inverted file can greatly improve document search rate. Traditionally, the d-gap technique is used in the inverted file compression by replacing document identifiers with usually much smaller gap values. However, fluctuating gap values cannot be efficiently compressed by some well-known prefix-free codes. To smoothen and reduce the gap values, we propose a document-identifier reassignment algorithm. This reassignment is based on a similarity factor between documents. We generate a reassignment order for all documents according to the similarity to reassign closer identifiers to the documents having closer relationships. Simulation results show that the average gap values of sample inverted files can be reduced by 30%, and the compression rate of d-gapped inverted file with prefix-free codes can be improved by 15%.  相似文献   

12.
Searching for relevant material that satisfies the information need of a user, within a large document collection is a critical activity for web search engines. Query Expansion techniques are widely used by search engines for the disambiguation of user’s information need and for improving the information retrieval (IR) performance. Knowledge-based, corpus-based and relevance feedback, are the main QE techniques, that employ different approaches for expanding the user query with synonyms of the search terms (word synonymy) in order to bring more relevant documents and for filtering documents that contain search terms but with a different meaning (also known as word polysemy problem) than the user intended. This work, surveys existing query expansion techniques, highlights their strengths and limitations and introduces a new method that combines the power of knowledge-based or corpus-based techniques with that of relevance feedback. Experimental evaluation on three information retrieval benchmark datasets shows that the application of knowledge or corpus-based query expansion techniques on the results of the relevance feedback step improves the information retrieval performance, with knowledge-based techniques providing significantly better results than their simple relevance feedback alternatives in all sets.  相似文献   

13.
14.
An integrated information retrieval system generally contains multiple databases that are inconsistent in terms of their content and indexing. This paper proposes a rough set-based transfer (RST) model for integration of the concepts of document databases using various indexing languages, so that users can search through the multiple databases using any of the current indexing languages. The RST model aims to effectively create meaningful transfer relations between the terms of two indexing languages, provided a number of documents are indexed with them in parallel. In our experiment, the indexing concepts of two databases respectively using the Thesaurus of Social Science (IZ) and the Schlagwortnormdatei (SWD) are integrated by means of the RST model. Finally, this paper compares the results achieved with a cross-concordance method, a conditional probability based method and the RST model.  相似文献   

15.
With the advent of various services and applications of Semantic Web, semantic annotation has emerged as an important research topic. The application of semantically annotated ontology had been evident in numerous information processing and retrieval tasks. One of such tasks is utilizing the semantically annotated ontology in product design which is able to suggest many important applications that are critical to aid various design related tasks. However, ontology development in design engineering remains a time consuming and tedious task that demands considerable human efforts. In the context of product family design, management of different product information that features efficient indexing, update, navigation, search and retrieval across product families is both desirable and challenging. For instance, an efficient way of retrieving timely information on product family can be useful for tasks such as product family redesign and new product variant derivation when requirements change. However, the current research and application of information search and navigation in product family is mostly limited to its structural aspect which is insufficient to handle advanced information search especially when the query targets at multiple aspects of a product. This paper attempts to address this problem by proposing an information search and retrieval framework based on the semantically annotated multi-facet product family ontology. Particularly, we propose a document profile (DP) model to suggest semantic tags for annotation purpose. Using a case study of digital camera families, we illustrate how the faceted search and retrieval of product information can be accomplished. We also exemplify how we can derive new product variants based on the designer’s query of requirements via the faceted search and retrieval of product family information. Lastly, in order to highlight the value of our current work, we briefly discuss some further research and applications in design decision support, e.g. commonality analysis and variety comparison, based on the semantically annotated multi-facet product family ontology.  相似文献   

16.
Pseudo-relevance feedback (PRF) is a well-known method for addressing the mismatch between query intention and query representation. Most current PRF methods consider relevance matching only from the perspective of terms used to sort feedback documents, thus possibly leading to a semantic gap between query representation and document representation. In this work, a PRF framework that combines relevance matching and semantic matching is proposed to improve the quality of feedback documents. Specifically, in the first round of retrieval, we propose a reranking mechanism in which the information of the exact terms and the semantic similarity between the query and document representations are calculated by bidirectional encoder representations from transformers (BERT); this mechanism reduces the text semantic gap by using the semantic information and improves the quality of feedback documents. Then, our proposed PRF framework is constructed to process the results of the first round of retrieval by using probability-based PRF methods and language-model-based PRF methods. Finally, we conduct extensive experiments on four Text Retrieval Conference (TREC) datasets. The results show that the proposed models outperform the robust baseline models in terms of the mean average precision (MAP) and precision P at position 10 (P@10), and the results also highlight that using the combined relevance matching and semantic matching method is more effective than using relevance matching or semantic matching alone in terms of improving the quality of feedback documents.  相似文献   

17.
A theory of indexing helps explain the nature of indexing, the structure of the vocabulary, and the quality of the index. Indexing theories formulated by Jonker, Heilprin, Landry and Salton are described. Each formulation has a different focus. Jonker, by means of the Terminological and Connective Continua, provided a basis for understanding the relationships between the size of the vocabulary, the hierarchical organization, and the specificity by which concepts can be described. Heilprin introduced the idea of a search path which leads from query to document. He also added a third dimension to Jonker's model; the three variables are diffuseness, permutivity and hierarchical connectedness. Landry made an ambitious and well conceived attempt to build a comprehensive theory of indexing predicated upon sets of documents, sets of attributes, and sets of relationships between the two. It is expressed in theorems and by formal notation. Salton provided both a notational definition of indexing and procedures for improving the ability of index terms to discriminate between relevant and nonrelevant documents. These separate theories need to be tested experimentally and eventually combined into a unified comprehensive theory of indexing.  相似文献   

18.
19.
Although there has been a great deal of research into Collaborative Information Retrieval (CIR) and Collaborative Information Seeking (CIS), the majority has assumed that team members have the same level of unrestricted access to underlying information. However, observations from different domains (e.g. healthcare, business, etc.) have suggested that collaboration sometimes involves people with differing levels of access to underlying information. This type of scenario has been referred to as Multi-Level Collaborative Information Retrieval (MLCIR). To the best of our knowledge, no studies have been conducted to investigate the effect of awareness, an existing CIR/CIS concept, on MLCIR. To address this gap in current knowledge, we conducted two separate user studies using a total of 5 different collaborative search interfaces and 3 information access scenarios. A number of Information Retrieval (IR), CIS and CIR evaluation metrics, as well as questionnaires were used to compare the interfaces. Design interviews were also conducted after evaluations to obtain qualitative feedback from participants. Results suggested that query properties such as time spent on query, query popularity and query effectiveness could allow users to obtain information about team's search performance and implicitly suggest better queries without disclosing sensitive data. Besides, having access to a history of intersecting viewed, relevant and bookmarked documents could provide similar positive effect as query properties. Also, it was found that being able to easily identify different team members and their actions is important for users in MLCIR. Based on our findings, we provide important design recommendations to help develop new CIR and MLCIR interfaces.  相似文献   

20.
To obtain high performances, previous works on FAQ retrieval used high-level knowledge bases or handcrafted rules. However, it is a time and effort consuming job to construct these knowledge bases and rules whenever application domains are changed. To overcome this problem, we propose a high-performance FAQ retrieval system only using users’ query logs as knowledge sources. During indexing time, the proposed system efficiently clusters users’ query logs using classification techniques based on latent semantic analysis. During retrieval time, the proposed system smoothes FAQs using the query log clusters. In the experiment, the proposed system outperformed the conventional information retrieval systems in FAQ retrieval. Based on various experiments, we found that the proposed system could alleviate critical lexical disagreement problems in short document retrieval. In addition, we believe that the proposed system is more practical and reliable than the previous FAQ retrieval systems because it uses only data-driven methods without high-level knowledge sources.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号