首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Searching for relevant material that satisfies the information need of a user, within a large document collection is a critical activity for web search engines. Query Expansion techniques are widely used by search engines for the disambiguation of user’s information need and for improving the information retrieval (IR) performance. Knowledge-based, corpus-based and relevance feedback, are the main QE techniques, that employ different approaches for expanding the user query with synonyms of the search terms (word synonymy) in order to bring more relevant documents and for filtering documents that contain search terms but with a different meaning (also known as word polysemy problem) than the user intended. This work, surveys existing query expansion techniques, highlights their strengths and limitations and introduces a new method that combines the power of knowledge-based or corpus-based techniques with that of relevance feedback. Experimental evaluation on three information retrieval benchmark datasets shows that the application of knowledge or corpus-based query expansion techniques on the results of the relevance feedback step improves the information retrieval performance, with knowledge-based techniques providing significantly better results than their simple relevance feedback alternatives in all sets.  相似文献   

2.
A comparative study of two types of patent retrieval tasks, technology survey and invalidity search, using the NTCIR-3 and -4 test collections is described, with a focus on pseudo-feedback effectiveness and different retrieval models. Invalidity searches are peculiar to patent retrieval tasks and feature small numbers of relevant documents and long queries. Different behaviors of effectiveness are observed when applying different retrieval models and pseudo-feedback. These different behaviors are analyzed in terms of the “weak cluster hypothesis”, i.e., terminological cohesiveness through relevant documents.  相似文献   

3.
In this paper, we present a comparison of collocation-based similarity measures: Jaccard, Dice and Cosine similarity measures for the proper selection of additional search terms in query expansion. In addition, we consider two more similarity measures: average conditional probability (ACP) and normalized mutual information (NMI). ACP is the mean value of two conditional probabilities between a query term and an additional search term. NMI is a normalized value of the two terms' mutual information. All these similarity measures are the functions of any two terms' frequencies and the collocation frequency, but are different in the methods of measurement. The selected measure changes the order of additional search terms and their weights, hence has a strong influence on the retrieval performance. In our experiments of query expansion using these five similarity measures, the additional search terms of Jaccard, Dice and Cosine similarity measures include more frequent terms with lower similarity values than ACP or NMI. In overall assessments of query expansion, the Jaccard, Dice and Cosine similarity measures are better than ACP and NMI in terms of retrieval effectiveness, whereas, NMI and ACP are better in terms of execution efficiency.  相似文献   

4.
The principle of polyrepresentation offers a theoretical framework for handling multiple contexts in information retrieval (IR). This paper presents an empirical laboratory study of polyrepresentation in restricted mode of the information space with focus on inter and intra-document features. The Cystic Fibrosis test collection indexed in the best match system InQuery constitutes the experimental setting. Overlaps between five functionally and/or cognitively different document representations are identified. Supporting the principle of polyrepresentation, results show that in general overlaps generated by three or four representations of different nature have higher precision than those generated from two representations or the single fields. This result pertains to both structured and unstructured query mode in best match retrieval, however, with the latter query mode demonstrating higher performance. The retrieval overlaps containing search keys from the bibliographic references provide the best retrieval performance and minor MeSH terms the worst. It is concluded that a highly structured query language is necessary when implementing the principle of polyrepresentation in a best match IR system because the principle is inherently Boolean. Finally a re-ranking test shows promising results when search results are re-ranked according to precision obtained in the overlaps whilst re-ranking by citations seems less useful when integrated into polyrepresentative applications.  相似文献   

5.
Modern information retrieval systems are designed to supply relevant information in response to requests received from the user population. In most retrieval environments the search requests consist of keywords, or index terms, interrelated by appropriate Boolean operators. Since it is difficult for untrained users to generate effective Boolean search requests, trained search intermediaries are normally used to translate original statements of user need into useful Boolean search formulations. Methods are introduced in this study which reduce the role of the search intermediaries by making it possible to generate Boolean search formulations completely automatically from natural language statements provided by the system patrons. Frequency considerations are used automatically to generate appropriate term combinations as well as Boolean connectives relating the terms. Methods are covered to produce automatic query formulations both in a standard Boolean logic system, as well as in an extended Boolean system in which the strict interpretation of the connectives is relaxed. Experimental results are supplied to evaluate the effectiveness of the automatic query formulation process, and methods are described for applying the automatic query formulation process in practice.  相似文献   

6.
就高校图书馆科技查新工作,分别从查新点提炼、检索词分析、检索策略制定、结论的撰写等方面进行技巧探讨及实例分析,希望对查新工作者和信息检索人员有一定的参考价值.  相似文献   

7.
The National Library of Medicine has offered TOXLINE, an online interactive bibliographic database of biomedical (toxicology) information since 1972. Files from 11 secondary sources comprise the TOXLINE database. The sources supplied bibliographic records in different formats and data structures. Data from each supplier's format had to be converted into a format suitable for TOXLINE. Three different, successive retrieval systems were used for the TOXLINE database which required reformatting of the data. Algorithms for generating terms for inverted file search methods were tested. Special characters peculiar to the scientific literature were evaluated during search term generation. Developing search term algorithms for chemical names in the scientific literature required techniques different from those used for nonscientific literature. Problems with replication of bibliographic records from multiple secondary sources are described. Some observations about online interactive databases since TOXLINE was first offered are noted.  相似文献   

8.
We present PubSearch, a hybrid heuristic scheme for re-ranking academic papers retrieved from standard digital libraries such as the ACM Portal. The scheme is based on the hierarchical combination of a custom implementation of the term frequency heuristic, a time-depreciated citation score and a graph-theoretic computed score that relates the paper’s index terms with each other. We designed and developed a meta-search engine that submits user queries to standard digital repositories of academic publications and re-ranks the repository results using the hierarchical heuristic scheme. We evaluate our proposed re-ranking scheme via user feedback against the results of ACM Portal on a total of 58 different user queries specified from 15 different users. The results show that our proposed scheme significantly outperforms ACM Portal in terms of retrieval precision as measured by most common metrics in Information Retrieval including Normalized Discounted Cumulative Gain (NDCG), Expected Reciprocal Rank (ERR) as well as a newly introduced lexicographic rule (LEX) of ranking search results. In particular, PubSearch outperforms ACM Portal by more than 77% in terms of ERR, by more than 11% in terms of NDCG, and by more than 907.5% in terms of LEX. We also re-rank the top-10 results of a subset of the original 58 user queries produced by Google Scholar, Microsoft Academic Search, and ArnetMiner; the results show that PubSearch compares very well against these search engines as well. The proposed scheme can be easily plugged in any existing search engine for retrieval of academic publications.  相似文献   

9.
An experimental computer intermediary system, CONIT, that assists users in accessing and searching heterogeneous retrieval systems has been enhanced with various search aids. Controlled experiments have been conducted to compare the effectiveness of the enhanced CONIT intermediary with that of human expert intermediary search specialists. Some 16 end users, none of whom had previously operated either CONIT or any of the four connected retrieval systems, performed searches on 20 different topics using CONIT with no assistance other than that provided by CONIT itself (except to recover from computer/software bugs). These same users also performed searches on the same topics with the help of human expert intermediaries who searched using the retrieval systems directly. Sometimes CONIT and sometimes the human expert were clearly superior in terms of such parameters as recall and search time. In general, however, users searching alone with CONIT achieved somewhat higher online recall at the expense of longer session times. We conclude that advanced experimental intermediary techniques are now capable of providing search assistance whose effectiveness at least approximates that of human intermediaries in some contexts. Also analyzed is the cost effectiveness of current intermediary systems. Finally, consideration is given to the prospects for much more advanced systems which would perform such functions as automatic data-base selection and the simulation of human experts, and thereby make information retrieval more effective for all classes of users.  相似文献   

10.
Search patterns of documents and information requests are their better or worse representatives only, so it is important to carry on examinations on possibilities of designing self-learning information retrieval systems. Another important question is to elaborate such an organization of document search pattern set as to obtain an acceptable response time of the information system to a given information request.A self-learning process of the proposed information system consists in the determination—on a set of document and information request search patterns—of the similarity relation according to L. A. Zadeh.The organization of a set of document search patterns proposed in the paper ensures the limitation of document search pattern set searching process—when retrieving a response to a given information request—to one (or several) subset from previously determined subsets. This makes the information system response time acceptable. The proposed information retrieval strategy is discussed in terms of fuzzy sets.  相似文献   

11.
A growing body of research is beginning to explore the information-seeking behavior of Web users. The vast majority of these studies have concentrated on the area of textual information retrieval (IR). Little research has examined how people search for non-textual information on the Internet, and few large-scale studies has investigated visual information-seeking behavior with general-purpose Web search engines. This study examined visual information needs as expressed in users’ Web image queries. The data set examined consisted of 1,025,908 sequential queries from 211,058 users of Excite, a major Internet search service. Twenty-eight terms were used to identify queries for both still and moving images, resulting in a subset of 33,149 image queries by 9855 users. We provide data on: (1) image queries – the number of queries and the number of search terms per user, (2) image search sessions – the number of queries per user, modifications made to subsequent queries in a session, and (3) image terms – their rank/frequency distribution and the most highly used search terms. On average, there were 3.36 image queries per user containing an average of 3.74 terms per query. Image queries contained a large number of unique terms. The most frequently occurring image related terms appeared less than 10% of the time, with most terms occurring only once. We contrast this to earlier work by P.G.B. Enser, Journal of Documentation 51 (2) (1995) 126–170, who examined written queries for pictorial information in a non-digital environment. Implications for the development of models for visual information retrieval, and for the design of Web search engines are discussed.  相似文献   

12.
This paper examines a real-time measure of bias in Web search engines. The measure captures the degree to which the distribution of URLs, retrieved in response to a query, deviates from an ideal or fair distribution for that query. This ideal is approximated by the distribution produced by a collection of search engines. Differences between bias and classical retrieval measures are highlighted by examining the possibilities for bias in four extreme cases of recall and precision. The results of experiments examining the influence on bias measurement of subject domains, search engines, and search terms are presented. Three general conclusions are drawn: (1) the performance of search engines can be distinguished with the aid of the bias measure; (2) bias values depend on the subject matter under consideration; (3) choice of search terms does not account for much of the variance in bias values. These conclusions underscore the need to develop “bias profiles” for search engines.  相似文献   

13.
Traditional information retrieval techniques that primarily rely on keyword-based linking of the query and document spaces face challenges such as the vocabulary mismatch problem where relevant documents to a given query might not be retrieved simply due to the use of different terminology for describing the same concepts. As such, semantic search techniques aim to address such limitations of keyword-based retrieval models by incorporating semantic information from standard knowledge bases such as Freebase and DBpedia. The literature has already shown that while the sole consideration of semantic information might not lead to improved retrieval performance over keyword-based search, their consideration enables the retrieval of a set of relevant documents that cannot be retrieved by keyword-based methods. As such, building indices that store and provide access to semantic information during the retrieval process is important. While the process for building and querying keyword-based indices is quite well understood, the incorporation of semantic information within search indices is still an open challenge. Existing work have proposed to build one unified index encompassing both textual and semantic information or to build separate yet integrated indices for each information type but they face limitations such as increased query process time. In this paper, we propose to use neural embeddings-based representations of term, semantic entity, semantic type and documents within the same embedding space to facilitate the development of a unified search index that would consist of these four information types. We perform experiments on standard and widely used document collections including Clueweb09-B and Robust04 to evaluate our proposed indexing strategy from both effectiveness and efficiency perspectives. Based on our experiments, we find that when neural embeddings are used to build inverted indices; hence relaxing the requirement to explicitly observe the posting list key in the indexed document: (a) retrieval efficiency will increase compared to a standard inverted index, hence reduces the index size and query processing time, and (b) while retrieval efficiency, which is the main objective of an efficient indexing mechanism improves using our proposed method, retrieval effectiveness also retains competitive performance compared to the baseline in terms of retrieving a reasonable number of relevant documents from the indexed corpus.  相似文献   

14.
Considerable evidence exists to show that the use of term relevance weights is beneficial in interactive information retrieval. Various term weighting systems are reviewed. An experiment is then described in which information retrieval users are asked to rank query terms in decreasing order of presumed importance prior to actual search and retrieval. The experimental design is examined, and various relevance ranking systems are evaluated, including fully automatic systems based on inverse document frequency parameters, human rankings performed by the user population, and combinations of the two.  相似文献   

15.
16.
One of the major problems in information retrieval is the formulation of queries on the part of the user. This entails specifying a set of words or terms that express their informational need. However, it is well-known that two people can assign different terms to refer to the same concepts. The techniques that attempt to reduce this problem as much as possible generally start from a first search, and then study how the initial query can be modified to obtain better results. In general, the construction of the new query involves expanding the terms of the initial query and recalculating the importance of each term in the expanded query. Depending on the technique used to formulate the new query several strategies are distinguished. These strategies are based on the idea that if two terms are similar (with respect to any criterion), the documents in which both terms appear frequently will also be related. The technique we used in this study is known as query expansion using similarity thesauri.  相似文献   

17.
The absence of diacritics in text documents or search queries is a serious problem for Turkish information retrieval because it creates homographic ambiguity. Thus, the inappropriate handling of diacritics reduces the retrieval performance in search engines. A straightforward solution to this problem is to normalize tokens by replacing diacritic characters with their American Standard Code for Information Interchange (ASCII) counterparts. However, this so-called ASCIIfication produces either synthetic words that are not legitimate Turkish words or legitimate words with meanings that are completely different from those of the original words. These non-valid synthetic words cannot be processed by morphological analysis components (such as stemmers or lemmatizers), which expect the input to be valid Turkish words. By contrast, synthetic words are not a problem when no stemmer or a simple first-n-characters-stemmer is used in the text analysis pipeline. This difference emphasizes the notion of the diacritic sensitivity of stemmers. In this study, we propose and evaluate an alternative solution based on the application of deASCIIfication, which restores accented letters in query terms or text documents. Our risk-sensitive evaluation results showed that the diacritics restoration approach yielded more effective and robust results compared with normalizing tokens to remove diacritics.  相似文献   

18.
Phase retrieval recovers signals from linear phaseless measurements via minimizing a quadratic or amplitude function, while its loss function is generally either non-convex or non-smooth. Existing methods are used to add a truncation procedure or reweighting to the gradient during the gradient descent process to address the non-smooth problem. However, these methods often cause inconsistency in the search direction and increase the sampling complexity. This paperproposes a smoothed amplitude flow-based phase retrieval (SAFPR) algorithm to solve these problems. By introducing the smoothing function into the phase retrieval problem, the loss function is smoothed, significantly reducing the sampling complexity. Moreover, we also develop a stochastic smooth amplitude flow-based phase retrieval (SSAF) algorithm with practical, scalable, and fast in large-scale applications. Experimental results show that whether SAFPR or SSAF, the number of measurements required to reconstruct the signal entirely is better than the existing most advanced phase retrieval algorithms. The proposed methods also perform well in terms of time cost and convergence rate.  相似文献   

19.
Both general and domain-specific search engines have adopted query suggestion techniques to help users formulate effective queries. In the specific domain of literature search (e.g., finding academic papers), the initial queries are usually based on a draft paper or abstract, rather than short lists of keywords. In this paper, we investigate phrasal-concept query suggestions for literature search. These suggestions explicitly specify important phrasal concepts related to an initial detailed query. The merits of phrasal-concept query suggestions for this domain are their readability and retrieval effectiveness: (1) phrasal concepts are natural for academic authors because of their frequent use of terminology and subject-specific phrases and (2) academic papers describe their key ideas via these subject-specific phrases, and thus phrasal concepts can be used effectively to find those papers. We propose a novel phrasal-concept query suggestion technique that generates queries by identifying key phrasal-concepts from pseudo-labeled documents and combines them with related phrases. Our proposed technique is evaluated in terms of both user preference and retrieval effectiveness. We conduct user experiments to verify a preference for our approach, in comparison to baseline query suggestion methods, and demonstrate the effectiveness of the technique with retrieval experiments.  相似文献   

20.
范哲 《现代情报》2007,27(11):221-225
本文介绍了Aureka、Delphimt、德温特创新索引、东方灵盾中外专利检索及分析平台、北京彼速专利搜索引擎、保定大为PatentEX专利信息创新平台六个中外商用专利栓索软件,并从收录范围、检索功能、检索结果方面对这六个专利检索工具进行了详细的比较和分析。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号