首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
User queries to the Web tend to have more than one interpretation due to their ambiguity and other characteristics. How to diversify the ranking results to meet users’ various potential information needs has attracted considerable attention recently. This paper is aimed at mining the subtopics of a query either indirectly from the returned results of retrieval systems or directly from the query itself to diversify the search results. For the indirect subtopic mining approach, clustering the retrieval results and summarizing the content of clusters is investigated. In addition, labeling topic categories and concept tags on each returned document is explored. For the direct subtopic mining approach, several external resources, such as Wikipedia, Open Directory Project, search query logs, and the related search services of search engines, are consulted. Furthermore, we propose a diversified retrieval model to rank documents with respect to the mined subtopics for balancing relevance and diversity. Experiments are conducted on the ClueWeb09 dataset with the topics of the TREC09 and TREC10 Web Track diversity tasks. Experimental results show that the proposed subtopic-based diversification algorithm significantly outperforms the state-of-the-art models in the TREC09 and TREC10 Web Track diversity tasks. The best performance our proposed algorithm achieves is α-nDCG@5 0.307, IA-P@5 0.121, and α#-nDCG@5 0.214 on the TREC09, as well as α-nDCG@10 0.421, IA-P@10 0.201, and α#-nDCG@10 0.311 on the TREC10. The results conclude that the subtopic mining technique with the up-to-date users’ search query logs is the most effective way to generate the subtopics of a query, and the proposed subtopic-based diversification algorithm can select the documents covering various subtopics.  相似文献   

2.
The evaluation of diversified web search results is a relatively new research topic and is not as well-understood as the time-honoured evaluation methodology of traditional IR based on precision and recall. In diversity evaluation, one topic may have more than one intent, and systems are expected to balance relevance and diversity. The recent NTCIR-9 evaluation workshop launched a new task called INTENT which included a diversified web search subtask that differs from the TREC web diversity task in several aspects: the choice of evaluation metrics, the use of intent popularity and per-intent graded relevance, and the use of topic sets that are twice as large as those of TREC. The objective of this study is to examine whether these differences are useful, using the actual data recently obtained from the NTCIR-9 INTENT task. Our main experimental findings are: (1) The $\hbox{D}\,\sharp$ evaluation framework used at NTCIR provides more “intuitive” and statistically reliable results than Intent-Aware Expected Reciprocal Rank; (2) Utilising both intent popularity and per-intent graded relevance as is done at NTCIR tends to improve discriminative power, particularly for $\hbox{D}\,\sharp$ -nDCG; and (3) Reducing the topic set size, even by just 10 topics, can affect not only significance testing but also the entire system ranking; when 50 topics are used (as in TREC) instead of 100 (as in NTCIR), the system ranking can be substantially different from the original ranking and the discriminative power can be halved. These results suggest that the directions being explored at NTCIR are valuable.  相似文献   

3.
Search result diversification aims to diversify search results to cover different query subtopics, i.e., pieces of relevant information. The state of the art diversification methods often explicitly model the diversity based on query subtopics, and their performance is closely related to the quality of subtopics. Most existing studies extracted query subtopics only from the unstructured data such as document collections. However, there exists a huge amount of information from structured data, which complements the information from the unstructured data. The structured data can provide valuable information about domain knowledge, but is currently under-utilized. In this article, we study how to leverage the integrated information from both structured and unstructured data to extract high quality subtopics for search result diversification. We first discuss how to extract subtopics from structured data. We then propose three methods to integrate structured and unstructured data. Specifically, the first method uses the structured data to guide the subtopic extraction from unstructured data, the second one uses the unstructured data to guide the extraction, and the last one first extracts the subtopics separately from two data sources and then combines those subtopics. Experimental results in both Enterprise and Web search domains show that the proposed methods are effective in extracting high quality subtopics from the integrated information, which can lead to better diversification performance.  相似文献   

4.
[目的/意义] 揭示移动图书馆用户的查询式构造行为特征,并为移动图书馆的检索功能改进提出建议。[方法/过程] 采用系统日志挖掘法,根据某高校移动图书馆为期一个月的用户日志,通过统计分析方法,利用互信息值、查询式多样性、查询式丰富性、学科分布、持续时间等指标考察移动图书馆用户的查询式关联性、查询重构模式、查询式主题等方面。[结果/结论] 移动图书馆用户的查询式互信息值普遍较低,即查询式在内容上的关联性较弱;重复模式和直线模式是最常见的重构模式,即移动图书馆用户反复搜索同一查询式;移动图书馆用户的搜索兴趣集中在人文社科领域,用户对相同主题查询式的搜索行为具有持续性。建议增加查询推荐功能、自动纠错功能和高级检索功能,以提高移动图书馆检索服务的查全率和查准率。  相似文献   

5.
A useful ability for search engines is to be able to rank objects with novelty and diversity: the top k documents retrieved should cover possible intents of a query with some distribution, or should contain a diverse set of subtopics related to the user’s information need, or contain nuggets of information with little redundancy. Evaluation measures have been introduced to measure the effectiveness of systems at this task, but these measures have worst-case NP-hard computation time. The primary consequence of this is that there is no ranking principle akin to the Probability Ranking Principle for document relevance that provides uniform instruction on how to rank documents for novelty and diversity. We use simulation to investigate the practical implications of this for optimization and evaluation of retrieval systems.  相似文献   

6.
Transaction logs of NAVER, a major Korean Web search engine, were analyzed to track the information-seeking behavior of Korean Web users. These transaction logs include more than 40 million queries collected over 1 week. This study examines current transaction log analysis methodologies and proposes a method for log cleaning, session definition, and query classification. A term definition method which is necessary for Korean transaction log analysis is also discussed. The results of this study show that users behave in a simple way: they type in short queries with a few query terms, seldom use advanced features, and view few results' pages. Users also behave in a passive way: they seldom change search environments set by the system. It is of interest that users tend to change their queries totally rather than adding or deleting terms to modify the previous queries. The results of this study might contribute to the development of more efficient and effective Web search engines and services.  相似文献   

7.
Query recommendation has long been considered a key feature of search engines, which can improve users’ search experience by providing useful query suggestions for their search tasks. Most existing approaches on query recommendation aim to recommend relevant queries, i.e., alternative queries similar to a user’s initial query. However, the ultimate goal of query recommendation is to assist users to reformulate queries so that they can accomplish their search task successfully and quickly. Only considering relevance in query recommendation is apparently not directly toward this goal. In this paper, we argue that it is more important to directly recommend queries with high utility, i.e., queries that can better satisfy users’ information needs. For this purpose, we attempt to infer query utility from users’ sequential search behaviors recorded in their search sessions. Specifically, we propose a dynamic Bayesian network, referred as Query Utility Model (QUM), to capture query utility by simultaneously modeling users’ reformulation and click behaviors. We then recommend queries with high utility to help users better accomplish their search tasks. We empirically evaluated the performance of our approach on a publicly released query log by comparing with the state-of-the-art methods. The experimental results show that, by recommending high utility queries, our approach is far more effective in helping users find relevant search results and thus satisfying their information needs.  相似文献   

8.
In Information Retrieval, since it is hard to identify users’ information needs, many approaches have been tried to solve this problem by expanding initial queries and reweighting the terms in the expanded queries using users’ relevance judgments. Although relevance feedback is most effective when relevance information about retrieved documents is provided by users, it is not always available. Another solution is to use correlated terms for query expansion. The main problem with this approach is how to construct the term-term correlations that can be used effectively to improve retrieval performance. In this study, we try to construct query concepts that denote users’ information needs from a document space, rather than to reformulate initial queries using the term correlations and/or users’ relevance feedback. To form query concepts, we extract features from each document, and then cluster the features into primitive concepts that are then used to form query concepts. Experiments are performed on the Associated Press (AP) dataset taken from the TREC collection. The experimental evaluation shows that our proposed framework called QCM (Query Concept Method) outperforms baseline probabilistic retrieval model on TREC retrieval.  相似文献   

9.
从Sogou查询日志中选取样本查询且进行人工标注,通过对标注后新闻查询的分析,提出能用于识别新闻意图的新特征,即查询表达式特征、查询随时间分布特征以及点击结果特征。根据这3个特征,利用决策树分类器实现查询中新闻意图的自动识别,结果发现:①新闻类查询的查询目标主要集中在特定主题信息以及娱乐类信息方面,其查询主题大多为娱乐、政治、体育与经济类信息;②相对非新闻查询,新闻查询具有更可能包含实体、随时间分布波动较大、点击结果之间相似度更高的特点;③本方法对查询中新闻意图的识别效果较好,其宏平均准确率、召回率、F值分别为 0.76、0.73、0、74。  相似文献   

10.
Users often issue all kinds of queries to look for the same target due to the intrinsic ambiguity and flexibility of natural languages. Some previous work clusters queries based on co-clicks; however, the intents of queries in one cluster are not that similar but roughly related. It is desirable to conduct automatic mining of queries with equivalent intents from a large scale search logs. In this paper, we take account of similarities between query strings. There are two issues associated with such similarities: it is too costly to compare any pair of queries in large scale search logs, and two queries with a similar formulation, such as “SVN” (Apache Subversion) and support vector machine (SVM), are not necessarily similar in their intents. To address these issues, we propose using the similarities of query strings above the co-click based clustering results. Our method improves precision over the co-click based clustering method (lifting precision from 0.37 to 0.62), and outperforms a commercial search engine’s query alteration (lifting \(F_1\) measure from 0.42 to 0.56). As an application, we consider web document retrieval. We aggregate similar queries’ click-throughs with the query’s click-throughs and evaluate them on a large scale dataset. Experimental results indicate that our proposed method significantly outperforms the baseline method of using a query’s own click-throughs in all metrics.  相似文献   

11.
12.
Relevance feedback methods generally suffer from topic drift caused by word ambiguities and synonymous uses of words. Topic drift is an important issue in patent information retrieval as people tend to use different expressions describing similar concepts causing low precision and recall at the same time. Furthermore, failing to retrieve relevant patents to an application during the examination process may cause legal problems caused by granting an existing invention. A possible cause of topic drift is utilizing a relevance feedback-based search method. As a way to alleviate the inherent problem, we propose a novel query phrase expansion approach utilizing semantic annotations in Wikipedia pages, trying to enrich queries with phrases disambiguating the original query words. The idea was implemented for patent search where patents are classified into a hierarchy of categories, and the analyses of the experimental results showed not only the positive roles of phrases and words in retrieving additional relevant documents through query expansion but also their contributions to alleviating the query drift problem. More specifically, our query expansion method was compared against relevance-based language model, a state-of-the-art query expansion method, to show its superiority in terms of MAP on all levels of the classification hierarchy.  相似文献   

13.
Coverage-based search result diversification   总被引:1,自引:0,他引:1  
Traditional retrieval models may provide users with less satisfactory search experience because documents are scored independently and the top ranked documents often contain excessively redundant information. Intuitively, it is more desirable to diversify search results so that the top-ranked documents can cover different query subtopics, i.e., different pieces of relevant information. In this paper, we study the problem of search result diversification in an optimization framework whose objective is to maximize a coverage-based diversity function. We first define the diversity score of a set of search results through measuring the coverage of query subtopics in the result set, and then discuss how to use them to derive diversification methods. The key challenge here is how to define an appropriate coverage function given a query and a set of search results. To address this challenge, we propose and systematically study three different strategies to define coverage functions. They are based on summations, loss functions and evaluation measures respectively. Each of these coverage functions leads to a result diversification method. We show that the proposed coverage based diversification methods not only cover several state-of-the-art methods but also allows us to derive new ones. We compare these methods both analytically and empirically. Experiment results on two standard TREC collections show that all the methods are effective for diversification and the new methods can outperform existing ones.  相似文献   

14.
In the patent domain significant efforts are invested to assist researchers in formulating better queries, preferably via automated query expansion. Currently, automatic query expansion in patent search is mostly limited to computing co-occurring terms for the searchable features of the invention. Additional query terms are extracted automatically from patent documents based on entropy measures. Learning synonyms in the patent domain for automatic query expansion has been a difficult task. No dedicated sources providing synonyms for the patent domain, such as patent domain specific lexica or thesauri, are available. In this paper we focus on the highly professional search setting of patent examiners. In particular, we use query logs to learn synonyms for the patent domain. For automatic query expansion, we create term networks based on the query logs specifically for several USPTO patent classes. Experiments show good performance in automatic query expansion using these automatically generated term networks. Specifically, with a larger number of query logs for a specific patent US class available the performance of the learned term networks increases.  相似文献   

15.
通过挖掘网络日志中的查询词语义关系,将《知网》的语义知识加入到聚类算法中实现搜索引擎优化。该方法通过机器学习算法深入挖掘查询日志,对其中的查询串进行概念相似度、语义聚类等计算,使返回网页更加合理,将更准确的网页结果呈现在用户面前,能够更好地满足用户需求。  相似文献   

16.
Social tagging systems have gained increasing popularity as a method of annotating and categorizing a wide range of different web resources. Web search that utilizes social tagging data suffers from an extreme example of the vocabulary mismatch problem encountered in traditional information retrieval (IR). This is due to the personalized, unrestricted vocabulary that users choose to describe and tag each resource. Previous research has proposed the utilization of query expansion to deal with search in this rather complicated space. However, non-personalized approaches based on relevance feedback and personalized approaches based on co-occurrence statistics only showed limited improvements. This paper proposes a novel query expansion framework based on individual user profiles mined from the annotations and resources the user has marked. The underlying theory is to regularize the smoothness of word associations over a connected graph using a regularizer function on terms extracted from top-ranked documents. The intuition behind the model is the prior assumption of term consistency: the most appropriate expansion terms for a query are likely to be associated with, and influenced by terms extracted from the documents ranked highly for the initial query. The framework also simultaneously incorporates annotations and web documents through a Tag-Topic model in a latent graph. The experimental results suggest that the proposed personalized query expansion method can produce better results than both the classical non-personalized search approach and other personalized query expansion methods. Hence, the proposed approach significantly benefits personalized web search by leveraging users’ social media data.  相似文献   

17.
Research on cross-language information retrieval (CLIR) has typically been restricted to settings using binary relevance assessments. In this paper, we present evaluation results for dictionary-based CLIR using graded relevance assessments in a best match retrieval environment. A text database containing newspaper articles and a related set of 35 search topics were used in the tests. First, monolingual baseline queries were automatically formed from the topics. Secondly, source language topics (in English, German, and Swedish) were automatically translated into the target language (Finnish), using structured target queries. The effectiveness of the translated queries was compared to that of the monolingual queries. Thirdly, pseudo-relevance feedback was used to expand the original target queries. CLIR performance was evaluated using three relevance thresholds: stringent, regular, and liberal. When regular or liberal threshold was used, a reasonable performance was achieved. Using stringent threshold, equally high performance could not be achieved. On all the relevance thresholds the performance of the translated queries was successfully raised by pseudo-relevance feedback based query expansion. However, the performance of the stringent threshold in relation to the other thresholds could not be raised by this method.  相似文献   

18.
Web 信息检索(Information Retrieval)技术研究是应用文本检索研究的成果,它结合Web图论的思想,研究Web上的信息检索,是行之有效的Web知识发现的途径。传统HITS方法所获得的信息精确度相当低,而PageRank作为一通用的搜索方法,不能够应用于特定主题的信息获取。在充分分析了PageRank、HITS等现有算法和Web文档的相似度计算方法的基础上,提出了Web上查询特定主题相关信息发现的RG-HITS算法。它结合了Web超链接、网页知识表示的信息相关度以及HITS方法来搜索Web上特定主题的相关知识。  相似文献   

19.
基于领域本体的数字图书馆检索结果动态组织方法研究   总被引:1,自引:1,他引:0  
在对现有数字图书馆检索结果的组织方法进行分析的基础上,从忠实于用户提问的角度,提出基于领域本体的检索结果动态组织方法。基本解决思路是将文献的标识与用户的提问进行有效地对接,即以用户提问为基础构造提问模型,并基于检索结果构造标识模型,将提问模型与标识模型在语义层面通过领域本体进行映射,从而实现文献标识与用户提问在语义层面的互通,最终以用户提问的语义方式来展现检索结果。  相似文献   

20.
In retrieving medical free text, users are often interested in answers pertinent to certain scenarios that correspond to common tasks performed in medical practice, e.g., treatment or diagnosis of a disease. A major challenge in handling such queries is that scenario terms in the query (e.g., treatment) are often too general to match specialized terms in relevant documents (e.g., chemotherapy). In this paper, we propose a knowledge-based query expansion method that exploits the UMLS knowledge source to append the original query with additional terms that are specifically relevant to the query's scenario(s). We compared the proposed method with traditional statistical expansion that expands terms which are statistically correlated but not necessarily scenario specific. Our study on two standard testbeds shows that the knowledge-based method, by providing scenario-specific expansion, yields notable improvements over the statistical method in terms of average precision-recall. On the OHSUMED testbed, for example, the improvement is more than 5% averaging over all scenario-specific queries studied and about 10% for queries that mention certain scenarios, such as treatment of a disease and differential diagnosis of a symptom/disease.
Wesley W. ChuEmail:
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号