首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Web search queries are often ambiguous or faceted, and the task of identifying the major underlying senses and facets of queries has received much attention in recent years. We refer to this task as query subtopic mining. In this paper, we propose to use surrounding text of query terms in top retrieved documents to mine subtopics and rank them. We first extract text fragments containing query terms from different parts of documents. Then we group similar text fragments into clusters and generate a readable subtopic for each cluster. Based on the cluster and the language model trained from a query log, we calculate three features and combine them into a relevance score for each subtopic. Subtopics are finally ranked by balancing relevance and novelty. Our evaluation experiments with the NTCIR-9 INTENT Chinese Subtopic Mining test collection show that our method significantly outperforms a query log based method proposed by Radlinski et al. (2010) and a search result clustering based method proposed by Zeng et al. (2004) in terms of precision, I-rec, D-nDCG and D#-nDCG, the official evaluation metrics used at the NTCIR-9 INTENT task. Moreover, our generated subtopics are significantly more readable than those generated by the search result clustering method.  相似文献   

2.
User queries to the Web tend to have more than one interpretation due to their ambiguity and other characteristics. How to diversify the ranking results to meet users’ various potential information needs has attracted considerable attention recently. This paper is aimed at mining the subtopics of a query either indirectly from the returned results of retrieval systems or directly from the query itself to diversify the search results. For the indirect subtopic mining approach, clustering the retrieval results and summarizing the content of clusters is investigated. In addition, labeling topic categories and concept tags on each returned document is explored. For the direct subtopic mining approach, several external resources, such as Wikipedia, Open Directory Project, search query logs, and the related search services of search engines, are consulted. Furthermore, we propose a diversified retrieval model to rank documents with respect to the mined subtopics for balancing relevance and diversity. Experiments are conducted on the ClueWeb09 dataset with the topics of the TREC09 and TREC10 Web Track diversity tasks. Experimental results show that the proposed subtopic-based diversification algorithm significantly outperforms the state-of-the-art models in the TREC09 and TREC10 Web Track diversity tasks. The best performance our proposed algorithm achieves is α-nDCG@5 0.307, IA-P@5 0.121, and α#-nDCG@5 0.214 on the TREC09, as well as α-nDCG@10 0.421, IA-P@10 0.201, and α#-nDCG@10 0.311 on the TREC10. The results conclude that the subtopic mining technique with the up-to-date users’ search query logs is the most effective way to generate the subtopics of a query, and the proposed subtopic-based diversification algorithm can select the documents covering various subtopics.  相似文献   

3.
Query recommendation has long been considered a key feature of search engines, which can improve users’ search experience by providing useful query suggestions for their search tasks. Most existing approaches on query recommendation aim to recommend relevant queries, i.e., alternative queries similar to a user’s initial query. However, the ultimate goal of query recommendation is to assist users to reformulate queries so that they can accomplish their search task successfully and quickly. Only considering relevance in query recommendation is apparently not directly toward this goal. In this paper, we argue that it is more important to directly recommend queries with high utility, i.e., queries that can better satisfy users’ information needs. For this purpose, we attempt to infer query utility from users’ sequential search behaviors recorded in their search sessions. Specifically, we propose a dynamic Bayesian network, referred as Query Utility Model (QUM), to capture query utility by simultaneously modeling users’ reformulation and click behaviors. We then recommend queries with high utility to help users better accomplish their search tasks. We empirically evaluated the performance of our approach on a publicly released query log by comparing with the state-of-the-art methods. The experimental results show that, by recommending high utility queries, our approach is far more effective in helping users find relevant search results and thus satisfying their information needs.  相似文献   

4.
Transaction logs of NAVER, a major Korean Web search engine, were analyzed to track the information-seeking behavior of Korean Web users. These transaction logs include more than 40 million queries collected over 1 week. This study examines current transaction log analysis methodologies and proposes a method for log cleaning, session definition, and query classification. A term definition method which is necessary for Korean transaction log analysis is also discussed. The results of this study show that users behave in a simple way: they type in short queries with a few query terms, seldom use advanced features, and view few results' pages. Users also behave in a passive way: they seldom change search environments set by the system. It is of interest that users tend to change their queries totally rather than adding or deleting terms to modify the previous queries. The results of this study might contribute to the development of more efficient and effective Web search engines and services.  相似文献   

5.
The critical task of predicting clicks on search advertisements is typically addressed by learning from historical click data. When enough history is observed for a given query-ad pair, future clicks can be accurately modeled. However, based on the empirical distribution of queries, sufficient historical information is unavailable for many query-ad pairs. The sparsity of data for new and rare queries makes it difficult to accurately estimate clicks for a significant portion of typical search engine traffic. In this paper we provide analysis to motivate modeling approaches that can reduce the sparsity of the large space of user search queries. We then propose methods to improve click and relevance models for sponsored search by mining click behavior for partial user queries. We aggregate click history for individual query words, as well as for phrases extracted with a CRF model. The new models show significant improvement in clicks and revenue compared to state-of-the-art baselines trained on several months of query logs. Results are reported on live traffic of a commercial search engine, in addition to results from offline evaluation.  相似文献   

6.
We propose a method for search privacy on the Internet, focusing on enhancing plausible deniability against search engine query-logs. The method approximates the target search results, without submitting the intended query and avoiding other exposing queries, by employing sets of queries representing more general concepts. We model the problem theoretically, and investigate the practical feasibility and effectiveness of the proposed solution with a set of real queries with privacy issues on a large web collection. The findings may have implications for other IR research areas, such as query expansion and fusion in meta-search. Finally, we discuss ideas for privacy, such as k-anonymity, and how these may be applied to search tasks.  相似文献   

7.
In the patent domain significant efforts are invested to assist researchers in formulating better queries, preferably via automated query expansion. Currently, automatic query expansion in patent search is mostly limited to computing co-occurring terms for the searchable features of the invention. Additional query terms are extracted automatically from patent documents based on entropy measures. Learning synonyms in the patent domain for automatic query expansion has been a difficult task. No dedicated sources providing synonyms for the patent domain, such as patent domain specific lexica or thesauri, are available. In this paper we focus on the highly professional search setting of patent examiners. In particular, we use query logs to learn synonyms for the patent domain. For automatic query expansion, we create term networks based on the query logs specifically for several USPTO patent classes. Experiments show good performance in automatic query expansion using these automatically generated term networks. Specifically, with a larger number of query logs for a specific patent US class available the performance of the learned term networks increases.  相似文献   

8.
This paper reports findings from an analysis of medical or health queries to different web search engines. We report results: (i). comparing samples of 10000 web queries taken randomly from 1.2 million query logs from the AlltheWeb.com and Excite.com commercial web search engines in 2001 for medical or health queries, (ii). comparing the 2001 findings from Excite and AlltheWeb.com users with results from a previous analysis of medical and health related queries from the Excite Web search engine for 1997 and 1999, and (iii). medical or health advice-seeking queries beginning with the word 'should'. Findings suggest: (i). a small percentage of web queries are medical or health related, (ii). the top five categories of medical or health queries were: general health, weight issues, reproductive health and puberty, pregnancy/obstetrics, and human relationships, and (iii). over time, the medical and health queries may have declined as a proportion of all web queries, as the use of specialized medical/health websites and e-commerce-related queries has increased. Findings provide insights into medical and health-related web querying and suggests some implications for the use of the general web search engines when seeking medical/health information.  相似文献   

9.
10.
信息检索扩展技术研究   总被引:1,自引:0,他引:1  
本文针对信息检索在查询扩展方面的不足,提出了一种结合本体理论和用户相关反馈技术的查询扩展方法。以FirteX作为检索平台, 选取WordNet作为本体扩展资源来验证本文所提出的查询扩展算法,实现结果表明该方法比基于余弦相似性的查询扩展方法在平均查全率、平均查准率方面有更大的优点。  相似文献   

11.
基于Sogou实验室提供的查询日志数据和新闻数据,探讨潜在时间意图查询的判断及其相关时间属性识别,构建潜在时间意图查询的检索排序模型。实验结果表明,时间属性识别的准确率为85%,且构建的检索模型能有效提高排序效果。  相似文献   

12.
This study investigates the information seeking behavior of general Korean Web users. The data from transaction logs of selected dates from August 2006 to August 2007 were used to examine characteristics of Web queries and to analyze click logs that consist of a collection of documents that users clicked and viewed for each query. Changes in search topics are explored for NAVER users from 2003/2004 to 2006/2007. Patterns involving spelling errors and queries in foreign languages are also investigated. Search behaviors of Korean Web users are compared to those of the United States and other countries. The results show that entertainment is the topranked category, followed by shopping, education, games, and computer/Internet. Search topics changed from computer/Internet to entertainment and shopping from 2003/2004 to 2006/2007 in Korea. The ratios of both spelling errors and queries in foreign languages are low. This study reveals differences for search topics among different regions of the world. The results suggest that the analysis of click logs allows for the reduction of unknown or unidentifiable queries by providing actual data on user behaviors and their probable underlying information needs. The implications for system designers and Web content providers are discussed.  相似文献   

13.
从Sogou查询日志中选取样本查询且进行人工标注,通过对标注后新闻查询的分析,提出能用于识别新闻意图的新特征,即查询表达式特征、查询随时间分布特征以及点击结果特征。根据这3个特征,利用决策树分类器实现查询中新闻意图的自动识别,结果发现:①新闻类查询的查询目标主要集中在特定主题信息以及娱乐类信息方面,其查询主题大多为娱乐、政治、体育与经济类信息;②相对非新闻查询,新闻查询具有更可能包含实体、随时间分布波动较大、点击结果之间相似度更高的特点;③本方法对查询中新闻意图的识别效果较好,其宏平均准确率、召回率、F值分别为 0.76、0.73、0、74。  相似文献   

14.
The collective feedback of the users of an Information Retrieval (IR) system has been shown to provide semantic information that, while hard to extract using standard IR techniques, can be useful in Web mining tasks. In the last few years, several approaches have been proposed to process the logs stored by Internet Service Providers (ISP), Intranet proxies or Web search engines. However, the solutions proposed in the literature only partially represent the information available in the Web logs. In this paper, we propose to use a richer data structure, which is able to preserve most of the information available in the Web logs. This data structure consists of three groups of entities: users, documents and queries, which are connected in a network of relations. Query refinements correspond to separate transitions between the corresponding query nodes in the graph, while users are linked to the queries they have issued and to the documents they have selected. The classical query/document transitions, which connect a query to the documents selected by the users’ in the returned result page, are also considered. The resulting data structure is a complete representation of the collective search activity performed by the users of a search engine or of an Intranet. The experimental results show that this more powerful representation can be successfully used in several Web mining tasks like discovering semantically relevant query suggestions and Web page categorization by topic.  相似文献   

15.
基于领域本体的数字图书馆检索结果动态组织方法研究   总被引:1,自引:1,他引:0  
在对现有数字图书馆检索结果的组织方法进行分析的基础上,从忠实于用户提问的角度,提出基于领域本体的检索结果动态组织方法。基本解决思路是将文献的标识与用户的提问进行有效地对接,即以用户提问为基础构造提问模型,并基于检索结果构造标识模型,将提问模型与标识模型在语义层面通过领域本体进行映射,从而实现文献标识与用户提问在语义层面的互通,最终以用户提问的语义方式来展现检索结果。  相似文献   

16.
随着信息的爆炸,人们对检索系统的功能、智能化程度以及检索效果的要求更高,希望它们能提供更准确、更精炼和更符合个人需要的检索结果。本文提出一种基于用户兴趣的个性化检索方法,结合分类法的思想,用“分类”代替“关键词”表示用户兴趣,改进了信息过滤的方法,优化了检索结果,使其更加符合用户的需要,实现了基于用户兴趣的个性化信息检索。此外,开发了基于用户兴趣的个性化检索系统,并进行了相关实验,验证了该方法确可明显改善检索效果。  相似文献   

17.
Query suggestion, which enables the user to revise a query with a single click, has become one of the most fundamental features of Web search engines. However, it has not been clear what circumstances cause the user to turn to query suggestion. In order to investigate when and how the user uses query suggestion, we analyzed three kinds of data sets obtained from a major commercial Web search engine, comprising approximately 126 million unique queries, 876 million query suggestions and 306 million action patterns of users. Our analysis shows that query suggestions are often used (1) when the original query is a rare query, (2) when the original query is a single-term query, (3) when query suggestions are unambiguous, (4) when query suggestions are generalizations or error corrections of the original query, and (5) after the user has clicked on several URLs in the first search result page. Our results suggest that search engines should provide better assistance especially when rare or single-term queries are input, and that they should dynamically provide query suggestions according to the searcher’s current state.  相似文献   

18.
Relevance feedback methods generally suffer from topic drift caused by word ambiguities and synonymous uses of words. Topic drift is an important issue in patent information retrieval as people tend to use different expressions describing similar concepts causing low precision and recall at the same time. Furthermore, failing to retrieve relevant patents to an application during the examination process may cause legal problems caused by granting an existing invention. A possible cause of topic drift is utilizing a relevance feedback-based search method. As a way to alleviate the inherent problem, we propose a novel query phrase expansion approach utilizing semantic annotations in Wikipedia pages, trying to enrich queries with phrases disambiguating the original query words. The idea was implemented for patent search where patents are classified into a hierarchy of categories, and the analyses of the experimental results showed not only the positive roles of phrases and words in retrieving additional relevant documents through query expansion but also their contributions to alleviating the query drift problem. More specifically, our query expansion method was compared against relevance-based language model, a state-of-the-art query expansion method, to show its superiority in terms of MAP on all levels of the classification hierarchy.  相似文献   

19.
为减少元搜索引擎中无效成员搜索引擎返回的大量重复冗余信息、减轻后期结果处理的负担、提高系统的查准率,文章提出一种基于奖励机制的成员搜索引擎调度策略。该策略引入Agent技术,将每个成员搜索引擎Agent对查询的重要程度进行量化管理,选择检索性能最佳的若干成员搜索引擎进行调度。实验结果证明,这种基于奖励机制的成员搜索引擎调度策略在提高查准率、缩短查询时间、减轻元搜索引擎后期的结果处理负担方面,都优于传统的成员搜索引擎调度策略。  相似文献   

20.
We study the problem of web search result diversification in the case where intent based relevance scores are available. A diversified search result will hopefully satisfy the information need of user-L.s who may have different intents. In this context, we first analyze the properties of an intent-based metric, ERR-IA, to measure relevance and diversity altogether. We argue that this is a better metric than some previously proposed intent aware metrics and show that it has a better correlation with abandonment rate. We then propose an algorithm to rerank web search results based on optimizing an objective function corresponding to this metric and evaluate it on shopping related queries.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号