首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 250 毫秒
1.
User generated content forms an important domain for mining knowledge. In this paper, we address the task of blog feed search: to find blogs that are principally devoted to a given topic, as opposed to blogs that merely happen to mention the topic in passing. The large number of blogs makes the blogosphere a challenging domain, both in terms of effectiveness and of storage and retrieval efficiency. We examine the effectiveness of an approach to blog feed search that is based on individual posts as indexing units (instead of full blogs). Working in the setting of a probabilistic language modeling approach to information retrieval, we model the blog feed search task by aggregating over a blogger’s posts to collect evidence of relevance to the topic and persistence of interest in the topic. This approach achieves state-of-the-art performance in terms of effectiveness. We then introduce a two-stage model where a pre-selection of candidate blogs is followed by a ranking step. The model integrates aggressive pruning techniques as well as very lean representations of the contents of blog posts, resulting in substantial gains in efficiency while maintaining effectiveness at a very competitive level.  相似文献   

2.
An information retrieval (IR) system can often fail to retrieve relevant documents due to the incomplete specification of information need in the user’s query. Pseudo-relevance feedback (PRF) aims to improve IR effectiveness by exploiting potentially relevant aspects of the information need present in the documents retrieved in an initial search. Standard PRF approaches utilize the information contained in these top ranked documents from the initial search with the assumption that documents as a whole are relevant to the information need. However, in practice, documents are often multi-topical where only a portion of the documents may be relevant to the query. In this situation, exploitation of the topical composition of the top ranked documents, estimated with statistical topic modeling based approaches, can potentially be a useful cue to improve PRF effectiveness. The key idea behind our PRF method is to use the term-topic and the document-topic distributions obtained from topic modeling over the set of top ranked documents to re-rank the initially retrieved documents. The objective is to improve the ranks of documents that are primarily composed of the relevant topics expressed in the information need of the query. Our RF model can further be improved by making use of non-parametric topic modeling, where the number of topics can grow according to the document contents, thus giving the RF model the capability to adjust the number of topics based on the content of the top ranked documents. We empirically validate our topic model based RF approach on two document collections of diverse length and topical composition characteristics: (1) ad-hoc retrieval using the TREC 6-8 and the TREC Robust ’04 dataset, and (2) tweet retrieval using the TREC Microblog ’11 dataset. Results indicate that our proposed approach increases MAP by up to 9% in comparison to the results obtained with an LDA based language model (for initial retrieval) coupled with the relevance model (for feedback). Moreover, the non-parametric version of our proposed approach is shown to be more effective than its parametric counterpart due to its advantage of adapting the number of topics, improving results by up to 5.6% of MAP compared to the parametric version.  相似文献   

3.
The retrieval of sentences that are relevant to a given information need is a challenging passage retrieval task. In this context, the well-known vocabulary mismatch problem arises severely because of the fine granularity of the task. Short queries, which are usually the rule rather than the exception, aggravate the problem. Consequently, effective sentence retrieval methods tend to apply some form of query expansion, usually based on pseudo-relevance feedback. Nevertheless, there are no extensive studies comparing different statistical expansion strategies for sentence retrieval. In this work we study thoroughly the effect of distinct statistical expansion methods on sentence retrieval. We start from a set of retrieved documents in which relevant sentences have to be found. In our experiments different term selection strategies are evaluated and we provide empirical evidence to show that expansion before sentence retrieval yields competitive performance. This is particularly novel because expansion for sentence retrieval is often done after sentence retrieval (i.e. expansion terms are mined from a ranked set of sentences) and there are no comparative results available between both types of expansion. Furthermore, this comparison is particularly valuable because there are important implications in time efficiency. We also carefully analyze expansion on weak and strong queries and demonstrate clearly that expanding queries before sentence retrieval is not only more convenient for efficiency purposes, but also more effective when handling poor queries.  相似文献   

4.
Research on cross-language information retrieval (CLIR) has typically been restricted to settings using binary relevance assessments. In this paper, we present evaluation results for dictionary-based CLIR using graded relevance assessments in a best match retrieval environment. A text database containing newspaper articles and a related set of 35 search topics were used in the tests. First, monolingual baseline queries were automatically formed from the topics. Secondly, source language topics (in English, German, and Swedish) were automatically translated into the target language (Finnish), using structured target queries. The effectiveness of the translated queries was compared to that of the monolingual queries. Thirdly, pseudo-relevance feedback was used to expand the original target queries. CLIR performance was evaluated using three relevance thresholds: stringent, regular, and liberal. When regular or liberal threshold was used, a reasonable performance was achieved. Using stringent threshold, equally high performance could not be achieved. On all the relevance thresholds the performance of the translated queries was successfully raised by pseudo-relevance feedback based query expansion. However, the performance of the stringent threshold in relation to the other thresholds could not be raised by this method.  相似文献   

5.
博客搜索引擎与传统搜索引擎的比较研究   总被引:8,自引:0,他引:8  
简要介绍了博客与国内外著名博客搜索引擎,针对博客搜索引擎与传统搜索引擎的不同,从工作原理、检索内容与检索方式三个方面对两种搜索引擎进行了系统的分析与比较,并选取了四个不同方面的具有代表性的主题,对两种搜索引擎的代表进行了检索功能和检索性能方面的测评,最后指出了两种搜索引擎在资源价值、检索方式、个性化服务等方面的各自的优势与不足之处,以期对两种搜索引擎的改进提供借鉴。  相似文献   

6.
We adapt the cluster hypothesis for score-based information retrieval by claiming that closely related documents should have similar scores. Given a retrieval from an arbitrary system, we describe an algorithm which directly optimizes this objective by adjusting retrieval scores so that topically related documents receive similar scores. We refer to this process as score regularization. Because score regularization operates on retrieval scores, regardless of their origin, we can apply the technique to arbitrary initial retrieval rankings. Document rankings derived from regularized scores, when compared to rankings derived from un-regularized scores, consistently and significantly result in improved performance given a variety of baseline retrieval algorithms. We also present several proofs demonstrating that regularization generalizes methods such as pseudo-relevance feedback, document expansion, and cluster-based retrieval. Because of these strong empirical and theoretical results, we argue for the adoption of score regularization as general design principle or post-processing step for information retrieval systems.
Fernando DiazEmail:
  相似文献   

7.
User queries to the Web tend to have more than one interpretation due to their ambiguity and other characteristics. How to diversify the ranking results to meet users’ various potential information needs has attracted considerable attention recently. This paper is aimed at mining the subtopics of a query either indirectly from the returned results of retrieval systems or directly from the query itself to diversify the search results. For the indirect subtopic mining approach, clustering the retrieval results and summarizing the content of clusters is investigated. In addition, labeling topic categories and concept tags on each returned document is explored. For the direct subtopic mining approach, several external resources, such as Wikipedia, Open Directory Project, search query logs, and the related search services of search engines, are consulted. Furthermore, we propose a diversified retrieval model to rank documents with respect to the mined subtopics for balancing relevance and diversity. Experiments are conducted on the ClueWeb09 dataset with the topics of the TREC09 and TREC10 Web Track diversity tasks. Experimental results show that the proposed subtopic-based diversification algorithm significantly outperforms the state-of-the-art models in the TREC09 and TREC10 Web Track diversity tasks. The best performance our proposed algorithm achieves is α-nDCG@5 0.307, IA-P@5 0.121, and α#-nDCG@5 0.214 on the TREC09, as well as α-nDCG@10 0.421, IA-P@10 0.201, and α#-nDCG@10 0.311 on the TREC10. The results conclude that the subtopic mining technique with the up-to-date users’ search query logs is the most effective way to generate the subtopics of a query, and the proposed subtopic-based diversification algorithm can select the documents covering various subtopics.  相似文献   

8.
Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this work, we conducted a careful, systematic evaluation of a set of tokenization heuristics on all the available TREC biomedical text collections for ad hoc document retrieval, using two representative retrieval methods and a pseudo-relevance feedback method. We also studied the effect of stemming and stop word removal on the retrieval performance. As expected, our experiment results show that tokenization can significantly affect the retrieval accuracy; appropriate tokenization can improve the performance by up to 96%, measured by mean average precision (MAP). In particular, it is shown that different query types require different tokenization heuristics, stemming is effective only for certain queries, and stop word removal in general does not improve the retrieval performance on biomedical text.  相似文献   

9.
Collection selection is a crucial function, central to the effectiveness and efficiency of a federated information retrieval system. A variety of solutions have been proposed for collection selection adapting proven techniques used in centralised retrieval. This paper defines a new approach to collection selection that models the topical distribution in each collection. We describe an extended version of latent Dirichlet allocation that uses a hierarchical hyperprior to enable the different topical distributions found in each collection to be modelled. Under the model, resources are ranked based on the topical relationship between query and collection. By modelling collections in a low dimensional topic space, we can implicitly smooth their term-based characterisation with appropriate terms from topically related samples, thereby dealing with the problem of missing vocabulary within the samples. An important advantage of adopting this hierarchical model over current approaches is that the model generalises well to unseen documents given small samples of each collection. The latent structure of each collection can therefore be estimated well despite imperfect information for each collection such as sampled documents obtained through query-based sampling. Experiments demonstrate that this new, fully integrated topical model is more robust than current state of the art collection selection algorithms.  相似文献   

10.
While past research has shown that learning outcomes can be influenced by the amount of effort students invest during the learning process, there has been little research into this question for scenarios where people use search engines to learn. In fact, learning-related tasks represent a significant fraction of the time users spend using Web search, so methods for evaluating and optimizing search engines to maximize learning are likely to have broad impact. Thus, we introduce and evaluate a retrieval algorithm designed to maximize educational utility for a vocabulary learning task, in which users learn a set of important keywords for a given topic by reading representative documents on diverse aspects of the topic. Using a crowdsourced pilot study, we compare the learning outcomes of users across four conditions corresponding to rankings that optimize for different levels of keyword density. We find that adding keyword density to the retrieval objective gave significant learning gains on some topics, with higher levels of keyword density generally corresponding to more time spent reading per word, and stronger learning gains per word read. We conclude that our approach to optimizing search ranking for educational utility leads to retrieved document sets that ultimately may result in more efficient learning of important concepts.  相似文献   

11.
Credibility-inspired ranking for blog post retrieval   总被引:1,自引:0,他引:1  
Credibility of information refers to its believability or the believability of its sources. We explore the impact of credibility-inspired indicators on the task of blog post retrieval, following the intuition that more credible blog posts are preferred by searchers. Based on a previously introduced credibility framework for blogs, we define several credibility indicators, and divide them into post-level (e.g., spelling, timeliness, document length) and blog-level (e.g., regularity, expertise, comments) indicators. The retrieval task at hand is precision-oriented, and we hypothesize that the use of credibility-inspired indicators will positively impact precision. We propose to use ideas from the credibility framework in a reranking approach to the blog post retrieval problem: We introduce two simple ways of reranking the top n of an initial run. The first approach, Credibility-inspired reranking, simply reranks the top n of a baseline based on the credibility-inspired score. The second approach, Combined reranking, multiplies the credibility-inspired score of the top n results by their retrieval score, and reranks based on this score. Results show that Credibility-inspired reranking leads to larger improvements over the baseline than Combined reranking, but both approaches are capable of improving over an already strong baseline. For Credibility-inspired reranking the best performance is achieved using a combination of all post-level indicators. Combined reranking works best using the post-level indicators combined with comments and pronouns. The blog-level indicators expertise, regularity, and coherence do not contribute positively to the performance, although analysis shows that they can be useful for certain topics. Additional analysis shows that a relative small value of n (15–25) leads to the best results, and that posts that move up the ranking due to the integration of reranking based on credibility-inspired indicators do indeed appear to be more credible than the ones that go down.  相似文献   

12.
The Reliable Information Access (RIA) Workshop was held in the summer of 2003, with a goal of improved understanding of information retrieval systems, in particular with regard to the variability of retrieval performance across topics. The workshop ran massive cross-system failure analysis on 45 of the TREC topics and also performed cross-system experiments on pseudo-relevance feedback. This paper presents an overview of that workshop, along with some preliminary conclusions from these experiments. Even if this workshop was held 6 years ago, the issues of improving system performance across all topics is still critical to the field and this paper, along with the others in this issue, are the first widely published full papers for the workshop.  相似文献   

13.
随着社交网络的兴起和发展,互联网上出现了大量与商品有关的社会信息。如何利用这些社会信息结合商品元数据进行检索和推荐是信息检索领域中一个热门的研究问题。本文以社会图书检索为例,提出了一种通用的信息检索方法来解决这一问题。首先,通过分析原始图书数据集和图书的用户标签、用户评分和流行度等社会信息,从图书中提取不同的社会特征构建特征矩阵;然后分别计算图书在各种社会特征上的相似度,并使用不同的策略对搜索引擎返回的排序结果进行重排序;最后使用学习排序的方法进行重排结果融合,得到最终的图书检索结果。在实验中,使用该检索方法在INEX Social Book Search 2015和2016数据集上分别进行了训练和测试。结果表明,相比现有的技术,该检索方法能够有效提升图书检索的效果。  相似文献   

14.
段落检索及其相关算法研究   总被引:2,自引:0,他引:2  
总结段落检索及其涉及的段落划分和相关算法,讨论文本分割和段落抽取的差别,介绍并比较几种常用的段落划分方法以及几类段落检索算法,在此基础上对段落检索的研究方向进行展望。  相似文献   

15.
The paper presents several techniques for selecting noun phrases for interactive query expansion following pseudo-relevance feedback and a new phrase-based document ranking method. A combined syntactico-statistical method was used for the selection of phrases for query expansion. Several statistical measures of phrase selection were evaluated. Experiments were also conducted studying the effectiveness of noun phrases in document ranking. One of the major problems in phrase-based document retrieval is weighting of overlapping and non-contiguous word sequences in documents. The paper presents a new method of phrase weighting, which addressed this problem, and its evaluation on the TREC dataset.  相似文献   

16.
交互式跨语言信息检索是信息检索的一个重要分支。在分析交互式跨语言信息检索过程、评价指标、用户行为进展等理论研究基础上,设计一个让用户参与跨语言信息检索全过程的用户检索实验。实验结果表明:用户检索词主要来自检索主题的标题;用户判断文档相关性的准确率较高;目标语言文档全文、译文摘要、译文全文都是用户认可的判断依据;翻译优化方法以及翻译优化与查询扩展的结合方法在用户交互环境下非常有效;用户对于反馈后的翻译仍然愿意做进一步选择;用户对于与跨语言信息检索系统进行交互是有需求并认可的。用户行为分析有助于指导交互式跨语言信息检索系统的设计与实践。  相似文献   

17.
Patent prior art search is a type of search in the patent domain where documents are searched for that describe the work previously carried out related to a patent application. The goal of this search is to check whether the idea in the patent application is novel. Vocabulary mismatch is one of the main problems of patent retrieval which results in low retrievability of similar documents for a given patent application. In this paper we show how the term distribution of the cited documents in an initially retrieved ranked list can be used to address the vocabulary mismatch. We propose a method for query modeling estimation which utilizes the citation links in a pseudo relevance feedback set. We first build a topic dependent citation graph, starting from the initially retrieved set of feedback documents and utilizing citation links of feedback documents to expand the set. We identify the important documents in the topic dependent citation graph using a citation analysis measure. We then use the term distribution of the documents in the citation graph to estimate a query model by identifying the distinguishing terms and their respective weights. We then use these terms to expand our original query. We use CLEF-IP 2011 collection to evaluate the effectiveness of our query modeling approach for prior art search. We also study the influence of different parameters on the performance of the proposed method. The experimental results demonstrate that the proposed approach significantly improves the recall over a state-of-the-art baseline which uses the link-based structure of the citation graph but not the term distribution of the cited documents.  相似文献   

18.
Library 2.0 literature has described many of the possibilities Web 2.0 technologies offer to libraries. Case studies have assessed local use, but no studies have measured the Library 2.0 phenomenon by searching public social networking sites. This study used library-specific terms to search public social networking sites, blog search engines, and social bookmarking sites for activity associated with librarians and library users. Blog search data about the recentness of activity or the popularity of a blog post indicate that Library 2.0 technology has many early adopters but provides less evidence of sustained use. The results follow a curve resembling the 80/20 rule and also resemble Chris Anderson's “long tail” effect, in which very few authors create the vast amount of content. These exploratory results can be used as a starting point for future studies. Librarians who use tags to describe Web-based content might use these findings to select more effective tags. Librarians implementing a blog or a social networking presence might use this study to balance the benefits with the amount of work required to maintain an up-to-date presence.  相似文献   

19.
Experiments were conducted to explore the impact of combining various components of eight leading information retrieval systems. Each system demonstrated improved effectiveness through the use of blind feedback, also known as pseudo-relevance feedback, a form of query expansion. Blind feedback uses the results of a preliminary retrieval step to augment the efficacy of a secondary retrieval step. The hybrid combination of primary and secondary retrieval steps from different systems in a number of cases yielded better effectiveness than either of the constituent systems alone. This positive combining effect was observed when entire documents were passed between the two retrieval steps, but not when only the expansion terms were passed. Several combinations of primary and secondary retrieval steps were fused using the CombMNZ algorithm; all yielded significant effectiveness improvement over the individual systems, with the best yielding an improvement of 13% (p = 10−6) over the best individual system and an improvement of 4% (p = 10−5) over a simple fusion of the eight systems.  相似文献   

20.
blog信息源的信息组织与利用   总被引:6,自引:0,他引:6  
颜丽君 《图书情报工作》2004,48(11):107-110
分析blog信息源的类型及特征和blog的信息自组织方式、分类方式、搜索方式,重点论述如何利用blog信息组织的特点改进blog信息搜索。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号