首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 234 毫秒
1.
In the web environment, most of the queries issued by users are implicit by nature. Inferring the different temporal intents of this type of query enhances the overall temporal part of the web search results. Previous works tackling this problem usually focused on news queries, where the retrieval of the most recent results related to the query are usually sufficient to meet the user's information needs. However, few works have studied the importance of time in queries such as “Philip Seymour Hoffman” where the results may require no recency at all. In this work, we focus on this type of queries named “time-sensitive queries” where the results are preferably from a diversified time span, not necessarily the most recent one. Unlike related work, we follow a content-based approach to identify the most important time periods of the query and integrate time into a re-ranking model to boost the retrieval of documents whose contents match the query time period. For that purpose, we define a linear combination of topical and temporal scores, which reflects the relevance of any web document both in the topical and temporal dimensions, thus contributing to improve the effectiveness of the ranked results across different types of queries. Our approach relies on a novel temporal similarity measure that is capable of determining the most important dates for a query, while filtering out the non-relevant ones. Through extensive experimental evaluation over web corpora, we show that our model offers promising results compared to baseline approaches. As a result of our investigation, we publicly provide a set of web services and a web search interface so that the system can be graphically explored by the research community.  相似文献   

2.
With the increase of information on the Web, it is difficult to find desired information quickly out of the documents retrieved by a search engine. One way to solve this problem is to classify web documents according to various criteria. Most document classification has been focused on a subject or a topic of a document. A genre or a style is another view of a document different from a subject or a topic. The genre is also a criterion to classify documents. In this paper, we suggest multiple sets of features to classify genres of web documents. The basic set of features, which have been proposed in the previous studies, is acquired from the textual properties of documents, such as the number of sentences, the number of a certain word, etc. However, web documents are different from textual documents in that they contain URL and HTML tags within the pages. We introduce new sets of features specific to web documents, which are extracted from URL and HTML tags. The present work is an attempt to evaluate the performance of the proposed sets of features, and to discuss their characteristics. Finally, we conclude which is an appropriate set of features in automatic genre classification of web documents.  相似文献   

3.
Searching for relevant material that satisfies the information need of a user, within a large document collection is a critical activity for web search engines. Query Expansion techniques are widely used by search engines for the disambiguation of user’s information need and for improving the information retrieval (IR) performance. Knowledge-based, corpus-based and relevance feedback, are the main QE techniques, that employ different approaches for expanding the user query with synonyms of the search terms (word synonymy) in order to bring more relevant documents and for filtering documents that contain search terms but with a different meaning (also known as word polysemy problem) than the user intended. This work, surveys existing query expansion techniques, highlights their strengths and limitations and introduces a new method that combines the power of knowledge-based or corpus-based techniques with that of relevance feedback. Experimental evaluation on three information retrieval benchmark datasets shows that the application of knowledge or corpus-based query expansion techniques on the results of the relevance feedback step improves the information retrieval performance, with knowledge-based techniques providing significantly better results than their simple relevance feedback alternatives in all sets.  相似文献   

4.
李江华  时鹏 《情报杂志》2012,31(4):112-116
Internet已成为全球最丰富的数据源,数据类型繁杂且动态变化,如何从中快速准确地检索出用户所需要的信息是一个亟待解决的问题.传统的搜索引擎基于语法的方式进行搜索,缺乏语义信息,难以准确地表达用户的查询需求和被检索对象的文档语义,致使查准率和查全率较低且搜索范围有限.本文对现有的语义检索方法进行了研究,分析了其中存在的问题,在此基础上提出了一种基于领域的语义搜索引擎模型,结合语义Web技术,使用领域本体元数据模型对用户的查询进行语义化规范,依据领域本体模式抽取文档中的知识并RDF化,准确地表达了用户的查询语义和作为被查询对象的文档语义,可以大大提高检索的准确性和检索效率,详细地给出了模型的体系结构、基本功能和工作原理.  相似文献   

5.
Large-scale web search engines are composed of multiple data centers that are geographically distant to each other. Typically, a user query is processed in a data center that is geographically close to the origin of the query, over a replica of the entire web index. Compared to a centralized, single-center search engine, this architecture offers lower query response times as the network latencies between the users and data centers are reduced. However, it does not scale well with increasing index sizes and query traffic volumes because queries are evaluated on the entire web index, which has to be replicated and maintained in all data centers. As a remedy to this scalability problem, we propose a document replication framework in which documents are selectively replicated on data centers based on regional user interests. Within this framework, we propose three different document replication strategies, each optimizing a different objective: reducing the potential search quality loss, the average query response time, or the total query workload of the search system. For all three strategies, we consider two alternative types of capacity constraints on index sizes of data centers. Moreover, we investigate the performance impact of query forwarding and result caching. We evaluate our strategies via detailed simulations, using a large query log and a document collection obtained from the Yahoo! web search engine.  相似文献   

6.
邓兵  蔡志勇  江洪 《现代情报》2011,31(10):81-84
本文叙述了NSTL武汉东湖高新区服务站的工作现状和服务模式,并运用文献信息分析的基本方法对用户行为和网络服务平台的服务结果进行了专项统计与分析, 从分析结果可以看到该服务站服务的不足和要改进之处。作者建议:用创新的思维提高我们的文献信息服务能力和信息保障渠道的拓展,并就提高现文献信息深层服务和信息资源的利用提出了自己的观点。  相似文献   

7.
A new dictionary-based text categorization approach is proposed to classify the chemical web pages efficiently. Using a chemistry dictionary, the approach can extract chemistry-related information more exactly from web pages. After automatic segmentation on the documents to find dictionary terms for document expansion, the approach adopts latent semantic indexing (LSI) to produce the final document vectors, and the relevant categories are finally assigned to the test document by using the k-NN text categorization algorithm. The effects of the characteristics of chemistry dictionary and test collection on the categorization efficiency are discussed in this paper, and a new voting method is also introduced to improve the categorization performance further based on the collection characteristics. The experimental results show that the proposed approach has the superior performance to the traditional categorization method and is applicable to the classification of chemical web pages.  相似文献   

8.
This paper describes an applied document filtering system embedded in an operational watch center that monitors disease outbreaks worldwide. At the initial time of this writing, the system effectively supported monitoring of 23 geographic regions by filtering documents in several thousand daily news sources in 11 different languages. This paper describes the filtering algorithm, statistical procedures for estimating Precision and Recall in an operational environment, summarizes operational performance data and suggests lessons learned for other applications of document filtering technology. Overall, these results are interpreted as supporting the general utility of document filtering and information retrieval technology and offers recommendations for future applications of this technology.  相似文献   

9.
Content-based filtering can be deployed for personalised information dissemination on the web, but this is a possibility that has been largely ignored. Nowadays, there are no successful content-based filtering applications available online. Nootropia is an immune-inspired user profiling model for content-based filtering. It has the advantageous property to be able to represent a user’s multiple interests and adapt to a variety of changes in them. In this paper we describe our early efforts to develop real world personalisation services based on Nootropia. We present, the architecture, implementation, usage and evaluation of the personalised news and paper aggregator, which aggregates news and papers that are relevant to an individual’s interests. Our user study shows that Nootropia can effectively learn a user’s interests and identify relevant information. It also indicates that information filtering is a complicated task with many factors affecting its successful application in a real situation.  相似文献   

10.
The rapid development of the web has led to a considerable increase in information dissemination. Recently, personalized web service recommendation has become a popular research area in service computing. Research on web service recommendation systems mainly addresses two problems: prediction and completion of sparse QoS data, and the user's personalized recommendation. To address the issue of high data sparsity and low recommendation accuracy in the traditional service recommendation models under mobile cloud, this study presents a hybrid collaborative filtering model for consumer service recommendation based on mobile cloud by introducing user preferences. The example verified that the service recommendation based on the model can effectively reduce the data sparsity and increase the accuracy of the prediction.  相似文献   

11.
This article explores how to develop complex data driven user models that go beyond the bag of words model and topical relevance. We propose to learn from rich user specific information and to satisfy complex user criteria under the graphical modelling framework. We carried out a user study with a web based personal news filtering system, and collected extensive user information, including explicit user feedback, implicit user feedback and some contextual information. Experimental results on the data set collected demonstrate that the graphical modelling approach helps us to better understand the complex domain. The results also show that the complex data driven user modelling approach can improve the adaptive information filtering performance. We also discuss some practical issues while learning complex user models, including how to handle data noise and the missing data problem.  相似文献   

12.
Online information intermediaries such as Facebook and Google are slowly replacing traditional media channels thereby partly becoming the gatekeepers of our society. To deal with the growing amount of information on the social web and the burden it brings on the average user, these gatekeepers recently started to introduce personalization features, algorithms that filter information per individual. In this paper we show that these online services that filter information are not merely algorithms. Humans not only affect the design of the algorithms, but they also can manually influence the filtering process even when the algorithm is operational. We further analyze filtering processes in detail, show how personalization connects to other filtering techniques, and show that both human and technical biases are present in today’s emergent gatekeepers. We use the existing literature on gatekeeping and search engine bias and provide a model of algorithmic gatekeeping.  相似文献   

13.
With the ever increasing size of the web, relevant information extraction on the Internet with a query formed by a few keywords has become a big challenge. Query Expansion (QE) plays a crucial role in improving searches on the Internet. Here, the user’s initial query is reformulated by adding additional meaningful terms with similar significance. QE – as part of information retrieval (IR) – has long attracted researchers’ attention. It has become very influential in the field of personalized social document, question answering, cross-language IR, information filtering and multimedia IR. Research in QE has gained further prominence because of IR dedicated conferences such as TREC (Text Information Retrieval Conference) and CLEF (Conference and Labs of the Evaluation Forum). This paper surveys QE techniques in IR from 1960 to 2017 with respect to core techniques, data sources used, weighting and ranking methodologies, user participation and applications – bringing out similarities and differences.  相似文献   

14.
Semi-supervised document retrieval   总被引:2,自引:0,他引:2  
This paper proposes a new machine learning method for constructing ranking models in document retrieval. The method, which is referred to as SSRank, aims to use the advantages of both the traditional Information Retrieval (IR) methods and the supervised learning methods for IR proposed recently. The advantages include the use of limited amount of labeled data and rich model representation. To do so, the method adopts a semi-supervised learning framework in ranking model construction. Specifically, given a small number of labeled documents with respect to some queries, the method effectively labels the unlabeled documents for the queries. It then uses all the labeled data to train a machine learning model (in our case, Neural Network). In the data labeling, the method also makes use of a traditional IR model (in our case, BM25). A stopping criterion based on machine learning theory is given for the data labeling process. Experimental results on three benchmark datasets and one web search dataset indicate that SSRank consistently and almost always significantly outperforms the baseline methods (unsupervised and supervised learning methods), given the same amount of labeled data. This is because SSRank can effectively leverage the use of unlabeled data in learning.  相似文献   

15.
Latent semantic indexing (LSI) has been demonstrated to outperform lexical matching in information retrieval. However, the enormous cost associated with the singular value decomposition (SVD) of the large term-by-document matrix becomes a barrier for its application to scalable information retrieval. This work shows that information filtering using level search techniques can reduce the SVD computation cost for LSI. For each query, level search extracts a much smaller subset of the original term-by-document matrix, containing on average 27% of the original non-zero entries. When LSI is applied to such subsets, the average precision can degrade by as much as 23% due to level search filtering. However, for some document collections an increase in precision has also been observed. Further enhancement of level search can be based on a pruning scheme which deletes terms connected to only one document from the query-specific submatrix. Such pruning has achieved a 65% reduction (on average) in the number of non-zeros with a precision loss of 5% for most collections.  相似文献   

16.
朱学芳  冯曦曦 《情报科学》2012,(7):1012-1015
通过对农业网页的HTML结构和特征研究,叙述基于文本内容的农业网页信息抽取和分类实验研究过程。实验中利用DOM结构对农业网页信息进行信息抽取和预处理,并根据文本的内容自动计算文本类别属性,得到特征词,通过总结样本文档的特征,对遇到的新文档进行自动分类。实验结果表明,本文信息提取的时间复杂度比较小、精确度高,提高了分类的正确率。  相似文献   

17.
Frequent requests from users to search engines on the World Wide Web are to search for information about people using personal names. Current search engines only return sets of documents containing the name queried, but, as several people usually share a personal name, the resulting sets often contain documents relevant to several people. It is necessary to disambiguate people in these result sets in order to to help users find the person of interest more readily. In the task of name disambiguation, effective measurement of similarities in the documents is a crucial step towards the final disambiguation. We propose a new method that uses web directories as a knowledge base to find common contexts in documents and uses the common contexts measure to determine document similarities. Experiments, conducted on documents mentioning real people on the web, together with several famous web directory structures, suggest that there are significant advantages in using web directories to disambiguate people compared with other conventional methods.  相似文献   

18.
陈京莲 《情报探索》2014,(11):98-101
介绍了文献老化的研究现状,指出在网络环境下对文献老化的研究扩展到网络信息老化研究的必然性和可能性,这是加强和改善网络管理的客观需要。在这个过程中,相关人员要进行实证研究和理论探索,从而建立完善的网络信息老化研究的理论体系和方法。  相似文献   

19.
This paper studies how to learn accurate ranking functions from noisy training data for information retrieval. Most previous work on learning to rank assumes that the relevance labels in the training data are reliable. In reality, however, the labels usually contain noise due to the difficulties of relevance judgments and several other reasons. To tackle the problem, in this paper we propose a novel approach to learning to rank, based on a probabilistic graphical model. Considering that the observed label might be noisy, we introduce a new variable to indicate the true label of each instance. We then use a graphical model to capture the joint distribution of the true labels and observed labels given features of documents. The graphical model distinguishes the true labels from observed labels, and is specially designed for ranking in information retrieval. Therefore, it helps to learn a more accurate model from noisy training data. Experiments on a real dataset for web search show that the proposed approach can significantly outperform previous approaches.  相似文献   

20.
We present a Life-Long Learning from Mistakes (3LM) algorithm for document classification, which could be used in various scenarios such as spam filtering, blog classification, and web resource categorization. We extend the ideas of online clustering and batch-mode centroid-based classification to online learning with negative feedback. The 3LM is a competitive learning algorithm, which avoids over-smoothing, characteristic of the centroid-based classifiers, by using a different class representative, which we call clusterhead. The clusterheads competing for vector-space dominance are drawn toward misclassified documents, eventually bringing the model to a “balanced state” for a fixed distribution of documents. Subsequently, the clusterheads oscillate between the misclassified documents, heuristically minimizing the rate of misclassifications, an NP-complete problem. Further, the 3LM algorithm prevents over-fitting by “leashing” the clusterheads to their respective centroids. A clusterhead provably converges if its class can be separated by a hyper-plane from all other classes. Lifelong learning with fixed learning rate allows 3LM to adapt to possibly changing distribution of the data and continually learn and unlearn document classes. We report on our experiments, which demonstrate high accuracy of document classification on Reuters21578, OHSUMED, and TREC07p-spam datasets. The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naïve Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号