首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 491 毫秒
1.
Query recommendation has long been considered a key feature of search engines, which can improve users’ search experience by providing useful query suggestions for their search tasks. Most existing approaches on query recommendation aim to recommend relevant queries, i.e., alternative queries similar to a user’s initial query. However, the ultimate goal of query recommendation is to assist users to reformulate queries so that they can accomplish their search task successfully and quickly. Only considering relevance in query recommendation is apparently not directly toward this goal. In this paper, we argue that it is more important to directly recommend queries with high utility, i.e., queries that can better satisfy users’ information needs. For this purpose, we attempt to infer query utility from users’ sequential search behaviors recorded in their search sessions. Specifically, we propose a dynamic Bayesian network, referred as Query Utility Model (QUM), to capture query utility by simultaneously modeling users’ reformulation and click behaviors. We then recommend queries with high utility to help users better accomplish their search tasks. We empirically evaluated the performance of our approach on a publicly released query log by comparing with the state-of-the-art methods. The experimental results show that, by recommending high utility queries, our approach is far more effective in helping users find relevant search results and thus satisfying their information needs.  相似文献   

2.
In this paper, we present Waves, a novel document-at-a-time algorithm for fast computing of top-k query results in search systems. The Waves algorithm uses multi-tier indexes for processing queries. It performs successive tentative evaluations of results which we call waves. Each wave traverses the index, starting from a specific tier level i. Each wave i may insert only those documents that occur in that tier level into the answer. After processing a wave, the algorithm checks whether the answer achieved might be changed by successive waves or not. A new wave is started only if it has a chance of changing the top-k scores. We show through experiments that such lazy query processing strategy results in smaller query processing times when compared to previous approaches proposed in the literature. We present experiments to compare Waves’ performance to the state-of-the-art document-at-a-time query processing methods that preserve top-k results and show scenarios where the method can be a good alternative algorithm for computing top-k results.  相似文献   

3.
This article introduces a new language-independent approach for creating a large-scale high-quality test collection of tweets that supports multiple information retrieval (IR) tasks without running a shared-task campaign. The adopted approach (demonstrated over Arabic tweets) designs the collection around significant (i.e., popular) events, which enables the development of topics that represent frequent information needs of Twitter users for which rich content exists. That inherently facilitates the support of multiple tasks that generally revolve around events, namely event detection, ad-hoc search, timeline generation, and real-time summarization. The key highlights of the approach include diversifying the judgment pool via interactive search and multiple manually-crafted queries per topic, collecting high-quality annotations via crowd-workers for relevancy and in-house annotators for novelty, filtering out low-agreement topics and inaccessible tweets, and providing multiple subsets of the collection for better availability. Applying our methodology on Arabic tweets resulted in EveTAR, the first freely-available tweet test collection for multiple IR tasks. EveTAR includes a crawl of 355M Arabic tweets and covers 50 significant events for which about 62K tweets were judged with substantial average inter-annotator agreement (Kappa value of 0.71). We demonstrate the usability of EveTAR by evaluating existing algorithms in the respective tasks. Results indicate that the new collection can support reliable ranking of IR systems that is comparable to similar TREC collections, while providing strong baseline results for future studies over Arabic tweets.  相似文献   

4.
Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple \({\textsf{tf}}{\textsf{-}}{\textsf{idf}}\) model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.  相似文献   

5.
6.
Search engines are increasingly going beyond the pure relevance of search results to entertain users with information items that are interesting and even surprising, albeit sometimes not fully related to their search intent. In this paper, we study this serendipitous search space in the context of entity search, which has recently emerged as a powerful paradigm for building semantically rich answers. Specifically, our work proposes to enhance an explorative search system that represents a large sample of Yahoo Answers as an entity network, with a result structuring that goes beyond ranked lists, using composite entity retrieval, which requires a bundling of the results. We propose and compare six bundling methods, which exploit topical categories, entity specializations, and sentiment, and go beyond simple entity clustering. Two large-scale crowd-sourced studies show that users find a bundled organization—especially based on the topical categories of the query entity—to be better at revealing the most useful results, as well as at organizing the results, helping to discover novel and interesting information, and promoting exploration. Finally, a third study of 30 simulated search tasks reveals the bundled search experience to be less frustrating and more rewarding, with more users willing to recommend it to others.  相似文献   

7.
This paper reviews the current status of the Anglophone (Anglo-American) publishing business and draws some comparisons with publishing in other languages. It then critically reviews the impact of the Harry Potter phenomenon and the questionable progress of e-books in the trade sector, using the example of Stephen King’s Riding the Bullet. It also comments on Amazon’s introduction of the Kindle e-book reader.  相似文献   

8.
Traditional pooling-based information retrieval (IR) test collections typically have \(n= 50\)–100 topics, but it is difficult for an IR researcher to say why the topic set size should really be n. The present study provides details on principled ways to determine the number of topics for a test collection to be built, based on a specific set of statistical requirements. We employ Nagata’s three sample size design techniques, which are based on the paired t test, one-way ANOVA, and confidence intervals, respectively. These topic set size design methods require topic-by-run score matrices from past test collections for the purpose of estimating the within-system population variance for a particular evaluation measure. While the previous work of Sakai incorrectly used estimates of the total variances, here we use the correct estimates of the within-system variances, which yield slightly smaller topic set sizes than those reported previously by Sakai. Moreover, this study provides a comparison across the three methods. Our conclusions nevertheless echo those of Sakai: as different evaluation measures can have vastly different within-system variances, they require substantially different topic set sizes under the same set of statistical requirements; by analysing the tradeoff between the topic set size and the pool depth for a particular evaluation measure in advance, researchers can build statistically reliable yet highly economical test collections.  相似文献   

9.
Hägglund’s “radical atheism”—innovative thinking within the philosophical current of “speculative materialism”—revitalizes deconstruction and provides an important basis to define parameters for the archivist’s role as activist for social justice. This paper argues postmodern archival theory gets deconstruction wrong by misreading Derrida’s “Archive fever” as a theory of “archontic power”; this misleads archivists on the call for justice. Properly understanding that justice is undecidable, radical atheism explodes the tension between postmodernists’ appreciation of all views and perspectives and their commitment to right unjust relations of power. This paper first advances the negative argument that “Archive fever” is not about power and injustice. It then advances the positive argument that “Archive fever” is Derrida’s effort to look at actual archives to resolve Freud’s problematic theorizing of a “death drive.” In a close and comprehensive reading of “Archive fever,” this paper explores the notion of “archive fever” as a death drive and suggests Derrida’s efforts are inconclusive. Viewed through the lens of radical atheism, the archive’s “traces”—the material of actual archives writ large in the manner of Derrida’s thinking about a universal archive—serve to mark the flow of time. Understanding the structure of the trace reveals the source of internal contradictions, discontinuities, and instabilities in the meaning of all things. It explains why justice is undecidable. In face of the unconditional condition of this undecidability, we as archivists and humans are compelled to make decisions and to act. Deconstruction politicizes our actions and evokes a responsibility that cannot be absolved.  相似文献   

10.
Analyzing archives and finding facts: use and users of digital data records   总被引:1,自引:1,他引:0  
This article focuses on use and users of data from the NARA (National Archives and Records Administration), U.S. Who is using archival electronic records, and why are they using them? It describes the changes in use and consequently user groups over the last 30 years. The changes in use are related to the evolution of reference services for electronic records at NARA, as well as to growth in the types of electronic records accessioned by NARA. The first user group consisted mainly of researchers with a social science background, who usually expected to handle the data themselves. The user community expanded when electronic records with personal value, like casualty records, were transferred to NARA, and broadened yet again when a selection of NARA’s electronic records became available online. Archivists trying to develop user services for electronic records will find that the needs and expectations of fact or information seeking data users are different from those of researchers using and analyzing data files.  相似文献   

11.
In 2004, the Scottish Parliament commissioned an independent review of abuse in children’s residential establishments between 1950 and 1995. In 2007, the review’s findings were published in a report entitled Historical Abuse Systemic Review: Residential Schools and Children’s Homes in Scotland 1950 to 1995, also known as the Shaw Report. In this article, the Shaw Report provides the jumping off point for a case study of the social justice impact of records. Drawing on secondary literature, interviews, and care-related records, the study identifies narratives that speak to the social justice impact of care records on care-leavers seeking access to them; it also assesses the potential of the surviving administrative records to serve as a foundation on which to construct historical narratives that speak more generally to the experience of children in residential care.  相似文献   

12.
This study is devoted to detection of the lexical environment and demonstration of the thematic medium of the words MEMORY and MEMORIES in the social sciences on the basis of the bibliographic database Social Science Citation Index (SSCI) of the Institute for Scientific Information (USA). The amount of studied material is over 3000 documents in English. Corresponding corpora and subcorpora of summary texts are formed, general frequency dictionaries and frequency dictionaries of binary combinations for each corpus and subcorpus are constructed, words and combinations specific for each subcorpus are found, and corresponding factors (lexical markers) are calculated for them. The general statistical information on the usage of the words under study is given, the obtained results of lexical analysis are represented in a tabulated form, and the corresponding semantic maps are discussed.  相似文献   

13.
We address the feature extraction problem for document ranking in information retrieval. We then propose LifeRank, a Linear feature extraction algorithm for Ranking. In LifeRank, we regard each document collection for ranking as a matrix, referred to as the original matrix. We try to optimize a transformation matrix, so that a new matrix (dataset) can be generated as the product of the original matrix and a transformation matrix. The transformation matrix projects high-dimensional document vectors into lower dimensions. Theoretically, there could be very large transformation matrices, each leading to a new generated matrix. In LifeRank, we produce a transformation matrix so that the generated new matrix can match the learning to rank problem. Extensive experiments on benchmark datasets show the performance gains of LifeRank in comparison with state-of-the-art feature selection algorithms.  相似文献   

14.
Although youth are increasingly going online to fulfill their needs for information, many youth struggle with information and digital literacy skills, such as the abilities to conduct a search and assess the credibility of online information. Ideally, these skills encompass an accurate and comprehensive understanding of the ways in which a system, such as a Web search engine, functions. In order to investigate youths’ conceptions of the Google search engine, a drawing activity was conducted with 26 HackHealth after-school program participants to elicit their mental models of Google. The findings revealed that many participants personified Google and emphasized anthropomorphic elements, computing equipment, and/or connections (such as cables, satellites and antennas) in their drawings. Far fewer participants focused their drawings on the actual Google interface or on computer code. Overall, their drawings suggest a limited understanding of Google and the ways in which it actually works. However, an understanding of youths’ conceptions of Google can enable educators to better tailor their digital literacy instruction efforts and can inform search engine developers and search engine interface designers in making the inner workings of the engine more transparent and their output more trustworthy to young users. With a better understanding of how Google works, young users will be better able to construct effective queries, assess search results, and ultimately find relevant and trustworthy information that will be of use to them.  相似文献   

15.
This study investigates the information seeking behavior of general Korean Web users. The data from transaction logs of selected dates from August 2006 to August 2007 were used to examine characteristics of Web queries and to analyze click logs that consist of a collection of documents that users clicked and viewed for each query. Changes in search topics are explored for NAVER users from 2003/2004 to 2006/2007. Patterns involving spelling errors and queries in foreign languages are also investigated. Search behaviors of Korean Web users are compared to those of the United States and other countries. The results show that entertainment is the topranked category, followed by shopping, education, games, and computer/Internet. Search topics changed from computer/Internet to entertainment and shopping from 2003/2004 to 2006/2007 in Korea. The ratios of both spelling errors and queries in foreign languages are low. This study reveals differences for search topics among different regions of the world. The results suggest that the analysis of click logs allows for the reduction of unknown or unidentifiable queries by providing actual data on user behaviors and their probable underlying information needs. The implications for system designers and Web content providers are discussed.  相似文献   

16.
This paper presents a Graph Inference retrieval model that integrates structured knowledge resources, statistical information retrieval methods and inference in a unified framework. Key components of the model are a graph-based representation of the corpus and retrieval driven by an inference mechanism achieved as a traversal over the graph. The model is proposed to tackle the semantic gap problem—the mismatch between the raw data and the way a human being interprets it. We break down the semantic gap problem into five core issues, each requiring a specific type of inference in order to be overcome. Our model and evaluation is applied to the medical domain because search within this domain is particularly challenging and, as we show, often requires inference. In addition, this domain features both structured knowledge resources as well as unstructured text. Our evaluation shows that inference can be effective, retrieving many new relevant documents that are not retrieved by state-of-the-art information retrieval models. We show that many retrieved documents were not pooled by keyword-based search methods, prompting us to perform additional relevance assessment on these new documents. A third of the newly retrieved documents judged were found to be relevant. Our analysis provides a thorough understanding of when and how to apply inference for retrieval, including a categorisation of queries according to the effect of inference. The inference mechanism promoted recall by retrieving new relevant documents not found by previous keyword-based approaches. In addition, it promoted precision by an effective reranking of documents. When inference is used, performance gains can generally be expected on hard queries. However, inference should not be applied universally: for easy, unambiguous queries and queries with few relevant documents, inference did adversely affect effectiveness. These conclusions reflect the fact that for retrieval as inference to be effective, a careful balancing act is involved. Finally, although the Graph Inference model is developed and applied to medical search, it is a general retrieval model applicable to other areas such as web search, where an emerging research trend is to utilise structured knowledge resources for more effective semantic search.  相似文献   

17.
Automatic detection of source code plagiarism is an important research field for both the commercial software industry and within the research community. Existing methods of plagiarism detection primarily involve exhaustive pairwise document comparison, which does not scale well for large software collections. To achieve scalability, we approach the problem from an information retrieval (IR) perspective. We retrieve a ranked list of candidate documents in response to a pseudo-query representation constructed from each source code document in the collection. The challenge in source code document retrieval is that the standard bag-of-words (BoW) representation model for such documents is likely to result in many false positives being retrieved, because of the use of identical programming language specific constructs and keywords. To address this problem, we make use of an abstract syntax tree (AST) representation of the source code documents. While the IR approach is efficient, it is essentially unsupervised in nature. To further improve its effectiveness, we apply a supervised classifier (pre-trained with features extracted from sample plagiarized source code pairs) on the top ranked retrieved documents. We report experiments on the SOCO-2014 dataset comprising 12K Java source files with almost 1M lines of code. Our experiments confirm that the AST based approach produces significantly better retrieval effectiveness than a standard BoW representation, i.e., the AST based approach is able to identify a higher number of plagiarized source code documents at top ranks in response to a query source code document. The supervised classifier, trained on features extracted from sample plagiarized source code pairs, is shown to effectively filter and thus further improve the ranked list of retrieved candidate plagiarized documents.  相似文献   

18.
The rapid growth of the Web has increased the difficulty of finding the information that can address the users’ information needs. A number of recommendation approaches have been developed to tackle this problem. The increase in the number of data providers has necessitated the development of multi-publisher recommender systems; systems that include more than one item/data provider. In such environments, preserving the privacy of both publishers and subscribers is a key and challenging point. In this paper, we propose a multi-publisher framework for recommender systems based on a client–server architecture, which preserves the privacy of both data providers and subscribers. We develop our framework as a content-based filtering system using the statistical language modeling framework. We also introduce AUTO, a simple yet effective threshold optimization algorithm, to find a dissemination threshold for making acceptance and rejection decisions for new published documents. We further propose a language model sketching technique to reduce the network traffic between servers and clients in the proposed framework. Extensive experiments using the TREC-9 Filtering Track and the CLEF 2008-09 INFILE Track collections indicate the effectiveness of the proposed models in both single- and multi-publisher settings.  相似文献   

19.
A number of online marketplaces enable customers to buy or sell used products, which raises the need for ranking tools to help them find desirable items among a huge pool of choices. To the best of our knowledge, no prior work in the literature has investigated the task of used product ranking which has its unique characteristics compared with regular product ranking. While there exist a few ranking metrics (e.g., price, conversion probability) that measure the “goodness” of a product, they do not consider the time factor, which is crucial in used product trading due to the fact that each used product is often unique while new products are usually abundant in supply or quantity. In this paper, we introduce a novel time-aware metric—“sellability”, which is defined as the time duration for a used item to be traded, to quantify the value of it. In order to estimate the “sellability” values for newly generated used products and to present users with a ranked list of the most relevant results, we propose a combined Poisson regression and listwise ranking model. The model has a good property in fitting the distribution of “sellability”. In addition, the model is designed to optimize loss functions for regression and ranking simultaneously, which is different from previous approaches that are conventionally learned with a single cost function, i.e., regression or ranking. We evaluate our approach in the domain of used vehicles. Experimental results show that the proposed model can improve both regression and ranking performance compared with non-machine learning and machine learning baselines.  相似文献   

20.
In this paper the problem of indexing heterogeneous structured documents and of retrieving semi-structured documents is considered. We propose a flexible paradigm for both indexing such documents and formulating user queries specifying soft constraints on both documents structure and content. At the indexing level we propose a model that achieves flexibility by constructing personalised document representations based on users views of the documents. This is obtained by allowing users to specify their preferences on the documents sections that they estimate to bear the most interesting information, as well as to linguistically quantify the number of sections which determine the global potential interest of the documents. At the query language level, a flexible query language for expressing soft selection conditions on both the documents structure and content is proposed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号