首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper explores the incorporation of prior knowledge into support vector machines as a means of compensating for a shortage of training data in text categorization. The prior knowledge about transformation invariance is generated by a virtual document method. The method applies a simple transformation to documents, i.e., making virtual documents by combining relevant document pairs for a topic in the training set. The virtual document thus created not only is expected to preserve the topic, but even improve the topical representation by exploiting relevant terms that are not given high importance in individual real documents. Artificially generated documents result in the change in the distribution of training data without the randomization. Experiments with support vector machines based on linear, polynomial and radial-basis function kernels showed the effectiveness on Reuters-21578 set for the topics with a small number of relevant documents. The proposed method achieved 131%, 34%, 12% improvements in micro-averaged F1 for 25, 46, and 58 topics with less than 10, 30, and 50 relevant documents in learning, respectively. The result analysis indicates that incorporating virtual documents contributes to a steady improvement on the performance.  相似文献   

2.
Structured document retrieval makes use of document components as the basis of the retrieval process, rather than complete documents. The inherent relationships between these components make it vital to support users’ natural browsing behaviour in order to offer effective and efficient access to structured documents. This paper examines the concept of best entry points, which are document components from which the user can browse to obtain optimal access to relevant document components. In particular this paper investigates the basic characteristics of best entry points.  相似文献   

3.
Document clustering is an important tool for document collection organization and browsing. In real applications, some limited knowledge about cluster membership of a small number of documents is often available, such as some pairs of documents belonging to the same cluster. This kind of prior knowledge can be served as constraints for the clustering process. We integrate the constraints into the trace formulation of the sum of square Euclidean distance function of K-means. Then,the combined criterion function is transformed into trace maximization, which is further optimized by eigen-decomposition. Our experimental evaluation shows that the proposed semi-supervised clustering method can achieve better performance, compared to three existing methods.  相似文献   

4.
Structured document retrieval makes use of document components as the basis of the retrieval process, rather than complete documents. The inherent relationships between these components make it vital to support users’ natural browsing behaviour in order to offer effective and efficient access to structured documents. This paper examines the concept of best entry points, which are document components from which the user can browse to obtain optimal access to relevant document components. It investigates at the types of best entry points in structured document retrieval, and their usage and effectiveness in real information search tasks.  相似文献   

5.
This paper presents a relevance model to rank the facts of a data warehouse that are described in a set of documents retrieved with an information retrieval (IR) query. The model is based in language modeling and relevance modeling techniques. We estimate the relevance of the facts by the probability of finding their dimensions values and the query keywords in the documents that are relevant to the query. The model is the core of the so-called contextualized warehouse, which is a new kind of decision support system that combines structured data sources and document collections. The paper evaluates the relevance model with the Wall Street Journal (WSJ) TREC test subcollection and a self-constructed fact database.  相似文献   

6.
Document length normalization is one of the fundamental components in a retrieval model because term frequencies can readily be increased in long documents. The key hypotheses in literature regarding document length normalization are the verbosity and scope hypotheses, which imply that document length normalization should consider the distinguishing effects of verbosity and scope on term frequencies. In this article, we extend these hypotheses in a pseudo-relevance feedback setting by assuming the verbosity hypothesis on the feedback query model, which states that the verbosity of an expanded query should not be high. Furthermore, we postulate the following two effects of document verbosity on a feedback query model that easily and typically holds in modern pseudo-relevance feedback methods: 1) the verbosity-preserving effect: the query verbosity of a feedback query model is determined by feedback document verbosities; 2) the verbosity-sensitive effect: highly verbose documents more significantly and unfairly affect the resulting query model than normal documents do. By considering these effects, we propose verbosity normalized pseudo-relevance feedback, which is straightforwardly obtained by replacing original term frequencies with their verbosity-normalized term frequencies in the pseudo-relevance feedback method. The results of the experiments performed on three standard TREC collections show that the proposed verbosity normalized pseudo-relevance feedback consistently provides statistically significant improvements over conventional methods, under the settings of the relevance model and latent concept expansion.  相似文献   

7.
A new model for aggregating multiple criteria evaluations for relevance assessment is proposed. An Information Retrieval context is considered, where relevance is modeled as a multidimensional property of documents. The usefulness and effectiveness of such a model are demonstrated by means of a case study on personalized Information Retrieval with multi-criteria relevance. The following criteria are considered to estimate document relevance: aboutness, coverage, appropriateness, and reliability.  相似文献   

8.
This paper describes an automatic approach designed to improve the retrieval effectiveness of very short queries such as those used in web searching. The method is based on the observation that stemming, which is designed to maximize recall, often results in depressed precision. Our approach is based on pseudo-feedback and attempts to increase the number of relevant documents in the pseudo-relevant set by reranking those documents based on the presence of unstemmed query terms in the document text. The original experiments underlying this work were carried out using Smart 11.0 and the lnc.ltc weighting scheme on three sets of documents from the TREC collection with corresponding TREC (title only) topics as queries. (The average length of these queries after stoplisting ranges from 2.4 to 4.5 terms.) Results, evaluated in terms of P@20 and non-interpolated average precision, showed clearly that pseudo-feedback (PF) based on this approach was effective in increasing the number of relevant documents in the top ranks. Subsequent experiments, performed on the same data sets using Smart 13.0 and the improved Lnu.ltu weighting scheme, indicate that these results hold up even over the much higher baseline provided by the new weights. Query drift analysis presents a more detailed picture of the improvements produced by this process.  相似文献   

9.
The Web has become a worldwide source of information and a mainstream business tool. It is changing the way people conduct the daily business of their lives. As these changes are occurring, we need to understand what Web searching trends are emerging within the various global regions. What are the regional differences and trends in Web searching, if any? What is the effectiveness of Web search engines as providers of information? As part of a body of research studying these questions, we have analyzed two data sets collected from queries by mainly European users submitted to AlltheWeb.com on 6 February 2001 and 28 May 2002. AlltheWeb.com is a major and highly rated European search engine. Each data set contains approximately a million queries submitted by over 200,000 users and spans a 24-h period. This longitudinal benchmark study shows that European Web searching is evolving in certain directions. There was some decline in query length, with extremely simple queries. European search topics are broadening, with a notable percentage decline in sexual and pornographic searching. The majority of Web searchers view fewer than five Web documents, spending only seconds on a Web document. Approximately 50% of the Web documents viewed by these European users were topically relevant. We discuss the implications for Web information systems and information content providers.  相似文献   

10.
This study examined the success and information seeking behaviors of seventh-grade science students and graduate students in information science in using Yahooligans! Web search engine/directory. It investigated these users' cognitive, affective, and physical behaviors as they sought the answer for a fact-finding task. It analyzed and compared the overall patterns of children's and graduate students' Web activities, including searching moves, browsing moves, backtracking moves, looping moves, screen scrolling, target location and deviation moves, and the time they took to complete the task. The authors applied Bilal's Web Traversal Measure to quantify these users' effectiveness, efficiency, and quality of moves they made. Results were based on 14 children's Web sessions and nine graduate students' sessions. Both groups' Web activities were captured online using Lotus ScreenCam, a software package that records and replays online activities in Web browsers. Children's affective states were captured via exit interviews. Graduate students' affective states were extracted from the journal writings they kept during the traversal process. The study findings reveal that 89% of the graduate students found the correct answer to the search task as opposed to 50% of the children. Based on the Measure, graduate students' weighted effectiveness, efficiency, and quality of the Web moves they made were much higher than those of the children. Regardless of success and weighted scores, however, similarities and differences in information seeking were found between the two groups. Yahooligans! poor structure of keyword searching was a major factor that contributed to the “breakdowns” children and graduate students experienced. Unlike children, graduate students were able to recover from “breakdowns” quickly and effectively. Three main factors influenced these users' performance: ability to recover from “breakdowns”, navigational style, and focus on task. Children and graduate students made recommendations for improving Yahooligans! interface design. Implications for Web user training and system design improvements are made.  相似文献   

11.
Personalization can be addressed by adaptability and adaptivity, which have different advantages and disadvantages. This study investigates how digital library (DL) users react to these two techniques. More specifically, we develop a personalized DL to suit the needs of different cognitive styles based on the findings of our previous work [Frias-Martinez, E., Chen, S. Y., & Liu, X. (2008) Investigation of behavior and perception of digital library users: A cognitive style perspective. International Journal of Information Management]. The personalized DL includes two versions: adaptive version and adaptable version. The results showed that users not only performed better in the adaptive version, but also they perceived more positively to the adaptive version. In addition, cognitive styles have great effects on users’ responses to adaptability and adaptivity. These results provide guidance for designers to select suitable techniques to develop personalized DLs.  相似文献   

12.
张娟  王向辉  付然  孙晓琳 《现代情报》2017,37(10):49-52
[目的]为了实现海量数据中信息的知识组织,促进单元信息和文献信息内容的知识关联和知识发现,构建单元信息知识组织体系;[方法]以养生领域内的领域本体、文献信息等知识资源为基础,构建养生单元信息知识服务系统;[结果/结论]"养生单元信息知识服务系统"(以下简称"养生知识服务平台")是单元信息知识组织体系建设的重要应用示范,提供语义检索、知识浏览、知识推理和知识发现等服务,实现大数据环境下"单元信息知识组织体系"的有效利用;[局限]文献资源的单元信息抽取及分析涉及人工智能、计算机处理等相关技术,技术实现有较大难度。  相似文献   

13.
Traditional information retrieval techniques that primarily rely on keyword-based linking of the query and document spaces face challenges such as the vocabulary mismatch problem where relevant documents to a given query might not be retrieved simply due to the use of different terminology for describing the same concepts. As such, semantic search techniques aim to address such limitations of keyword-based retrieval models by incorporating semantic information from standard knowledge bases such as Freebase and DBpedia. The literature has already shown that while the sole consideration of semantic information might not lead to improved retrieval performance over keyword-based search, their consideration enables the retrieval of a set of relevant documents that cannot be retrieved by keyword-based methods. As such, building indices that store and provide access to semantic information during the retrieval process is important. While the process for building and querying keyword-based indices is quite well understood, the incorporation of semantic information within search indices is still an open challenge. Existing work have proposed to build one unified index encompassing both textual and semantic information or to build separate yet integrated indices for each information type but they face limitations such as increased query process time. In this paper, we propose to use neural embeddings-based representations of term, semantic entity, semantic type and documents within the same embedding space to facilitate the development of a unified search index that would consist of these four information types. We perform experiments on standard and widely used document collections including Clueweb09-B and Robust04 to evaluate our proposed indexing strategy from both effectiveness and efficiency perspectives. Based on our experiments, we find that when neural embeddings are used to build inverted indices; hence relaxing the requirement to explicitly observe the posting list key in the indexed document: (a) retrieval efficiency will increase compared to a standard inverted index, hence reduces the index size and query processing time, and (b) while retrieval efficiency, which is the main objective of an efficient indexing mechanism improves using our proposed method, retrieval effectiveness also retains competitive performance compared to the baseline in terms of retrieving a reasonable number of relevant documents from the indexed corpus.  相似文献   

14.
Term weighting for document ranking and retrieval has been an important research topic in information retrieval for decades. We propose a novel term weighting method based on a hypothesis that a term’s role in accumulated retrieval sessions in the past affects its general importance regardless. It utilizes availability of past retrieval results consisting of the queries that contain a particular term, retrieved documents, and their relevance judgments. A term’s evidential weight, as we propose in this paper, depends on the degree to which the mean frequency values for the relevant and non-relevant document distributions in the past are different. More precisely, it takes into account the rankings and similarity values of the relevant and non-relevant documents. Our experimental result using standard test collections shows that the proposed term weighting scheme improves conventional TF*IDF and language model based schemes. It indicates that evidential term weights bring in a new aspect of term importance and complement the collection statistics based on TF*IDF. We also show how the proposed term weighting scheme based on the notion of evidential weights are related to the well-known weighting schemes based on language modeling and probabilistic models.  相似文献   

15.
基于XML的数字图书馆检索技术研究   总被引:1,自引:0,他引:1  
申飞驹 《现代情报》2010,30(7):97-98,102
随着XML数字图书馆的迅速发展,怎样快速有效地对XML文档进行查询和处理,正受到越来越多的重视,本文对XML数字图书馆检索系统进行了分类比较。并从检索模型、文档聚类、索引技术3个方面对XML数字图书馆检索研究方向进行了阐述。  相似文献   

16.
张继东  蔡雪 《现代情报》2019,39(1):70-77
[目的/意义]本文以用户行为感知视角,研究影响移动社交网络主导型用户与浏览型用户持续使用的因素,为移动社交网络信息服务提供理论基础,并为移动社交网络提供商提出参考与应用借鉴。[方法/过程]分析移动社交网络主导型用户与浏览型用户持续使用意愿影响因素,引入相关变量,构建了基于用户行为感知的移动社交网络信息服务持续使用意愿模型并提出假设,最后通过结构方程模型进行实证分析。[结果/结论]感知有用性、感知易用性、感知娱乐、感知质量等因素均显著影响主导型及浏览型两类用户;服务质量、感知风险、知识获取、个人创新、社会认可、感知信任、感知转换成本等因素对两类用户有不同程度的影响。  相似文献   

17.
This study aims to explore the relationships between user interaction and digital libraries (DLs) evaluation. User interaction is a multi-dimensional construct and recognized as three dimensions in this study, as user interaction with: information resource; interface; and, tasks. DL evaluation is considered from the user's perspective and defined as users’ perception of DL performance from different perspectives, including the support of DL's interaction design to user interaction (labeled as interaction-design-based (IDB) evaluation), the support of task completion (labeled as task-based evaluation), and a DL's overall performance (labeled as overall evaluation). An experiment with 48 participants was conducted using the China National Knowledge Infrastructure (CNKI (http://cnki.net/), the most widely used digital library in China). Participants searched for four simulated work tasks and one real work task during the experiment, subsequently evaluating their interaction with information resource, interface, and tasks, and DL performance from different perspectives before or after the search. Correlation analysis and stepwise regression analysis were conducted to examine the relationships. The results indicate that a list of factors related to different dimensions of user interaction can significantly predict or be correlated to users’ evaluation of DL performance from different perspectives, including appropriateness, rich and valid links, reasonable page layout, salience of topics, search task difficulty, well-organized web site, easy to learn, accessibility, usefulness, familiarity with task procedure, etc. These factors surface as the most critical criteria for DL evaluation. Based on the results, an integrated DL evaluation framework is developed. The study adds new knowledge about how tasks affect DL evaluation. It has implications for improving the efficiency of DL evaluation and helping DL developers design DLs to better support users’ interaction, task completion, and their overall experience with DLs.  相似文献   

18.
Effective knowledge management in a knowledge-intensive environment can place heavy demands on the information filtering (IF) strategies used to model workers’ long-term task-needs. Because of the growing complexity of knowledge-intensive work tasks, a profiling technique is needed to deliver task-relevant documents to workers. In this study, we propose an IF technique with task-stage identification that provides effective codification-based support throughout the execution of a task. Task-needs pattern similarity analysis based on a correlation value is used to identify a worker’s task-stage (the pre-focus, focus formulation, or post-focus task-stage). The identified task-stage is then incorporated into a profile adaptation process to generate the worker’s current task profile. The results of a pilot study conducted in a research institute confirm that there is a low or negative correlation between search sessions and transactions in the pre-focus task-stage, whereas there is at least a moderate correlation between search sessions/transactions in the post-focus stage. Compared with the traditional IF technique, the proposed IF technique with task-stage identification achieves, on average, a 19.49% improvement in task-relevant document support. The results confirm the effectiveness of the proposed method for knowledge-intensive work tasks.  相似文献   

19.
查收查引服务中存在文献未检到现象,文献检索不到既影响用户使用,也说明图书馆需要提高服务质量.文章从数据库商、用户、检索人员3个角度针对性地提出了相关策略和建议.  相似文献   

20.
We present an efficient document clustering algorithm that uses a term frequency vector for each document instead of using a huge proximity matrix. The algorithm has the following features: (1) it requires a relatively small amount of memory and runs fast, (2) it produces a hierarchy in the form of a document classification tree and (3) the hierarchy obtained by the algorithm explicitly reveals a collection structure. We confirm these features and thus show the algorithm's feasibility through clustering experiments in which we use two collections of Japanese documents, the sizes of which are 83,099 and 14,701 documents. We also introduce an application of this algorithm to a document browser. This browser is used in our Japanese-to-English translation aid system. The browsing module of the system consists of a huge database of Japanese news articles and their English translations. The Japanese article collection is clustered into a hierarchy by our method. Since each node in the hierarchy corresponds to a topic in the collection, we can use the hierarchy to directly access articles by topic. A user can learn general translation knowledge of each topic by browsing the Japanese articles and their English translations. We also discuss techniques of presenting a large tree-formed hierarchy on a computer screen.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号