首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 664 毫秒
1.
The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text   总被引:1,自引:1,他引:0  
A known-item search is a particular information retrieval task in which the system is asked to find a single target document in a large document set. The TREC-5 confusion track used a set of 49 known-item tasks to study the impact of data corruption on retrieval system performance. Two corrupted versions of a 55,600 document corpus whose true content was known were created by applying OCR techniques to page images. The first version of the corpus used the page images as scanned, resulting in an estimated character error rate of approximately 5%. The second version used page images that had been down-sampled, resulting in an estimated character error rate of approximately 20%. The true text and each of the corrupted versions were then searched using the same set of 49 questions. In general, retrieval methods that attempted a probabilistic reconstruction of the original clean text fared better than methods that simply accepted corrupted versions of the query text.  相似文献   

2.
Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR’ed to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available, then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction can minimize the need for morphologically sensitive error correction.
Kareem DarwishEmail:
  相似文献   

3.
Intelligent Indexing and Semantic Retrieval of Multimodal Documents   总被引:2,自引:0,他引:2  
Finding useful information from large multimodal document collections such as the WWW without encountering numerous false positives poses a challenge to multimedia information retrieval systems (MMIR). This research addresses the problem of finding pictures. The fact that images do not appear in isolation, but rather with accompanying, collateral text is exploited. Taken independently, existing techniques for picture retrieval using (i) text-based and (ii) image-based methods have several limitations. This research presents a general model for multimodal information retrieval that addresses the following issues: (i) users' information need, (ii) expressing information need through composite, multimodal queries, and (iii) determining the most appropriate weighted combination of indexing techniques in order to best satisfy information need. A machine learning approach is proposed for the latter. The focus is on improving precision and recall in a MMIR system by optimally combining text and image similarity. Experiments are presented which demonstrate the utility of individual indexing systems in improving overall average precision.  相似文献   

4.
This paper describes features and methods for document image comparison and classification at the spatial layout level. The methods are useful for visual similarity based document retrieval as well as fast algorithms for initial document type classification without OCR. A novel feature set called interval encoding is introduced to capture elements of spatial layout. This feature set encodes region layout information in fixed-length vectors by capturing structural characteristics of the image. These fixed-length vectors are then compared to each other through a Manhattan distance computation for fast page layout comparison. The paper describes experiments and results to rank-order a set of document pages in terms of their layout similarity to a test document. We also demonstrate the usefulness of the features derived from interval coding in a hidden Markov model based page layout classification system that is trainable and extendible. The methods described in the paper can be used in various document retrieval tasks including visual similarity based retrieval, categorization and information extraction.  相似文献   

5.
Locating and Recognizing Text in WWW Images   总被引:4,自引:0,他引:4  
The explosive growth of the World Wide Web has resulted in a distributed database consisting of hundreds of millions of documents. While existing search engines index a page based on the text that is readily extracted from its HTML encoding, an increasing amount of the information on the Web is embedded in images. This situation presents a new and exciting challenge for the fields of document analysis and information retrieval, as WWW image text is typically rendered in color and at very low spatial resolutions. In this paper, we survey the results of several years of our work in the area. For the problem of locating text in Web images, we describe a procedure based on clustering in color space followed by a connected-components analysis that seems promising. For character recognition, we discuss techniques using polynomial surface fitting and fuzzy n-tuple classifiers. Also presented are the results of several experiments that demonstrate where our methods perform well and where more work needs to be done. We conclude with a discussion of topics for further research.  相似文献   

6.
The retrieval of documents that originate from digitized and OCR-converted paper documents is an important task for modern retrieval systems. The problems that OCR errors cause for the retrieval process have been subject to research for several years now. We approach the problem from a theoretical point of view and model OCR conversion as a random experiment. Our theoretical results, which are supported by experiments, show clearly that information retrieval can cope even with many errors. It is, however, important that the documents are not too short and that recognition errors are distributed appropriately among words and documents. These results disclose that an expensive manual or automatic post-processing of OCR-converted documents usually does not make sense, but that scanning and OCR must be performed in an appropriate way and with care.  相似文献   

7.
基于内容的图像检索技术是对图像的物理内容为加工对象的检索技术之一,主要实现方式包括基于颜色、纹理、形状、空间位置和语义等。其中基于颜色的图像检索发展最为成熟,而基于语义的检索则尚处于探讨、研究阶段。基于内容检索和基于文本检索在数字图书馆中可以融合共同提供检索服务。Google为这一尝试提供了在后控阶段的有效案例,而真正的实现两者的融合是在预处理阶段。两者结合在数字图书馆中的应用是可行的,相信能够提供更好的图像检索服务。  相似文献   

8.
As the volume and variety of information sources continues to grow, there is increasing difficulty with respect to obtaining information that accurately matches user information needs. A number of factors affect information retrieval effectiveness (the accuracy of matching user information needs against the retrieved information). First, users often do not present search queries in the form that optimally represents their information need. Second, the measure of a document’s relevance is often highly subjective between different users. Third, information sources might contain heterogeneous documents, in multiple formats and the representation of documents is not unified. This paper discusses an approach for improvement of information retrieval effectiveness from document databases. It is proposed that retrieval effectiveness can be improved by applying computational intelligence techniques for modelling information needs, through interactive reinforcement learning. The method combines qualitative (subjective) user relevance feedback with quantitative (algorithmic) measures of the relevance of retrieved documents. An information retrieval is developed whose retrieval effectiveness is evaluated using traditional precision and recall.  相似文献   

9.
在海量信息中检索时,与用户查询相关的信息常常被漏掉,而与查询无关的信息———信息垃圾,却大量地出现在检索结果中。改进文本信息检索系统的质量,提高检索效能,已成为亟待解决的问题。本文针对能够影响检索效力的一个易被忽略的因素———修饰语,研究其在文本信息检索中的作用。为此,构建了修正的向量空间模型(Modified Vector Space Model,MVSM),并以英文文本进行试验,进而说明修饰语的作用。  相似文献   

10.
Information Retrieval Systeme haben in den letzten Jahren nur geringe Verbesserungen in der Retrieval Performance erzielt. Wir arbeiten an neuen Ans?tzen, dem sogenannten Collaborativen Information Retrieval (CIR), die das Potential haben, starke Verbesserungen zu erreichen. CIR ist die Methode, mit der durch Ausnutzen von Informationen aus früheren Anfragen die Retrieval Peformance für die aktuelle Anfrage verbessert wird. Wir haben ein eingeschr?nktes Szenario, in dem nur alte Anfragen und dazu relevante Antwortdokumente zur Verfügung stehen. Neue Ans?tze für Methoden der Query Expansion führen unter diesen Bedingungen zu Verbesserungen der Retrieval Performance . The accuracy of ad-hoc document retrieval systems has reached a stable plateau in the last few years. We are working on so-called collaborative information retrieval (CIR) systems which have the potential to overcome the current limits. We define CIR as a task, where an information retrieval (IR) system uses information gathered from previous search processes from one or several users to improve retrieval performance for the current user searching for information. We focus on a restricted setting in CIR in which only old queries and correct answer documents to these queries are available for improving a new query. For this restricted setting we propose new approaches for query expansion procedures. We show how CIR methods can improve overall IR performance.
CR Subject Classification H.3.3  相似文献   

11.
梁柱  沈思  叶文豪  王东波 《情报学报》2022,41(2):167-175
在现有的裁判文书检索系统上,非专业领域的用户检索具有局限性。目前,法律领域的智能检索仅在基于裁判文书的法律条文的推荐和分类上开展了研究,缺乏对裁判文书自动推荐的相关研究,因此,本文提出了一种利用类新闻的事实性文本智能推荐裁判文书的方法,结合目前的研究工作,总结裁判文书的结构和内容特征,利用类新闻的事实性文本模拟非法律专业用户的检索查询式,构建含有结构内容特征的裁判文书语料库,并自动推荐相关裁判文书文档。结果显示,利用裁判文书的法院意见结构内容特征,对新闻语料进行特征词表示之后,LambdaMART模型在文本匹配结果上表现良好,优于传统的全文检索技术。  相似文献   

12.
In distributed information retrieval systems, document overlaps occur frequently among different component databases. This paper presents an experimental investigation and evaluation of a group of result merging methods including the shadow document method and the multi-evidence method in the environment of overlapping databases. We assume, with the exception of resultant document lists (either with rankings or scores), no extra information about retrieval servers and text databases is available, which is the usual case for many applications on the Internet and the Web. The experimental results show that the shadow document method and the multi-evidence method are the two best methods when overlap is high, while Round-robin is the best for low overlap. The experiments also show that [0,1] linear normalization is a better option than linear regression normalization for result merging in a heterogeneous environment.
Sally McCleanEmail:
  相似文献   

13.
通过对近年来计算机科学、人工智能、专利文献加工等领域的发展进行总结,从多语言混合检索、分类检索、语义检索、图像检索以及辅助技术五个方面介绍专利文献计算机检索技术的最新发展。机器翻译技术和多边共同分类体系的完善有助于提高计算机检索效率、消除语言障碍,而语义检索、图像检索和文献自动处理技术的发展有望使面向不同层次用户的计算机智能化检索系统得以实现。  相似文献   

14.
New Mexico State University's Computing Research Lab has participated in research in all three phases of the US Government's Tipster program. Our work on information retrieval has focused on research and development of multilingual and cross-language approaches to automatic retrieval. The work on automatic systems has been supplemented by additional research into the role of the IR system user in interactive retrieval scenarios: monolingual, multilingual and cross-language. The combined efforts suggest that universal text retrieval, in which a user can find, access and use documents in the face of language differences and information overload, may be possible.  相似文献   

15.
Streaming data poses a variety of new and interesting challenges for information retrieval and text analysis. Unlike static document collections, which are typically analyzed and indexed off-line to support ad-hoc queries, streaming data often must be analyzed on the fly and acted on as the data passes through the analysis system. Speech is one example of streaming data that is a challenge to exploit, yet has significant potential to provide value in a knowledge management system. We are specifically interested in techniques that analyze streaming data and automatically find collateral information, or information that clarifies, expands, and generally enhances the value of the streaming data. We present a system that analyzes a data stream and automatically finds documents related to the current topic of discussion in the data stream. Experimental results show that the system generates result lists with an average precision at 10 hits of better than 60%. We also present a hit-list re-ranking technique based on named entity analysis and automatic text categorization that can improve the search results by 6%–12%.  相似文献   

16.
Content-only queries in hierarchically structured documents should retrieve the most specific document nodes which are exhaustive to the information need. For this problem, we investigate two methods of augmentation, which both yield high retrieval quality. As retrieval effectiveness, we consider the ratio of retrieval quality and response time; thus, fast approximations to the 'correct' retrieval result may yield higher effectiveness. We present a classification scheme for algorithms addressing this issue, and adopt known algorithms from standard document retrieval for XML retrieval. As a new strategy, we propose incremental-interruptible retrieval, which allows for instant presentation of the top ranking documents. We develop a new algorithm implementing this strategy and evaluate the different methods with the INEX collection.  相似文献   

17.
Web 信息检索(Information Retrieval)技术研究是应用文本检索研究的成果,它结合Web图论的思想,研究Web上的信息检索,是行之有效的Web知识发现的途径。传统HITS方法所获得的信息精确度相当低,而PageRank作为一通用的搜索方法,不能够应用于特定主题的信息获取。在充分分析了PageRank、HITS等现有算法和Web文档的相似度计算方法的基础上,提出了Web上查询特定主题相关信息发现的RG-HITS算法。它结合了Web超链接、网页知识表示的信息相关度以及HITS方法来搜索Web上特定主题的相关知识。  相似文献   

18.
19.
This study introduces a novel framework for evaluating passage and XML retrieval. The framework focuses on a user’s effort to localize relevant content in a result document. Measuring the effort is based on a system guided reading order of documents. The effort is calculated as the quantity of text the user is expected to browse through. More specifically, this study seeks evaluation metrics for retrieval methods following a specific fetch and browse approach, where in the fetch phase documents are ranked in decreasing order according to their document score, like in document retrieval. In the browse phase, for each retrieved document, a set of non-overlapping passages representing the relevant text within the document is retrieved. In other words, the passages of the document are re-organized, so that the best matching passages are read first in sequential order. We introduce an application scenario motivating the framework, and propose sample metrics based on the framework. These metrics give a basis for the comparison of effectiveness between traditional document retrieval and passage/XML retrieval and illuminate the benefit of passage/XML retrieval.  相似文献   

20.
从文献检索到信息检索最大的变化 :一是由文献单元向信息单元为基础的组织方式的改变 ;二是由手工分类、主题标引、著者标引经过机器的主题词、自由词抽取、标引发展到全文标引乃至超文本检索。网络技术、超媒体技术和智能技术等是促其变化的关键。作为一门学科的教学必须创建以CAI课件为主导的实践教学方法和建立信息检索课程的基本框架体系。参考文献 4。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号