共查询到20条相似文献,搜索用时 664 毫秒
1.
A known-item search is a particular information retrieval task in which the system is asked to find a single target document in a large document set. The TREC-5 confusion track used a set of 49 known-item tasks to study the impact of data corruption on retrieval system performance. Two corrupted versions of a 55,600 document corpus whose true content was known were created by applying OCR techniques to page images. The first version of the corpus used the page images as scanned, resulting in an estimated character error rate of approximately 5%. The second version used page images that had been down-sampled, resulting in an estimated character error rate of approximately 20%. The true text and each of the corrupted versions were then searched using the same set of 49 questions. In general, retrieval methods that attempted a probabilistic reconstruction of the original clean text fared better than methods that simply accepted corrupted versions of the query text. 相似文献
2.
Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR’ed
to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving
Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different
correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error
rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available,
then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction
with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to
be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction
can minimize the need for morphologically sensitive error correction.
相似文献
Kareem DarwishEmail: |
3.
Finding useful information from large multimodal document collections such as the WWW without encountering numerous false positives poses a challenge to multimedia information retrieval systems (MMIR). This research addresses the problem of finding pictures. The fact that images do not appear in isolation, but rather with accompanying, collateral text is exploited. Taken independently, existing techniques for picture retrieval using (i) text-based and (ii) image-based methods have several limitations. This research presents a general model for multimodal information retrieval that addresses the following issues: (i) users' information need, (ii) expressing information need through composite, multimodal queries, and (iii) determining the most appropriate weighted combination of indexing techniques in order to best satisfy information need. A machine learning approach is proposed for the latter. The focus is on improving precision and recall in a MMIR system by optimally combining text and image similarity. Experiments are presented which demonstrate the utility of individual indexing systems in improving overall average precision. 相似文献
4.
This paper describes features and methods for document image comparison and classification at the spatial layout level. The methods are useful for visual similarity based document retrieval as well as fast algorithms for initial document type classification without OCR. A novel feature set called interval encoding is introduced to capture elements of spatial layout. This feature set encodes region layout information in fixed-length vectors by capturing structural characteristics of the image. These fixed-length vectors are then compared to each other through a Manhattan distance computation for fast page layout comparison. The paper describes experiments and results to rank-order a set of document pages in terms of their layout similarity to a test document. We also demonstrate the usefulness of the features derived from interval coding in a hidden Markov model based page layout classification system that is trainable and extendible. The methods described in the paper can be used in various document retrieval tasks including visual similarity based retrieval, categorization and information extraction. 相似文献
5.
Locating and Recognizing Text in WWW Images 总被引:4,自引:0,他引:4
The explosive growth of the World Wide Web has resulted in a distributed database consisting of hundreds of millions of documents. While existing search engines index a page based on the text that is readily extracted from its HTML encoding, an increasing amount of the information on the Web is embedded in images. This situation presents a new and exciting challenge for the fields of document analysis and information retrieval, as WWW image text is typically rendered in color and at very low spatial resolutions. In this paper, we survey the results of several years of our work in the area. For the problem of locating text in Web images, we describe a procedure based on clustering in color space followed by a connected-components analysis that seems promising. For character recognition, we discuss techniques using polynomial surface fitting and fuzzy n-tuple classifiers. Also presented are the results of several experiments that demonstrate where our methods perform well and where more work needs to be done. We conclude with a discussion of topics for further research. 相似文献
6.
The retrieval of documents that originate from digitized and OCR-converted paper documents is an important task for modern retrieval systems. The problems that OCR errors cause for the retrieval process have been subject to research for several years now. We approach the problem from a theoretical point of view and model OCR conversion as a random experiment. Our theoretical results, which are supported by experiments, show clearly that information retrieval can cope even with many errors. It is, however, important that the documents are not too short and that recognition errors are distributed appropriately among words and documents. These results disclose that an expensive manual or automatic post-processing of OCR-converted documents usually does not make sense, but that scanning and OCR must be performed in an appropriate way and with care. 相似文献
7.
基于内容的图像检索技术是对图像的物理内容为加工对象的检索技术之一,主要实现方式包括基于颜色、纹理、形状、空间位置和语义等。其中基于颜色的图像检索发展最为成熟,而基于语义的检索则尚处于探讨、研究阶段。基于内容检索和基于文本检索在数字图书馆中可以融合共同提供检索服务。Google为这一尝试提供了在后控阶段的有效案例,而真正的实现两者的融合是在预处理阶段。两者结合在数字图书馆中的应用是可行的,相信能够提供更好的图像检索服务。 相似文献
8.
As the volume and variety of information sources continues to grow, there is increasing difficulty with respect to obtaining
information that accurately matches user information needs. A number of factors affect information retrieval effectiveness
(the accuracy of matching user information needs against the retrieved information). First, users often do not present search
queries in the form that optimally represents their information need. Second, the measure of a document’s relevance is often
highly subjective between different users. Third, information sources might contain heterogeneous documents, in multiple formats
and the representation of documents is not unified. This paper discusses an approach for improvement of information retrieval
effectiveness from document databases. It is proposed that retrieval effectiveness can be improved by applying computational
intelligence techniques for modelling information needs, through interactive reinforcement learning. The method combines qualitative
(subjective) user relevance feedback with quantitative (algorithmic) measures of the relevance of retrieved documents. An
information retrieval is developed whose retrieval effectiveness is evaluated using traditional precision and recall. 相似文献
9.
10.
Armin Hust 《Informatik - Forschung und Entwicklung》2005,19(4):224-238
Information Retrieval Systeme haben in den letzten Jahren nur geringe Verbesserungen in der Retrieval Performance erzielt.
Wir arbeiten an neuen Ans?tzen, dem sogenannten Collaborativen Information Retrieval (CIR), die das Potential haben, starke
Verbesserungen zu erreichen. CIR ist die Methode, mit der durch Ausnutzen von Informationen aus früheren Anfragen die Retrieval
Peformance für die aktuelle Anfrage verbessert wird. Wir haben ein eingeschr?nktes Szenario, in dem nur alte Anfragen und
dazu relevante Antwortdokumente zur Verfügung stehen. Neue Ans?tze für Methoden der Query Expansion führen unter diesen Bedingungen
zu Verbesserungen der Retrieval Performance .
The accuracy of ad-hoc document retrieval systems has reached a stable plateau in the last few years. We are working on so-called
collaborative information retrieval (CIR) systems which have the potential to overcome the current limits. We define CIR as
a task, where an information retrieval (IR) system uses information gathered from previous search processes from one or several
users to improve retrieval performance for the current user searching for information. We focus on a restricted setting in
CIR in which only old queries and correct answer documents to these queries are available for improving a new query. For this
restricted setting we propose new approaches for query expansion procedures. We show how CIR methods can improve overall IR
performance.
CR Subject Classification H.3.3 相似文献
11.
在现有的裁判文书检索系统上,非专业领域的用户检索具有局限性。目前,法律领域的智能检索仅在基于裁判文书的法律条文的推荐和分类上开展了研究,缺乏对裁判文书自动推荐的相关研究,因此,本文提出了一种利用类新闻的事实性文本智能推荐裁判文书的方法,结合目前的研究工作,总结裁判文书的结构和内容特征,利用类新闻的事实性文本模拟非法律专业用户的检索查询式,构建含有结构内容特征的裁判文书语料库,并自动推荐相关裁判文书文档。结果显示,利用裁判文书的法院意见结构内容特征,对新闻语料进行特征词表示之后,LambdaMART模型在文本匹配结果上表现良好,优于传统的全文检索技术。 相似文献
12.
Result merging methods in distributed information retrieval with overlapping databases 总被引:5,自引:0,他引:5
In distributed information retrieval systems, document overlaps occur frequently among different component databases. This
paper presents an experimental investigation and evaluation of a group of result merging methods including the shadow document
method and the multi-evidence method in the environment of overlapping databases. We assume, with the exception of resultant
document lists (either with rankings or scores), no extra information about retrieval servers and text databases is available,
which is the usual case for many applications on the Internet and the Web.
The experimental results show that the shadow document method and the multi-evidence method are the two best methods when
overlap is high, while Round-robin is the best for low overlap. The experiments also show that [0,1] linear normalization
is a better option than linear regression normalization for result merging in a heterogeneous environment.
相似文献
Sally McCleanEmail: |
13.
14.
New Mexico State University's Computing Research Lab has participated in research in all three phases of the US Government's Tipster program. Our work on information retrieval has focused on research and development of multilingual and cross-language approaches to automatic retrieval. The work on automatic systems has been supplemented by additional research into the role of the IR system user in interactive retrieval scenarios: monolingual, multilingual and cross-language. The combined efforts suggest that universal text retrieval, in which a user can find, access and use documents in the face of language differences and information overload, may be possible. 相似文献
15.
Streaming data poses a variety of new and interesting challenges for information retrieval and text analysis. Unlike static
document collections, which are typically analyzed and indexed off-line to support ad-hoc queries, streaming data often must
be analyzed on the fly and acted on as the data passes through the analysis system. Speech is one example of streaming data
that is a challenge to exploit, yet has significant potential to provide value in a knowledge management system. We are specifically
interested in techniques that analyze streaming data and automatically find collateral information, or information that clarifies, expands, and generally enhances the value of the streaming data. We present a system that
analyzes a data stream and automatically finds documents related to the current topic of discussion in the data stream. Experimental
results show that the system generates result lists with an average precision at 10 hits of better than 60%. We also present
a hit-list re-ranking technique based on named entity analysis and automatic text categorization that can improve the search
results by 6%–12%. 相似文献
16.
Content-only queries in hierarchically structured documents should retrieve the most specific document nodes which are exhaustive
to the information need. For this problem, we investigate two methods of augmentation, which both yield high retrieval quality.
As retrieval effectiveness, we consider the ratio of retrieval quality and response time; thus, fast approximations to the
'correct' retrieval result may yield higher effectiveness. We present a classification scheme for algorithms addressing this
issue, and adopt known algorithms from standard document retrieval for XML retrieval. As a new strategy, we propose incremental-interruptible retrieval, which allows for instant presentation of the top ranking documents. We develop a new algorithm implementing this strategy
and evaluate the different methods with the INEX collection. 相似文献
17.
丁一 《现代图书情报技术》2005,21(6):26-29
Web 信息检索(Information Retrieval)技术研究是应用文本检索研究的成果,它结合Web图论的思想,研究Web上的信息检索,是行之有效的Web知识发现的途径。传统HITS方法所获得的信息精确度相当低,而PageRank作为一通用的搜索方法,不能够应用于特定主题的信息获取。在充分分析了PageRank、HITS等现有算法和Web文档的相似度计算方法的基础上,提出了Web上查询特定主题相关信息发现的RG-HITS算法。它结合了Web超链接、网页知识表示的信息相关度以及HITS方法来搜索Web上特定主题的相关知识。 相似文献
18.
19.
This study introduces a novel framework for evaluating passage and XML retrieval. The framework focuses on a user’s effort
to localize relevant content in a result document. Measuring the effort is based on a system guided reading order of documents.
The effort is calculated as the quantity of text the user is expected to browse through. More specifically, this study seeks
evaluation metrics for retrieval methods following a specific fetch and browse approach, where in the fetch phase documents
are ranked in decreasing order according to their document score, like in document retrieval. In the browse phase, for each
retrieved document, a set of non-overlapping passages representing the relevant text within the document is retrieved. In
other words, the passages of the document are re-organized, so that the best matching passages are read first in sequential
order. We introduce an application scenario motivating the framework, and propose sample metrics based on the framework. These
metrics give a basis for the comparison of effectiveness between traditional document retrieval and passage/XML retrieval
and illuminate the benefit of passage/XML retrieval. 相似文献
20.
从文献检索到信息检索——网络环境下检索课教学内容与方法的调整 总被引:21,自引:0,他引:21
从文献检索到信息检索最大的变化 :一是由文献单元向信息单元为基础的组织方式的改变 ;二是由手工分类、主题标引、著者标引经过机器的主题词、自由词抽取、标引发展到全文标引乃至超文本检索。网络技术、超媒体技术和智能技术等是促其变化的关键。作为一门学科的教学必须创建以CAI课件为主导的实践教学方法和建立信息检索课程的基本框架体系。参考文献 4。 相似文献