首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A comparative evaluation has been carried out on the Philips “DIRECT” and the British “INSPEC” retrieval system. DIRECT is based on automatic indexing whereas INSPEC uses manual subject indexing.Two queries were submitted to both systems, using the same data base. The results are expressed in terms of recall and precision. Both recall and precision of INSPEC were found to be higher than those of DIRECT by 20%. It is concluded that this is mainly a result of the query formulation. The effectiveness obtained with automatic indexing of documents is equivalent to that of the manual procedure.  相似文献   

2.
3.
4.
The Defense Documentation Center (DDC), a field activity of the Defense Supply Agency, implemented an automated indexing procedure in October 1973. This Machine-Aided Indexing (MAI) System [1] had been under development since 1969. The following is a report of several comparisons designed to measure the retrieval effectiveness of MAI and manual indexing procedures under normal operational conditions.Several definitions are required in order to clarify the MAI process as it pertains to these investigations. The MAI routines scan unedited text in the form of titles and abstracts. The output of these routines is called Candidate Index Terms. These word strings are matched by computer against an internal file of manually screened and cross-referenced terms called a Natural Language Data Base (NLDB). The NLDB differs from a standard thesaurus in that there is no related term category. Word strings which match the NLDB are accepted as valid MAI output. The mismatches are manually screened for suitability. Those accepted are added to the NLDB. If now, the original set of Candidate Index Terms is matched against the updated NLDB, the matched output is unedited MAI. If both the unedited matches and mismatches are further structured in accession order and sent to technical analysts for review, the output of that process is called edited MAI.The tests were designed to (a) compare unedited MAI with manual indexing, holding the indexing language and the retrieval technique constant; (b) compare edited MAI with unedited MAI, holding both the indexing and the retrieval technique constant; and (c) compare two different retrieval techniques, called simple and complex, while holding the indexing constant.  相似文献   

5.
The Defense Documentation Center (DDC), a field activity of the Defense Supply Agency, implemented an automated indexing procedure in October 1973. This Machine-Aided Indexing (MAI) System [1] had been under development since 1969. The following is a report of several comparisons designed to measure the retrieval effectiveness of MAI and manual indexing procedures under normal operational conditions.Several definitions are required in order to clarify the MAI process as it pertains to these investigations. The MAI routines scan unedited text in the form of titles and abstracts. The output of these routines is called Candidate Index Terms. These word strings are matched by computer against an internal file of manually screened and cross-referenced terms called a Natural Language Data Base (NLDB). The NLDB differs from a standard thesaurus in that there is no related term category. Word strings which match the NLDB are accepted as valid MAI output. The mismatches are manually screened for suitability. Those accepted are added to the NLDB. If now, the original set of Candidate Index Terms is matched against the updated NLDB, the matched output is unedited MAI. If both the unedited matches and mismatches are further structured in accession order and sent to technical analysts for review, the output of that process is called edited MAI.The tests were designed to (a) compare unedited MAI with manual indexing, holding the indexing language and the retrieval technique constant; (b) compare edited MAI with unedited MAI, holding both the indexing and the retrieval technique constant; and (c) compare two different retrieval techniques, called simple and complex, while holding the indexing constant.  相似文献   

6.
Determining requirements when searching for and retrieving relevant information suited to a user’s needs has become increasingly important and difficult, partly due to the explosive growth of electronic documents. The vector space model (VSM) is a popular method in retrieval procedures. However, the weakness in traditional VSM is that the indexing vocabulary changes whenever changes occur in the document set, or the indexing vocabulary selection algorithms, or parameters of the algorithms, or if wording evolution occurs. The major objective of this research is to design a method to solve the afore-mentioned problems for patent retrieval. The proposed method utilizes the special characteristics of the patent documents, the International Patent Classification (IPC) codes, to generate the indexing vocabulary for presenting all the patent documents. The advantage of the generated indexing vocabulary is that it remains unchanged, even if the document sets, selection algorithms, and parameters are changed, or if wording evolution occurs. Comparison of the proposed method with two traditional methods (entropy and chi-square) in manual and automatic evaluations is presented to verify the feasibility and validity. The results also indicate that the IPC-based indexing vocabulary selection method achieves a higher accuracy and is more satisfactory.  相似文献   

7.
This paper reports on the underlying IR problems encountered when dealing with the complex morphology and compound constructions found in the Hungarian language. It describes evaluations carried out on two general stemming strategies for this language, and also demonstrates that a light stemming approach could be quite effective. Based on searches done on the CLEF test collection, we find that a more aggressive suffix-stripping approach may produce better MAP. When compared to an IR scheme without stemming or one based on only a light stemmer, we find the differences to be statistically significant. When compared with probabilistic, vector-space and language models, we find that the Okapi model results in the best retrieval effectiveness. The resulting MAP is found to be about 35% better than the classical tf idf approach, particularly for very short requests. Finally, we demonstrate that applying an automatic decompounding procedure for both queries and documents significantly improves IR performance (+10%), compared to word-based indexing strategies.  相似文献   

8.
网页自动标引方案的优选及标引性能的测评   总被引:2,自引:0,他引:2  
仲云云  侯汉清  薛鹏军 《情报科学》2002,20(10):1108-1110
本文介绍了三种网页自动标引方案,通过对“中国经济网”上50页网页的手工标引、自动标引结果比较,从而优选出一种方案,即对网页全文不同部位加权,采用词频加权统计法。最后对该方案自动主题标引和分类标引分别从人机相符率方面进行测评。  相似文献   

9.
吕美香 《情报科学》2012,(8):1160-1166
词表是图书馆和信息检索领域最重要的知识组织工具,《中国分类主题词表》是传统词表的一种,它的更新和维护一直依靠手工进行,这制约了它在数字图书馆和网络信息环境下的应用。本文介绍了一项基于统计的、从元数据的标题中抽取关键词并定位在词表中的方法。大致包括三个步骤:从标题中提取关键词;确定抽取出的关键词的专指度;将专指度高的专业词汇定位在词表中。在《中国分类主题词表》和上海图书馆提供的计算机科技领域的元数据上所进行实验,结果证明该方法是可行的。这一方法可以应用到自动标引或编目中,有一定的实用性和广阔的应用前景。  相似文献   

10.
Whereas in language words of high frequency are generally associated with low content [Bookstein, A., & Swanson, D. (1974). Probabilistic models for automatic indexing. Journal of the American Society of Information Science, 25(5), 312–318; Damerau, F. J. (1965). An experiment in automatic indexing. American Documentation, 16, 283–289; Harter, S. P. (1974). A probabilistic approach to automatic keyword indexing. PhD thesis, University of Chicago; Sparck-Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21; Yu, C., & Salton, G. (1976). Precision weighting – an effective automatic indexing method. Journal of the Association for Computer Machinery (ACM), 23(1), 76–88], shallow syntactic fragments of high frequency generally correspond to lexical fragments of high content [Lioma, C., & Ounis, I. (2006). Examining the content load of part of speech blocks for information retrieval. In Proceedings of the international committee on computational linguistics and the association for computational linguistics (COLING/ACL 2006), Sydney, Australia]. We implement this finding to Information Retrieval, as follows. We present a novel automatic query reformulation technique, which is based on shallow syntactic evidence induced from various language samples, and used to enhance the performance of an Information Retrieval system. Firstly, we draw shallow syntactic evidence from language samples of varying size, and compare the effect of language sample size upon retrieval performance, when using our syntactically-based query reformulation (SQR) technique. Secondly, we compare SQR to a state-of-the-art probabilistic pseudo-relevance feedback technique. Additionally, we combine both techniques and evaluate their compatibility. We evaluate our proposed technique across two standard Text REtrieval Conference (TREC) English test collections, and three statistically different weighting models. Experimental results suggest that SQR markedly enhances retrieval performance, and is at least comparable to pseudo-relevance feedback. Notably, the combination of SQR and pseudo-relevance feedback further enhances retrieval performance considerably. These collective experimental results confirm the tenet that high frequency shallow syntactic fragments correspond to content-bearing lexical fragments.  相似文献   

11.
一个基于本体论全文自动标引方案   总被引:5,自引:1,他引:5  
王泰森 《情报科学》2003,21(9):950-952
本文为支持数字图书馆全文检索精度的提高,提出了一个基于本体论全文自动标引方案。该方案利用本体论的方法,强调词与词之间的内在概念联系,着重解决传统的人工标引不能全面概括全文,而且词与词之间缺乏概念性的连接,很难反映文件主题的全面内容及由于多义词、同义词等的原因造成漏检或检索结果返回信息太多,失去检索意义,达不到理想效果的问题。并为数字图书馆在进行主题标引时实现自动化操作。  相似文献   

12.
In image retrieval, most systems lack user-centred evaluation since they are assessed by some chosen ground truth dataset. The results reported through precision and recall assessed against the ground truth are thought of as being an acceptable surrogate for the judgment of real users. Much current research focuses on automatically assigning keywords to images for enhancing retrieval effectiveness. However, evaluation methods are usually based on system-level assessment, e.g. classification accuracy based on some chosen ground truth dataset. In this paper, we present a qualitative evaluation methodology for automatic image indexing systems. The automatic indexing task is formulated as one of image annotation, or automatic metadata generation for images. The evaluation is composed of two individual methods. First, the automatic indexing annotation results are assessed by human subjects. Second, the subjects are asked to annotate some chosen images as the test set whose annotations are used as ground truth. Then, the system is tested by the test set whose annotation results are judged against the ground truth. Only one of these methods is reported for most systems on which user-centred evaluation are conducted. We believe that both methods need to be considered for full evaluation. We also provide an example evaluation of our system based on this methodology. According to this study, our proposed evaluation methodology is able to provide deeper understanding of the system’s performance.  相似文献   

13.
Search engines play an essential role in the usability of Internet-based information systems and without them the web would certainly break down or, at the very least would develop at a much slower rate. Our main objective is to analyze and evaluate the retrieval effectiveness of various indexing and searching strategies on a new web text collection, using a rigorous evaluation methodology. Our second aim is to describe and evaluate different preprocessing techniques that might be implemented in order to improve retrieval effectiveness. As a third objective, this paper will evaluate whether or not hyperlinks may serve as useful sources of evidence in improving retrieval algorithms.  相似文献   

14.
15.
Many information retrieval systems use the inverted file as indexing structure. The inverted file, however, requires inefficient reorganization when new documents are to be added to an existing collection. Most studies suggest dealing with this problem by sparing free space in an inverted file for incremental updates. In this paper, we propose a run-time statistics-based approach to allocate the spare space. This approach estimates the space requirements in an inverted file using only a little most recent statistical data on space usage and document update request rate. For best indexing speed and space efficiency, the amount of the spare space to be allocated is determined by adaptively balancing the trade-offs between reorganization reduction and space utilization. Experiment results show that the proposed space-sparing approach significantly avoids reorganization in updating an inverted file, and in the meantime, unused free space can be well controlled such that the file access speed is not affected.  相似文献   

16.
This paper describes a technique for automatic book indexing. The technique requires a dictionary of terms that are to appear in the index, along with all text strings that count as instances of the term. It also requires that the text be in a form suitable for processing by a text formatter. A program searches the text for each occurrence of a term or its associated strings and creates an entry to the index when either is found. The results of the experimental application to a portion of a book text are presented, including measures of precision and recall, with precision giving the ratio of terms correctly assigned in the automatic process to the total assigned, and recall giving the ratio of correct terms automatically assigned to the total number of term assignments according to a human standard. Results indicate that the technique can be applied successfully, especially for texts that employ a technical vocabulary and where there is a premium on indexing exhaustivity.  相似文献   

17.
A new dictionary-based text categorization approach is proposed to classify the chemical web pages efficiently. Using a chemistry dictionary, the approach can extract chemistry-related information more exactly from web pages. After automatic segmentation on the documents to find dictionary terms for document expansion, the approach adopts latent semantic indexing (LSI) to produce the final document vectors, and the relevant categories are finally assigned to the test document by using the k-NN text categorization algorithm. The effects of the characteristics of chemistry dictionary and test collection on the categorization efficiency are discussed in this paper, and a new voting method is also introduced to improve the categorization performance further based on the collection characteristics. The experimental results show that the proposed approach has the superior performance to the traditional categorization method and is applicable to the classification of chemical web pages.  相似文献   

18.
It is well-known that relevance feedback is a method significant in improving the effectiveness of information retrieval systems. Improving effectiveness is important since these information retrieval systems must gain access to large document collections distributed over different distant sites. As a consequence, efforts to retrieve relevant documents have become significantly greater. Relevance feedback can be viewed as an aid to the information retrieval task. In this paper, a relevance feedback strategy is presented. The strategy is based on back-propagation of the relevance of retrieved documents using an algorithm developed in a neural approach. This paper describes a neural information retrieval model and emphasizes the results obtained with the associated relevance back-propagation algorithm in three different environments: manual ad hoc, automatic ad hoc and mixed ad hoc strategy (automatic plus manual ad hoc).  相似文献   

19.
In this paper, we lay out a relational approach for indexing and retrieving photographs from a collection. The increase of digital image acquisition devices, combined with the growth of the World Wide Web, requires the development of information retrieval (IR) models and systems that provide fast access to images searched by users in databases. The aim of our work is to develop an IR model suited to images, integrating rich semantics for representing this visual data and user queries, which can also be applied to large corpora.  相似文献   

20.
A variety of abstract automatic indexing models have been developed in recent times in an effort to produce indexing methods that are both effective and usable in practice. Among these are the term discrimination model and the term precision system. These two indexing systems are briefly described and experimental evidence is cited showing that a combination of both theories produces better retrieval performance than either one alone. Appropriate conclusions are reached concerning viable automatic indexing procedures usable in practice.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号