首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
[目的/意义] 对比文件是用以判断专利能否授权或无效的重要文件,针对传统信息检索方法的不足且鲜有利用机器学习方法研究对比文件检索的问题,在引入对比文件信息的基础上,构建专利相关性判定模型。[方法/过程] 以专利无效判决书中的目标专利与对比文件为数据集进行实验,提取文本相似度、共现词汇和共词数量特征信息,利用GBDT模型将对比文件的检索问题转化为判断其是否相关的分类问题。[结果/结论] 研究结果表明,不同字段数据对分类效果的贡献不同,其中说明书字段的准确率、召回率和F1值分别为79%、48%和59%,并且多特征集成后的分类效果显著优于单一文本相似度的结果,最后对实验错分情况进行分析,指出本研究下一步的研究方向。  相似文献   

2.
通过对近年来计算机科学、人工智能、专利文献加工等领域的发展进行总结,从多语言混合检索、分类检索、语义检索、图像检索以及辅助技术五个方面介绍专利文献计算机检索技术的最新发展。机器翻译技术和多边共同分类体系的完善有助于提高计算机检索效率、消除语言障碍,而语义检索、图像检索和文献自动处理技术的发展有望使面向不同层次用户的计算机智能化检索系统得以实现。  相似文献   

3.
The images found within biomedical articles are sources of essential information useful for a variety of tasks. Due to the rapid growth of biomedical knowledge, image retrieval systems are increasingly becoming necessary tools for quickly accessing the most relevant images from the literature for a given information need. Unfortunately, article text can be a poor substitute for image content, limiting the effectiveness of existing text-based retrieval methods. Additionally, the use of visual similarity by content-based retrieval methods as the sole indicator of image relevance is problematic since the importance of an image can depend on its context rather than its appearance. For biomedical image retrieval, multimodal approaches are often desirable. We describe in this work a practical multimodal solution for indexing and retrieving the images contained in biomedical articles. Recognizing the importance of text in determining image relevance, our method combines a predominately text-based image representation with a limited amount of visual information, in the form of quantized content-based visual features, through a process called global feature mapping. The resulting multimodal image surrogates are easily indexed and searched using existing text-based retrieval systems. Our experimental results demonstrate that our multimodal strategy significantly improves upon the retrieval accuracy of existing approaches. In addition, unlike many retrieval methods that utilize content-based visual features, the response time of our approach is negligible, making it suitable for use with large collections.  相似文献   

4.
一种新的数字图书馆图像检索算法   总被引:1,自引:0,他引:1  
提出一种适应图书馆特点的视觉特征和高层语义相结合的图像检索算法,通过相关反馈构建了动态的相似性度量方程。实验结果表明,综合视觉特征和语义特征的检索比仅利用视觉特征的检索能获得更高的检索率。  相似文献   

5.
This paper presents four novel techniques for open-vocabulary spoken document retrieval: a method to detect slots that possibly contain a query feature; a method to estimate occurrence probabilities; a technique that we call collection-wide probability re-estimation and a weighting scheme which takes advantage of the fact that long query features are detected more reliably. These four techniques have been evaluated using the TREC-6 spoken document retrieval test collection to determine the improvements in retrieval effectiveness with respect to a baseline retrieval method. Results show that the retrieval effectiveness can be improved considerably despite the large number of speech recognition errors.  相似文献   

6.
基于并行文献数据库的索引语言概念兼容转换   总被引:3,自引:0,他引:3  
张雪英 《情报学报》2005,24(2):161-168
本文提出的RST模型 ,是一种基于并行文献数据库的概念语义相似度度量模型 ,适用于不同索引语言概念之间的自动兼容转换。RST模型根据粗糙集和索引语言的一些基本理论建立 ,能够明确定义概念之间的语义关系和相似程度。实验表明 ,RST模型的性能明显优于现有的两种方法 ,可以广泛应用于各类电子文献数据库和搜索引擎的集成检索系统 ,从而实现应用单种索引语言进行跨数据库的有效检索。  相似文献   

7.
Information Retrieval from Documents: A Survey   总被引:4,自引:0,他引:4  
Given the phenomenal growth in the variety and quantity of data available to users through electronic media, there is a great demand for efficient and effective ways to organize and search through all this information. Besides speech, our principal means of communication is through visual media, and in particular, through documents. In this paper, we provide an update on Doermann's comprehensive survey (1998) of research results in the broad area of document-based information retrieval. The scope of this survey is also somewhat broader, and there is a greater emphasis on relating document image analysis methods to conventional IR methods.Documents are available in a wide variety of formats. Technical papers are often available as ASCII files of clean, correct, text. Other documents may only be available as hardcopies. These documents have to be scanned and stored as images so that they may be processed by a computer. The textual content of these documents may also be extracted and recognized using OCR methods. Our survey covers the broad spectrum of methods that are required to handle different formats like text and images. The core of the paper focuses on methods that manipulate document images directly, and perform various information processing tasks such as retrieval, categorization, and summarization, without attempting to completely recognize the textual content of the document. We start, however, with a brief overview of traditional IR techniques that operate on clean text. We also discuss research dealing with text that is generated by running OCR on document images. Finally, we also briefly touch on the related problem of content-based image retrieval.  相似文献   

8.
In this paper a distance-angle-based visual retrieval tool DARE is introduced. The distance-based similarity distribution, angle-based similarity distribution, and the differences of their distributions in the visual space are analyzed. The document cluster analysis in the visual space and the document cluster comparison between the document space and the visual space are addressed. A new concept—Distance to Reference Axis—is introduced to better understand the visual space. The impact of other operations in DARE on the document distribution is discussed. Future research directions including significance of the index term distribution in the visual space and a user study are addressed.  相似文献   

9.
The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text   总被引:1,自引:1,他引:0  
A known-item search is a particular information retrieval task in which the system is asked to find a single target document in a large document set. The TREC-5 confusion track used a set of 49 known-item tasks to study the impact of data corruption on retrieval system performance. Two corrupted versions of a 55,600 document corpus whose true content was known were created by applying OCR techniques to page images. The first version of the corpus used the page images as scanned, resulting in an estimated character error rate of approximately 5%. The second version used page images that had been down-sampled, resulting in an estimated character error rate of approximately 20%. The true text and each of the corrupted versions were then searched using the same set of 49 questions. In general, retrieval methods that attempted a probabilistic reconstruction of the original clean text fared better than methods that simply accepted corrupted versions of the query text.  相似文献   

10.
Locating and Recognizing Text in WWW Images   总被引:4,自引:0,他引:4  
The explosive growth of the World Wide Web has resulted in a distributed database consisting of hundreds of millions of documents. While existing search engines index a page based on the text that is readily extracted from its HTML encoding, an increasing amount of the information on the Web is embedded in images. This situation presents a new and exciting challenge for the fields of document analysis and information retrieval, as WWW image text is typically rendered in color and at very low spatial resolutions. In this paper, we survey the results of several years of our work in the area. For the problem of locating text in Web images, we describe a procedure based on clustering in color space followed by a connected-components analysis that seems promising. For character recognition, we discuss techniques using polynomial surface fitting and fuzzy n-tuple classifiers. Also presented are the results of several experiments that demonstrate where our methods perform well and where more work needs to be done. We conclude with a discussion of topics for further research.  相似文献   

11.
基于毕业论文开题报告的文献信息检索课教学   总被引:1,自引:0,他引:1  
文章对基于毕业论文开题报告的文献信息检索课教学方法切入的理论依据、课程教学内容的组织实施及效用进行了深入的探讨,为提高文献信息检索课教学效果提供实践和理论借鉴。  相似文献   

12.
"什么是图书馆学",没有也不会有一劳永逸的答案;图书馆学肯定是一门学问、学科,但要慎提"科学";图书馆学是人文"学科"而非人文"科学".图书馆学是一门旨在表达某种图书馆"人"的价值观念和价值理想的"人学",以人文哲学的高度与姿态,以综述、反思与批判的方式对图书馆"人"的生存意义、价值及其实现问题的感性与理性有机结合的学...  相似文献   

13.
一种基于网页分割的Web信息检索方法   总被引:2,自引:0,他引:2  
提出一种基于网页内容分割的Web信息检索算法。该算法根据网页半结构化的特点,按照HTML标记和网页的内容将网页进行区域分割。在建立HTML标记树的基础上,利用了的内容相似性和视觉相似性进行节点的整合。在检索和排序中,根据用户的查询,充分利用了区域信息来对相关的检索结果进行排序。  相似文献   

14.
本文提出一种面向聚类主题的文本特征表示方法,即以聚类的主题概念来刻画文本的特征向量,将文本描述提升至语义层次.首先,通过聚类,形成一组以向量形式表达的隐含主题概念,再将基于词条空间的文本特征向量投影至这组主题概念,以隐含的主题概念来描述文本.实验分析表明,建立在概念空间之上的文本向量实质上是文本矢量与主题概念的关联度,能够突出表现文本内容的主题特征,更好地反映文本的语义内容,从而有效提高模型在文本检索与分类等领域的应用性能.而基于聚类形成的概念空间的维数由于可主观调整,又能有效地约减概念空间的维数,提高模型的应用实效.  相似文献   

15.
文本分类是信息检索领域的重要应用之一,由于采用统一特征向量形式表示所有文档,导致针对每个文档的特征向量具有高维性和稀疏性,从而影响文档分类的性能和精度。为有效提升文本特征选择的准确度,本文首先提出基于信息增益的特征选择函数改进方法,提高特征选择的精度。KNN(K-Nearest Neighbor)算法是文本分类中广泛应用的算法,本文针对经典KNN计算量大、类别标定函数精度不高的问题,提出基于训练集裁剪的加权KNN算法。该算法通过对训练集进行裁剪提升了分类算法的计算效率,通过模糊集的隶属度函数提升分类算法的准确性。在公开数据上的实验结果及实验分析证明了算法的有效性。  相似文献   

16.
智能信息检索中个性化模式的表示形式研究   总被引:3,自引:2,他引:3  
智能信息检索中 ,个性化模式的描述和更新决定了文档过滤的效率。本文根据Huffman树的特点 ,提出基于Huffman树形式组织用户个性化模式并给出其相应的文档过滤算法。与其他他同的个性化模式过滤算法的性能比较而言 ,其具有占用空间少 ,过滤速度快的优点。  相似文献   

17.
随着计算机网络和多媒体信息的迅猛发展,基于内容的图像检索技术成为研究热点。本文通过介绍了基于颜色、形状、纹理和空间特征的图像检索,阐述了利用单一特征检索的不足,提出和分析了多种综合特征的图像检索技术。  相似文献   

18.
基于局域网的多媒体辅助教学系统是由硬件环境、软件系统和教学资源库组成的有机整体,本文结合文检课辅助教学系统构建与实施的研究,分析了教学资源与课程教学的整合模式,探讨了远程教学中文检课教学的模式与特点,并指出网络环境下辅助教学系统的发展方向.  相似文献   

19.
20.
图像索引与检索的数据库方法   总被引:3,自引:0,他引:3  
图像资源的迅速增长使我们面临新的挑战, 迫使人们对其索引与检索技术进行深入研究。本文讨论了图像索引的数据库方法,具体论述了图像的颜色、纹理、形状基本特征的抽取和对分类、主题、标题、创建者等外部特征与内容特征的描述,建立索引支持快速检索。.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号