首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
一种面向中文信息检索的汉语自动分词方法   总被引:3,自引:1,他引:3  
阐述信息检索对汉语分词技术的要求,分析中文信息检索与汉语分词技术结合过程中有待解决的关键问题,并重点针对这些要求及关键问题提出一种面向中文信息检索的汉语自动分词方法。  相似文献   

2.
汉语分词技术综述   总被引:2,自引:1,他引:1  
首先介绍了汉语自动分词技术及基于词索引的中文全文检索技术,接着分别从文献自动标引、文摘自动生成、文本自动分类、文本信息过滤、自然语言检索接口和智能检索等方面详细地阐述了汉语自动分词技术在中文全文检索中的应用,并对目前汉语自动分词技术存在的局限性进行了分析,提出了发展思路,最后对汉语自动分词技术在中文全文检索中的应用前景进行了预测。  相似文献   

3.
一种能综合利用多种检索技术优势的数据库检索功能设计   总被引:3,自引:0,他引:3  
首先说明,通过分面分类、后控检索和超链接检索技术的利用,信息检索中分类、主题检索优势难以兼得,采用自然语言又影响查全率,扩检、缩检难以同时方便进行的问题是可以解决的。然后又说明,采用分面分类技术可编制一个分类主题一体化的联机词表,在联机词表的基础上可生成具有各种词间关系的后控词表,由于在后控词表网状词间关系的基础上还可以引入超链接检索技术,因此如上三种技术的综合利用成为可能,一个具有分类、主题检索两方面优势,使用自然语言、又能保障检索质量,并能方便的进行扩检、缩检检索功能设计得以实现。  相似文献   

4.
利用模糊数学综合评判模型,将科技查新题名分割为自然词,通过计算检索词之间的蕴含度,完成对检索策略过程的模糊量化处理,并利用权重矩阵进行检索词合适程度的综合评判,提取出最合适的检索词和最佳检索表达式。利用查新实例进行该模型的模拟计算试验,结果表明该模型提取的检索词及其组成的检索表达式比较符合实际情况,具有一定的客观性和准确性。  相似文献   

5.
Applying Machine Learning to Text Segmentation for Information Retrieval   总被引:2,自引:0,他引:2  
We propose a self-supervised word segmentation technique for text segmentation in Chinese information retrieval. This method combines the advantages of traditional dictionary based, character based and mutual information based approaches, while overcoming many of their shortcomings. Experiments on TREC data show this method is promising. Our method is completely language independent and unsupervised, which provides a promising avenue for constructing accurate multi-lingual or cross-lingual information retrieval systems that are flexible and adaptive. We find that although the segmentation accuracy of self-supervised segmentation is not as high as some other segmentation methods, it is enough to give good retrieval performance. It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. However, for Chinese, we find that the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance. We demonstrate this effect by presenting an empirical investigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation algorithms with word segmentation accuracies ranging from 44% to 95%, including 70% word segmentation accuracy from our self-supervised word-segmentation approach. It appears that the main reason for the drop in retrieval performance is that correct compounds and collocations are preserved by accurate segmenters, while they are broken up by less accurate (but reasonable) segmenters, to a surprising advantage. This suggests that words themselves might be too broad a notion to conveniently capture the general semantic meaning of Chinese text. Our research suggests machine learning techniques can play an important role in building adaptable information retrieval systems and different evaluation standards for word segmentation should be given to different applications.  相似文献   

6.
Although always present in text, word sense ambiguity only recently became regarded as a problem to information retrieval which was potentially solvable. The growth of interest in word senses resulted from new directions taken in disambiguation research. This paper first outlines this research and surveys the resulting efforts in information retrieval. Although the majority of attempts to improve retrieval effectiveness were unsuccessful, much was learnt from the research. Most notably a notion of under what circumstance disambiguation may prove of use to retrieval.  相似文献   

7.
基于词向量扩展的学术资源语义检索技术   总被引:1,自引:0,他引:1  
[目的/意义] 尝试以统计的方法为指导思想,探究基于词向量扩展的语义检索技术来提升学术资源的语义检索能力。[方法/过程] 利用自然语言处理、文本挖掘技术,对采集来的学术资源(主要是学术论文)元数据进行预处理,结合word2vec词向量生成工具和elasticsearch全文检索引擎搭建语义检索系统,对学术资源进行语义检索的探索研究。[结果/结论] 本文提出的方法能够有效提升学术信息的检索效果,一定程度上实现学术资源的语义检索,并为后续语义检索的进一步研究提供借鉴。  相似文献   

8.
The application of word sense disambiguation (WSD) techniques to information retrieval (IR) has yet to provide convincing retrieval results. Major obstacles to effective WSD in IR include coverage and granularity problems of word sense inventories, sparsity of document context, and limited information provided by short queries. In this paper, to alleviate these issues, we propose the construction of latent context models for terms using latent Dirichlet allocation. We propose building one latent context per word, using a well principled representation of local context based on word features. In particular, context words are weighted using a decaying function according to their distance to the target word, which is learnt from data in an unsupervised manner. The resulting latent features are used to discriminate word contexts, so as to constrict query’s semantic scope. Consistent and substantial improvements, including on difficult queries, are observed on TREC test collections, and the techniques combines well with blind relevance feedback. Compared to traditional topic modeling, WSD and positional indexing techniques, the proposed retrieval model is more effective and scales well on large-scale collections.  相似文献   

9.
汉语分词对中文搜索引擎检索性能的影响   总被引:3,自引:0,他引:3  
金澎  刘毅  王树梅 《情报学报》2006,25(1):21-24
针对中文网页的特点,研究了汉语分词对中文搜索引擎检索性能的影响。首先介绍中文分词在搜索引擎中的作用,然后介绍常用的分词算法。作者利用网页特征,提出一个简单的“带启发性规则的双向匹配分词策略”。最后,在10G的语料库中,就各种分词算法对查全率和查准率的影响进行了实验比较,结果表明分词性能和检索性能没有正比关系。  相似文献   

10.
宋明亮 《图书情报工作》1994,38(5):16-18,63
通过控制提高检索效率是情报语言学研究的根本目的。在计算机化的“自然语言检索系统”中,控制的手段、方法和技术发生了变化,这些变化开辟了情报语言学研究的新领域:主题词词典、类主题词典、后控词表和术语等。  相似文献   

11.
一种面向语义的信息检索方法   总被引:1,自引:0,他引:1  
传统的信息检索技术忽视了语义对检索过程的影响,这是造成查准率不高的一个重要原因.论文提出了一种面向语义的信息检索方法,该方法强调使用基于知网的语义处理技术实现对用户查询需求和目标文档的语义标注,使用基于知网的词汇链技术实现对文档特征词汇的过滤.一方面可以实现语义级别的检索匹配,另一方面可以降低大量无关词对检索结果的干扰.论文描述了一个实现该方法的信息检索系统SOIRS,并且利用该系统与传统检索系统做了对比实验.实验结果表明面向语义的信息检索方法在查准率方面要明显优于传统信息检索方法.  相似文献   

12.
Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this work, we conducted a careful, systematic evaluation of a set of tokenization heuristics on all the available TREC biomedical text collections for ad hoc document retrieval, using two representative retrieval methods and a pseudo-relevance feedback method. We also studied the effect of stemming and stop word removal on the retrieval performance. As expected, our experiment results show that tokenization can significantly affect the retrieval accuracy; appropriate tokenization can improve the performance by up to 96%, measured by mean average precision (MAP). In particular, it is shown that different query types require different tokenization heuristics, stemming is effective only for certain queries, and stop word removal in general does not improve the retrieval performance on biomedical text.  相似文献   

13.
Efficient information searching and retrieval methods are needed to navigate the ever increasing volumes of digital information. Traditional lexical information retrieval methods can be inefficient and often return inaccurate results. To overcome problems such as polysemy and synonymy, concept-based retrieval methods have been developed. One such method is Latent Semantic Indexing (LSI), a vector-space model, which uses the singular value decomposition (SVD) of a term-by-document matrix to represent terms and documents in k-dimensional space. As with other vector-space models, LSI is an attempt to exploit the underlying semantic structure of word usage in documents. During the query matching phase of LSI, a user's query is first projected into the term-document space, and then compared to all terms and documents represented in the vector space. Using some similarity measure, the nearest (most relevant) terms and documents are identified and returned to the user. The current LSI query matching method requires that the similarity measure be computed between the query and every term and document in the vector space. In this paper, the kd-tree searching algorithm is used within a recent LSI implementation to reduce the time and computational complexity of query matching. The kd-tree data structure stores the term and document vectors in such a way that only those terms and documents that are most likely to qualify as nearest neighbors to the query will be examined and retrieved.  相似文献   

14.
The abbreviation of Chinese sentence-mode subject indexing is the "sentence-mode method" -a new method for Chinese scientific and technical document subject indexing retrieval. The Chinese sentence-mode being a form of the retrieval language, is compatible with some characteristics of subject indexing and classification. The article also makes an approach to a new may for the standardization of mark unit. In each particular subject, there exists objectively a kind of "concept unit" which one may follow to use.The 'concept unit'is not like the unit word of the basic word method,nor is it like the subject word derived from a thesaurus artificially standardized.It is an objective,intrinsic concept unit separated out from a particular subject,i.e.a kind of standard subject word of a special form without a thesaurus.The method has already been retrieved and tested by a computer.1 table.  相似文献   

15.
论第四种情报检索语言系统   总被引:7,自引:0,他引:7  
第四种情报检索语言是自然语言与人工语言结合的一体化语言。第四种情报检索语言系统是一种基于网络的信息检索系统 ,比分类主题一体化情报检索语言系统更高级更新颖 ,是我国 2 1世纪情报检索语言系统研究的方向。加快我国第四种情报检索语言系统研究的关键 ,是解决汉语分词技术问题。参考文献 14。  相似文献   

16.
彭哲 《图书情报工作》2008,52(6):110-110
全文检索系统由三大功能模块组成:索引模块、检索模块和存储模块。本文着重分析系统组成和XML数据库的设计、建立倒排索引文件、中文分词等技术难点。同时在此基础之上建立基于Lucene/XML的期刊文献全文检索系统。  相似文献   

17.
针对传统的like通配符检索存在的问题,提出基于二元中文分词的高效率检索算法的思路、流程,给出核心算法代码;在消除重复词语、查全率、查准率、多字词检索等方面,与传统检索进行比较,各方面评测结果都优于传统检索;基于二元中文分词的高效率检索算法,简单、高效、容易实现,以期在信息系统的检索模块中得以利用,提高信息检索效率,减小信息搜索成本。  相似文献   

18.
信息检索与利用课教学实践新尝试   总被引:7,自引:0,他引:7  
本文就利用新技术和设备,对信息检索与利用课教学进行了新的尝试,对文献检索课提出了新的教学模式。  相似文献   

19.
本体论方法在文献型信息检索系统中的应用研究   总被引:1,自引:0,他引:1  
在研究本体方法应用于文献信息检索的基础上,对基于叙词表的领域初级本体的构建进行研究,对概念词相似匹配的检索要求进行语义上的扩充,通过与本体的交互对检索文档进行过滤,筛选出能更好匹配检索要求的文档。  相似文献   

20.
蔡盈芳 《图书情报工作》2012,56(23):108-112,134
在现有的实例知识模型表示与检索的基础上,针对现有实例知识检索的不足,就具有多层属性关系的实例知识的检索技术进行研究。将多层属性实例知识表示为一个树状结构的多层属性模型,在检索算法中综合运用数值类属性相似度算法、模糊值属性相似度算法、词语类属性值相似度计算算法等,对算法步骤顺序进行调整优化,使检索效率得到提高。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号