首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 23 毫秒
1.
[目的]为了克服传统视觉词袋方法(Bag-of-Visual-Words)中忽略视觉单词间的空间关系和语义信息等问题。[方法]本文提出一种与视觉语言模型相结合的基于LDA主题模型,并采用查询似然模型实现检索。[结果]实验数据表明,本文所提出的基于LDA的表示方法可以高效、准确地解决蒙古文古籍的关键词检索问题。[结论]同时,该方法的性能比BoVW方法有显著提高。  相似文献   

2.
Due to natural language morphology, words can take on various morphological forms. Morphological normalisation – often used in information retrieval and text mining systems – conflates morphological variants of a word to a single representative form. In this paper, we describe an approach to lexicon-based inflectional normalisation. This approach is in between stemming and lemmatisation, and is suitable for morphological normalisation of inflectionally complex languages. To eliminate the immense effort required to compile the lexicon by hand, we focus on the problem of acquiring automatically an inflectional morphological lexicon from raw corpora. We propose a convenient and highly expressive morphology representation formalism on which the acquisition procedure is based. Our approach is applied to the morphologically complex Croatian language, but it should be equally applicable to other languages of similar morphological complexity. Experimental results show that our approach can be used to acquire a lexicon whose linguistic quality allows for rather good normalisation performance.  相似文献   

3.
Many operational IR indexes are non-normalized, i.e. no lemmatization or stemming techniques, etc. have been employed in indexing. This poses a challenge for dictionary-based cross-language retrieval (CLIR), because translations are mostly lemmas. In this study, we face the challenge of dictionary-based CLIR in a non-normalized index. We test two optional approaches: FCG (Frequent Case Generation) and s-gramming. The idea of FCG is to automatically generate the most frequent inflected forms for a given lemma. FCG has been tested in monolingual retrieval and has been shown to be a good method for inflected retrieval, especially for highly inflected languages. S-gramming is an approximate string matching technique (an extension of n-gramming). The language pairs in our tests were English–Finnish, English–Swedish, Swedish–Finnish and Finnish–Swedish. Both our approaches performed quite well, but the results varied depending on the language pair. S-gramming and FCG performed quite equally in all the other language pairs except Finnish–Swedish, where s-gramming outperformed FCG.  相似文献   

4.
论自然语言检索   总被引:8,自引:1,他引:8  
With its advantages in information retrieval, such as simple in indexing and convenient in information re-trieval, natural language is now widely used in idonnafion retrieval and network retrieval. However, due to some charac-teristics of the natural language itself, its retrieval result is usually tmsafisfactory. This paper sumamrizes several main fac-tots that restrict natural language retrieval and introduces a concept space method to improve its retrieval efficiency.  相似文献   

5.
The estimation of query model is an important task in language modeling (LM) approaches to information retrieval (IR). The ideal estimation is expected to be not only effective in terms of high mean retrieval performance over all queries, but also stable in terms of low variance of retrieval performance across different queries. In practice, however, improving effectiveness can sacrifice stability, and vice versa. In this paper, we propose to study this tradeoff from a new perspective, i.e., the bias–variance tradeoff, which is a fundamental theory in statistics. We formulate the notion of bias–variance regarding retrieval performance and estimation quality of query models. We then investigate several estimated query models, by analyzing when and why the bias–variance tradeoff will occur, and how the bias and variance can be reduced simultaneously. A series of experiments on four TREC collections have been conducted to systematically evaluate our bias–variance analysis. Our approach and results will potentially form an analysis framework and a novel evaluation strategy for query language modeling.  相似文献   

6.
This work addresses the information retrieval problem of auto-indexing Arabic documents. Auto-indexing a text document refers to automatically extracting words that are suitable for building an index for the document. In this paper, we propose an auto-indexing method for Arabic text documents. This method is mainly based on morphological analysis and on a technique for assigning weights to words. The morphological analysis uses a number of grammatical rules to extract stem words that become candidate index words. The weight assignment technique computes weights for these words relative to the container document. The weight is based on how spread is the word in a document and not only on its rate of occurrence. The candidate index words are then sorted in descending order by weight so that information retrievers can select the more important index words. We empirically verify the usefulness of our method using several examples. For these examples, we obtained an average recall of 46% and an average precision of 64%.  相似文献   

7.
网络环境下人工语言与自然语言的融合研究   总被引:3,自引:0,他引:3  
本文首先指出在网络环境下对人工语言与自然语言融合的研究是情报检索语言研究的主要内容 ,通过人工语言与自然语言在情报检索中的发展与运用的分析比较 ,明确了两者在情报信息的标引与检索中各有利弊 ,指出网络环境下人工语言与自然语言融合———优势互补的可行性、必然性与必要性。最后论述了如何实现网络环境下自然语言与人工语言融合。  相似文献   

8.
陈立华 《现代情报》2010,30(3):26-28,31
潜在语义分析是自然语言使用于情报检索系统的理论基础,以此理论建构的空间向量模型是评判检索系统性能优良与否的知识工具。阐述了潜在语义标引(LSI)的基本内容、LSI下影响自然语言检索查准率的因素及向量空间模型检索软件的运行机制。此评述对网络化的情报检索技术的发展起到了一定的参考作用。  相似文献   

9.
Whereas in language words of high frequency are generally associated with low content [Bookstein, A., & Swanson, D. (1974). Probabilistic models for automatic indexing. Journal of the American Society of Information Science, 25(5), 312–318; Damerau, F. J. (1965). An experiment in automatic indexing. American Documentation, 16, 283–289; Harter, S. P. (1974). A probabilistic approach to automatic keyword indexing. PhD thesis, University of Chicago; Sparck-Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21; Yu, C., & Salton, G. (1976). Precision weighting – an effective automatic indexing method. Journal of the Association for Computer Machinery (ACM), 23(1), 76–88], shallow syntactic fragments of high frequency generally correspond to lexical fragments of high content [Lioma, C., & Ounis, I. (2006). Examining the content load of part of speech blocks for information retrieval. In Proceedings of the international committee on computational linguistics and the association for computational linguistics (COLING/ACL 2006), Sydney, Australia]. We implement this finding to Information Retrieval, as follows. We present a novel automatic query reformulation technique, which is based on shallow syntactic evidence induced from various language samples, and used to enhance the performance of an Information Retrieval system. Firstly, we draw shallow syntactic evidence from language samples of varying size, and compare the effect of language sample size upon retrieval performance, when using our syntactically-based query reformulation (SQR) technique. Secondly, we compare SQR to a state-of-the-art probabilistic pseudo-relevance feedback technique. Additionally, we combine both techniques and evaluate their compatibility. We evaluate our proposed technique across two standard Text REtrieval Conference (TREC) English test collections, and three statistically different weighting models. Experimental results suggest that SQR markedly enhances retrieval performance, and is at least comparable to pseudo-relevance feedback. Notably, the combination of SQR and pseudo-relevance feedback further enhances retrieval performance considerably. These collective experimental results confirm the tenet that high frequency shallow syntactic fragments correspond to content-bearing lexical fragments.  相似文献   

10.
This paper proposes a novel query expansion method to improve accuracy of text retrieval systems. Our method makes use of a minimal relevance feedback to expand the initial query with a structured representation composed of weighted pairs of words. Such a structure is obtained from the relevance feedback through a method for pairs of words selection based on the Probabilistic Topic Model. We compared our method with other baseline query expansion schemes and methods. Evaluations performed on TREC-8 demonstrated the effectiveness of the proposed method with respect to the baseline.  相似文献   

11.
汉语自然语言检索中的词法分析处理   总被引:6,自引:0,他引:6  
耿骞  毛瑞 《情报科学》2004,22(4):466-469
本文对自然语言检索中的词法分析处理进行了探讨。首先讨论了基于词法分析的自然语言检索处理的类型,如加权统计法、N元法、统计学习方法,然后讨论了词法分析的方法和过程,重点对语词切分、词性标注的方法,并分析了相关的过程,特别是对基于概率统计的方法进行了介绍。最后对词法分析中存在的问题进行了探讨。  相似文献   

12.
关于情报检索语言的控制问题   总被引:6,自引:1,他引:6  
林茵 《情报科学》2001,19(3):258-259
从检索实践出发,对检索语言中的受控语言和自然语言检索作一简要比较,认为以“简略标引+自然语言检索+后控技术控制”为较为理想的检索模式。  相似文献   

13.
This paper presents a probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem. In this framework, queries and documents are modeled using statistical language models, user preferences are modeled through loss functions, and retrieval is cast as a risk minimization problem. We discuss how this framework can unify existing retrieval models and accommodate systematic development of new retrieval models. As an example of using the framework to model non-traditional retrieval problems, we derive retrieval models for subtopic retrieval, which is concerned with retrieving documents to cover many different subtopics of a general query topic. These new models differ from traditional retrieval models in that they relax the traditional assumption of independent relevance of documents.  相似文献   

14.
论网络环境下情报检索语言的发展趋势   总被引:2,自引:0,他引:2  
论述了网络环境下采用自然语言是必然趋势,然而自然语言井不会成为网络环境下的惟一检索语言,自然语言争人工语言相结合才是发展的方向。  相似文献   

15.
《汉语主题词表》是我国情报检索语言发展历史中的一个里程碑。在网络时代,《汉语主题词表》将得到新的发展和应用。文章针对《汉语主题词表》的现状,回顾了它的编制和修订历史,其作为情报语言检索工具,在信息组织中发挥了重要作用。对如何在知识组织中发挥作用,如何在网络环境下构筑适应计算机环境的新型词表,向网络环境下的词系统推进,作者提出了新的发展思路和策略方法。  相似文献   

16.
Knowledge acquisition and bilingual terminology extraction from multilingual corpora are challenging tasks for cross-language information retrieval. In this study, we propose a novel method for mining high quality translation knowledge from our constructed Persian–English comparable corpus, University of Tehran Persian–English Comparable Corpus (UTPECC). We extract translation knowledge based on Term Association Network (TAN) constructed from term co-occurrences in same language as well as term associations in different languages. We further propose a post-processing step to do term translation validity check by detecting the mistranslated terms as outliers. Evaluation results on two different data sets show that translating queries using UTPECC and using the proposed methods significantly outperform simple dictionary-based methods. Moreover, the experimental results show that our methods are especially effective in translating Out-Of-Vocabulary terms and also expanding query words based on their associated terms.  相似文献   

17.
林竹梅 《未来与发展》2011,34(8):100-102
良好的信息素养是大学生学好英语的一种基本能力,而信息网络化和英语国际化使信息采集的途径更加便利也更加广泛。本文基于对本校大学生英语学习信息素养的随机调查,通过分析大学生外文信息检索的现状,提出了培养大学生外文信息检索能力的几种途径。  相似文献   

18.
自然语言在Pubmed检索系统中的应用   总被引:4,自引:0,他引:4  
黄碧云  方平 《情报科学》2001,19(11):1191-1192,1204
自然语言检索是情报检索的发展趋势。Pubmed检索系统通过自然语言自动转换及截词检索、词组的自动识别及特定符号的使用、限定检索范围、Preview/Index等功能,有效地克服了自然语言检索的局限,提高了检索效率。  相似文献   

19.
In Korean text, foreign words, which are mostly transliterations of English words, are frequently used. Foreign words are usually very important index terms in Korean information retrieval since most of them are technical terms or names. So accurate foreign word extraction is crucial for high performance of information retrieval. However, accurate foreign word extraction is not easy because it inevitably accompanies word segmentation and most of the foreign words are unknown. In this paper, we present an effective foreign word recognition and extraction method. In order to accurately extract foreign words, we developed an effective method of word segmentation that involves unknown foreign words. Our word segmentation method effectively utilizes both unknown word information acquired through the automatic dictionary compilation and foreign word recognition information. Our HMM-based foreign word recognition method does not require large labeled examples for the model training unlike the previously proposed method.  相似文献   

20.
曲琳琳 《情报科学》2021,39(8):132-138
【目的/意义】跨语言信息检索研究的目的即在消除因语言的差异而导致信息查询的困难,提高从大量纷繁 复杂的查找特定信息的效率。同时提供一种更加方便的途径使得用户能够使用自己熟悉的语言检索另外一种语 言文档。【方法/过程】本文通过对国内外跨语言信息检索的研究现状分析,介绍了目前几种查询翻译的方法,包括: 直接查询翻译、文献翻译、中间语言翻译以及查询—文献翻译方法,对其效果进行比较,然后阐述了跨语言检索关 键技术,对使用基于双语词典、语料库、机器翻译技术等产生的歧义性提出了解决方法及评价。【结果/结论】使用自 然语言处理技术、共现技术、相关反馈技术、扩展技术、双向翻译技术以及基于本体信息检索技术确保知识词典的 覆盖度和歧义性处理,通过对跨语言检索实验分析证明采用知识词典、语料库和搜索引擎组合能够提高查询效 率。【创新/局限】本文为了解决跨语言信息检索使用词典、语料库中词语缺乏的现象,提出通过搜索引擎从网页获 取信息资源来充实语料库中语句对不足的问题。文章主要针对中英文信息检索问题进行了探讨,解决方法还需要 进一步研究,如中文切词困难以及字典覆盖率低等严重影响检索的效率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号