首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Two probabilistic approaches to cross-lingual retrieval are in wide use today, those based on probabilistic models of relevance, as exemplified by INQUERY, and those based on language modeling. INQUERY, as a query net model, allows the easy incorporation of query operators, including a synonym operator, which has proven to be extremely useful in cross-language information retrieval (CLIR), in an approach often called structured query translation. In contrast, language models incorporate translation probabilities into a unified framework. We compare the two approaches on Arabic and Spanish data sets, using two kinds of bilingual dictionaries––one derived from a conventional dictionary, and one derived from a parallel corpus. We find that structured query processing gives slightly better results when queries are not expanded. On the other hand, when queries are expanded, language modeling gives better results, but only when using a probabilistic dictionary derived from a parallel corpus.We pursue two additional issues inherent in the comparison of structured query processing with language modeling. The first concerns query expansion, and the second is the role of translation probabilities. We compare conventional expansion techniques (pseudo-relevance feedback) with relevance modeling, a new IR approach which fits into the formal framework of language modeling. We find that relevance modeling and pseudo-relevance feedback achieve comparable levels of retrieval and that good translation probabilities confer a small but significant advantage.  相似文献   

2.
曲琳琳 《情报科学》2021,39(8):132-138
【目的/意义】跨语言信息检索研究的目的即在消除因语言的差异而导致信息查询的困难,提高从大量纷繁 复杂的查找特定信息的效率。同时提供一种更加方便的途径使得用户能够使用自己熟悉的语言检索另外一种语 言文档。【方法/过程】本文通过对国内外跨语言信息检索的研究现状分析,介绍了目前几种查询翻译的方法,包括: 直接查询翻译、文献翻译、中间语言翻译以及查询—文献翻译方法,对其效果进行比较,然后阐述了跨语言检索关 键技术,对使用基于双语词典、语料库、机器翻译技术等产生的歧义性提出了解决方法及评价。【结果/结论】使用自 然语言处理技术、共现技术、相关反馈技术、扩展技术、双向翻译技术以及基于本体信息检索技术确保知识词典的 覆盖度和歧义性处理,通过对跨语言检索实验分析证明采用知识词典、语料库和搜索引擎组合能够提高查询效 率。【创新/局限】本文为了解决跨语言信息检索使用词典、语料库中词语缺乏的现象,提出通过搜索引擎从网页获 取信息资源来充实语料库中语句对不足的问题。文章主要针对中英文信息检索问题进行了探讨,解决方法还需要 进一步研究,如中文切词困难以及字典覆盖率低等严重影响检索的效率。  相似文献   

3.
目前大多数机器翻译和跨语言检索系统都是基于通用语料,对外文科技资料的翻译效果不理想,本文结合科技文献的加工方法,研究面向科技文献的跨语言信息检索系统的模型。首先对跨语言信息检索的概念和特点进行简单的概述,从3个角度介绍跨语言信息检索的研究方法,然后讨论构建跨语言信息检索系统的必要性,在此基础上设计出一个面向科技文献的跨语言信息检索系统模型以及主要功能结构。  相似文献   

4.
This paper presents a laboratory based evaluation study of cross-language information retrieval technologies, utilizing partially parallel test collections, NTCIR-2 (used together with NTCIR-1), where Japanese–English parallel document collections, parallel topic sets and their relevance judgments are available. These enable us to observe and compare monolingual retrieval processes in two languages as well as retrieval across languages. Our experiments focused on (1) the Rosetta stone question (whether a partially parallel collection helps in cross-language information access or not?) and (2) two aspects of retrieval difficulties namely “collection discrepancy” and “query discrepancy”. Japanese and English monolingual retrieval systems are combined by dictionary based query translation modules so that a symmetrical bilingual evaluation environment is implemented.  相似文献   

5.
双语机读词典是基于查询翻译的跨语言信息检索中的常用资源,但是传统的手工构建词典的方法费时费力,本文利用统计方法从英汉句对齐平行语料库中自动获取翻译词典,以用于查询翻译过程中。  相似文献   

6.
基于本体的跨语言信息检索在数字图书馆中的应用   总被引:2,自引:0,他引:2  
鲍丽倩  张自然 《现代情报》2011,31(7):169-172
首先对跨语言信息检索和相关技术进行了介绍,了解当前跨语言信息检索技术的不足,然后阐述了传统跨语言信息检索技术在数字图书馆应用中的局限性,并由此引出了基于本体的跨语言技术。最后提出了一种基于本体的数字图书馆跨语言信息检索系统,并详细阐述了系统的流程,着重讲述了数字图书馆跨语言领域本体的构建。由于本体具有良好的概念层次和对逻辑推理的支持,对源语言和目标语言进行语义扩展,提高了数字图书馆跨语言系统的检索效率。  相似文献   

7.
数字资源整合及检索初探   总被引:8,自引:0,他引:8  
本文介绍了目前整合数字资源的方式以及相关标准、协议和接口技术,讨论了数字资源整合系统的信息检索原理及跨语言翻译、学科专业分类表、检索语言转换等问题,分析了数字资源整合研究将面临的挑战。  相似文献   

8.
网上信息跨语言检索方法   总被引:5,自引:1,他引:5  
刘卫中 《情报科学》2004,22(12):1503-1504
本文介绍了目前全世界网络语言的现状和实现网上信息跨语言检索的五种方法,阐述了跨语言检索对实现网络信息资源共享的重要性。  相似文献   

9.
This paper reviews state-of-the-art techniques and methods for enhancing effectiveness of cross-language information retrieval (CLIR). The following research issues are covered: (1) matching strategies and translation techniques, (2) methods for solving the problem of translation ambiguity, (3) formal models for CLIR such as application of the language model, (4) the pivot language approach, (5) methods for searching multilingual document collection, (6) techniques for combining multiple language resources, etc.  相似文献   

10.
This paper discusses the role of user-centred evaluations as an essential method for researching interactive information retrieval. It draws mainly on the work carried out during the Clarity Project where different user-centred evaluations were run during the lifecycle of a cross-language information retrieval system. The iterative testing was not only instrumental to the development of a usable system, but it enhanced our knowledge of the potential, impact, and actual use of cross-language information retrieval technology. Indeed the role of the user evaluation was dual: by testing a specific prototype it was possible to gain a micro-view and assess the effectiveness of each component of the complex system; by cumulating the result of all the evaluations (in total 43 people were involved) it was possible to build a macro-view of how cross-language retrieval would impact on users and their tasks. By showing the richness of results that can be acquired, this paper aims at stimulating researchers into considering user-centred evaluations as a flexible, adaptable and comprehensive technique for investigating non-traditional information access systems.  相似文献   

11.
The rapid growth of documents in different languages, the increased accessibility of electronic documents, and the availability of translation tools have caused cross-lingual plagiarism detection research area to receive increasing attention in recent years. The task of cross-language plagiarism detection entails two main steps: candidate retrieval and assessing pairwise document similarity. In this paper we examine candidate retrieval, where the goal is to find potential source documents of a suspicious text. Our proposed method for cross-language plagiarism detection is a keyword-focused approach. Since plagiarism usually happens in parts of the text, there is a requirement to segment the texts into fragments to detect local similarity. Therefore we propose a topic-based segmentation algorithm to convert the suspicious document to a set of related passages. After that, we use a proximity-based model to retrieve documents with the best matching passages. Experiments show promising results for this important phase of cross-language plagiarism detection.  相似文献   

12.
In the KL divergence framework, the extended language modeling approach has a critical problem of estimating a query model, which is the probabilistic model that encodes the user’s information need. For query expansion in initial retrieval, the translation model had been proposed to involve term co-occurrence statistics. However, the translation model was difficult to apply, because the term co-occurrence statistics must be constructed in the offline time. Especially in a large collection, constructing such a large matrix of term co-occurrences statistics prohibitively increases time and space complexity. In addition, reliable retrieval performance cannot be guaranteed because the translation model may comprise noisy non-topical terms in documents. To resolve these problems, this paper investigates an effective method to construct co-occurrence statistics and eliminate noisy terms by employing a parsimonious translation model. The parsimonious translation model is a compact version of a translation model that can reduce the number of terms containing non-zero probabilities by eliminating non-topical terms in documents. Through experimentation on seven different test collections, we show that the query model estimated from the parsimonious translation model significantly outperforms not only the baseline language modeling, but also the non-parsimonious models.  相似文献   

13.
We study several machine learning algorithms for cross-language patent retrieval and classification. In comparison with most of other studies involving machine learning for cross-language information retrieval, which basically used learning techniques for monolingual sub-tasks, our learning algorithms exploit the bilingual training documents and learn a semantic representation from them. We study Japanese–English cross-language patent retrieval using Kernel Canonical Correlation Analysis (KCCA), a method of correlating linear relationships between two variables in kernel defined feature spaces. The results are quite encouraging and are significantly better than those obtained by other state of the art methods. We also investigate learning algorithms for cross-language document classification. The learning algorithm are based on KCCA and Support Vector Machines (SVM). In particular, we study two ways of combining the KCCA and SVM and found that one particular combination called SVM_2k achieved better results than other learning algorithms for either bilingual or monolingual test documents.  相似文献   

14.
We will explore various ways to apply query structuring in cross-language information retrieval. In the first test, English queries were translated into Finnish using an electronic dictionary, and were run in a Finnish newspaper database of 55,000 articles. Queries were structured by combining the Finnish translation equivalents of the same English query key using the syn-operator of the InQuery retrieval system. Structured queries performed markedly better than unstructured queries. Second, the effects of compound-based structuring using a proximity operator for the translation equivalents of query language compound components were tested. The method was not useful in syn-based queries but resulted in decrease in retrieval effectiveness. Proper names are often non-identical spelling variants in different languages. This allows n-gram based translation of names not included in a dictionary. In the third test, a query structuring method where the Boolean and-operator was used to assign more weight to keys translated through n-gram matching gave good results.  相似文献   

15.
Knowledge acquisition and bilingual terminology extraction from multilingual corpora are challenging tasks for cross-language information retrieval. In this study, we propose a novel method for mining high quality translation knowledge from our constructed Persian–English comparable corpus, University of Tehran Persian–English Comparable Corpus (UTPECC). We extract translation knowledge based on Term Association Network (TAN) constructed from term co-occurrences in same language as well as term associations in different languages. We further propose a post-processing step to do term translation validity check by detecting the mistranslated terms as outliers. Evaluation results on two different data sets show that translating queries using UTPECC and using the proposed methods significantly outperform simple dictionary-based methods. Moreover, the experimental results show that our methods are especially effective in translating Out-Of-Vocabulary terms and also expanding query words based on their associated terms.  相似文献   

16.
张雪梅  过仕明 《现代情报》2013,33(7):112-117
以CNKI数字出版平台所收录的文献为依据,对近年来我国跨语言信息检索(CLIR:Cross Language Information Retrieval)的文献进行文献计量学统计,选取2001-2012年间研究成果作为数据样本,并从文献年代分布、文献被引情况、文献情报源分布、研究人员及机构分布、获得资助情况、关键词及论文主题分布进行统计分析,对2001-2012年关于跨语言信息检索的研究现状进行了梳理和总结,从而为进一步的研究和发展提供参考。  相似文献   

17.
This paper presents a study of relevance feedback in a cross-language information retrieval environment. We have performed an experiment in which Portuguese speakers are asked to judge the relevance of English documents; documents hand-translated to Portuguese and documents automatically translated to Portuguese. The goals of the experiment were to answer two questions (i) how well can native Portuguese searchers recognise relevant documents written in English, compared to documents that are hand translated and automatically translated to Portuguese; and (ii) what is the impact of misjudged documents on the performance improvement that can be achieved by relevance feedback. Surprisingly, the results show that machine translation is as effective as hand translation in aiding users to assess relevance in the experiment. In addition, the impact of misjudged documents on the performance of RF is overall just moderate, and varies greatly for different query topics.  相似文献   

18.
As an effective technique for improving retrieval effectiveness, relevance feedback (RF) has been widely studied in both monolingual and translingual information retrieval (TLIR). The studies of RF in TLIR have been focused on query expansion (QE), in which queries are reformulated before and/or after they are translated. However, RF in TLIR actually not only can help select better query terms, but also can enhance query translation by adjusting translation probabilities and even resolving some out-of-vocabulary terms. In this paper, we propose a novel relevance feedback method called translation enhancement (TE), which uses the extracted translation relationships from relevant documents to revise the translation probabilities of query terms and to identify extra available translation alternatives so that the translated queries are more tuned to the current search. We studied TE using pseudo-relevance feedback (PRF) and interactive relevance feedback (IRF). Our results show that TE can significantly improve TLIR with both types of relevance feedback methods, and that the improvement is comparable to that of query expansion. More importantly, the effects of translation enhancement and query expansion are complementary. Their integration can produce further improvement, and makes TLIR more robust for a variety of queries.  相似文献   

19.
文章提出的基于三元组可比语料库的自动语言剖析技术扩大了该研究领域的内涵,使其包括面向自然语言处理的应用研究。从工程可实现性考虑,创新性地提出建造三元组可比语料库,利用n-元词串、关键词簇和语义多词表达等自动抽取技术,通过对比中式英语表达,发掘英语本族语言模型,实现改进和发展机器翻译、跨语言信息检索等自然语言处理应用的目标。  相似文献   

20.
This paper presents a Foreign-Language Search Assistant that uses noun phrases as fundamental units for document translation and query formulation, translation and refinement. The system (a) supports the foreign-language document selection task providing a cross-language indicative summary based on noun phrase translations, and (b) supports query formulation and refinement using the information displayed in the cross-language document summaries. Our results challenge two implicit assumptions in most of cross-language Information Retrieval research: first, that once documents in the target language are found, Machine Translation is the optimal way of informing the user about their contents; and second, that in an interactive setting the optimal way of formulating and refining the query is helping the user to choose appropriate translations for the query terms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号