期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An effective and efficient results merging strategy for multilingual information retrieval in federated search environments

Luo Si Jamie Callan Suleyman Cetintas Hao Yuan 《Information Retrieval》2008,11(1):1-24

Multilingual information retrieval is generally understood to mean the retrieval of relevant information in multiple target languages in response to a user query in a single source language. In a multilingual federated search environment, different information sources contain documents in different languages. A general search strategy in multilingual federated search environments is to translate the user query to each language of the information sources and run a monolingual search in each information source. It is then necessary to obtain a single ranked document list by merging the individual ranked lists from the information sources that are in different languages. This is known as the results merging problem for multilingual information retrieval. Previous research has shown that the simple approach of normalizing source-specific document scores is not effective. On the other side, a more effective merging method was proposed to download and translate all retrieved documents into the source language and generate the final ranked list by running a monolingual search in the search client. The latter method is more effective but is associated with a large amount of online communication and computation costs. This paper proposes an effective and efficient approach for the results merging task of multilingual ranked lists. Particularly, it downloads only a small number of documents from the individual ranked lists of each user query to calculate comparable document scores by utilizing both the query-based translation method and the document-based translation method. Then, query-specific and source-specific transformation models can be trained for individual ranked lists by using the information of these downloaded documents. These transformation models are used to estimate comparable document scores for all retrieved documents and thus the documents can be sorted into a final ranked list. This merging approach is efficient as only a subset of the retrieved documents are downloaded and translated online. Furthermore, an extensive set of experiments on the Cross-Language Evaluation Forum (CLEF) () data has demonstrated the effectiveness of the query-specific and source-specific results merging algorithm against other alternatives. The new research in this paper proposes different variants of the query-specific and source-specific results merging algorithm with different transformation models. This paper also provides thorough experimental results as well as detailed analysis. All of the work substantially extends the preliminary research in (Si and Callan, in: Peters (ed.) Results of the cross-language evaluation forum-CLEF 2005, 2005).

Hao YuanEmail:

相似文献

2.

Using Corpus-Based Approaches in a System for Multilingual Information Retrieval

Martin Braschler Peter Schäuble 《Information Retrieval》2000,3(3):273-284

We present a system for multilingual information retrieval that allows users to formulate queries in their preferred language and retrieve relevant information from a collection containing documents in multiple languages. The system is based on a process of document level alignments, where documents of different languages are paired according to their similarity. The resulting mapping allows us to produce a multilingual comparable corpus. Such a corpus has multiple interesting applications. It allows us to build a data structure for query translation in cross-language information retrieval (CLIR). Moreover, we also perform pseudo relevance feedback on the alignments to improve our retrieval results. And finally, multiple retrieval runs can be merged into one unified result list. The resulting system is inexpensive, adaptable to domain-specific collections and new languages and has performed very well at the TREC-7 conference CLIR system comparison. 相似文献

3.

A merging strategy proposal: The 2-step retrieval status value method

Fernando Martínez-Santiago L. Alfonso Ureña-López Maite Martín-Valdivia 《Information Retrieval》2006,9(1):71-93

A usual strategy to implement CLIR (Cross-Language Information Retrieval) systems is the so-called query translation approach. The user query is translated for each language present in the multilingual collection in order to compute an independent monolingual information retrieval process per language. Thus, this approach divides documents according to language. In this way, we obtain as many different collections as languages. After searching in these corpora and obtaining a result list per language, we must merge them in order to provide a single list of retrieved articles. In this paper, we propose an approach to obtain a single list of relevant documents for CLIR systems driven by query translation. This approach, which we call 2-step RSV (RSV: Retrieval Status Value), is based on the re-indexing of the retrieval documents according to the query vocabulary, and it performs noticeably better than traditional methods. The proposed method requires query vocabulary alignment: given a word for a given query, we must know the translation or translations to the other languages. Because this is not always possible, we have researched on a mixed model. This mixed model is applied in order to deal with queries with partial word-level alignment. The results prove that even in this scenario, 2-step RSV performs better than traditional merging methods. 相似文献

4.

Statistical Models for Monolingual and Bilingual Information Retrieval

Nicola Bertoldi Marcello Federico 《Information Retrieval》2004,7(1-2):53-72

This work reviews information retrieval systems developed at ITC-irst which were evaluated through several tracks of CLEF, during the last three years. The presentation tries to follow the progress made over time in developing new statistical models first for monolingual information retrieval, then for cross-language information retrieval. Besides describing the underlying theory, performance of monolingual and bilingual information retrieval models are reported, respectively, on Italian monolingual tracks and Italian-English bilingual tracks of CLEF. Monolingual systems by ITC-irst performed consistently well in all the official evaluations, while the bilingual system ranked in CLEF 2002 just behind competitors using commercial machine translation engines. However, by experimentally comparing our statistical topic translation model against a state-of-the-art commercial system, no statistically significant difference in retrieval performance could be measured on a larger set of queries. 相似文献

5.

Combining Multiple Strategies for Effective Monolingual and Cross-Language Retrieval

Jacques Savoy 《Information Retrieval》2004,7(1-2):121-148

This paper describes and evaluates different retrieval strategies that are useful for search operations on document collections written in various European languages, namely French, Italian, Spanish and German. We also suggest and evaluate different query translation schemes based on freely available translation resources. In order to cross language barriers, we propose a combined query translation approach that has resulted in interesting retrieval effectiveness. Finally, we suggest a collection merging strategy based on logistic regression that tends to perform better than other merging approaches. 相似文献

6.

Mixed monolingual homepage finding in 34 languages: the role of language script and search domain

Roi Blanco Christina Lioma 《Information Retrieval》2009,12(3):324-351

相似文献

7.

本体在跨语言信息检索中的应用机制研究 总被引：3，自引：1，他引：2

吴丹王惠临《图书情报工作》2006,50(9):10-13

解释多语本体的含义,指出其在不同语言中所对应的领域知识,分析多语本体在查询扩展、语义标注、基于概念索引3方面对改善跨语言信息检索的作用,并通过介绍EuroWorldNet和Cindor系统的多语本体概念的对应方法,探讨本体应用于跨语言信息检索最关键的多语本体库的映射方法,认为采用中间语言作为概念表示、并通过词典翻译对照与不同语种的词汇建立链接关系是多语本体映射的一种良好方法。相似文献

8.

Combination Approaches for Multilingual Text Retrieval

Martin Braschler 《Information Retrieval》2004,7(1-2):183-204

We describe the Eurospider component for Cross-Language Information Retrieval (CLIR) that has been employed for experiments at all three CLEF campaigns to date. The central aspect of our efforts is the use of combination approaches, effectively combining multiple language pairs, translation resources and translation methods into one multilingual retrieval system. We discuss the implications of building a system that allows flexible combination, give details of the various translation resources and methods, and investigate the impact of merging intermediate results generated by the individual steps. An analysis of the resulting combination system is given which also takes into account additional requirements when deploying the system as a component in an operational, commercial setting. 相似文献

9.

Prediction of performance of cross-language information retrieval using automatic evaluation of translation

Kazuaki Kishida 《Library & information science research》2008

This study develops regression models for predicting the performance of cross-language information retrieval (CLIR). The model assumes that CLIR performance can be explained by two factors: (1) the ease of search inherent in each query and (2) the translation quality in the process of CLIR systems. As operational variables, monolingual information retrieval (IR) performance is used for measuring the ease of search, and the well-known evaluation metric BLEU is used to measure the translation quality. This study also proposes an alternative metric, weighted average for matched unigrams (WAMU), which is tailored to gauging translation quality for special IR purposes. The data for regression analysis are obtained from a retrieval experiment of English-to-Italian bilingual searches using the CLEF 2003 test collection. The CLIR and monolingual IR performances are measured by average precision score. The result shows that the proposed regression model can explain about 60% of the variation in CLIR performance, and WAMU has more predictive power than BLEU. A back translation method for applying the regression model to operational CLIR systems in real situations is discussed. 相似文献

10.

Architecture and evaluation of BRUJA, a multilingual question answering system

M. á. García-Cumbreras F. Martínez-Santiago L. A. Ure?a-López 《Information Retrieval》2012,15(5):413-432

Given a user question, the goal of a Question Answering (QA) system is to retrieve answers rather than full documents or even best-matching passages, as most Information Retrieval systems currently do. In this paper, we present BRUJA, a QA system for the management of multilingual collections. BRUJ rkstions (English, Spanish and French). The BRUJA architecture is not formed with three monolingual QA systems but instead uses English as Interlingua to make usual QA tasks such as question classifications and answer extractions. In addition, BRUJA uses Cross Language Information Retrieval (CLIR) techniques to retrieve relevant documents from a multilingual collection. On the one hand, we have more documents to find answers from but on the other hand, we are introducing noise into the system because of translations to the Interlingua (English) and the CLIR module. The question is whether the difficulty of managing three languages is worth it or whether a monolingual QA system delivers better results. We report on in-depth experimentation and demonstrate that our multilingual QA system gets better results than its monolingual counterpart whenever it uses good translation resources and, especially, CLIR techniques that are state-of-the-art. 相似文献

11.

Studying machine translation technologies for large-data CLIR tasks: a patent prior-art search case study

Walid Magdy Gareth J. F. Jones 《Information Retrieval》2014,17(5-6):492-519

Prior-art search in patent retrieval is concerned with finding all existing patents relevant to a patent application. Since patents often appear in different languages, cross-language information retrieval (CLIR) is an essential component of effective patent search. In recent years machine translation (MT) has become the dominant approach to translation in CLIR. Standard MT systems focus on generating proper translations that are morphologically and syntactically correct. Development of effective MT systems of this type requires large training resources and high computational power for training and translation. This is an important issue for patent CLIR where queries are typically very long sometimes taking the form of a full patent application, meaning that query translation using MT systems can be very slow. However, in contrast to MT, the focus for information retrieval (IR) is on the conceptual meaning of the search words regardless of their surface form, or the linguistic structure of the output. Thus much of the complexity of MT is not required for effective CLIR. We present an adapted MT technique specifically designed for CLIR. In this method IR text pre-processing in the form of stop word removal and stemming are applied to the MT training corpus prior to the training phase. Applying this step leads to a significant decrease in the MT computational and training resources requirements. Experimental application of the new approach to the cross language patent retrieval task from CLEF-IP 2010 shows that the new technique to be up to 23 times faster than standard MT for query translations, while maintaining IR effectiveness statistically indistinguishable from standard MT when large training resources are used. Furthermore the new method is significantly better than standard MT when only limited translation training resources are available, which can be a significant issue for translation in specialized domains. The new MT technique also enables patent document translation in a practical amount of time with a resulting significant improvement in the retrieval effectiveness. 相似文献

12.

“Sentences like these:” Multicultural information dynamics and international diversity of thought

Paul L. Hover Jun Lu 《International Information and Library Review》2013,45(3):196-218

Multicultural information dynamics is exploratory cross-cultural research of the information-seeking behavior of a group of eighty-four Egyptian and American reference librarians asked to choose from websites in different languages. This paper, the fourth in a series, focuses on national, monolingual, and multilingual subgroups, and provides multi-tiered analyzes of websites clicked, reasons given for clicking, preferences for machine translations vs. original foreign language websites, decision making when choosing non-native language hits, and foreign language anxiety. Findings of the research show that information seekers of both nationalities are reluctant to cross cultural lines at the basic level of retrieved Internet information hits. Further results delineate differences and similarities in motivations, circumstantial preferences for original languages or machine translations, and comparative information-seeking behavior of subgroups. The research has implications for improving search performance in the fields of global knowledge dissemination via website and search engine design, library science, and international scholarship. 相似文献

13.

Word normalization and decompounding in mono- and bilingual IR

Eija Airio 《Information Retrieval》2006,9(3):249-271

The present research studies the impact of decompounding and two different word normalization methods, stemming and lemmatization, on monolingual and bilingual retrieval. The languages in the monolingual runs are English, Finnish, German and Swedish. The source language of the bilingual runs is English, and the target languages are Finnish, German and Swedish. In the monolingual runs, retrieval in a lemmatized compound index gives almost as good results as retrieval in a decompounded index, but in the bilingual runs differences are found: retrieval in a lemmatized decompounded index performs better than retrieval in a lemmatized compound index. The reason for the poorer performance of indexes without decompounding in bilingual retrieval is the difference between the source language and target languages: phrases are used in English, while compounds are used instead of phrases in Finnish, German and Swedish. No remarkable performance differences could be found between stemming and lemmatization. 相似文献

14.

一种实用型跨语言检索查询翻译接口的设计与实现

高影繁徐红姣《图书情报工作》2013,57(20):123-126

面对日益膨胀的多语种信息资源,跨语言信息检索已成为实现全球知识存取和共享的关键技术手段。构建一个实用型的跨语言检索查询翻译接口,可方便地嵌入任意的信息检索平台,扩展现有信息检索平台的多语言信息处理能力。该查询翻译接口采用基于最长短语、查询分类和概率词典等多种翻译消歧策略,并从查询翻译的准确性和接口的运行效率两个角度对构建的查询翻译接口进行评测,实验结果验证所采用方法具有可行性。相似文献

15.

“Sentences like these:” Multicultural information dynamics and international diversity of thought

Paul L. Jun 《International Information and Library Review》2009,41(3):196-218

Multicultural information dynamics is exploratory cross-cultural research of the information-seeking behavior of a group of eighty-four Egyptian and American reference librarians asked to choose from websites in different languages. This paper, the fourth in a series, focuses on national, monolingual, and multilingual subgroups, and provides multi-tiered analyzes of websites clicked, reasons given for clicking, preferences for machine translations vs. original foreign language websites, decision making when choosing non-native language hits, and foreign language anxiety. Findings of the research show that information seekers of both nationalities are reluctant to cross cultural lines at the basic level of retrieved Internet information hits. Further results delineate differences and similarities in motivations, circumstantial preferences for original languages or machine translations, and comparative information-seeking behavior of subgroups. The research has implications for improving search performance in the fields of global knowledge dissemination via website and search engine design, library science, and international scholarship. 相似文献

16.

Swedish full text retrieval: Effectiveness of different combinations of indexing strategies with query terms

Per Ahlgren Jaana Kekäläinen 《Information Retrieval》2006,9(6):681-697

In this paper, which treats Swedish full text retrieval, the problem of morphological variation of query terms in the document database is studied. The Swedish CLEF 2003 test collection was used, and the effects of combination of indexing strategies with query terms on retrieval effectiveness were studied. Four of the seven tested combinations involved indexing strategies that used normalization, a form of conflation. All of these four combinations employed compound splitting, both during indexing and at query phase. SWETWOL, a morphological analyzer for the Swedish language, was used for normalization and compound splitting. A fifth combination used stemming, while a sixth attempted to group related terms by right hand truncation of query terms. The truncation was performed by a search expert. These six combinations were compared to each other and to a baseline combination, where no attempt was made to counteract the problem of morphological variation of query terms in the document database. Both the truncation combination, the four combinations based on normalization and the stemming combination outperformed the baseline. Truncation had the best performance. The main conclusion of the paper is that truncation, normalization and stemming enhanced retrieval effectiveness in comparison to the baseline. Further, normalization and stemming were not far below truncation. 相似文献

17.

Dictionary-Based Cross-Language Information Retrieval: Learning Experiences from CLEF 2000–2002

Turid Hedlund Eija Airio Heikki Keskustalo Raija Lehtokangas Ari Pirkola Kalervo Järvelin 《Information Retrieval》2004,7(1-2):99-119

In this study the basic framework and performance analysis results are presented for the three year long development process of the dictionary-based UTACLIR system. The tests expand from bilingual CLIR for three language pairs Swedish, Finnish and German to English, to six language pairs, from English to French, German, Spanish, Italian, Dutch and Finnish, and from bilingual to multilingual. In addition, transitive translation tests are reported. The development process of the UTACLIR query translation system will be regarded from the point of view of a learning process. The contribution of the individual components, the effectiveness of compound handling, proper name matching and structuring of queries are analyzed. The results and the fault analysis have been valuable in the development process. Overall the results indicate that the process is robust and can be extended to other languages. The individual effects of the different components are in general positive. However, performance also depends on the topic set and the number of compounds and proper names in the topic, and to some extent on the source and target language. The dictionaries used affect the performance significantly. 相似文献

18.

An analysis of evaluation campaigns in ad-hoc medical information retrieval: CLEF eHealth 2013 and 2014

Lorraine Goeuriot Gareth J. F. Jones Liadh Kelly Johannes Leveling Mihai Lupu Joao Palotti Guido Zuccon 《Information Retrieval》2018,21(6):507-540

Since its inception in 2013, one of the key contributions of the CLEF eHealth evaluation campaign has been the organization of an ad-hoc information retrieval (IR) benchmarking task. This IR task evaluates systems intended to support laypeople searching for and understanding health information. Each year the task provides registered participants with standard IR test collections consisting of a document collection and topic set. Participants then return retrieval results obtained by their IR systems for each query, which are assessed using a pooling procedure. In this article we focus on CLEF eHealth 2013 and 2014s retrieval task, which saw topics created based on patients’ information needs associated with their medical discharge summaries. We overview the task and datasets created, and the results obtained by participating teams over these two years. We then provide a detailed comparative analysis of the results, and conduct an evaluation of the datasets in the light of these results. This twofold study of the evaluation campaign teaches us about technical aspects of medical IR, such as the effectiveness of query expansion; the quality and characteristics of CLEF eHealth IR datasets, such as their reliability; and how to run an IR evaluation campaign in the medical domain. 相似文献

19.

Monolingual Document Retrieval for European Languages

Vera Hollink Jaap Kamps Christof Monz Maarten de Rijke 《Information Retrieval》2004,7(1-2):33-52

Recent years have witnessed considerable advances in information retrieval for European languages other than English. We give an overview of commonly used techniques and we analyze them with respect to their impact on retrieval effectiveness. The techniques considered range from linguistically motivated techniques, such as morphological normalization and compound splitting, to knowledge-free approaches, such as n-gram indexing. Evaluations are carried out against data from the CLEF campaign, covering eight European languages. Our results show that for many of these languages a modicum of linguistic techniques may lead to improvements in retrieval effectiveness, as can the use of language independent techniques. 相似文献

20.

Web searching across languages: Preference and behavior of bilingual academic users in Korea

《Library & information science research》2005,27(2):249-263

The problem of language in Web searching has been discussed primarily in the area of cross-language information retrieval (CLIR). However, much CLIR research centers on investigation of the effectiveness of automatic translation techniques. The case study reported here explored bilingual user behaviors, perceptions, and preferences with respect to the capability of the Web as a multilingual information resource. Twenty-eight bilingual academic users from Myongji University in Korea were recruited for the study. Findings show that the subjects did not use Web search engines as multilingual tools. For search queries, they selected a language that represents their information need most accurately depending on the types of information task rather than choosing their first language. Subjects expressed concerns about the accuracy of machine translation of scholarly terminologies and preferred to have user control over multilingual Web searches. 相似文献