首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Extracting translations from comparable corpora for Cross-Language Information Retrieval using the language modeling framework
Institution:1. School of Electrical and Computer Engineering, College of Engineering, University of Tehran, P.O. Box 14395-515 Tehran, Iran;2. School of Computer Science, Institute for Research in Fundamental Sciences (IPM), P.O. Box 19395-5746 Tehran, Iran;3. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, NT, Hong Kong;1. Universidad Autónoma de Madrid, Ciudad Universitaria de Cantoblanco. C/Iván Pavlov, s/n., 28049 Madrid, Spain\n;2. Universidad Nacional de Educación a Distancia, Juan del Rosal, nº 10. 28023, Spain;3. Semantia Lab, Bravo Murillo, 38. 28015, Madrid, Spain;1. Department of Computer Science and Software Engineering, International Islamic University, Sector H-10, Islamabad 44000, Pakistan;2. Department of Computer Science, Southern Illinois University, Carbondale, IL 62901, United States;1. Qatar Computing Research Institute, HBKU, Doha, Qatar;2. Department of Computer Science and Engineering, College of Engineering, Qatar University, Doha, Qatar
Abstract:A main challenge in Cross-Language Information Retrieval (CLIR) is to estimate a proper translation model from available translation resources, since translation quality directly affects the retrieval performance. Among different translation resources, we focus on obtaining translation models from comparable corpora, because they provide appropriate translations for both languages and domains with limited linguistic resources. In this paper, we employ a two-step approach to build an effective translation model from comparable corpora, without requiring any additional linguistic resources, for the CLIR task. In the first step, translations are extracted by deriving correlations between source–target word pairs. These correlations are used to estimate word translation probabilities in the second step. We propose a language modeling approach for the first step, where modeling based on probability distribution provides two key advantages. First, our approach can be tuned easier in comparison with heuristically adjusted previous work. Second, it provides a principled basis for integrating additional lexical and translational relations to improve the accuracy of translations from comparable corpora. As an indication, we integrate monolingual relations of word co-occurrences into the process of translation extraction, which helps to extract more reliable translations for low-frequency words in a comparable corpus. Experimental results on an English–Persian comparable corpus show that our method outperforms the previous approaches in terms of both translation quality and the performance of CLIR. Indeed, the proposed method is naturally applicable to any comparable corpus, regardless of its languages. In addition, we demonstrate the significant impact of word translation probabilities, estimated in the second step of our approach, on the performance of CLIR.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号