首页 | 本学科首页   官方微博 | 高级检索  
     检索      


DeASCIIfication approach to handle diacritics in Turkish information retrieval
Institution:1. Universidad Autónoma de Madrid, Ciudad Universitaria de Cantoblanco. C/Iván Pavlov, s/n., 28049 Madrid, Spain\n;2. Universidad Nacional de Educación a Distancia, Juan del Rosal, nº 10. 28023, Spain;3. Semantia Lab, Bravo Murillo, 38. 28015, Madrid, Spain;1. College of Education Science and Technology, Zhejiang University of Technology, Hangzhou, 310023, China;2. College of Business and Administration, Zhejiang University of Technology, Hangzhou, 310023, China;3. College of Electrical and Information Engineering, Hunan University, Changsha, Hunan, 410082, China
Abstract:The absence of diacritics in text documents or search queries is a serious problem for Turkish information retrieval because it creates homographic ambiguity. Thus, the inappropriate handling of diacritics reduces the retrieval performance in search engines. A straightforward solution to this problem is to normalize tokens by replacing diacritic characters with their American Standard Code for Information Interchange (ASCII) counterparts. However, this so-called ASCIIfication produces either synthetic words that are not legitimate Turkish words or legitimate words with meanings that are completely different from those of the original words. These non-valid synthetic words cannot be processed by morphological analysis components (such as stemmers or lemmatizers), which expect the input to be valid Turkish words. By contrast, synthetic words are not a problem when no stemmer or a simple first-n-characters-stemmer is used in the text analysis pipeline. This difference emphasizes the notion of the diacritic sensitivity of stemmers. In this study, we propose and evaluate an alternative solution based on the application of deASCIIfication, which restores accented letters in query terms or text documents. Our risk-sensitive evaluation results showed that the diacritics restoration approach yielded more effective and robust results compared with normalizing tokens to remove diacritics.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号