一种快速中文分词词典机制 |
| |
作者姓名: | 吴晶晶 荆继武 聂晓峰 王平建 |
| |
作者单位: | 1. 中国科学技术大学电子工程与信息科学系,合肥 230027;
2. 中国科学院研究生院信息安全国家重点实验室, 北京100049 |
| |
基金项目: | 国家高技术研究发展计划(863)(2006AA01Z454)、国家信息安全242计划(2005B23)和国家自然科学基金(60573015)资助 |
| |
摘 要: | 通过研究目前中文分词领域各类分词机制,注意到中文快速分词机制的关键在于对单双字词的识别,在这一思想下,提出了一种快速中文分词机制:双字词-长词哈希机制,通过提高单双字词的查询效率来实现对中文分词机制的改进.实验证明,该机制提高了中文文本分词的效率.
|
关 键 词: | 文本实时处理 中文分词 词典法分词 双字词-长词哈希机制 |
收稿时间: | 2008-10-16 |
修稿时间: | 2009-04-21 |
Fast dictionary mechanism for Chinese word segmentation |
| |
Authors: | WU Jing-Jing JING Ji-Wu NIE Xiao-Feng Wang Ping-Jian |
| |
Institution: | 1. Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei 230027, China;
2. State Key Laboratory of Information Security, Graduate University of the Chinese Academy of Sciences, Beijing 100049, China |
| |
Abstract: | With the development of global networking through Internet, the amount of articles in Chinese or other native languages is increasing rapidly. As the lack of explicit separator, word segmentation is a precondition for the processing of these character-based languages and thus it affects the whole system in performance. In this paper, we propose a new solution for Chinese word segmentation problem based on Lexicon named double-character-and-long-word-hash-indexing (DCLWHI).Compared with traditional lexicon mechanism, DCLWHI improves the speed and efficiency of word segmentation without extra memory spending and gains the same accuracy. |
| |
Keywords: | text real-time processing Chinese word segmentation lexicon mechanism double-character-and-long-word-Hash-indexing(DCLWHI) |
|
| 点击此处可从《》浏览原始摘要信息 |
| 点击此处可从《》下载免费的PDF全文 |