首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Exploiting Hierarchy in Text Categorization
Authors:Andreas S Weigend  Erik D Wiener  Jan O Pedersen
Institution:(1) T.J. Watson Research Center, IBM Corporation, Kitchawan Road 1101, Yorktown Heights, 10598, NY, USA;(2) School of Computer Science &; Engg., University of New South Wales, Sydney, 2052, NSW, Australia;(3) Dept. Statistics, Hill Center, Rutgers University, Piscataway, 08854-8019, NJ, USA
Abstract:With the recent dramatic increase in electronic access to documents, text categorization—the task of assigning topics to a given document—has moved to the center of the information sciences and knowledge management. This article uses the structure that is present in the semantic space of topics in order to improve performance in text categorization: according to their meaning, topics can be grouped together into ldquometa-topicsrdquo, e.g., gold, silver, and copper are all metals. The proposed architecture matches the hierarchical structure of the topic space, as opposed to a flat model that ignores the structure. It accommodates both single and multiple topic assignments for each document. Its probabilistic interpretation allows its predictions to be combined in a principled way with information from other sources. The first level of the architecture predicts the probabilities of the meta-topic groups. This allows the individual models for each topic on the second level to focus on finer discriminations within the group. Evaluating the performance of a two-level implementation on the Reuters-22173 testbed of newswire articles shows the most significant improvement for rare classes.
Keywords:information retrieval  text mining  topic spotting  text categorization  knowledge management  problem decomposition  machine learning  neural networks  probabilistic models  hierarchical models  performance evaluation
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号