Exploiting Hierarchy in Text Categorization期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

Exploiting Hierarchy in Text Categorization

Authors:	Andreas S Weigend Erik D Wiener Jan O Pedersen

Institution:	(1) T.J. Watson Research Center, IBM Corporation, Kitchawan Road 1101, Yorktown Heights, 10598, NY, USA;(2) School of Computer Science &; Engg., University of New South Wales, Sydney, 2052, NSW, Australia;(3) Dept. Statistics, Hill Center, Rutgers University, Piscataway, 08854-8019, NJ, USA

Abstract:	With the recent dramatic increase in electronic access to documents, text categorization—the task of assigning topics to a given document—has moved to the center of the information sciences and knowledge management. This article uses the structure that is present in the semantic space of topics in order to improve performance in text categorization: according to their meaning, topics can be grouped together into meta-topics, e.g., gold, silver, and copper are all metals. The proposed architecture matches the hierarchical structure of the topic space, as opposed to a flat model that ignores the structure. It accommodates both single and multiple topic assignments for each document. Its probabilistic interpretation allows its predictions to be combined in a principled way with information from other sources. The first level of the architecture predicts the probabilities of the meta-topic groups. This allows the individual models for each topic on the second level to focus on finer discriminations within the group. Evaluating the performance of a two-level implementation on the Reuters-22173 testbed of newswire articles shows the most significant improvement for rare classes.

Keywords:	information retrieval text mining topic spotting text categorization knowledge management problem decomposition machine learning neural networks probabilistic models hierarchical models performance evaluation
本文献已被 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏