Exploiting Hierarchy in Text Categorization |
| |
Authors: | Andreas S Weigend Erik D Wiener Jan O Pedersen |
| |
Institution: | (1) T.J. Watson Research Center, IBM Corporation, Kitchawan Road 1101, Yorktown Heights, 10598, NY, USA;(2) School of Computer Science &; Engg., University of New South Wales, Sydney, 2052, NSW, Australia;(3) Dept. Statistics, Hill Center, Rutgers University, Piscataway, 08854-8019, NJ, USA |
| |
Abstract: | With the recent dramatic increase in electronic access to documents, text categorization—the task of assigning topics to a given document—has moved to the center of the information sciences and knowledge management. This article uses the structure that is present in the semantic space of topics in order to improve performance in text categorization: according to their meaning, topics can be grouped together into meta-topics, e.g., gold, silver, and copper are all metals. The proposed architecture matches the hierarchical structure of the topic space, as opposed to a flat model that ignores the structure. It accommodates both single and multiple topic assignments for each document. Its probabilistic interpretation allows its predictions to be combined in a principled way with information from other sources. The first level of the architecture predicts the probabilities of the meta-topic groups. This allows the individual models for each topic on the second level to focus on finer discriminations within the group. Evaluating the performance of a two-level implementation on the Reuters-22173 testbed of newswire articles shows the most significant improvement for rare classes. |
| |
Keywords: | information retrieval text mining topic spotting text categorization knowledge management problem decomposition machine learning neural networks probabilistic models hierarchical models performance evaluation |
本文献已被 SpringerLink 等数据库收录! |
|