Extending WHIRL with background knowledge for improved text classification |
| |
Authors: | Sarah Zelikovitz William W Cohen Haym Hirsh |
| |
Institution: | (1) Computer Science Department, College of Staten Island of CUNY, 2800 Victory Blvd, Staten Island, NY 10314, USA;(2) Center for Automated Learning and Discovery, Carnegie Mellon University, 500 Forbes Avenue, Pittsburgh, PA 15213, USA;(3) Department of Computer Science, Rutgers University, 110 Frelinghuysen Road, Piscataway, NJ 08854-8019, USA |
| |
Abstract: | Intelligent use of the many diverse forms of data available on the Internet requires new tools for managing and manipulating
heterogeneous forms of information. This paper uses WHIRL, an extension of relational databases that can manipulate textual
data using statistical similarity measures developed by the information retrieval community. We show that although WHIRL is
designed for more general similarity-based reasoning tasks, it is competitive with mature systems designed explicitly for
inductive classification. In particular, WHIRL is well suited for combining different sources of knowledge in the classification
process. We show on a diverse set of tasks that the use of appropriate sets of unlabeled background knowledge often decreases
error rates, particularly if the number of examples or the size of the strings in the training set is small. This is especially
useful when labeling text is a labor-intensive job and when there is a large amount of information available about a particular
problem on the World Wide Web.
|
| |
Keywords: | Text categorization Background knowledge Semi-supervised learning Information retrieval Query processing |
本文献已被 SpringerLink 等数据库收录! |
|