首页 | 本学科首页   官方微博 | 高级检索  
     检索      

改进TF-IDF算法的文本特征项权值计算方法
引用本文:路永和,李焰锋.改进TF-IDF算法的文本特征项权值计算方法[J].图书情报工作,2013,57(3):90-95.
作者姓名:路永和  李焰锋
作者单位:中山大学资讯管理学院
基金项目:国家高技术研究发展计划(863计划)资助项目"农产品全供应链多源信息感知技术与产品开发",广东省哲学社会科学十二五规划项目"我国农民信息需求特征及其获取渠道实证研究"
摘    要:首先,从特征项重要性和类别区分能力的角度出发,通过分析传统的权重函数TF-IDF(term frequency-inverse document frequency)及其相关改进算法,研究文本分类中向量化时的特征权重计算,构建权重修正函数TW。其次,通过对特征词的卡方分布和TW作对比实验,验证TW能提高类别中专有词汇的权值,降低常见但对分类不重要的特征的权值。最后,将TW与TF-IDF结合作为新的特征权重算法,通过在中文分类语料库上的实际分类实验,与其他权重算法比较,验证此种算法的有效性。

关 键 词:文本分类  TF-IDF  特征权重  类别区分  
收稿时间:2012-10-12

Improvement of Text Feature Weighting Method Based on TF-IDF Algorithm
Lu Yonghe,Li Yanfeng.Improvement of Text Feature Weighting Method Based on TF-IDF Algorithm[J].Library and Information Service,2013,57(3):90-95.
Authors:Lu Yonghe  Li Yanfeng
Institution:School of Information Management, Sun Yat-sen University, Guangzhou 510006
Abstract:Based on the importance of the feature and the ability of category distinguishing, this paper analyzes the disadvantages of traditional TF-KG*4]IDF and its related improved algorithm, studies how to calculate feature weighting in text categorization, and develops a new function TW to correct feature's weight. Secondly, with the comparative experiments on term's CHI and term's TW validate, it reveals that TW can increase the weight of special features in a class and decrease the weight of common but unimportant features. Finally, this paper develops a new feature weighting algorithm combining TW with TF-IDF, and compares it with other methods by the classification experiments on Chinese classification corpus, in order to verify the validity of the new algorithm.
Keywords:text categorization  term frequency and inverse documentation frequency(TF-IDF)  feature weighting category distinguishing  
本文献已被 万方数据 等数据库收录!
点击此处可从《图书情报工作》浏览原始摘要信息
点击此处可从《图书情报工作》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号