首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 95 毫秒
1.
文本分类是信息检索领域的重要应用之一,由于采用统一特征向量形式表示所有文档,导致针对每个文档的特征向量具有高维性和稀疏性,从而影响文档分类的性能和精度。为有效提升文本特征选择的准确度,本文首先提出基于信息增益的特征选择函数改进方法,提高特征选择的精度。KNN(K-Nearest Neighbor)算法是文本分类中广泛应用的算法,本文针对经典KNN计算量大、类别标定函数精度不高的问题,提出基于训练集裁剪的加权KNN算法。该算法通过对训练集进行裁剪提升了分类算法的计算效率,通过模糊集的隶属度函数提升分类算法的准确性。在公开数据上的实验结果及实验分析证明了算法的有效性。  相似文献   

2.
KNN算法是文本分类中广泛应用的算法.作为一种基于实例的算法,训练样本的数量和分布位置影响KNN分类器分类性能.合理的样本剪裁以及样本赋权方法可以提高分类器的效率.提出了一种基于样本分布状况的KNN改进模型.首先基于样本位置对训练集进行删减以节约计算开销,然后针对类偏斜现象对分类器的赋权方式进行优化,改善k近邻选择时大类别、高密度训练样本的占优现象.试验结果表明,本文提出的改进KNN文本分类算法提高了KNN的分类效率.  相似文献   

3.
文本分类中一种基于密度的KNN改进方法   总被引:2,自引:1,他引:1  
特征降维与分类算法的性能是文本自动分类的两个主要问题.KNN算法以其简单、有效、非参数特点常用于文本分类,但是训练文本分布的不均匀对KNN的分类效果产生负面影响,而在实际应用中训练文本分布不均是常见现象.本文针对这种分类环境,首先提出了一种改进的tf-idf赋权方法用于特征降维,在此基础上进一步提出了一种基于密度的改进KNN方法用于文本分类, 使处于样本点分布较密集区域的样本点之间的距离增大.随后的文本分类试验表明,本文提出的方法基于密度的KNN方法具有较好的文本分类效果.  相似文献   

4.
用于Web文本分类的快速KNN算法   总被引:12,自引:0,他引:12  
王煜  白石  王正欧 《情报学报》2007,26(1):60-64
KNN算法是一种简单、有效、非参数的Web文本分类方法。传统KNN方法的明显缺陷是样本相似度的计算量很大,使其在具有大量高维样本的Web文本分类中缺乏实用性。本文提出一种快速查找精确的k个最近邻的FKNN(Fast-k-Nearest-Neighbor)算法。FKNN算法首先选择一个样本作为基准点,并将所有样本按照距基准样本的距离进行排序并建立索引表,然后根据索引表和有序队列查找k个最近邻,减小了查找范围,极大降低了相似度计算量。  相似文献   

5.
专有名词的自动抽取是文本挖掘、信息检索和机器翻译等领域的关键技术.本文研究了组合SVM和KNN两种分类器进行汉语专有名词自动抽取的方法.对样本在空间的不同分布使用不同的分类方法,当测试样本与SVM最优超平面的距离大于给定的阈值时使用SVM分类,否则使用KNN;在实际训练语料中,常常是负类样本数远多于正类样本数,而传统KNN方法对不平衡训练集存在敏感性,所以提出了用归一化的思想对传统的KNN方法进行修正.实验表明,用SVM与修正的KNN组合算法进行汉语专有名词抽取比单一的SVM方法以及原始的SVM-KNN方法更具优越性,而且这种方法可以推广到其他非平衡分布样本的分类问题.  相似文献   

6.
互联网的蓬勃发展使得文本数据呈指数型增长态势,如何实现文本内容的高效分类成为信息资源管理工作面临的紧要问题。本文以维普学术期刊资源与百度新闻网页作为基础语料集,基于LDA模型抽取文档主题、切分文本内容,融合集成学习Catboost算法获得文档在主题上的概率分布,然后利用训练集提取出的隐含主题-文本矩阵进行分类器训练,最终构建文本分类系统。研究结果显示,该系统能够有效完成文本混合自动分类,分类误差率较低,分类性能明显优于传统的文本分类方法。  相似文献   

7.
王煜  白石  王正欧 《情报学报》2007,26(5):643-647
本文提出了一种基于权重优化的样本相似度测量的距离公式,改进了KNN文本分类算法.KNN算法通常采用传统的VSM模型,各个特征具有相同的权重,使其不适应于文本处理的环境.本文首先根据神经网络理论,采用灵敏度方法对文本特征向量的每个特征的权重进行修正,并且采用降低运算量的神经网络特征选择方法进行第二次降维处理.然后根据同一特征对不同类别的文本类的分类作用不同,对距离公式中的特征权重进行进一步改进,从而进一步提高了KNN文本分类算法的精度.  相似文献   

8.
首先提出一种基于模糊向量空间模型和径向基函数网络的文本自动分类方法,该网络由输入层、隐层和输出层组成 :输入层完成分类样本的输入,隐层提取输入样本所隐含的模式特征,将分类结果在输出层表现出来 ;其次,构造更详细的算法推导及实施方案 ;最后,以中国期刊网全文数据库部分文档数据为例,对该方法的有效性进行验证,结果表明该方法分类效果较好。  相似文献   

9.
王效岳  白如江 《情报学报》2006,25(4):475-480
结合粗糙集的属性约简和神经网络的分类机理,提出了一种混合算法。首先应用粗糙集理论的属性约简作为预处理器,把冗余的属性从决策表中删去,然后运用神经网络进行分类。这样可以大大降低向量维数,克服粗糙集对于决策表噪声比较敏感的缺点。试验结果表明,与朴素贝叶斯、SVM、KNN传统分类方法相比,该方法在保持分类精度的基础上,分类速度有明显的提高,体现出较好的稳定性和容错性,尤其适用于特征向量多且难以分类的文本。  相似文献   

10.
从《中国植物志》中随机采集1 000个文档作为数据集,采用自主学习规则与先导词相结合的算法实现中文物种描述文本的语义标注。实验数据表明,本研究设计的基于规则的算法整体标注效率(F值)达到0.930,大部分元素的F值在0.724-0.964之间,该算法优于朴素贝叶斯分类算法。同时证明,先导词对优化算法具有积极意义。  相似文献   

11.

Rhetorical criticism (i.e., textual analysis of speeches) is severely handicapped because speech cannot be adequately represented in writing; even if it could be so represented, it is illogical and presumptuous to study critically oral communication received from an inappropriate medium (printed page) via an inappropriate sensory channel (vision). A hierarchy of research priorities—the criterion for relative worth being the degree to which methodologies extend our knowledge of rhetorical theory—is proposed by the author.  相似文献   

12.
《Journalism Practice》2013,7(2):187-200
This article begins with the assertion that creativity in journalism has moved from being a matter of guile and ingenuity to being about expressiveness, and that this reflects a broader cultural shift from professional expertise to the authenticity of personal expression as dominant modes of valorization. It then seeks to unpack the normative baggage that underpins the case for creativity in the cultural industries. First, there is a prioritization of agency, which does not stand up against the phenomenological argument that we do not own our own practices. Second, creative expression is not necessarily more free, simply alternately structured. As with Judith Butler's performativity model, contemporary discourses of creativity assume it to have a unique quality by which it eludes determination (relying on tropes of fluidity), whereas it can be countered that it is in spontaneous, intuitive practice that we are at our least agentive. Third, the article argues against the idea that by authorizing journalists (and audiences) to express themselves, creativity is democratizing, since the always-already nature of recognition means that subjects can only voice their position within an established terrain rather than engage active positioning.  相似文献   

13.
The dilemma of implementing macroappraisal is to transform theory and methodology into selection and preservation of archival records through disposition procedures. Having shifted the focus from the record to the function from which it derives, how does a program or an appraisal project committed to the macroappraisal approach get back to the record to ensure compliance and accountability? This paper uses the experience of Library and Archives Canada (LAC) as a form of case study (a model for success) which examines how applied theory and program practice come to terms with each other. It analyses the tensions, the challenges, and the creativity that inevitably arise when turning macroappraisal from an appraisal methodology into a fully articulated archival disposition program whose final “deliverable” is the archival record. Making things simple, it turns out, is complicated.  相似文献   

14.
Knowledge flow between scientific disciplines has commonly been measured based on citation data. Previous studies using citing relationships have mostly considered direct citations but have paid little attention to indirect citations (IDC) to indicate how knowledge diffusion from one discipline to another via one or more intermediaries. In this study, we measured knowledge flow between disciplines from two perspectives: direct citations (DC) and discipline potential energy (DPE), which is proposed to combine both direct and indirect citations. Data were collected from the Web of Science (WoS) database. Findings include: (1) DPE overshadows previous measures by considering not only direct citations but also indirect citations between disciplines which was usually ignored in previous measures, and revealed that the knowledge contribution of some disciplines had been underestimated by previous measures, such as Physics and Engineering. (2) The proportion of IDC contribution is close to that of direct knowledge contribution when the discipline scale is removed, which suggests that it is essential to consider IDC to distinguish the knowledge relationship (net-outflow/inflow) between disciplines. (3) Both measurements show that Biology & Biochemistry has always been the top discipline with the highest net outflow of knowledge, which is inconsistent with the history of science that Mathematics, Physics and Chemistry would be the highest net outflow disciplines. The results show that even considering IDC does not fully reveal the knowledge contribution and academic influence of disciplines. This paper also analyzes the potential reasons for citation bias in revealing the contribution of disciplinary knowledge from a citation perspective. Therefore, caution should be taken in the use of citations as a primary measure of knowledge flow.  相似文献   

15.
The long-term storage and retrieval requirements of research organizations have raised questions of what can go wrong with magnetic media and how long it will last. This paper addresses such concerns. This paper has been updated by the author since it appeared in Magnetic Tape Storage and handling. A Guide for Libraries and Archives, (to which references to an Appendix refer) which is available from the Commission on Preservation and Access, 1400 16th St. NW, Washington, DC 20036-2217. Published with permission.  相似文献   

16.
从食品药品安全事件的防范和监控要求出发,阐述应用企业实验室平台的重要作用;探讨以此为基础利 用云环境开发和搭建从源头到消费者全过程质量监控大平台的现实性和可行性。同时,根据多年从事实验室信息管理 系统的推广经验,说明只有以人本电子健康为核心,才能真正发挥计算机应用系统的作用。  相似文献   

17.
The monograph by A.V Sokolov “The Philosophy of Information” (St. Petersburg, 2010) is reviewed, which is dedicated to the conceptualization of information as one of the philosophical categories. Doubts are expressed about the legitimacy of the author’s treatment of the phenomenon of information from the perspective of dualistic monism recognizing the existence of a substance with two opposite “hypostases,” which cannot be reduced to one another, namely, about the author’s statement that the nature of information is neither ideal nor material but ambivalent, i.e., it is an indissoluble unity of the material (in this case, the carrier) and ideal (meaning) principles. This doubt is based on the premise that the ideal is not an independent substance, along with the material, but is just one form of existence of matter. Therefore, in our review it is proposed to treat information not as an element of ambivalence that belongs to the two principles of substantival, which is characteristic of dualism, but as an element of a monosubstantival principle, as a type of matter.  相似文献   

18.
The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. This paper shows how to exploit locality by building, using, and searching partial replicas of text collections in a distributed IR system. In this work, a partial replica includes a subset of the documents from larger collection(s) and the corresponding inference network search mechanism. For each query, the distributed system determines if partial replica is a good match and then searches it, or it searches the original collection. We demonstrate the scenarios where partial replication performs better than systems that use caches which only store previous query and answer pairs. We first use logs from THOMAS and Excite to examine query locality using query similarity versus exact match. We show that searching replicas can improve locality (from 3 to 19%) over the exact match required by caching. Replicas increase locality because they satisfy queries which are distinct but return the same or very similar answers. We then present a novel inference network replica selection function. We vary its parameters and compare it to previous collection selection functions, demonstrating a configuration that directs most of the appropriate queries to replicas in a replica hierarchy. We then explore the performance of partial replication in a distributed IR system. We compare it with caching and partitioning. Our validated simulator shows that the increases in locality due to replication make it preferable to caching alone, and that even a small increase of 4% in locality translates into a performance advantage. We also show a hybrid system with caches and replicas that performs better than each on their own.  相似文献   

19.
There is evidence that national scientific journals are important for local communities despite their limited audience due to national languages and topics, like in pedagogy. However, it is not easy to assess the level of scientific rigour of local journals, as most do not have available scientometric data and are often published in minority languages. We hypothesize that a possible manifestation of a latent trait of inner authenticity of the scientific journal (meaning the journal is accepted by a community interested in developing the field which conducts internationally accepted research) could be H-index of the editorial board members. To test this approach, we evaluated H-index and gender of editorial board members (n = 490) from 17 Czech and Slovak national science-oriented scientific pedagogical journals which were not indexed or indexed in Erih+ or Scopus, and compared this with the five lowest-rated journals from the same field indexed in the Web of Science (WoS) database. The H-index of editorial board members was somewhat higher in indexed journals with those from WoS showing higher scores, and the number of board members with no discernable H-index was far greater in non-indexed journals. Editorial boards of journals indexed in WoS were mostly male, compared to a dominance of women on boards of non-indexed journals. Acknowledging the limited sample, it appears that the H-index of editorial board members may be a way to value national scientific journals.  相似文献   

20.
The European Union recently launched an innovative participatory mechanism allowing its citizens across Europe get together and set the agenda for policy-making in Brussels. The tool – the European Citizens’ Initiative – was labelled as “most direct and digital” ever in the history of European democratic experimentation as it made it possible to collect signatures (of which it is required not less than 1 million) in favour of an initiative via the internet (e-collection). Launched on 1 April 2012 the ECI was met with major enthusiasm in Brussels, but soon stumbled over serious difficulties as the organisers on the ground were unable to set up their online collection systems. The present paper looks into this ICT-related crisis from the point of reference of e-democracy theory based on the findings of a qualitative case-study. As a deliverable, it offers an understanding of factors and stakeholder rationales which shaped the design and implementation of the digital dimension of the ECI (iECI).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号