首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 368 毫秒
1.
In text classification, it is necessary to perform feature selection to alleviate the curse of dimensionality caused by high-dimensional text data. In this paper, we utilize class term frequency (CTF) and class document frequency (CDF) to characterize the relevance between terms and categories in the level of term frequency (TF) and document frequency (DF). On the basis of relevance measurement above, three feature selection methods (ADF based on CTF (ADF-CTF), ADF based on CDF (ADF-CDF), and ADF based on both CTF and CDF (ADF-CTDF)) are proposed to identify relevant and discriminant terms by introducing absolute deviation factors (ADFs). Absolute deviation, a statistic concept, is first adopted to measure the relevance divergence characterized by CTF and CDF. In addition, ADF-CTF and ADF-CDF can be combined with existing DF-based and TF-based methods, respectively, which results in new ADF-based methods. Experimental results on six high-dimensional textual datasets using three classifiers indicate that ADF-based methods outperform original DF-based and TF-based ones in 89% cases in terms of Micro-F1 and Macro-F1, which demonstrates the role of ADF integrated in existing methods to boost the classification performance. In addition, findings also show that ADF-CTDF ranks first averagely among multiple datasets and significantly outperforms other methods in 99% cases.  相似文献   

2.
3.
张嶷  汪雪锋  朱东华  周潇 《科学学研究》2013,31(11):1615-1622
 如何从科技文献数据中获取有效的信息,提升知识发现的能力是当前科学学研究中甚为关注的热点问题。大量相关的分析技术与方法均围绕自然语言处理技术所获取的“主题词”展开。然而,一般情况下,从科技文献数据中获取的主题词数量庞大,人工清洗几无可能,软件清洗亦缺乏可信度。本文以文献计量学方法为基础,构建了包括停词表、模糊语义处理、关联规则、词频与文档频次转换以及聚类分析在内的半自动化“主题词簇”方法体系,实现了以定量方法为主、定性方法为辅的主题词清洗、合并与聚类方案,旨在为技术竞争情报分析提供更为精准的主题词词表。本文以Derwent专利数据库中国“光伏电池”领域的科技文献为例,展开实证研究,验证了方法的科学性与有效性。  相似文献   

4.
One of the important problems in text classification is the high dimensionality of the feature space. Feature selection methods are used to reduce the dimensionality of the feature space by selecting the most valuable features for classification. Apart from reducing the dimensionality, feature selection methods have potential to improve text classifiers’ performance both in terms of accuracy and time. Furthermore, it helps to build simpler and as a result more comprehensible models. In this study we propose new methods for feature selection from textual data, called Meaning Based Feature Selection (MBFS) which is based on the Helmholtz principle from the Gestalt theory of human perception which is used in image processing. The proposed approaches are extensively evaluated by their effect on the classification performance of two well-known classifiers on several datasets and compared with several feature selection algorithms commonly used in text mining. Our results demonstrate the value of the MBFS methods in terms of classification accuracy and execution time.  相似文献   

5.
Contemporary ICTs such as speaking machines and computer games tend to create illusions. Is this ethically problematic? Is it deception? And what kind of “reality” do we presuppose when we talk about illusion in this context? Inspired by work on similarities between ICT design and the art of magic and illusion, responding to literature on deception in robot ethics and related fields, and briefly considering the issue in the context of the history of machines, this paper discusses these questions through the lens of stage magic and illusionism, with the aim of reframing the very question of deception. It investigates if we can take a more positive or at least morally neutral view of magic, illusion, and performance, while still being able to understand and criticize the relevant phenomena, and if we can describe and evaluate these phenomena without recourse to the term “deception” at all. This leads the paper into a discussion about metaphysics and into taking a relational and narrative turn. Replying to Tognazzini, the paper identifies and analyses two metaphysical positions: a narrative and performative non-dualist position is articulated in response to what is taken to be a dualist, in particular Platonic, approach to “deception” phenomena. The latter is critically discussed and replaced by a performative and relational approach which avoids a distant “view from nowhere” metaphysics and brings us back to the phenomena and experience in the performance relation. The paper also reflects on the ethical and political implications of the two positions: for the responsibility of ICT designers and users, which are seen as co-responsible magicians or co-performers, and for the responsibility of those who influence the social structures that shape who has (more) power to deceive or to let others perform.  相似文献   

6.
7.
According to the amoralist, computer games cannot be subject to moral evaluation because morality applies to reality only, and games are not real but “just games”. This challenges our everyday moralist intuition that some games are to be met with moral criticism. I discuss and reject the two most common answers to the amoralist challenge and argue that the amoralist is right in claiming that there is nothing intrinsically wrong in simply playing a game. I go on to argue for the so-called “endorsement view” according to which there is nevertheless a sense in which games themselves can be morally problematic, viz. when they do not only represent immoral actions but endorse a morally problematic worldview. Based on the endorsement view, I argue against full blown amoralism by claiming that gamers do have a moral obligation when playing certain games even if their moral obligation is not categorically different from that of readers and moviegoers.  相似文献   

8.
Let us show how property is grasped as an institutional fact. If Jones steals a computer, he does not own it in the sense of property, but only exercises control towards it. If he buys the computer, he controls it too, and moreover owns it in the sense of property. In other words, simply exercising control towards something is a brute fact. This control counts as property only in a certain context: the computer counts as Jones’s property only if he got it through a licit transfer. This is why property is not a brute fact, and is therefore an institutional fact. The same kind of reasoning applies to privacy. When a personal information P about Jones is openly diffused, it seems that P becomes public. From this point of view, a violation of privacy equates to a publication. The problem about this account is the following: who would call “publication of a book” the hacking of it on its author’s computer? No one, because the word “publication” is an institutional word that only refers to a licit diffusion. Considering this answer, we may conclude as follows: if the diffusion of P is illicit, P still counts as private, even if everyone knows about it. If that conclusion is true, privacy is an institutional fact.  相似文献   

9.
Automatic text classification is the task of organizing documents into pre-determined classes, generally using machine learning algorithms. Generally speaking, it is one of the most important methods to organize and make use of the gigantic amounts of information that exist in unstructured textual format. Text classification is a widely studied research area of language processing and text mining. In traditional text classification, a document is represented as a bag of words where the words in other words terms are cut from their finer context i.e. their location in a sentence or in a document. Only the broader context of document is used with some type of term frequency information in the vector space. Consequently, semantics of words that can be inferred from the finer context of its location in a sentence and its relations with neighboring words are usually ignored. However, meaning of words, semantic connections between words, documents and even classes are obviously important since methods that capture semantics generally reach better classification performances. Several surveys have been published to analyze diverse approaches for the traditional text classification methods. Most of these surveys cover application of different semantic term relatedness methods in text classification up to a certain degree. However, they do not specifically target semantic text classification algorithms and their advantages over the traditional text classification. In order to fill this gap, we undertake a comprehensive discussion of semantic text classification vs. traditional text classification. This survey explores the past and recent advancements in semantic text classification and attempts to organize existing approaches under five fundamental categories; domain knowledge-based approaches, corpus-based approaches, deep learning based approaches, word/character sequence enhanced approaches and linguistic enriched approaches. Furthermore, this survey highlights the advantages of semantic text classification algorithms over the traditional text classification algorithms.  相似文献   

10.
张小艳  宋丽平 《现代情报》2009,29(3):131-133
文本分类技术在信息过滤和信息检索中有着重要应用。文本表示技术是文本分类中的首要任务,特征选择技术又是文本表示中的杖心技术.对分类效果起着至关重要的作用。本文介绍了文本表示和特征选择技术的发展,并在详细分析目前各种文本表示和特征选择的方法和技术特点基础上,比较了各种方法的适用性和优缺点.最后总结出了文本表示和特征选择技术研究的方向和目标。  相似文献   

11.
Chunking is a task which divides a sentence into non-recursive structures. The primary aim is to specify chunk boundaries and classes. Although chunking generally refers to simple chunks, it is possible to customize the concept. A simple chunk is a small structure, such as a noun phrase, while constituent chunk is a structure that functions as a single unit in a sentence, such as a subject. For an agglutinative language with a rich morphology, constituent chunking is a significant problem in comparison to simple chunking. Most of Turkish studies on this issue use the IOB tagging schema to mark the boundaries.In this study, we proposed a new simpler tagging schema, namely OE, in constituent chunking for Turkish. “E” represents the rightmost token of a chunk, while “O” stands for all other items. In reference to OE, we also used a schema called OB, where “B” represents the leftmost token of a chunk. We aimed to identify both chunk boundaries and chunk classes using the conditional random fields (CRF) method. The initial motivation was to employ the fact that Turkish phrases are head-final for chunking. In this context, we assumed that marking the end of a chunk (OE) would be more advantageous than marking the beginning of a chunk (OB). In support of the assumption, the test results reveal that OB has the worst performance and OE is significantly a more successful schema in many cases. Especially in long sentences, this contrast is more obvious. Indeed, using OE means simply marking the head of the phrase (chunk). Since the head and the distinctive label “E” are aligned, CRF finds the chunk class more easily by using the information contained in the head. OE also produced more successful results than the schemas available in the literature.In addition to comparing tagging schemas, we performed four analyses. Along with the examination of window size, which is a parameter of CRF, it is adequate to select and accept this value as 3. A comparison of the evaluation measures for chunking revealed that F-score was a more balanced measure in contrast to token accuracy and sentence accuracy. As a result of the feature analysis, syntactic features improves chunking performance significantly under all conditions. Yet when withdrawing these features, a pronounced difference between OB and OE is forthcoming. In addition, flexibility analysis shows that OE is more successful in different data.  相似文献   

12.
Deep multi-view clustering (MVC) is to mine and employ the complex relationships among views to learn the compact data clusters with deep neural networks in an unsupervised manner. The more recent deep contrastive learning (CL) methods have shown promising performance in MVC by learning cluster-oriented deep feature representations, which is realized by contrasting the positive and negative sample pairs. However, most existing deep contrastive MVC methods only focus on the one-side contrastive learning, such as feature-level or cluster-level contrast, failing to integrating the two sides together or bringing in more important aspects of contrast. Additionally, most of them work in a separate two-stage manner, i.e., first feature learning and then data clustering, failing to mutually benefit each other. To fix the above challenges, in this paper we propose a novel joint contrastive triple-learning framework to learn multi-view discriminative feature representation for deep clustering, which is threefold, i.e., feature-level alignment-oriented and commonality-oriented CL, and cluster-level consistency-oriented CL. The former two submodules aim to contrast the encoded feature representations of data samples in different feature levels, while the last contrasts the data samples in the cluster-level representations. Benefiting from the triple contrast, the more discriminative representations of views can be obtained. Meanwhile, a view weight learning module is designed to learn and exploit the quantitative complementary information across the learned discriminative features of each view. Thus, the contrastive triple-learning module, the view weight learning module and the data clustering module with these fused features are jointly performed, so that these modules are mutually beneficial. The extensive experiments on several challenging multi-view datasets show the superiority of the proposed method over many state-of-the-art methods, especially the large improvement of 15.5% and 8.1% on Caltech-4V and CCV in terms of accuracy. Due to the promising performance on visual datasets, the proposed method can be applied into many practical visual applications such as visual recognition and analysis. The source code of the proposed method is provided at https://github.com/ShizheHu/Joint-Contrastive-Triple-learning.  相似文献   

13.
In text categorization, it is quite often that the numbers of documents in different categories are different, i.e., the class distribution is imbalanced. We propose a unique approach to improve text categorization under class imbalance by exploiting the semantic context in text documents. Specifically, we generate new samples of rare classes (categories with relatively small amount of training data) by using global semantic information of classes represented by probabilistic topic models. In this way, the numbers of samples in different categories can become more balanced and the performance of text categorization can be improved using this transformed data set. Indeed, the proposed method is different from traditional re-sampling methods, which try to balance the number of documents in different classes by re-sampling the documents in rare classes. Such re-sampling methods can cause overfitting. Another benefit of our approach is the effective handling of noisy samples. Since all the new samples are generated by topic models, the impact of noisy samples is dramatically reduced. Finally, as demonstrated by the experimental results, the proposed methods can achieve better performance under class imbalance and is more tolerant to noisy samples.  相似文献   

14.
具有开放性、可供性等特征的数字技术使得传统企业形成了数字化转型企业开放式创新生态系统的新组织型态。然而,当前对其研究还处于概念探讨和特征归纳的初级阶段,急需对其构建研究形成系统性理论框架。因此,首先基于相关概念的梳理,尝试明晰“何为”数字化转型企业开放式创新生态系统,即其内涵界定;其次,结合当前开放式创新生态系统、数字化转型企业两大研究基础,厘清“为何”急需开展数字化转型企业开放式创新生态系统构建的研究;最后,初步提出包括“型态识别→构建动因→构建逻辑→构建过程”的研究框架,指出未来“如何”开展数字化转型企业开放式创新生态系统构建的研究,以期为数字化转型企业开放式创新生态系统构建研究夯实理论基础并指明未来研究方向。  相似文献   

15.
In this paper, we argue that, under a specific set of circumstances, designing and employing certain kinds of virtual reality (VR) experiences can be unethical. After a general discussion of simulations and their ethical context, we begin our argument by distinguishing between the experiences generated by different media (text, film, computer game simulation, and VR simulation), and argue that VR experiences offer an unprecedented degree of what we call “perspectival fidelity” that prior modes of simulation lack. Additionally, we argue that when VR experiences couple this perspectival fidelity with what we call “context realism,” VR experiences have the ability to produce “virtually real experiences.” We claim that virtually real experiences generate ethical issues for VR technologies that are unique to the medium. Because subjects of these experiences treat them as if they were real, a higher degree of ethical scrutiny should be applied to any VR scenario with the potential to generate virtually real experiences. To mitigate this unique moral hazard, we propose and defend what we call “The Equivalence Principle.” This principle states that “if it would be wrong to allow subjects to have a certain experience in reality, then it would be wrong to allow subjects to have that experience in a virtually real setting.” We argue that such a principle, although limited in scope, should be part of the risk analysis conducted by any Institutional Review Boards, psychologists, empirically oriented philosophers, or game designers who are using VR technology in their work.  相似文献   

16.
An optimal procedure is established for the reconstruction of the angular object distribution in a given field of view (FOV). The object is coherently illuminated and located in the far zone of the receiving aperture. The procedure is “uniformly” optimal in the sense of minimizing the statistical r.m.s. difference between the object distribution, modeled as a random function of the angular coordinates and its reconstructed image, for each direction belonging to the FOV. The observable complex amplitude distribution of the field on the aperture is due in the general case not only to the incidentfield scattered by the object but also to background disturbance, or “angular noise”, randomly distributed inside and outside the FOV, and is affected by “measurement noise”, that is random errors introduced in measuring the aperture field. The reconstruction algorithm consists of summing a truncated series of special functions—prolate spheroidal for the linear case and their generalizations for two dimensional apertures—weighted by appropriate coefficients. These coefficients depend upon the observed aperture field and upon the relative power densities associated with the object field and the various types of noise. The series is truncated to a number of terms (“effective degrees of freedom” of the image) determined through an information theoretical method: each term of the series, suitably ordered, provides an information gain less than the preceding one, and the information gain goes rapidly to zero. A relationship between information transfer and mean squared error for each term in the image series is established. Numerical examples are discussed.  相似文献   

17.
The feature selection, which can reduce the dimensionality of vector space without sacrificing the performance of the classifier, is widely used in text categorization. In this paper, we proposed a new feature selection algorithm, named CMFS, which comprehensively measures the significance of a term both in inter-category and intra-category. We evaluated CMFS on three benchmark document collections, 20-Newsgroups, Reuters-21578 and WebKB, using two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVMs). The experimental results, comparing CMFS with six well-known feature selection algorithms, show that the proposed method CMFS is significantly superior to Information Gain (IG), Chi statistic (CHI), Document Frequency (DF), Orthogonal Centroid Feature Selection (OCFS) and DIA association factor (DIA) when Naïve Bayes classifier is used and significantly outperforms IG, DF, OCFS and DIA when Support Vector Machines are used.  相似文献   

18.
This paper examines the feasibility of discovering “title-like” terms using a decision tree classifier from the document. The premise of discovering title-like terms is that title terms and title-like terms should behave similarly in the document. This behavior is characterized by a set of distributional and linguistic features. By training the classifier to observe the behavior of title terms in a balanced manner using 25,000 titles in Reuters articles, other terms with similar behavior would also be discovered. Based on 5000 unseen titles, the recall of title terms was 83%, similar to the manual identification of title terms. The precision of finding title terms is low (i.e., 32%) because some non-title but title-like terms should have been identified as well. Seven subjects were asked to rate, on a scale of between 1 and 5, whether the identified term is a topical/thematic/title term. If a rating of 2.5 is used to determine whether a term is judged to be a “title-like” term, then the mean precision is increased to 58%, or the headline/title is expanded with twice the average number of terms. Since this precision (i.e., 58%) is similar to the mean precision of manually identified title terms averaged across different subjects, we conclude that the discovery of title-like terms using classifiers is a promising approach.  相似文献   

19.
基于改进KNN的文本分类方法   总被引:8,自引:0,他引:8  
钱晓东  王正欧 《情报科学》2005,23(4):550-554
本文针对VSM (向量空间模型)中KNN (K最近邻算法)在文本处理环境下的不足,根据SOM (自组织映射神经网络)理论、特征选取和模式聚合理论,提出了一种改进的KNN文本分类方法。应用特征选取和模式聚合理论以降低特征空间维数。传统的VSM模型各维相同的权重并不适应于文本处理的环境,本文提出应用SOM神经网络进行VSM模型各维权重的计算。结合两种改进,有效地降低了向量空间的维数,提高了文本分类的精度和速度。  相似文献   

20.
Gene ontology (GO) consists of three structured controlled vocabularies, i.e., GO domains, developed for describing attributes of gene products, and its annotation is crucial to provide a common gateway to access different model organism databases. This paper explores an effective application of text categorization methods to this highly practical problem in biology. As a first step, we attempt to tackle the automatic GO annotation task posed in the Text Retrieval Conference (TREC) 2004 Genomics Track. Given a pair of genes and an article reference where the genes appear, the task simulates assigning GO domain codes. We approach the problem with careful consideration of the specialized terminology and pay special attention to various forms of gene synonyms, so as to exhaustively locate the occurrences of the target gene. We extract the words around the spotted gene occurrences and used them to represent the gene for GO domain code annotation. We regard the task as a text categorization problem and adopt a variant of kNN with supervised term weighting schemes, making our method among the top-performing systems in the TREC official evaluation. Furthermore, we investigate different feature selection policies in conjunction with the treatment of terms associated with negative instances. Our experiments reveal that round-robin feature space allocation with eliminating negative terms substantially improves performance as GO terms become specific.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号