首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper, we propose a new algorithm, which incorporates the relationships of concept-based thesauri into the document categorization using the k-NN classifier (k-NN). k-NN is one of the most popular document categorization methods because it shows relatively good performance in spite of its simplicity. However, it significantly degrades precision when ambiguity arises, i.e., when there exist more than one candidate category to which a document can be assigned. To remedy the drawback, we employ concept-based thesauri in the categorization. Employing the thesaurus entails structuring categories into hierarchies, since their structure needs to be conformed to that of the thesaurus for capturing relationships between categories. By referencing various relationships in the thesaurus corresponding to the structured categories, k-NN can be prominently improved, removing the ambiguity. In this paper, we first perform the document categorization by using k-NN and then employ the relationships to reduce the ambiguity. Experimental results show that this method improves the precision of k-NN up to 13.86% without compromising its recall.  相似文献   

2.
Text categorization pertains to the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the subsequent assignment of unclassified documents to appropriate categories. Most existing text categorization techniques deal with monolingual documents (i.e., written in the same language) during the learning of the text categorization model and category assignment (or prediction) for unclassified documents. However, with the globalization of business environments and advances in Internet technology, an organization or individual may generate and organize into categories documents in one language and subsequently archive documents in different languages into existing categories, which necessitate cross-lingual text categorization (CLTC). Specifically, cross-lingual text categorization deals with learning a text categorization model from a set of training documents written in one language (e.g., L1) and then classifying new documents in a different language (e.g., L2). Motivated by the significance of this demand, this study aims to design a CLTC technique with two different category assignment methods, namely, individual- and cluster-based. Using monolingual text categorization as a performance reference, our empirical evaluation results demonstrate the cross-lingual capability of the proposed CLTC technique. Moreover, the classification accuracy achieved by the cluster-based category assignment method is statistically significantly higher than that attained by the individual-based method.  相似文献   

3.
A new dictionary-based text categorization approach is proposed to classify the chemical web pages efficiently. Using a chemistry dictionary, the approach can extract chemistry-related information more exactly from web pages. After automatic segmentation on the documents to find dictionary terms for document expansion, the approach adopts latent semantic indexing (LSI) to produce the final document vectors, and the relevant categories are finally assigned to the test document by using the k-NN text categorization algorithm. The effects of the characteristics of chemistry dictionary and test collection on the categorization efficiency are discussed in this paper, and a new voting method is also introduced to improve the categorization performance further based on the collection characteristics. The experimental results show that the proposed approach has the superior performance to the traditional categorization method and is applicable to the classification of chemical web pages.  相似文献   

4.
A proposed particle swarm classifier has been integrated with the concept of intelligently controlling the search process of PSO to develop an efficient swarm intelligence based classifier, which is called intelligent particle swarm classifier (IPS-classifier). This classifier is described to find the decision hyperplanes to classify patterns of different classes in the feature space. An intelligent fuzzy controller is designed to improve the performance and efficiency of the proposed classifier by adapting three important parameters of PSO (inertia weight, cognitive parameter and social parameter). Three pattern recognition problems with different feature vector dimensions are used to demonstrate the effectiveness of the introduced classifier: Iris data classification, Wine data classification and radar targets classification from backscattered signals. The experimental results show that the performance of the IPS-classifier is comparable to or better than the k-nearest neighbor (k-NN) and multi-layer perceptron (MLP) classifiers, which are two conventional classifiers.  相似文献   

5.
Using data generated by progressive nucleation mechanism on the cumulative fraction of citations of individual papers published successively by a hypothetical author, an expression for the time dependence of the cumulative number Lsum(t) of citations of progressively published papers is proposed. It was found that, for all nonzero values of constant publication rate ΔN, the cumulative citations Lsum(t) of the cumulative N papers published by an author in his/her entire publication career spanning over T years may be represented in distinct regions: (1) in the region 0 < t < Θ0 (where Θ0 ≈ T/3), Lsum(t) slowly increases proportionally to the square of the citation time t, and (2) in the region t > Θ0, Lsum(t) approaches a constant Lsum(max) at T. In the former region, the time dependence of Lsum(t) of an author is associated with three parameters, viz. the citability parameter λ0, the publication rate ΔN and his/her publication career t. Based on the predicted dependence of Lsum(t) on t, a useful scientometric age-independent measure, defined as citation acceleration a = Lsum(t)/t2, is suggested to analyze and compare the scientific activities of different authors. Confrontation of the time dependence of cumulative number Lsum(t) of citations of papers with the theoretical equation reveals one or more citation periods during the publication careers of different authors.  相似文献   

6.
7.
Webpages are mainly distinguished by their topic (e.g., politics, sports etc.) and genre (e.g., blogs, homepages, e-shops, etc.). Automatic detection of webpage genre could considerably enhance the ability of modern search engines to focus on the requirements of the user’s information need. In this paper, we present an approach to webpage genre detection based on a fully-automated extraction of the feature set that represents the style of webpages. The features we propose (character n-grams of variable length and HTML tags) are language-independent and easily-extracted while they can be adapted to the properties of the still evolving web genres and the noisy environment of the web. Experiments based on two publicly-available corpora show that the performance of the proposed approach is superior in comparison to previously reported results. It is also shown that character n-grams are better features than words when the dimensionality increases while the binary representation is more effective than the term-frequency representation for both feature types. Moreover, we perform a series of cross-check experiments (e.g., training using a genre palette and testing using a different genre palette as well as using the features extracted from one corpus to discriminate the genres of the other corpus) to illustrate the robustness of our approach and its ability to capture the general stylistic properties of genre categories even when the feature set is not optimized for the given corpus.  相似文献   

8.
The paper is concerned with similarity search at large scale, which efficiently and effectively finds similar data points for a query data point. An efficient way to accelerate similarity search is to learn hash functions. The existing approaches for learning hash functions aim to obtain low values of Hamming distances for the similar pairs. However, these methods ignore the ranking order of these Hamming distances. This leads to the poor accuracy about finding similar items for a query data point. In this paper, an algorithm is proposed, referred to top k RHS (Rank Hash Similarity), in which a ranking loss function is designed for learning a hash function. The hash function is hypothesized to be made up of l binary classifiers. The issue of learning a hash function can be formulated as a task of learning l binary classifiers. The algorithm runs l rounds and learns a binary classifier at each round. Compared with the existing approaches, the proposed method has the same order of computational complexity. Nevertheless, experiment results on three text datasets show that the proposed method obtains higher accuracy than the baselines.  相似文献   

9.
Artificial intelligence (AI) is rapidly becoming the pivotal solution to support critical judgments in many life-changing decisions. In fact, a biased AI tool can be particularly harmful since these systems can contribute to or demote people’s well-being. Consequently, government regulations are introducing specific rules to prohibit the use of sensitive features (e.g., gender, race, religion) in the algorithm’s decision-making process to avoid unfair outcomes. Unfortunately, such restrictions may not be sufficient to protect people from unfair decisions as algorithms can still behave in a discriminatory manner. Indeed, even when sensitive features are omitted (fairness through unawareness), they could be somehow related to other features, named proxy features. This study shows how to unveil whether a black-box model, complying with the regulations, is still biased or not. We propose an end-to-end bias detection approach exploiting a counterfactual reasoning module and an external classifier for sensitive features. In detail, the counterfactual analysis finds the minimum cost variations that grant a positive outcome, while the classifier detects non-linear patterns of non-sensitive features that proxy sensitive characteristics. The experimental evaluation reveals the proposed method’s efficacy in detecting classifiers that learn from proxy features. We also scrutinize the impact of state-of-the-art debiasing algorithms in alleviating the proxy feature problem.  相似文献   

10.
Most document clustering algorithms operate in a high dimensional bag-of-words space. The inherent presence of noise in such representation obviously degrades the performance of most of these approaches. In this paper we investigate an unsupervised dimensionality reduction technique for document clustering. This technique is based upon the assumption that terms co-occurring in the same context with the same frequencies are semantically related. On the basis of this assumption we first find term clusters using a classification version of the EM algorithm. Documents are then represented in the space of these term clusters and a multinomial mixture model (MM) is used to build document clusters. We empirically show on four document collections, Reuters-21578, Reuters RCV2-French, 20Newsgroups and WebKB, that this new text representation noticeably increases the performance of the MM model. By relating the proposed approach to the Probabilistic Latent Semantic Analysis (PLSA) model we further propose an extension of the latter in which an extra latent variable allows the model to co-cluster documents and terms simultaneously. We show on these four datasets that the proposed extended version of the PLSA model produces statistically significant improvements with respect to two clustering measures over all variants of the original PLSA and the MM models.  相似文献   

11.
12.
An h-type index is proposed which depends on the obtained citations of articles belonging to the h-core. This weighted h-index, denoted as hw, is presented in a continuous setting and in a discrete one. It is shown that in a continuous setting the new index enjoys many good properties. In the discrete setting some small deviations from the ideal may occur.  相似文献   

13.
Modern web search engines are expected to return the top-k results efficiently. Although many dynamic index pruning strategies have been proposed for efficient top-k computation, most of them are prone to ignoring some especially important factors in ranking functions, such as term-proximity (the distance relationship between query terms in a document). In our recent work [Zhu, M., Shi, S., Li, M., & Wen, J. (2007). Effective top-k computation in retrieving structured documents with term-proximity support. In Proceedings of 16th CIKM conference (pp. 771–780)], we demonstrated that, when term-proximity is incorporated into ranking functions, most existing index structures and top-k strategies become quite inefficient. To solve this problem, we built the inverted index based on web page structure and proposed the query processing strategies accordingly. The experimental results indicate that the proposed index structures and query processing strategies significantly improve the top-k efficiency. In this paper, we study the possibility of adopting additional techniques to further improve top-k computation efficiency. We propose a Proximity-Probe Heuristic to make our top-k algorithms more efficient. We also test the efficiency of our approaches on various settings (linear or non-linear ranking functions, exact or approximate top-k processing, etc.).  相似文献   

14.
A linear matrix inequality based mixed H2-dissipative type state observer design approach is presented for smooth discrete time nonlinear systems with finite energy disturbances. This observer is designed to maintain H2 type estimation error performance together with either H or a passivity type disturbance reduction performance in case of randomly varying perturbations in its gain. A linear matrix inequality is used at each time instant to find the time-varying gain of the observer. Simulation studies are included to explore the performance in comparison to the extended Kalman filter and a previously proposed constant gain observer counterpart.  相似文献   

15.
The mathematical modeling of most physical systems, such as aerospace systems, heat processes, telecommunication systems, transmission lines and chemical reactors, results in complex high order models. The complexity of the models imposes a lot of difficulties in analysis, simulation and control designs. Several analytical model reduction techniques have been proposed in literature over the past few decades to reduce these difficulties. However, most of the optimal techniques follow computationally demanding, time consuming, iterative procedures that usually result in non-robustly stable models with poor frequency response resemblance to the original high order model in some frequency ranges. Genetic Algorithm (GA) has proved to be an excellent optimization tool in the past few years. Therefore, the aim of this paper will be to use GA to solve H2 and H norm model reduction problems, and help obtain globally optimized nominal models.  相似文献   

16.
Most previous works of feature selection emphasized only the reduction of high dimensionality of the feature space. But in cases where many features are highly redundant with each other, we must utilize other means, for example, more complex dependence models such as Bayesian network classifiers. In this paper, we introduce a new information gain and divergence-based feature selection method for statistical machine learning-based text categorization without relying on more complex dependence models. Our feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. Empirical results are given on a number of dataset, showing that our feature selection method is more effective than Koller and Sahami’s method [Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In Proceedings of ICML-96, 13th international conference on machine learning], which is one of greedy feature selection methods, and conventional information gain which is commonly used in feature selection for text categorization. Moreover, our feature selection method sometimes produces more improvements of conventional machine learning algorithms over support vector machines which are known to give the best classification accuracy.  相似文献   

17.
In this paper, an analytic solution of nonlinear H robust controller is first proposed and used in a complete six degree-of-freedom nonlinear equations of motion of flight vehicle system with mass and moment inertia uncertainties. A special Lyapunov function with mass and moment inertia uncertainties is considered to solve the associated Hamilton-Jacobi partial differential inequality (HJPDI). The HJPDI is solved analytically, resulting in a nonlinear H robust controller with simple proportional feedback structure. Next, the control surface inverse algorithm (CSIA) is introduced to determine the angles of control surface deflection from the nonlinear H control command. The ranges of prefilter and loss ratio that guarantee stability and robustness of nonlinear H flight control system implemented by CSIA are derived. Real aerodynamic data, engine data and actuator system of F-16 aircraft are carried out in numerical simulations to verify the proposed scheme. The results show that the responses still keep good convergence for large initial perturbation and the robust stability with mass and moment inertia uncertainties in the permissible ranges of the prefilter and loss ratio for which this design guarantees stability give same conclusion.  相似文献   

18.
Cross-Company Churn Prediction (CCCP) is a domain of research where one company (target) is lacking enough data and can use data from another company (source) to predict customer churn successfully. To support CCCP, the cross-company data is usually transformed to a set of similar normal distribution of target company data prior to building a CCCP model. However, it is still unclear which data transformation method is most effective in CCCP. Also, the impact of data transformation methods on CCCP model performance using different classifiers have not been comprehensively explored in the telecommunication sector. In this study, we devised a model for CCCP using data transformation methods (i.e., log, z-score, rank and box-cox) and presented not only an extensive comparison to validate the impact of these transformation methods in CCCP, but also evaluated the performance of underlying baseline classifiers (i.e., Naive Bayes (NB), K-Nearest Neighbour (KNN), Gradient Boosted Tree (GBT), Single Rule Induction (SRI) and Deep learner Neural net (DP)) for customer churn prediction in telecommunication sector using the above mentioned data transformation methods. We performed experiments on publicly available datasets related to the telecommunication sector. The results demonstrated that most of the data transformation methods (e.g., log, rank, and box-cox) improve the performance of CCCP significantly. However, the Z-Score data transformation method could not achieve better results as compared to the rest of the data transformation methods in this study. Moreover, it is also investigated that the CCCP model based on NB outperform on transformed data and DP, KNN and GBT performed on the average, while SRI classifier did not show significant results in term of the commonly used evaluation measures (i.e., probability of detection, probability of false alarm, area under the curve and g-mean).  相似文献   

19.
Document classification, with the blooming of the Internet information delivery, has become indispensable required and is expected to be disposed by an automatic text categorization. This paper presents a text categorization system to solve the multi-class categorization problem. The system consists of two modules: the processing module and the classifying module. In the first module, ICF and Uni are used as the indictors to extract the relevant terms. While the fuzzy set theory is incorporated into the OAA-SVM in the classifying module, we specifically propose an OAA-FSVM classifier to implement a multi-class classification system. The performances of OAA-SVM and OAA-FSVM are evaluated by macro-average performance index.  相似文献   

20.
In recent years, sparse subspace clustering (SSC) has been witnessed to its advantages in subspace clustering field. Generally, the SSC first learns the representation matrix of data by self-expressive, and then constructs affinity matrix based on the obtained sparse representation. Finally, the clustering result is achieved by applying spectral clustering to the affinity matrix. As described above, the existing SSC algorithms often learn the sparse representation and affinity matrix in a separate way. As a result, it may not lead to the optimum clustering result because of the independence process. To this end, we proposed a novel clustering algorithm via learning representation and affinity matrix conjointly. By the proposed method, we can learn sparse representation and affinity matrix in a unified framework, where the procedure is conducted by using the graph regularizer derived from the affinity matrix. Experimental results show the proposed method achieves better clustering results compared to other subspace clustering approaches.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号