首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
介绍聚类算法的过程以及聚类有效性指标的分类,分别评述科学计量学常用软件中的几种聚类算法,分析聚类算法的特性并采用基于类内紧密度和类间分离度对聚类结果的有效性进行探讨,总结各聚类算法的效果并对应软件分析的结果进行案例分析。  相似文献   

2.
How to merge and organise query results retrieved from different resources is one of the key issues in distributed information retrieval. Some previous research and experiments suggest that cluster-based document browsing is more effective than a single merged list. Cluster-based retrieval results presentation is based on the cluster hypothesis, which states that documents that cluster together have a similar relevance to a given query. However, while this hypothesis has been demonstrated to hold in classical information retrieval environments, it has never been fully tested in heterogeneous distributed information retrieval environments. Heterogeneous document representations, the presence of document duplicates, and disparate qualities of retrieval results, are major features of an heterogeneous distributed information retrieval environment that might disrupt the effectiveness of the cluster hypothesis. In this paper we report on an experimental investigation into the validity and effectiveness of the cluster hypothesis in highly heterogeneous distributed information retrieval environments. The results show that although clustering is affected by different retrieval results representations and quality, the cluster hypothesis still holds and that generating hierarchical clusters in highly heterogeneous distributed information retrieval environments is still a very effective way of presenting retrieval results to users.  相似文献   

3.
Typically graph-clustering approaches assume that a cluster is a vertex subset such that for all of its vertices, the number of links connecting a vertex to its cluster is higher than the number of links connecting the vertex to the remaining graph. We consider a cluster such that for all of its vertices, the number of links connecting a vertex to its cluster is higher than the number of links connecting the vertex to any other cluster. Based on this fundamental view, we propose a graph-clustering algorithm that identifies clusters even if they contain vertices more strongly connected outside than inside their cluster; hence, the proposed algorithm is proved exceptionally efficient in clustering densely interconnected graphs. Extensive experimentation with artificial and real datasets shows that our approach outperforms earlier alternate clustering techniques.  相似文献   

4.
We consider a challenging clustering task: the clustering of multi-word terms without document co-occurrence information in order to form coherent groups of topics. For this task, we developed a methodology taking as input multi-word terms and lexico-syntactic relations between them. Our clustering algorithm, named CPCL is implemented in the TermWatch system. We compared CPCL to other existing clustering algorithms, namely hierarchical and partitioning (k-means, k-medoids). This out-of-context clustering task led us to adapt multi-word term representation for statistical methods and also to refine an existing cluster evaluation metric, the editing distance in order to evaluate the methods. Evaluation was carried out on a list of multi-word terms from the genomic field which comes with a hand built taxonomy. Results showed that while k-means and k-medoids obtained good scores on the editing distance, they were very sensitive to term length. CPCL on the other hand obtained a better cluster homogeneity score and was less sensitive to term length. Also, CPCL showed good adaptability for handling very large and sparse matrices.  相似文献   

5.
Document clustering is an important tool for document collection organization and browsing. In real applications, some limited knowledge about cluster membership of a small number of documents is often available, such as some pairs of documents belonging to the same cluster. This kind of prior knowledge can be served as constraints for the clustering process. We integrate the constraints into the trace formulation of the sum of square Euclidean distance function of K-means. Then,the combined criterion function is transformed into trace maximization, which is further optimized by eigen-decomposition. Our experimental evaluation shows that the proposed semi-supervised clustering method can achieve better performance, compared to three existing methods.  相似文献   

6.
张丹  何跃 《情报杂志》2012,31(5):62-65
SNS网站积累着起巨大的用户量以及用户的关系网络,蕴藏着巨大的商业价值。以SNS网站社会网络为研究对象,结合聚类分析和社会网络分析的基本思想,提出了以社会心理学理论为基础的针对SNS关系分析的框架——基于聚类分析的社会网络分析框架SNAC,定义了SNS关系权重指标"亲密度",基于"簇饱和度"改进K-Means聚类算法,用网络参数描述了SNS关系网络的特征和小世界特征,利用实验验证了分析框架的可行性。  相似文献   

7.
As text documents are explosively increasing in the Internet, the process of hierarchical document clustering has been proven to be useful for grouping similar documents for versatile applications. However, most document clustering methods still suffer from challenges in dealing with the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels. In this paper, we will present an effective Fuzzy Frequent Itemset-Based Hierarchical Clustering (F2IHC) approach, which uses fuzzy association rule mining algorithm to improve the clustering accuracy of Frequent Itemset-Based Hierarchical Clustering (FIHC) method. In our approach, the key terms will be extracted from the document set, and each document is pre-processed into the designated representation for the following mining process. Then, a fuzzy association rule mining algorithm for text is employed to discover a set of highly-related fuzzy frequent itemsets, which contain key terms to be regarded as the labels of the candidate clusters. Finally, these documents will be clustered into a hierarchical cluster tree by referring to these candidate clusters. We have conducted experiments to evaluate the performance based on Classic4, Hitech, Re0, Reuters, and Wap datasets. The experimental results show that our approach not only absolutely retains the merits of FIHC, but also improves the accuracy quality of FIHC.  相似文献   

8.
聂珍  王华秋 《现代情报》2012,32(7):112-116,121
本文采取了3种必要的措施提高了聚类质量:考虑到各维数据特征属性对聚类效果影响不同,采用了基于统计方法的维度加权的方法进行特征选择;对于和声搜索算法的调音概率进行了改进,将改进的和声搜索算法和模糊聚类相结合用于快速寻找最优的聚类中心;循环测试各种中心数情况下的聚类质量以获得最佳的类中心数。接着,该算法被应用于图书馆读者兴趣度建模中,用于识别图书馆日常运行时各读者借阅图书的类型,实验表明该算法较其它算法更优。这样的读者兴趣度聚类分析可以进行图书推荐,从而提高图书馆的运行效率。  相似文献   

9.
Due to the hopeful application of gathering information from unreachable position, wireless sensor network creates an immense challenge for data routing to maximize the communication with more energy efficiency. In order to design the energy efficient routing, the optimization based clustering protocols are more preferred in wireless sensor network. In this paper, we have proposed competent optimization based algorithm called Fractional lion (FLION) clustering algorithm for creating the energy efficient routing path. Here, the proposed clustering algorithm is used to increase the energy and lifetime of the network nodes by selecting the rapid cluster head. In addition, we have proposed multi-objective FLION clustering algorithm to develop the new fitness function based on the five objectives like intra-cluster distance, inter-cluster distance, cluster head energy, normal nodes energy and delay. Here, the proposed fitness function is used to find the rapid cluster centroid for an efficient routing path. Finally, the performance of the proposed clustering algorithm is compared with the existing clustering algorithms such as low energy adaptive clustering hierarchy (LEACH), particle swarm optimization (PSO), artificial bee colony (ABC) and Fractional ABC clustering algorithm. The results proved that the lifetime of the wireless sensor nodes is maximized by the proposed FLION based multi-objective clustering algorithm as compared with existing protocols.  相似文献   

10.
基于改进特征提取及聚类的网络评论挖掘研究   总被引:1,自引:0,他引:1  
[目的/意义]针对信息过载条件下中文网络产品评论中特征提取性能低以及特征聚类中初始中心点的选取问题。[方法/过程]本研究提出采用基于权重的改进Apriori算法产生候选产品特征集合,再根据独立支持度、频繁项名词非特征规则及基于网络搜索引擎的PMI算法对候选产品特征集合进行过滤。并以基于HowNet的语义相似度和特征观点共现作为衡量产品特征之间关联程度的特征,提出一种改进K-means聚类算法对产品特征进行聚类。[结果/结论]实验结果表明,在特征提取阶段,查准率为69%,查全率为92.64%,综合值达到79.07%。在特征聚类阶段,本文提出的改进K-means算法相对传统算法具有更优的挖掘性能。  相似文献   

11.
There are several recent studies that propose search output clustering as an alternative representation method to ranked output. Users are provided with cluster representations instead of lists of titles and invited to make decisions on groups of documents. This paper discusses the difficulties involved in representing clusters for users’ evaluation in a concise but easily interpretable form. The discussion is based on findings and user feedback from a user study investigating the effectiveness of search output clustering. The overall impression created by the experiment results and users’ feedback is that clusters cannot be relied on to consistently produce meaningful document groups that can easily be recognised by the users. They also seem to lead to unrealistic user expectations.  相似文献   

12.
Clusters face what has been referred to as a ‘cluster paradox’; a situation in which a collective identity breeds cohesion and efficiency in inter-organisational collaboration, yet it hinders the variety needed to adapt to disruptive change and prevent lock-in situations. Accordingly, a recurring theme in the literature on cluster evolution and cluster life-cycles is the need for constant renewal to allow clusters to adapt to a changing environment. However, how individual firms enact a process of cluster renewal and consider possible response options is not well understood. Using a French energy cluster as empirical setting, this paper investigates individual members’ enactment of the renewal in terms of how it could affect their current position, both structurally and relationally, and to what extent members felt that they had agency to steer the process to safeguard their position. The findings show that members’ enactment of the proposed change does not only depend on the perceived impact of cluster renewal on the member itself but also on the impact the renewal might have on other members in the firm’s network. The analysis also suggests that cluster renewal leads to a leadership vacuum where it is not clear who, if anyone, will lead the renewal process.  相似文献   

13.
This study employs our proposed semi-supervised clustering method called Constrained-PLSA to cluster tagged documents with a small amount of labeled documents and uses two data sets for system performance evaluations. The first data set is a document set whose boundaries among the clusters are not clear; while the second one has clear boundaries among clusters. This study employs abstracts of papers and the tags annotated by users to cluster documents. Four combinations of tags and words are used for feature representations. The experimental results indicate that almost all of the methods can benefit from tags. However, unsupervised learning methods fail to function properly in the data set with noisy information, but Constrained-PLSA functions properly. In many real applications, background knowledge is ready, making it appropriate to employ background knowledge in the clustering process to make the learning more fast and effective.  相似文献   

14.
Jackie Krafft   《Research Policy》2004,33(10):1687-1706
The process by which knowledge is created, accumulated and eventually destroyed appears crucial to many industrial dynamics patterns, since it shapes the profile of evolution of industries by favouring the entry of new companies, the co-existence of incumbents and new entrants and, eventually, their selective or joint exit over time. Though problematic, and all too often neglected, the connection between two nodes of interest, Industrial Dynamics on the one hand, and Knowledge Dynamics on the other hand, thus appears as a promising field of research. On the basis of a case study in the info-communications industry, we start by emphasizing that this field of research has direct importance at the empirical level. Knowledge dynamics can create specific models of evolution among firms at the local level, such as non-shakeout patterns within the cluster, which significantly differ from more global patterns of evolution in the info-communications industry, now generally oriented towards trends of decline and bust. We further argue in favour of the development of Knowledge-Based Industrial Dynamics, an approach that lies at the interface of industry and knowledge dynamics, and which can explain how a cluster may decrease the barriers to knowledge of clustered companies and, further, create a specific knowledge dynamics that is able to shape the industrial dynamics. Finally, we document how this process of knowledge dynamics was collectively implemented in our case study on the info-communications cluster and decompose the mechanisms that led to a local non-shakeout pattern of industrial dynamics. We conclude with some remarks on the policy implications.  相似文献   

15.
刘高勇  汪会玲 《情报科学》2007,25(6):929-931,937
利用自组织映射网络(80M)可以实现文本聚类,在此基础上进一步对索引词聚类,从而可以得到文本聚类图和索引词聚类图。利用这两个图,就可以对普通文本进行超文本自组织,即对普通文本的某些知识点做超链接,以链接到与之相关的Web文档上。  相似文献   

16.
17.
In this paper, the scalability and quality of the contextual document clustering (CDC) approach is demonstrated for large data-sets using the whole Reuters Corpus Volume 1 (RCV1) collection. CDC is a form of distributional clustering, which automatically discovers contexts of narrow scope within a document corpus. These contexts act as attractors for clustering documents that are semantically related to each other. Once clustered, the documents are organized into a minimum spanning tree so that the topical similarity of adjacent documents within this structure can be assessed. The pre-defined categories from three different document category sets are used to assess the quality of CDC in terms of its ability to group and structure semantically related documents given the contexts. Quality is evaluated based on two factors, the category overlap between adjacent documents within a cluster, and how well a representative document categorizes all the other documents within a cluster. As the RCV1 collection was collated in a time ordered fashion, it was possible to assess the stability of clusters formed from documents within one time interval when presented with new unseen documents at subsequent time intervals. We demonstrate that CDC is a powerful and scaleable technique with the ability to create stable clusters of high quality. Additionally, to our knowledge this is the first time that a collection as large as RCV1 has been analyzed in its entirety using a static clustering approach.  相似文献   

18.
从流动要素的集聚视角研究产业演化,凸显产业演化本质。构建单一要素流动模型和动态多因素耦合模型,分析要素流动驱动产业集聚路径以及产业集聚如何通过经济变迁和技术变迁完成产业动态演化。通过研究得出结论:各要素在流动中相互作用、动态耦合,导致产业经济系统的整体演化;集聚综合效应促进产业经济变迁,经济变迁中竞争效应诱发技术变迁,经济变迁和技术变迁的动态作用决定产业演化趋势,技术创新是产业升级或萎缩的决定因素。  相似文献   

19.
目前国内对于专利地图的研究大部分仍停留在应用阶段,对其制作的基础理论研究较少。概述目前专利地图类别,分析现有专利地图制作方法的缺陷,从增强专利文献信息可信度和价值的角度,运用TF-IDF(term…frequency-inverse…document…frequency)统计特征将非结构化的专利文献信息映射到低维空间中,采用密度峰值快速搜索聚类(clustering…by…fast…search…and…find…of…density…peaks,CFSFDP)算法进行聚类,对同一聚类中的专利文献特征进行分析,得到不同专利文献间的发展关系并映射为图表示,从而构建以有向图表示的专利地图。改进提出的这种专利地图制作方法,同时利用了结构化信息与非结构化信息,以使专利地图更为真实准确地反映目标技术领域的技术发展过程。  相似文献   

20.
本文将数据挖掘聚类分析技术融合到ETC运营管理过程中,分析了ETC交易数据三种关键的特征数据并构建了数据模型,采用轮廓系数对聚类结果进行科学评估。在聚类分析结果的基础上,对ETC用户进行二维分类,提出ETC的推广营销和ETC系统维护管理的方法和建议。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号