首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 718 毫秒
1.
文本聚类算法的质量评价   总被引:4,自引:0,他引:4  
文本聚类是建立大规模文本集合的分类体系实例的有效手段之一。本文讨论了利用标准的分类测试集合进行聚类质量的量化评价的手段,选择了k-Means聚类算法、STC(后缀树聚类)算法和基于Ant的聚类算法进行了实验对比。对实验结果的分析表明,STC聚类算法由于在处理文本时充分考虑了文本的短语特性,其聚类效果较好;基于Ant的聚类算法的结果受参数输入的影响较大;在Ant聚类算法中引入文本特性可以提高聚类结果的质量。  相似文献   

2.
以某零售企业的产品销售为主题,分析和设计了某零售企业产品销售数据仓库,并介绍了基于数据仓库的多维数据分析技术即联机分析处理(OLAP)在其中的应用。  相似文献   

3.
周秀英 《科教文汇》2014,(34):135-136
语篇语言学作为语言学的一大分支,在结构主义语言学以及转换生成语法之后兴起,目前已被广泛应用于语言教育研究领域。针对中国大学生英语写作能力低下的现状,提出基于语篇视角的英语写作教学模式,并探讨该教学模式对国内英语写作教学的启发。  相似文献   

4.
随着数据仓库理论的发展,数据仓库系统已逐步成为新型的决策管理信息系统的解决方案,而数据仓库系统的核心是联机分析处理。主要介绍了OLAP技术和多维数据领域中的一些基本概念及多维数据分析的主要方法和多维数据的3种存储模式:ROLAP、MOLAP、HOLAP。  相似文献   

5.
齐瑞敏  史向红 《科教文汇》2014,(6):50-50,52
语篇的衔接对英语阅读理解的作用和对英语阅读教学都有着重要的影响,以语篇衔接理论为支撑,构建衔接分析模式。通过对语篇衔接手段的分析,探索一种新的教学模式。本文通过语篇衔接理论应用于具体教学的实证研究,从实证的层面来验证语篇衔接教学是否可以切实有效地提高大学生英语阅读能力。  相似文献   

6.
Clinicians, healthcare providers-suppliers, policy makers and patients are experiencing exciting opportunities in light of new information deriving from the analysis of big data sets, a capability that has emerged in the last decades. Due to the rapid increase of publications in the healthcare industry, we have conducted a structured review regarding healthcare big data analytics. With reference to the resource-based view theory we focus on how big data resources are utilised to create organization values/capabilities, and through content analysis of the selected publications we discuss: the classification of big data types related to healthcare, the associate analysis techniques, the created value for stakeholders, the platforms and tools for handling big health data and future aspects in the field. We present a number of pragmatic examples to show how the advances in healthcare were made possible. We believe that the findings of this review are stimulating and provide valuable information to practitioners, policy makers and researchers while presenting them with certain paths for future research.  相似文献   

7.
Nowadays, access to information requires managing multimedia databases effectively, and so, multi-modal retrieval techniques (particularly images retrieval) have become an active research direction. In the past few years, a lot of content-based image retrieval (CBIR) systems have been developed. However, despite the progress achieved in the CBIR, the retrieval accuracy of current systems is still limited and often worse than only textual information retrieval systems. In this paper, we propose to combine content-based and text-based approaches to multi-modal retrieval in order to achieve better results and overcome the lacks of these techniques when they are taken separately. For this purpose, we use a medical collection that includes both images and non-structured text. We retrieve images from a CBIR system and textual information through a traditional information retrieval system. Then, we combine the results obtained from both systems in order to improve the final performance. Furthermore, we use the information gain (IG) measure to reduce and improve the textual information included in multi-modal information retrieval systems. We have carried out several experiments that combine this reduction technique with a visual and textual information merger. The results obtained are highly promising and show the profit obtained when textual information is managed to improve conventional multi-modal systems.  相似文献   

8.
基于最小二乘支持向量机的数据挖掘应用研究   总被引:6,自引:0,他引:6  
蔡冬松  靖继鹏 《情报科学》2005,23(12):1877-1880
随着数据仓库技术、联机分析技术的发展。基于数据库的数据挖掘已成为一种重要的数据处理手段。最小二乘支持向量机作为一种新的机器学习方法。具有全局收敛性和良好的泛化能力。本文将其应用于数据挖掘的分类与预测研究。通过棱函数的选择及参数优化,并结合支持向量机、多层感知器神经网络模型及判别分析方法进行比较研究,证明最小二乘支持向量机作为一种有效的数据挖掘算法具有较高精度。  相似文献   

9.
This paper presents a robust and comprehensive graph-based rank aggregation approach, used to combine results of isolated ranker models in retrieval tasks. The method follows an unsupervised scheme, which is independent of how the isolated ranks are formulated. Our approach is able to combine arbitrary models, defined in terms of different ranking criteria, such as those based on textual, image or hybrid content representations.We reformulate the ad-hoc retrieval problem as a document retrieval based on fusion graphs, which we propose as a new unified representation model capable of merging multiple ranks and expressing inter-relationships of retrieval results automatically. By doing so, we claim that the retrieval system can benefit from learning the manifold structure of datasets, thus leading to more effective results. Another contribution is that our graph-based aggregation formulation, unlike existing approaches, allows for encapsulating contextual information encoded from multiple ranks, which can be directly used for ranking, without further computations and post-processing steps over the graphs. Based on the graphs, a novel similarity retrieval score is formulated using an efficient computation of minimum common subgraphs. Finally, another benefit over existing approaches is the absence of hyperparameters.A comprehensive experimental evaluation was conducted considering diverse well-known public datasets, composed of textual, image, and multimodal documents. Performed experiments demonstrate that our method reaches top performance, yielding better effectiveness scores than state-of-the-art baseline methods and promoting large gains over the rankers being fused, thus demonstrating the successful capability of the proposal in representing queries based on a unified graph-based model of rank fusions.  相似文献   

10.
Textual entailment is a task for which the application of supervised learning mechanisms has received considerable attention as driven by successive Recognizing Data Entailment data challenges. We developed a linguistic analysis framework in which a number of similarity/dissimilarity features are extracted for each entailment pair in a data set and various classifier methods are evaluated based on the instance data derived from the extracted features. The focus of the paper is to compare and contrast the performance of single and ensemble based learning algorithms for a number of data sets. We showed that there is some benefit to the use of ensemble approaches but, based on the extracted features, Naïve Bayes proved to be the strongest learning mechanism. Only one ensemble approach demonstrated a slight improvement over the technique of Naïve Bayes.  相似文献   

11.
Question classification (QC) involves classifying given question based on the expected answer type and is an important task in the Question Answering(QA) system. Existing approaches for question classification use full training dataset to fine-tune the models. It is expensive and requires more time to develop labelled datasets in huge size. Hence, there is a need to develop approaches that can achieve comparable or state of the art performance using limited training instances. In this paper, we propose an approach that uses data augmentation as a tool to generate additional training instances. We evaluate our proposed approach on two question classification datasets namely TREC and ICHI datasets. Experimental results show that our proposed approach reduces the requirement of labelled instances (a) up to 81.7% and achieves new state of the art accuracy of 98.11 on TREC dataset and (b) up to 75% and achieves 67.9 on ICHI dataset.  相似文献   

12.
Several approaches focus on how to automatically capture the latent features from original diffusion data and predict the future scale of cascades utilizing a black box framework. However, they ignore the penetrating insight into the underlying mechanism that how each participant is involved in the cascade. In this work, we bridge the gap between prediction and understanding of information diffusion by incorporating deep learning techniques and social psychology. To characterize individual participation driven by both subjective and objective impetus and integrate it into the macro-level cascade, we propose an end-to-end model, named PFDID, which is designed based on the field dynamics theory of psychology, including the intrinsic cognition field and the extrinsic environment field. We represent these two field dynamics respectively with the pairwise semantic relation between the message itself and corresponding comment and the forwarder’s micro-community activity embedding to provide educated explanations for forwarding behaviour. Afterwards, the cross infusion mechanism is designed to calculate the mutual influence of inhomogeneous field dynamics inside users and cross influence of homogeneous field dynamics among individuals, whose output is fed into the diffusion network aggregation layer for the cascade size prediction. Extensive experiments on two typical social networks, Sina Weibo and Twitter, manifest that the proposed PFDID outperforms state-of-the-art approaches. Our model achieves excellent prediction results, with MSLE = 1.856 on Sina Weibo and MSLE = 1.962 on Twitter, providing 6.54% and 10.53% relative performance gains, respectively. Furthermore, the interpretability is also discussed based on detailed visualization. We observe that the psychological impetus behind social behaviour varies mainly following two patterns with the spread of information, including gradual change and joint influence. Additionally, the indirect dependencies have also been verified.  相似文献   

13.
One of the important problems in text classification is the high dimensionality of the feature space. Feature selection methods are used to reduce the dimensionality of the feature space by selecting the most valuable features for classification. Apart from reducing the dimensionality, feature selection methods have potential to improve text classifiers’ performance both in terms of accuracy and time. Furthermore, it helps to build simpler and as a result more comprehensible models. In this study we propose new methods for feature selection from textual data, called Meaning Based Feature Selection (MBFS) which is based on the Helmholtz principle from the Gestalt theory of human perception which is used in image processing. The proposed approaches are extensively evaluated by their effect on the classification performance of two well-known classifiers on several datasets and compared with several feature selection algorithms commonly used in text mining. Our results demonstrate the value of the MBFS methods in terms of classification accuracy and execution time.  相似文献   

14.
The advantages of user click data greatly inspire its wide application in fine-grained image classification tasks. In previous click data based image classification approaches, each image is represented as a click frequency vector on a pre-defined query/word dictionary. However, this approach not only introduces high-dimensional issues, but also ignores the part of speech (POS) of a specific word as well as the word correlations. To address these issues, we devise the factorized deep click features to represent images. We first represent images as the factorized TF-IDF click feature vectors to discover word correlation, wherein several word dictionaries of different POS are constructed. Afterwards, we learn an end-to-end deep neural network on click feature tensors built on these factorized TF-IDF vectors. We evaluate our approach on the public Clickture-Dog dataset. It shows that: 1) the deep click feature learned on click tensor performs much better than traditional click frequency vectors; and 2) compared with many state-of-the-art textual representations, the proposed deep click feature is more discriminative and with higher classification accuracies.  相似文献   

15.
In the last few years hybrid generative discriminative approaches have received increasing attention and their capabilities have been demonstrated by several applications in different domains. Hybrid approaches allow the incorporation of prior knowledge about the nature of the data to classify. Past work on hybrid approaches has focused on Gaussian data, however, and less attention has been given to other kinds of non-Gaussian data which appear in many applications. In this article we introduce a class of generative kernels based on finite mixture models for non-Gaussian data classification. This particular class is based on the generalized Dirichlet distribution which have been shown to be effective to model this kind of data. We demonstrate the efficacy of the proposed framework on two challenging applications namely object detection and content-based image classification via the integration of color and spatial information.  相似文献   

16.
基于支持向量机的土地覆被遥感分类   总被引:4,自引:0,他引:4  
遥感图像的分类是研究土地变化的基础。传统的遥感图像分类存在着精度不高,不确定性强的特点。本文使用支持向量机(SVM,Support Vector Machine)技术对遥感图像分类,并与传统的最大似然分类进行对比试验。结果表明不同参数组合下SVM的分类总精度和Kappa指数普遍高于最大似然分类的结果,其最高总精度高出最大似然分类0.9779%。SVM和最大似然分类结果都存在着类别混分,但是SVM混分程度远小于最大似然分类,其精度保持在可接受的范围内,如对于低密度草而言,最大似然分类的用户精度下降到84.68%,而支持向量机的用户精度虽然也有下降但还是保持在92.31%。SVM在样本数目很少的情况下表现出了出色的学习能力,是机器学习领域很有希望的一种学习方法。  相似文献   

17.
宋庆元 《科技广场》2005,17(1):53-57
本文介绍了一种针对化学数据分析的挖掘系统原型实现和设计理论。阐述从化学数据分析的角度开发一个联机分析数据挖掘系统原型的理论过程,研究过程采用数据仓库提供的OLAP技术进行关联规则挖掘,提供了一种数据项的二进制编码技术,对于提高数据信息的处理能力和可靠性有一定意义。预期实现从各种文献资料或数据库自动抽取有关化学反应的信息,发现新的有用化学成分,完成合成设计和反应预测等功能,从而对数据挖掘的实现进行了有益的尝试。  相似文献   

18.
Semantic representation reflects the meaning of the text as it may be understood by humans. Thus, it contributes to facilitating various automated language processing applications. Although semantic representation is very useful for several applications, a few models were proposed for the Arabic language. In that context, this paper proposes a graph-based semantic representation model for Arabic text. The proposed model aims to extract the semantic relations between Arabic words. Several tools and concepts have been employed such as dependency relations, part-of-speech tags, name entities, patterns, and Arabic language predefined linguistic rules. The core idea of the proposed model is to represent the meaning of Arabic sentences as a rooted acyclic graph. Textual entailment recognition challenge is considered in order to evaluate the ability of the proposed model to enhance other Arabic NLP applications. The experiments have been conducted using a benchmark Arabic textual entailment dataset, namely, ArbTED. The results proved that the proposed graph-based model is able to enhance the performance of the textual entailment recognition task in comparison to other baseline models. On average, the proposed model achieved 8.6%, 30.2%, 5.3% and 16.2% improvement in terms of accuracy, recall, precision, and F-score results, respectively.  相似文献   

19.
聚焦数据赋能推动制造业企业转型升级的作用边界与具体内容,为企业进行数据赋能自诊自检、了解自身数据应用状况提供科学的工具方法,促进制造业企业借助数据实现转型升级。基于制造业企业转型升级的现实需求,遵循“文献综述→参考架构→评估框架→指标体系→层次分析法赋权→模糊综合评价法应用→成熟度模型”的技术路线,在理清数据赋能、制造业企业数字化转型以及数据价值链理论演进的基础上,遵从成熟度模型的设定原则,从数据赋能准备度、数据赋能深入度、数据赋能提升度3个维度开发构建制造业企业数据赋能评价指标体系,应用层次分析法确定指标权重并基于模糊综合评价法对企业的数据赋能水平进行打分;同时构建相应的数据赋能成熟度等级划分模型,分为引领标杆级、提升优化级、综合集成级、成长规范级和初始规划级。未来将不断修正该评价指标体系与成熟度模型,进一步提高其普适性与准确性。  相似文献   

20.
In this work, we elaborate on the meaning of metadata quality by surveying efforts and experiences matured in the digital library domain. In particular, an overview of the frameworks developed to characterize such a multi-faceted concept is presented. Moreover, the most common quality-related problems affecting metadata both during the creation and the aggregation phase are discussed together with the approaches, technologies and tools developed to mitigate them. This survey on digital library developments is expected to contribute to the ongoing discussion on data and metadata quality occurring in the emerging yet more general framework of data infrastructures.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号