首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
This paper examines several different approaches to exploiting structural information in semi-structured document categorization. The methods under consideration are designed for categorization of documents consisting of a collection of fields, or arbitrary tree-structured documents that can be adequately modeled with such a flat structure. The approaches range from trivial modifications of text modeling to more elaborate schemes, specifically tailored to structured documents. We combine these methods with three different text classification algorithms and evaluate their performance on four standard datasets containing different types of semi-structured documents. The best results were obtained with stacking, an approach in which predictions based on different structural components are combined by a meta classifier. A further improvement of this method is achieved by including the flat text model in the final prediction.  相似文献   

2.
A new dictionary-based text categorization approach is proposed to classify the chemical web pages efficiently. Using a chemistry dictionary, the approach can extract chemistry-related information more exactly from web pages. After automatic segmentation on the documents to find dictionary terms for document expansion, the approach adopts latent semantic indexing (LSI) to produce the final document vectors, and the relevant categories are finally assigned to the test document by using the k-NN text categorization algorithm. The effects of the characteristics of chemistry dictionary and test collection on the categorization efficiency are discussed in this paper, and a new voting method is also introduced to improve the categorization performance further based on the collection characteristics. The experimental results show that the proposed approach has the superior performance to the traditional categorization method and is applicable to the classification of chemical web pages.  相似文献   

3.
An automatic patent categorization system would be invaluable to individual inventors and patent attorneys, saving them time and effort by quickly identifying conflicts with existing patents. In recent years, it has become more and more common to classify all patent documents using the International Patent Classification (IPC), a complex hierarchical classification system comprised of eight sections, 128 classes, 648 subclasses, about 7200 main groups, and approximately 72,000 subgroups. So far, however, no patent categorization method has been developed that can classify patents down to the subgroup level (the bottom level of the IPC). Therefore, this paper presents a novel categorization method, the three phase categorization (TPC) algorithm, which classifies patents down to the subgroup level with reasonable accuracy. The experimental results for the TPC algorithm, using the WIPO-alpha collection, indicate that our classification method can achieve 36.07% accuracy at the subgroup level. This is approximately a 25,764-fold improvement over a random guess.  相似文献   

4.
Text categorization pertains to the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the subsequent assignment of unclassified documents to appropriate categories. Most existing text categorization techniques deal with monolingual documents (i.e., written in the same language) during the learning of the text categorization model and category assignment (or prediction) for unclassified documents. However, with the globalization of business environments and advances in Internet technology, an organization or individual may generate and organize into categories documents in one language and subsequently archive documents in different languages into existing categories, which necessitate cross-lingual text categorization (CLTC). Specifically, cross-lingual text categorization deals with learning a text categorization model from a set of training documents written in one language (e.g., L1) and then classifying new documents in a different language (e.g., L2). Motivated by the significance of this demand, this study aims to design a CLTC technique with two different category assignment methods, namely, individual- and cluster-based. Using monolingual text categorization as a performance reference, our empirical evaluation results demonstrate the cross-lingual capability of the proposed CLTC technique. Moreover, the classification accuracy achieved by the cluster-based category assignment method is statistically significantly higher than that attained by the individual-based method.  相似文献   

5.
This paper explores the incorporation of prior knowledge into support vector machines as a means of compensating for a shortage of training data in text categorization. The prior knowledge about transformation invariance is generated by a virtual document method. The method applies a simple transformation to documents, i.e., making virtual documents by combining relevant document pairs for a topic in the training set. The virtual document thus created not only is expected to preserve the topic, but even improve the topical representation by exploiting relevant terms that are not given high importance in individual real documents. Artificially generated documents result in the change in the distribution of training data without the randomization. Experiments with support vector machines based on linear, polynomial and radial-basis function kernels showed the effectiveness on Reuters-21578 set for the topics with a small number of relevant documents. The proposed method achieved 131%, 34%, 12% improvements in micro-averaged F1 for 25, 46, and 58 topics with less than 10, 30, and 50 relevant documents in learning, respectively. The result analysis indicates that incorporating virtual documents contributes to a steady improvement on the performance.  相似文献   

6.
周克放  乔永忠 《科研管理》2021,42(10):148-155
专利无效程序对于控制专利质量具有重要作用。以2008-2017年期间信息通信技术(ICT)领域被提起无效宣告请求的专利为样本,在根据专利有效性进行分组的基础上,分析“维持有效”和“宣告无效”两组专利在6项指标及在无效程序中援引法条的差异。结果显示:技术覆盖范围和引证专利数与专利质量显著负相关,非专利文献引用数与专利质量正相关;专利质量还与技术可专利性及专利文件的撰写和修改具有重要关系,专利可能会因为专利申请和修改过程中的程序和实体等基本问题被宣告无效。因此,评价专利质量时不仅应当重视技术和经济指标,还应对专利申请、授权、无效审查过程中可能涉及的相关因素进行综合考察,特别是专利无效审查时可能涉及到的专利文献内容等。  相似文献   

7.
专利的相关性检索与集成应用研究   总被引:1,自引:1,他引:0       下载免费PDF全文
摘要:阐述了专利分析技术和专业应用系统的现状和特点,提出通过专利模型树来描述专利文档,并以专利模型树为基础,建立了基于向量空间模型的专利分类方法和专利相似性检索方法。基于上述方法,在工作流管理系统中集成专利管理系统,建立了集成框架,开发了集成系统,实现了在企业工作流程中的每个工作单元与专利相似性检索模块的集成。最后,在某企业的电缓速器设计的工作流系统中得以应用。  相似文献   

8.
Through the recent NTCIR workshops, patent retrieval casts many challenging issues to information retrieval community. Unlike newspaper articles, patent documents are very long and well structured. These characteristics raise the necessity to reassess existing retrieval techniques that have been mainly developed for structure-less and short documents such as newspapers. This study investigates cluster-based retrieval in the context of invalidity search task of patent retrieval. Cluster-based retrieval assumes that clusters would provide additional evidence to match user’s information need. Thus far, cluster-based retrieval approaches have relied on automatically-created clusters. Fortunately, all patents have manually-assigned cluster information, international patent classification codes. International patent classification is a standard taxonomy for classifying patents, and has currently about 69,000 nodes which are organized into a five-level hierarchical system. Thus, patent documents could provide the best test bed to develop and evaluate cluster-based retrieval techniques. Experiments using the NTCIR-4 patent collection showed that the cluster-based language model could be helpful to improving the cluster-less baseline language model.  相似文献   

9.
The paper proposes a new approach to create a patent classification system to replace the IPC or UPC system for conducting patent analysis and management. The new approach is based on co-citation analysis of bibliometrics. The traditional approach for management of patents, which is based on either the IPC or UPC, is too general to meet the needs of specific industries. In addition, some patents are placed in incorrect categories, making it difficult for enterprises to carry out R&D planning, technology positioning, patent strategy-making and technology forecasting. Therefore, it is essential to develop a patent classification system that is adaptive to the characteristics of a specific industry. The analysis of this approach is divided into three phases. Phase I selects appropriate databases to conduct patent searches according to the subject and objective of this study and then select basic patents. Phase II uses the co-cited frequency of the basic patent pairs to assess their similarity. Phase III uses factor analysis to establish a classification system and assess the efficiency of the proposed approach. The main contribution of this approach is to develop a patent classification system based on patent similarities to assist patent manager in understanding the basic patents for a specific industry, the relationships among categories of technologies and the evolution of a technology category.  相似文献   

10.
In text categorization, it is quite often that the numbers of documents in different categories are different, i.e., the class distribution is imbalanced. We propose a unique approach to improve text categorization under class imbalance by exploiting the semantic context in text documents. Specifically, we generate new samples of rare classes (categories with relatively small amount of training data) by using global semantic information of classes represented by probabilistic topic models. In this way, the numbers of samples in different categories can become more balanced and the performance of text categorization can be improved using this transformed data set. Indeed, the proposed method is different from traditional re-sampling methods, which try to balance the number of documents in different classes by re-sampling the documents in rare classes. Such re-sampling methods can cause overfitting. Another benefit of our approach is the effective handling of noisy samples. Since all the new samples are generated by topic models, the impact of noisy samples is dramatically reduced. Finally, as demonstrated by the experimental results, the proposed methods can achieve better performance under class imbalance and is more tolerant to noisy samples.  相似文献   

11.
In this paper, we propose a new algorithm, which incorporates the relationships of concept-based thesauri into the document categorization using the k-NN classifier (k-NN). k-NN is one of the most popular document categorization methods because it shows relatively good performance in spite of its simplicity. However, it significantly degrades precision when ambiguity arises, i.e., when there exist more than one candidate category to which a document can be assigned. To remedy the drawback, we employ concept-based thesauri in the categorization. Employing the thesaurus entails structuring categories into hierarchies, since their structure needs to be conformed to that of the thesaurus for capturing relationships between categories. By referencing various relationships in the thesaurus corresponding to the structured categories, k-NN can be prominently improved, removing the ambiguity. In this paper, we first perform the document categorization by using k-NN and then employ the relationships to reduce the ambiguity. Experimental results show that this method improves the precision of k-NN up to 13.86% without compromising its recall.  相似文献   

12.
[研究目的]高质量专利对促进专利转化、技术追踪和战略布局十分重要,面对海量专利数据,如何准确高效自动识别高质量专利,为开展后续专利投资融资、产业转型等专利工作做基础铺垫,成为当前重要研究问题。[研究方法]以国家知识产权局受理的申请专利为研究对象,使用专利维持年限表征专利质量,提取专利数字特征并嵌入专利文本特征生成的专利-核心词汇网络,搭建图卷积网络模型自动识别高质量专利。[研究结论]目前针对专利质量的研究专注于挖掘专利数字特征而忽视专利文本特征,该方案在高质量专利自动识别过程中使用专利数字特征与文本特征,对当前专利质量研究做出补充。此外,所提方案可在专家标注少量专利文档情况下完成专利质量识别任务,解决现有专利质量标签标注方案无法全面衡量专利质量的局限。同时,将图卷积网络扩展到专利背景下的质量识别领域,为专利质量研究提供崭新框架,实验结果也显示方案具有较高实践价值。  相似文献   

13.
通过对许可专利与转让专利的总体情况、专利的地域特征、专利技术年龄以及专利技术领域等进行对比分析,结果显示:(1)我国高校许可专利数量远大于转让专利数量,主要是发明专利,集中分布在作业运输类以及化学冶金部类;(2)许可与转让专利输出和输入地区主要分布在经济较发达和高校资源较丰富的地区;(3)转让专利的技术年龄大于许可专利的技术年龄;(4)许可专利比转让专利的技术领域分布更加广泛,不同技术领域的专利适合不同的技术转移模式。  相似文献   

14.
为及时有效地识别潜在技术机会,采用文本挖掘和异常值检测的方法,提出一种基于专利文本的技术机会识别方法.首先采用文本表示模型Doc2vec技术对专利摘要进行建模,以更深层表征文本语义信息;然后利用基于密度的离群值检测算法,识别出具有潜在技术机会的专利方向;最后以深度学习领域潜在技术识别为例,构建专利检索式并收集458条专利文献作为数据集.实证结果总结出4类主题共10个潜在的技术机会,验证了该基于专利的技术机会识别方法的有效性,可为企业相应技术应用、研发和创新提供参考.  相似文献   

15.
刘鑫  余翔 《科研管理》2016,37(11):150-158
本文在梳理概括了国内外专利文本挖掘技术研究进展基础上,探索建立一种基于对专利文本中特定动宾(AO)结构进行挖掘分析的专利功能分析方法,并通过专利功能的定义、提取和分析将专利技术与相关产业进行对接,实现了从专利文本中识别产业化的潜在领域。更为重要的是,本文提出了描述专利技术功能效用的S曲线和S指数,完善和改进了专利技术产业化适用性量化评价模型,并定义了该模型中的S指数、专利功能的绝对重要性指数(AI)和专利功能的相对重要性指数(RI)三个评价指标。最后,以具备"reduce PM2.5"功能的专利为例,验证了基于功能分析的专利技术产业化适用性评价模型的可行性,为中国专利技术产业化路径选择提供了新思路。  相似文献   

16.
   潜在标准必要专利在未来市场中具有极高的战略价值和经济价值,企业如何抢先识别这些专利对建设创新型国家、优化企业专利布局、加快技术创新、提升行业地位、规避专利挟持具有重要意义。但目前关于自动化识别潜在标准必要专利的研究尚少。本文从提取标准必要专利语义特征的视角下,提出利用Bert-CNN网络模型结合上下文对已知标准必要专利的隐性全局语义特征和高维层次语义特征双重提取,依据特征提取结果识别潜在标准必要专利,并通过计算Bert向量相似度预测潜在标准必要专利可能对应的标准。实证部分以ETSI欧洲标准化协会发布的标准必要专利构建数据验证集对模型的性能进行验证,结果显示本模型在大规模专利数据实验中的精准率、召回率、F1值优于已有研究。  相似文献   

17.
本文在简要地介绍了关键词检索的现状之后,重点从文献检索的角度分析了专利文献特点,并探讨了完善关键词检索的3个方面。最后,就专利文献检索领域关键词检索的发展趋势进行了简要的分析。  相似文献   

18.
Question categorization, which suggests one of a set of predefined categories to a user’s question according to the question’s topic or content, is a useful technique in user-interactive question answering systems. In this paper, we propose an automatic method for question categorization in a user-interactive question answering system. This method includes four steps: feature space construction, topic-wise words identification and weighting, semantic mapping, and similarity calculation. We firstly construct the feature space based on all accumulated questions and calculate the feature vector of each predefined category which contains certain accumulated questions. When a new question is posted, the semantic pattern of the question is used to identify and weigh the important words of the question. After that, the question is semantically mapped into the constructed feature space to enrich its representation. Finally, the similarity between the question and each category is calculated based on their feature vectors. The category with the highest similarity is assigned to the question. The experimental results show that our proposed method achieves good categorization precision and outperforms the traditional categorization methods on the selected test questions.  相似文献   

19.
为细化专利合作网络结构特征对企业专利质量的影响过程,尤其针对当前文献较少关注以知识搜索为中介变量的内在影响机制,以大尺寸硅片等7种集成电路“卡脖子”技术为研究对象,基于知识基础观,以度数中心度和结构洞代表合作网络结构特征,采用2012—2019年中国这7种技术的发明专利合作数据,构建5组4年期的移动时间窗,并结合考虑期刊影响因子构建企业专利质量评价指标体系,运用社会网络分析、文献计量和回归分析等方法进行实证分析。结果表明:在企业专利合作网络中,度数中心度与结构洞这两大结构特征对于专利质量存在正向促进作用,知识搜索在其中发挥重要中介作用,其深度和宽度增加对于企业专利质量均存在正向促进作用;而专利合作网络结构特征对于企业知识搜索同样具有正向影响。因此,中国的集成电路企业要积极与外部组织展开合作,包括组成商业联盟或构建开放式合作研发平台;加强拓宽和加深知识搜索渠道,构建完善知识获取体系,加强知识储备,促进企业专利质量的提升;此外,加大研发投入,扩充研发人员数量、引进高质量研发人才,提升自身的知识吸收、整合和转化的能力。  相似文献   

20.
This paper follows a bibliometric method for nanowire case to make evident the technological trends; to present the relationship between patents; to help the researchers to discover relatively significant patents and to analyse important relationships between patents to identify those with most commercial potential and those which are critical technologies. This research focuses on the nanowire case study due to fact that this field is one of the most mature nanostructures and is one of the highly invested fields in nanotechnology. In terms of methodological approach, this study uses a different patent collection method than previous studies. This new method offers a new taxonomy that could make a significant impact on accurate patent data quests and increase the reliability of the patent analyses. As patent data are valuable sources of technology innovation and for forecasting technical change, this study utilises nanowire patent documents to pick out the technological trends, to identify nanowire technologies which both have the most commercial potential and which are critical at the organisational, national and international levels.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号