首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
针对传统信息检索模型不能很好满足用户需求的问题,在分析现有相关研究的基础上,提出基于领域Ontology的知识检索模型。通过构建领域Ontology,对文档进行语义标注,对查询请求进行概念提取和语义扩展,从而得到语义索引项作为文档和用户请求的知识表达,进一步研究领域Ontology中词语间语义关系的计算模型。考虑到语义相似度与语义相关的内在关系,给出相关系数来衡量检索目标与候选者间符合程度。最后对提出的模型进行验证,结果表明检索性能有显著提高。  相似文献   

3.
传统的Web文本分类方法将文本中关键词的相似度作为分类的依据,丢失了很多重要的语义信息,导致分类结果不够准确且计算量大。基于此,文章提出了一种基于语义相似度的Web文本分类方法,利用领域本体将用关键词表示的文本特征向量表示为与之匹配的语义概念特征向量集,定义Web文本相似度的计算公式,设计并实现基于语义相似度的KNN算法。实验结果表明,该方法从语义概念层次上表示和处理Web文本,降低了文本特征空间维度,减少了计算量,提高了分类精确度。  相似文献   

4.
A huge volume of news stories are reported by various news channels, on a daily basis. Subscribing to all the stories and keeping track of the important ones day after day is very time-consuming. This paper proposes several approaches to identify important news stories. To this end, we take advantage of the blogosphere as an information source to evaluate the importance of news stories. Blogs reflect the diverse opinions of bloggers about news stories, and the attention that these stories receive can help estimate the importance of the stories. In this paper, we define the popularity of a news story in the blogosphere as the attention it attracts from users. We measure popularity of the stories in the blogosphere from two viewpoints: content and a timeline. In terms of content, we suggest several approaches to estimate language models for a news story and blog posts, and we evaluate the importance of the story using these language models. Furthermore, we generate a temporal profile of a news story by exploring the timeline of blog posts related to the story, and evaluate its importance based on the temporal profile. We experimentally verify the effectiveness of the proposed approaches for identifying top news stories.  相似文献   

5.
Detection As Multi-Topic Tracking   总被引:1,自引:0,他引:1  
The topic tracking task from TDT is a variant of information filtering tasks that focuses on event-based topics in streams of broadcast news. In this study, we compare tracking to another TDT task, detection, which has the goal of partitioning all arriving news into topics, regardless of whether the topics are of interest to anyone, and even when a new topic appears that had not been previous anticipated. There are clear relationships between the two tasks (under some assumptions, a perfect tracking system could solve the detection problem), but they are evaluated quite differently. We describe the two tasks and discuss their similarities. We show how viewing detection as a form of multi-topic parallel tracking can illuminate the performance tradeoffs of detection over tracking.  相似文献   

6.
如何利用具有本体标注的结构化文档中的语义信息组织P2P网络,提供对基于语义的信息共享与查询的P2P网络支持,是当前P2P网络的研究热点之一.本文提出采用Peer所存储文档中的加权本体概念向量作为Peer的特征向量,通过相似度计算将Peer聚成Peer组,从而构造基于语义的半结构化P2P网络.用户的查询请求由各Peer组内的组服务器负责路由转发,组服务器计算查询请求与各路由表项之间的相似度,将查询向最有可能包含查询目标的Peer组转发.文中较全面地阐述了本体概念局部和全局权重的计算方法.由于P2P网络拓扑的建立过程和查询路由过程均基于语义信息,使得网络的各项性能与基于关键字处理的P2P网络相比,得到了较大的提高.  相似文献   

7.
Efficient information searching and retrieval methods are needed to navigate the ever increasing volumes of digital information. Traditional lexical information retrieval methods can be inefficient and often return inaccurate results. To overcome problems such as polysemy and synonymy, concept-based retrieval methods have been developed. One such method is Latent Semantic Indexing (LSI), a vector-space model, which uses the singular value decomposition (SVD) of a term-by-document matrix to represent terms and documents in k-dimensional space. As with other vector-space models, LSI is an attempt to exploit the underlying semantic structure of word usage in documents. During the query matching phase of LSI, a user's query is first projected into the term-document space, and then compared to all terms and documents represented in the vector space. Using some similarity measure, the nearest (most relevant) terms and documents are identified and returned to the user. The current LSI query matching method requires that the similarity measure be computed between the query and every term and document in the vector space. In this paper, the kd-tree searching algorithm is used within a recent LSI implementation to reduce the time and computational complexity of query matching. The kd-tree data structure stores the term and document vectors in such a way that only those terms and documents that are most likely to qualify as nearest neighbors to the query will be examined and retrieved.  相似文献   

8.
以CSSCI为数据来源,利用可视化分析软件CiteSpace绘制本体研究领域的代表人物、代表文献的知识图谱和机构主题词混合知识图谱,分析我国图书情报领域本体的研究态势和研究热点。研究显示我国相关领域研究的基础知识由一系列重要文献组成。主要研究方向为数字图书馆的领域本体构建;本体在信息检索、语义检索中的应用;基于本体的知识组织和知识服务;本体整合和提取方法研究等。  相似文献   

9.
基于概念和语义层次的领域本体评价研究   总被引:1,自引:0,他引:1  
领域本体评价是本体论和语义网研究中的重要内容.本文提出了一种基于编辑距离对领域本体中概念之间的相似度进行计算的方法.此外,通过比较给定的领域本体和"黄金标准"之间在概念的实例的安排以及概念本身的等级安排上具有的相似性,可以在语义角度对二者做出相似性判断.本文利用一个已有的军用飞机领域的本体和<中国分类主题词表>进行比较计算.实验结果表明,该方法能较为准确地计算出两个本体的概念集的相似性,也能较好地衡量本体之间的语义关系,从而实现对领域本体的有效评价.  相似文献   

10.
自动综述是指针对特定的主题进行多文档自动摘要,最终提供简洁、重要的信息.新闻专题自动综述是多文档自动摘要的一种应用形式,它可以帮助人们快速了解某个新闻事件的概貌.提出了一种基于名实体的新闻专题自动综述方法.该方法首先从新闻专题的文章集合中识别并挑选出代表新闻要素的时间、地点、人物、机构等名实体,经过语义处理后进行名实体的频率统计.然后根据句子中名实体的频率,结合句子位置、长度等因素计算句子的综合权值选出摘要句,最后根据句子的时间戳信息对句子排序输出得到最终的新闻专题综述.实验结果表明,该方法是有效的,具有实用价值.  相似文献   

11.
基于HowNet的话题跟踪及倾向性分类研究   总被引:11,自引:1,他引:10  
金珠  林鸿飞  赵晶 《情报学报》2005,24(5):555-561
本文研究了如何基于信息检索技术和“知网”实现有效的话题跟踪和话题立场分类。话题跟踪任务就是给出话题相关的训练新闻报道,系统在后续报道中发现与这个话题相关的报道。它属于话题检测与跟踪的一项子任务。本文针对跟踪任务中话题本身的特点,论述了权重调整、事件框架和报道扩充等多种提高跟踪性能的策略,同时基于“知网”中的情感体系和动态角色框架,提出了如何填充框架并结合建立的立场概念库对报道进行话题立场分类。实验证明这些方法是有效的。  相似文献   

12.
LSI潜在语义标引方法在情报检索中的应用   总被引:9,自引:2,他引:7  
介绍了一种基于词相依性的语义结构, 被称为“潜在语义标引”的文献自动标引和检索技术。采用词频统计和奇值分解技术来捕捉文献的语义结构, 得到标引词、提问和文献的向量表示, 检索系统可以预测文献与提问之间的相关度, 达到检索的目的。  相似文献   

13.
The Web contains a tremendous amount of information. It is challenging to determine which Web documents are relevant to a user query, and even more challenging to rank them according to their degrees of relevance. In this paper, we propose a probabilistic retrieval model using logistic regression for recognizing multiple-record Web documents against an application ontology, a simple conceptual modeling approach. We notice that many Web documents contain a sequence of chunks of textual information, each of which constitutes a record. This type of documents is referred to as multiple-record documents. In our categorization approach, a document is represented by a set of term frequencies of index terms, a density heuristic value, and a grouping heuristic value. We first apply the logistic regression analysis on relevant probabilities using the (i) index terms, (ii) density value, and (iii) grouping value of each training document. Hereafter, the relevant probability of each test document is interpolated from the fitting curves. Contrary to other probabilistic retrieval models, our model makes only a weak independent assumption and is capable of handling any important dependent relationships among index terms. In addition, we use logistic regression, instead of linear regression analysis, because the relevance probabilities of training documents are discrete. Using a test set of car-ads and another one for obituary Web documents, our probabilistic model achieves the averaged recall ratio of 100%, precision ratio of 83.3%, and accuracy ratio of 92.5%.  相似文献   

14.
基于主题概念的多文档自动摘要研究   总被引:4,自引:0,他引:4  
文章叙述了一种针对大规模文档集的综合性自动摘要的研究与实践。首先利用HOWNET来计算文献主题概念的内聚度,在此基础上,处理文档之间的相关度以及各自在整个文档集中的主题重要度等特征;其次阐述了基于文档综合主题辞和综合优先度的多文档自动摘要生成原理。实验结果表明,该系统经过对新闻多文档集进行综合性分析,生成的能有效地反映重要的主题内容。  相似文献   

15.
本文依据中国知网、《中图法》、《中国分类主题词表》等知识库,通过对领域词语的概念化处理、建立推理规则、过滤掉阈值较低的词语等手段,形成领域词语本体知识库;然后,根据待分类题名的语义逻辑关系,结合基于距离的语义相似度的计算规则,形成一种应用于领域词语本体的题名自动分类方法,该方法在一定程度上弥补了文献题名特征不足的缺点,且提高了准确率和召回率。  相似文献   

16.
高昂  程越  李进  朱虹  邱志平 《情报工程》2017,3(5):043-052
本文基于线分类法构建网络突发性新闻事件分类体系,提出常见网络新闻事件信息分类体系的类目划分原则和分类代码定义方法,并以地震专题新闻事件为例,给出事件本体语料库的建设流程以及新闻事件本体模型的构建方法,为网络新闻事件信息分类和本体语料库建模提供思路和借鉴。  相似文献   

17.
The new Swedish Law on legal deposit of electronic documents went into full effect on January 1, 2015. The sheer volume of documents in a wide variety of media types delivered by thousands of publishers (suppliers), such as government agencies, online news media, and publishing houses, poses an exceptional challenge for the National Library of Sweden (NLS). This requires a high level of automation in the data processing, from ingest to validation, transformation, enrichment, and storage, while at the same time attaining metadata of the best possible quality. To meet the challenges encountered the NLS has developed new electronic systems and workflows that will be explained in this article. We will also touch on what we learned from our initial experiences with e-deposit and some of the issues that appear on the horizon.  相似文献   

18.
罗昊 《图书情报工作》2006,50(12):17-21
针对网络信息检索的语言保障问题,总结国内外本体研究成果,将本体引入检索语言研究。从原理、结构和进化的视角对这种新型检索语言——网络本体语言进行理论分析和方法设计,提出网络本体语言构成的2个基本理论模型,研究和探讨本体的体系结构及构造方法、本体的进化原理和机制。  相似文献   

19.
In this paper, we present a framework that can process a user query for retrieval of information from documents of different properties across multiple domains, with specific application to patent laws and regulations. The framework has three basic components. The first component is ontology mapping and generation. What happens is that the keywords entered by users are mapped into a subset of relevant keywords. This step is performed by looking up those words in an ontology database. The second component is the joint and cross search in various document domains; in our case, they are patents and scientific publications. The last component is to modify the search results by applying user feedback statistics. The results of feedback will be saved as metadata for future uses.A case example is given to demonstrate how results from multiple domain searches can be combined using ontology and cross referencing. We use an example of well-known biotechnology patents on erythropoietin (EPO) and give detailed analysis on each document domain with this keyword. Relationships between each domain are demonstrated.A user feedback mechanism is also discussed in this paper. The ability to take user feedback into the framework is important. There is no doubt that domain knowledge from expert or experienced users could be a very good compliment to the proposed system. Both direct and indirect user feedbacks are discussed.  相似文献   

20.
曾文  徐红姣  李颖  王莉军  赵婧 《情报工程》2016,2(3):037-042
文本相似度的计算方法以采用TF-IDF的方法对文本建模成词频向量空间模型(VSM)为主,本文结合科技期刊文献和专利文献特点,对TF-IDF的计算方法进行了改进,将词频的统计改进为科技术语的频率统计,提出了一种针对科技文献相似度的计算方法,该方法首先应用自然语言处理技术对科技文献进行预处理,采用科技术语的自动抽取方法进行科技文献术语的自动抽取,结合该文提出的术语权重计算公式构建向量空间模型,来计算科技期刊文献和专利文献之间的相似度。并利用真实有效的科学期刊和文献数据进行实验测试,实验结果表明文中提出的方法优于传统的TF-IDF计算方法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号