首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
Content-only queries in hierarchically structured documents should retrieve the most specific document nodes which are exhaustive to the information need. For this problem, we investigate two methods of augmentation, which both yield high retrieval quality. As retrieval effectiveness, we consider the ratio of retrieval quality and response time; thus, fast approximations to the 'correct' retrieval result may yield higher effectiveness. We present a classification scheme for algorithms addressing this issue, and adopt known algorithms from standard document retrieval for XML retrieval. As a new strategy, we propose incremental-interruptible retrieval, which allows for instant presentation of the top ranking documents. We develop a new algorithm implementing this strategy and evaluate the different methods with the INEX collection.  相似文献   

2.
基于规则的信息抽取,设计了信息抽取的规则文档,再利用XML技术对PDF格式的台湾科技文献进行信息抽取,并将所得的结构化数据导入SQLSERVER数据库,最后利用ASP技术构建一个方便、智能的信息检索平台。  相似文献   

3.
Query languages for XML such as XPath or XQuery support Boolean retrieval: a query result is a (possibly restructured) subset of XML elements or entire documents that satisfy the search conditions of the query. This search paradigm works for highly schematic XML data collections such as electronic catalogs. However, for searching information in open environments such as the Web or intranets of large corporations, ranked retrieval is more appropriate: a query result is a ranked list of XML elements in descending order of (estimated) relevance. Web search engines, which are based on the ranked retrieval paradigm, do, however, not consider the additional information and rich annotations provided by the structure of XML documents and their element names.This article presents the XXL search engine that supports relevance ranking on XML data. XXL is particularly geared for path queries with wildcards that can span multiple XML collections and contain both exact-match as well as semantic-similarity search conditions. In addition, ontological information and suitable index structures are used to improve the search efficiency and effectiveness. XXL is fully implemented as a suite of Java classes and servlets. Experiments in the context of the INEX benchmark demonstrate the efficiency of the XXL search engine and underline its effectiveness for ranked retrieval.  相似文献   

4.
在分词技术、索引技术、结构化查询语言技术的基础上,提出了一个基于XML文档数据库的信息检索系统,这一系统模型主要由分词模块、索引模块及查询模块组成。  相似文献   

5.
This study introduces a novel framework for evaluating passage and XML retrieval. The framework focuses on a user’s effort to localize relevant content in a result document. Measuring the effort is based on a system guided reading order of documents. The effort is calculated as the quantity of text the user is expected to browse through. More specifically, this study seeks evaluation metrics for retrieval methods following a specific fetch and browse approach, where in the fetch phase documents are ranked in decreasing order according to their document score, like in document retrieval. In the browse phase, for each retrieved document, a set of non-overlapping passages representing the relevant text within the document is retrieved. In other words, the passages of the document are re-organized, so that the best matching passages are read first in sequential order. We introduce an application scenario motivating the framework, and propose sample metrics based on the framework. These metrics give a basis for the comparison of effectiveness between traditional document retrieval and passage/XML retrieval and illuminate the benefit of passage/XML retrieval.  相似文献   

6.
In this paper we evaluate the application of data fusion or meta-search methods, combining different algorithms and XML elements, to content-oriented retrieval of XML structured data. The primary approach is the combination of a probabilistic methods using Logistic regression and the Okapi BM-25 algorithm for estimation of document relevance or XML element relevance, in conjunction with Boolean approaches for some query elements. In the evaluation we use the INEX XML test collection to examine the relative performance of individual algorithms and elements and compare these to the performance of the data fusion approaches.  相似文献   

7.
Evaluating the effectiveness of content-oriented XML retrieval methods   总被引:1,自引:0,他引:1  
Content-oriented XML retrieval approaches aim at a more focused retrieval strategy: Instead of retrieving whole documents, document components that are exhaustive to the information need while at the same time being as specific as possible should be retrieved. In this article, we show that the evaluation methods developed for standard retrieval must be modified in order to deal with the structure of XML documents. More precisely, the size and overlap of document components must be taken into account. For this purpose, we propose a new effectiveness metric based on the definition of a concept space defined upon the notions of exhaustiveness and specificity of a search result. We compare the results of this new metric by the results obtained with the official metric used in INEX, the evaluation initiative for content-oriented XML retrieval.
Gabriella KazaiEmail:
  相似文献   

8.
This paper investigates the impact of three approaches to XML retrieval: using Zettair, a full-text information retrieval system; using eXist, a native XML database; and using a hybrid system that takes full article answers from Zettair and uses eXist to extract elements from those articles. For the content-only topics, we undertake a preliminary analysis of the INEX 2003 relevance assessments in order to identify the types of highly relevant document components. Further analysis identifies two complementary sub-cases of relevance assessments (General and Specific) and two categories of topics (Broad and Narrow). We develop a novel retrieval module that for a content-only topic utilises the information from the resulting answer list of a native XML database and dynamically determines the preferable units of retrieval, which we call Coherent Retrieval Elements. The results of our experiments show that—when each of the three systems is evaluated against different retrieval scenarios (such as different cases of relevance assessments, different topic categories and different choices of evaluation metrics)—the XML retrieval systems exhibit varying behaviour and the best performance can be reached for different values of the retrieval parameters. In the case of INEX 2003 relevance assessments for the content-only topics, our newly developed hybrid XML retrieval system is substantially more effective than either Zettair or eXist, and yields a robust and a very effective XML retrieval.  相似文献   

9.
XML文档相似度计算方法研究   总被引:1,自引:0,他引:1  
XML(可扩展标记语言)正在成为Web上各种应用交换信息的标准.随着XML格式的半结构数据的大量出现,如何处理和管理XML文档已经成为了一个研究热点.XML文档的相似度计算是XML数据处理的重要课题,是XML文档聚类与检索的关键技术.XML文档由逻辑结构(structure)和文本内容(content)构成,可以根据结构特征或内容特征来度量XML文档之间的相似度.本文将XML文档的相似度计算方法分为基于结构的和结构与内容相结合的两类,并对各种已有的XML文档相似度计算方法进行了比较和述评.  相似文献   

10.
XML信息检索探究   总被引:4,自引:0,他引:4  
廖述梅  万常选  徐升华 《情报学报》2007,381(2):229-234
XML文档是具有层次结构和文本内容的半结构化数据。现有的Web信息检索是基于HTML文档的关键词全文检索,无法胜任XML元素粒度的检索;同时,XML数据库检索实现的是精确查找,检索结果无排序支持。因此,融合信息检索和数据库技术研究XML检索问题成为必然。本文从XML检索的问题域出发,阐述了XML信息检索(XML IR)的国内外研究现状与特点,并分析了目前XML IR的热点和难点问题。  相似文献   

11.
XML retrieval is a departure from standard document retrieval in which each individual XML element, ranging from italicized words or phrases to full blown articles, is a retrievable unit. The distribution of XML element lengths is unlike what we usually observe in standard document collections, prompting us to revisit the issue of document length normalization. We perform a comparative analysis of arbitrary elements versus relevant elements, and show the importance of element length as a parameter for XML retrieval. Within the language modeling framework, we investigate a range of techniques that deal with length either directly or indirectly. We observe a length-bias introduced by the amount of smoothing, and show the importance of extreme length bias for XML retrieval. We also show that simply removing shorter elements from the index (by introducing a cut-off value) does not create an appropriate element length normalization. Even after restricting the minimal size of XML elements occurring in the index, the importance of an extreme explicit length bias remains.  相似文献   

12.
基于XML的分布式信息检索   总被引:1,自引:0,他引:1  
提出了一种对互联网信息进行分布式信息检索的方法:利用代理程序和XML技术向多个相同类型的网站同时发送检索请求文档并接收它们返回的检索结果文档,经过统一处理后将检索结果显示给读者  相似文献   

13.
TIJAH: Embracing IR Methods in XML Databases   总被引:1,自引:0,他引:1  
This paper discusses our participation in INEX (the Initiative for the Evaluation of XML Retrieval) using the TIJAH XML-IR system. TIJAHs system design follows a standard layered database architecture, carefully separating the conceptual, logical and physical levels. At the conceptual level, we classify the INEX XPath-based query expressions into three different query patterns. For each pattern, we present its mapping into a query execution strategy. The logical layer exploits score region algebra (SRA) as the basis for query processing. We discuss the region operators used to select and manipulate XML document components. The logical algebra expressions are mapped into efficient relational algebra expressions over a physical representation of the XML document collection using the pre-post numbering scheme. The paper concludes with an analysis of experiments performed with the INEX test collection.  相似文献   

14.
文章对ISO、IEC、ITU、CEN等主要国际标准文献检索平台进行了系统调研,并从收录内容、检索方式、著录方式及检索结果四个方面作了分析、评价与比较,以便用户选择使用。最后,针对我国标准检索平台重质量轻数量、检索字段少、检准率和检全率低、收录不全、更新慢等不足给出了相关建议,包括提高著录质量、增加检索字段、提供多语言检索和检索结果的多种排序方式、提高时效性和连续性、整合标准文献。  相似文献   

15.
In this paper the problem of indexing heterogeneous structured documents and of retrieving semi-structured documents is considered. We propose a flexible paradigm for both indexing such documents and formulating user queries specifying soft constraints on both documents structure and content. At the indexing level we propose a model that achieves flexibility by constructing personalised document representations based on users views of the documents. This is obtained by allowing users to specify their preferences on the documents sections that they estimate to bear the most interesting information, as well as to linguistically quantify the number of sections which determine the global potential interest of the documents. At the query language level, a flexible query language for expressing soft selection conditions on both the documents structure and content is proposed.  相似文献   

16.
In this paper, we propose a new term dependence model for information retrieval, which is based on a theoretical framework using Markov random fields. We assume two types of dependencies of terms given in a query: (i) long-range dependencies that may appear for instance within a passage or a sentence in a target document, and (ii) short-range dependencies that may appear for instance within a compound word in a target document. Based on this assumption, our two-stage term dependence model captures both long-range and short-range term dependencies differently, when more than one compound word appear in a query. We also investigate how query structuring with term dependence can improve the performance of query expansion using a relevance model. The relevance model is constructed using the retrieval results of the structured query with term dependence to expand the query. We show that our term dependence model works well, particularly when using query structuring with compound words, through experiments using a 100-gigabyte test collection of web documents mostly written in Japanese. We also show that the performance of the relevance model can be significantly improved by using the structured query with our term dependence model.
Koji EguchiEmail:
  相似文献   

17.
梁柱  沈思  叶文豪  王东波 《情报学报》2022,41(2):167-175
在现有的裁判文书检索系统上,非专业领域的用户检索具有局限性。目前,法律领域的智能检索仅在基于裁判文书的法律条文的推荐和分类上开展了研究,缺乏对裁判文书自动推荐的相关研究,因此,本文提出了一种利用类新闻的事实性文本智能推荐裁判文书的方法,结合目前的研究工作,总结裁判文书的结构和内容特征,利用类新闻的事实性文本模拟非法律专业用户的检索查询式,构建含有结构内容特征的裁判文书语料库,并自动推荐相关裁判文书文档。结果显示,利用裁判文书的法院意见结构内容特征,对新闻语料进行特征词表示之后,LambdaMART模型在文本匹配结果上表现良好,优于传统的全文检索技术。  相似文献   

18.
通过对近年来计算机科学、人工智能、专利文献加工等领域的发展进行总结,从多语言混合检索、分类检索、语义检索、图像检索以及辅助技术五个方面介绍专利文献计算机检索技术的最新发展。机器翻译技术和多边共同分类体系的完善有助于提高计算机检索效率、消除语言障碍,而语义检索、图像检索和文献自动处理技术的发展有望使面向不同层次用户的计算机智能化检索系统得以实现。  相似文献   

19.
XML搜索引擎研究   总被引:1,自引:0,他引:1  
首先分析传统搜索引擎查准率不高的原因,然后介绍XML以及XML搜索引擎研究现状,并对XML搜索引擎所涉及的文档存储、索引、查询等关键技术进行详尽探讨。在此基础上,设计现行网络环境下的XML搜索引擎模型。认为该模型可充分利用XML文档的DTD模式信息,并能大幅度提高查询的准确率。  相似文献   

20.
XML检索系统及其比较研究*   总被引:2,自引:0,他引:2  
探讨XML检索与传统信息检索的区别、XML检索的目标与任务以及XML检索系统研究的核心问题,并对现有的几个XML检索系统进行介绍和比较研究。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号