共查询到20条相似文献,搜索用时 765 毫秒
1.
Query languages for XML such as XPath or XQuery support Boolean retrieval: a query result is a (possibly restructured) subset of XML elements or entire documents that satisfy the search conditions of the query. This search paradigm works for highly schematic XML data collections such as electronic catalogs. However, for searching information in open environments such as the Web or intranets of large corporations, ranked retrieval is more appropriate: a query result is a ranked list of XML elements in descending order of (estimated) relevance. Web search engines, which are based on the ranked retrieval paradigm, do, however, not consider the additional information and rich annotations provided by the structure of XML documents and their element names.This article presents the XXL search engine that supports relevance ranking on XML data. XXL is particularly geared for path queries with wildcards that can span multiple XML collections and contain both exact-match as well as semantic-similarity search conditions. In addition, ontological information and suitable index structures are used to improve the search efficiency and effectiveness. XXL is fully implemented as a suite of Java classes and servlets. Experiments in the context of the INEX benchmark demonstrate the efficiency of the XXL search engine and underline its effectiveness for ranked retrieval. 相似文献
2.
This paper investigates the impact of three approaches to XML retrieval: using Zettair, a full-text information retrieval system; using eXist, a native XML database; and using a hybrid system that takes full article answers from Zettair and uses eXist to extract elements from those articles. For the content-only topics, we undertake a preliminary analysis of the INEX 2003 relevance assessments in order to identify the types of highly relevant document components. Further analysis identifies two complementary sub-cases of relevance assessments (General and Specific) and two categories of topics (Broad and Narrow). We develop a novel retrieval module that for a content-only topic utilises the information from the resulting answer list of a native XML database and dynamically determines the preferable units of retrieval, which we call Coherent Retrieval Elements. The results of our experiments show that—when each of the three systems is evaluated against different retrieval scenarios (such as different cases of relevance assessments, different topic categories and different choices of evaluation metrics)—the XML retrieval systems exhibit varying behaviour and the best performance can be reached for different values of the retrieval parameters. In the case of INEX 2003 relevance assessments for the content-only topics, our newly developed hybrid XML retrieval system is substantially more effective than either Zettair or eXist, and yields a robust and a very effective XML retrieval. 相似文献
3.
XML retrieval is a departure from standard document retrieval in which each individual XML element, ranging from italicized words or phrases to full blown articles, is a retrievable unit. The distribution of XML element lengths is unlike what we usually observe in standard document collections, prompting us to revisit the issue of document length normalization. We perform a comparative analysis of arbitrary elements versus relevant elements, and show the importance of element length as a parameter for XML retrieval. Within the language modeling framework, we investigate a range of techniques that deal with length either directly or indirectly. We observe a length-bias introduced by the amount of smoothing, and show the importance of extreme length bias for XML retrieval. We also show that simply removing shorter elements from the index (by introducing a cut-off value) does not create an appropriate element length normalization. Even after restricting the minimal size of XML elements occurring in the index, the importance of an extreme explicit length bias remains. 相似文献
4.
一种基于Native XML的全文检索引擎 总被引:5,自引:0,他引:5
随着XML的日益流行 ,基于XML的全文检索应用需求也迅速扩大。在这些应用中 ,native XML数据库是发展方向。虽然商业化的native XML数据库已经出现 ,但其全文检索的性能还不尽人意。本文提出一种方法 :在传统的倒排索引的框架下 ,对XML的标记建立索引 ,使得一个全文数据库能够以Native的方式存储、索引、检索和输出XML文档 ,成为一个真正意义上的native XML全文数据库 ,既有传统全文数据库的优越性能 ,又能满足基于na tive XML的应用需求 相似文献
5.
Content-only queries in hierarchically structured documents should retrieve the most specific document nodes which are exhaustive
to the information need. For this problem, we investigate two methods of augmentation, which both yield high retrieval quality.
As retrieval effectiveness, we consider the ratio of retrieval quality and response time; thus, fast approximations to the
'correct' retrieval result may yield higher effectiveness. We present a classification scheme for algorithms addressing this
issue, and adopt known algorithms from standard document retrieval for XML retrieval. As a new strategy, we propose incremental-interruptible retrieval, which allows for instant presentation of the top ranking documents. We develop a new algorithm implementing this strategy
and evaluate the different methods with the INEX collection. 相似文献
6.
与传统信息检索不同的是XML要实现元素级的检索,其核心是元素级检索模型的构建。而XML文档内上下文元素的相关性、元素之间信息的重复性以及元素大小的不一性等则是构建模型时面临的核心问题。解决办法是:构建基于BM25元素级XML检索模型,构建基于上下文的元素级XML检索模型BM25E,过滤重复元素,进行可检索元素的选择和太小元素的处理。表1。图1。参考文献19。 相似文献
7.
In Information Retrieval, since it is hard to identify users’ information needs, many approaches have been tried to solve
this problem by expanding initial queries and reweighting the terms in the expanded queries using users’ relevance judgments.
Although relevance feedback is most effective when relevance information about retrieved documents is provided by users, it
is not always available. Another solution is to use correlated terms for query expansion. The main problem with this approach
is how to construct the term-term correlations that can be used effectively to improve retrieval performance. In this study,
we try to construct query concepts that denote users’ information needs from a document space, rather than to reformulate initial queries using the term correlations
and/or users’ relevance feedback. To form query concepts, we extract features from each document, and then cluster the features into primitive concepts that are then used to form
query concepts. Experiments are performed on the Associated Press (AP) dataset taken from the TREC collection. The experimental evaluation
shows that our proposed framework called QCM (Query Concept Method) outperforms baseline probabilistic retrieval model on
TREC retrieval. 相似文献
8.
Content-oriented XML retrieval approaches aim at a more focused retrieval strategy: Instead of retrieving whole documents, document components that are exhaustive to the information need while at the same time being as specific as possible should be retrieved. In this article, we show that the evaluation methods developed for standard retrieval must be modified in order to deal with the structure of XML documents. More precisely, the size and overlap of document components must be taken into account. For this purpose, we propose a new effectiveness metric based on the definition of a concept space defined upon the notions of exhaustiveness and specificity of a search result. We compare the results of this new metric by the results obtained with the official metric used in INEX, the evaluation initiative for content-oriented XML retrieval.
相似文献
Gabriella KazaiEmail: |
9.
图像对象特征值的抽取、存储、转换、显现的实现有多种方法,SIMIIRS系统主要采用了数据库方法和XML方法。文章主要讨论了图像资源的XML描述方法、建立图像信息的XML索引文档,检索XML文档以实现图像信息查询与提供。 相似文献
10.
Over the last three decades, research in Information Retrieval (IR) shows performance improvement when many sources of evidence
are combined to produce a ranking of documents. Most current approaches assess document relevance by computing a single score
which aggregates values of some attributes or criteria. They use analytic aggregation operators which either lead to a loss
of valuable information, e.g., the min or lexicographic operators, or allow very bad scores on some criteria to be compensated
with good ones, e.g., the weighted sum operator. Moreover, all these approaches do not handle imprecision of criterion scores.
In this paper, we propose a multiple criteria framework using a new aggregation mechanism based on decision rules identifying
positive and negative reasons for judging whether a document should get a better ranking than another. The resulting procedure
also handles imprecision in criteria design. Experimental results are reported showing that the suggested method performs
better than standard aggregation operators. 相似文献
11.
Xiangji Huang Fuchun Peng Dale Schuurmans Nick Cercone Stephen E. Robertson 《Information Retrieval》2003,6(3-4):333-362
We propose a self-supervised word segmentation technique for text segmentation in Chinese information retrieval. This method combines the advantages of traditional dictionary based, character based and mutual information based approaches, while overcoming many of their shortcomings. Experiments on TREC data show this method is promising. Our method is completely language independent and unsupervised, which provides a promising avenue for constructing accurate multi-lingual or cross-lingual information retrieval systems that are flexible and adaptive. We find that although the segmentation accuracy of self-supervised segmentation is not as high as some other segmentation methods, it is enough to give good retrieval performance. It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. However, for Chinese, we find that the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance. We demonstrate this effect by presenting an empirical investigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation algorithms with word segmentation accuracies ranging from 44% to 95%, including 70% word segmentation accuracy from our self-supervised word-segmentation approach. It appears that the main reason for the drop in retrieval performance is that correct compounds and collocations are preserved by accurate segmenters, while they are broken up by less accurate (but reasonable) segmenters, to a surprising advantage. This suggests that words themselves might be too broad a notion to conveniently capture the general semantic meaning of Chinese text. Our research suggests machine learning techniques can play an important role in building adaptable information retrieval systems and different evaluation standards for word segmentation should be given to different applications. 相似文献
12.
相关反馈是近年来信息检索领域的研究热点,是自动查询扩展中的一种重要形式,相关反馈主要包括检索词加权和检索词选择。本文介绍了在相关反馈技术中经典的检索词排序算法,对它们带来的性能改进做了比较,并提出了相关反馈的实际应用中需要解决的一些问题。 相似文献
13.
Research on cross-language information retrieval (CLIR) has typically been restricted to settings using binary relevance assessments.
In this paper, we present evaluation results for dictionary-based CLIR using graded relevance assessments in a best match
retrieval environment. A text database containing newspaper articles and a related set of 35 search topics were used in the
tests. First, monolingual baseline queries were automatically formed from the topics. Secondly, source language topics (in
English, German, and Swedish) were automatically translated into the target language (Finnish), using structured target queries.
The effectiveness of the translated queries was compared to that of the monolingual queries. Thirdly, pseudo-relevance feedback
was used to expand the original target queries. CLIR performance was evaluated using three relevance thresholds: stringent,
regular, and liberal. When regular or liberal threshold was used, a reasonable performance was achieved. Using stringent threshold,
equally high performance could not be achieved. On all the relevance thresholds the performance of the translated queries
was successfully raised by pseudo-relevance feedback based query expansion. However, the performance of the stringent threshold
in relation to the other thresholds could not be raised by this method. 相似文献
14.
段落检索及其相关算法研究 总被引:2,自引:0,他引:2
总结段落检索及其涉及的段落划分和相关算法,讨论文本分割和段落抽取的差别,介绍并比较几种常用的段落划分方法以及几类段落检索算法,在此基础上对段落检索的研究方向进行展望。 相似文献
15.
XML检索系统及其比较研究* 总被引:2,自引:0,他引:2
探讨XML检索与传统信息检索的区别、XML检索的目标与任务以及XML检索系统研究的核心问题,并对现有的几个XML检索系统进行介绍和比较研究。 相似文献
16.
Knowledge transfer for cross domain learning to rank 总被引:1,自引:1,他引:0
Depin Chen Yan Xiong Jun Yan Gui-Rong Xue Gang Wang Zheng Chen 《Information Retrieval》2010,13(3):236-253
Recently, learning to rank technology is attracting increasing attention from both academia and industry in the areas of machine
learning and information retrieval. A number of algorithms have been proposed to rank documents according to the user-given
query using a human-labeled training dataset. A basic assumption behind general learning to rank algorithms is that the training
and test data are drawn from the same data distribution. However, this assumption does not always hold true in real world
applications. For example, it can be violated when the labeled training data become outdated or originally come from another
domain different from its counterpart of test data. Such situations bring a new problem, which we define as cross domain learning
to rank. In this paper, we aim at improving the learning of a ranking model in target domain by leveraging knowledge from
the outdated or out-of-domain data (both are referred to as source domain data). We first give a formal definition of the
cross domain learning to rank problem. Following this, two novel methods are proposed to conduct knowledge transfer at feature
level and instance level, respectively. These two methods both utilize Ranking SVM as the basic learner. In the experiments,
we evaluate these two methods using data from benchmark datasets for document retrieval. The results show that the feature-level
transfer method performs better with steady improvements over baseline approaches across different datasets, while the instance-level
transfer method comes out with varying performance depending on the dataset used. 相似文献
17.
Information Retrieval with a Hybrid Automatic Query Expansion and Data Fusion Procedure 总被引:1,自引:0,他引:1
We propose a hybrid information retrieval (IR) procedure that builds on two well-known IR approaches: data fusion and query expansion via relevance feedback. This IR procedure is designed to exploit the strengths of data fusion and relevance feedback and to avoid some weaknesses of these approaches. We show that our IR procedure is built on postulates that can be justified analytically and empirically. Additionally, we offer an empirical investigation of the procedure, showing that it is superior to relevance feedback on some dimensions and comparable on other dimensions. The empirical investigation also verifies the conditions under which the use of our IR procedure could be beneficial. 相似文献
18.
基于传统文本检索系统的XML索引实现研究 总被引:3,自引:0,他引:3
作为重要的信息交换与存储标准,XML得到学者们越来越多的重视。作为XML检索研究的重要组成部分,XML索引机制与实现的研究已经取得了一定的研究成果。然而,大部分研究都是基于数据库及专门的半结构化管理器之上的。本文提出了如何在传统文本检索系统Okapi的基础上构建XML索引的方法。首先介绍了Okapi的索引结构,在此基础上,深入探讨了XML索引的存储结构及实现,并对索引的性能进行了评价。 相似文献
19.