首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Query languages for XML such as XPath or XQuery support Boolean retrieval: a query result is a (possibly restructured) subset of XML elements or entire documents that satisfy the search conditions of the query. This search paradigm works for highly schematic XML data collections such as electronic catalogs. However, for searching information in open environments such as the Web or intranets of large corporations, ranked retrieval is more appropriate: a query result is a ranked list of XML elements in descending order of (estimated) relevance. Web search engines, which are based on the ranked retrieval paradigm, do, however, not consider the additional information and rich annotations provided by the structure of XML documents and their element names.This article presents the XXL search engine that supports relevance ranking on XML data. XXL is particularly geared for path queries with wildcards that can span multiple XML collections and contain both exact-match as well as semantic-similarity search conditions. In addition, ontological information and suitable index structures are used to improve the search efficiency and effectiveness. XXL is fully implemented as a suite of Java classes and servlets. Experiments in the context of the INEX benchmark demonstrate the efficiency of the XXL search engine and underline its effectiveness for ranked retrieval.  相似文献   

2.
一种新的搜索引擎探讨   总被引:1,自引:0,他引:1  
罗三定  廖程锋 《情报学报》2004,23(4):428-432
传统的搜索引擎都不具备理解文档内容的能力 ,导致查准率普遍不高。本文提出一种采用RDF和信息提取技术的新的搜索引擎。该引擎采用信息提取技术自动获取和生成网络资源的元数据 ,由RDF描述并携带在互联网上传输、交换 ,智能代理负责收集、处理该元数据信息并向用户提供检索服务。由于计算机可以理解RDF携带的元数据的含义 ,因此可以做到基于内容的概念检索。本文在分析各种技术背景的基础上 ,给出了这种搜索模型的结构图 ,阐述了该系统的原理、优点 ,并给出了部分模块的设计  相似文献   

3.
搜索引擎中Robot搜索算法的优化   总被引:15,自引:0,他引:15  
目前的搜索引擎越来越暴露出不足之处 ,当用户使用搜索引擎时输入特定关键词之后 ,返回的查询结果往往有数千甚至几百万之多 ,而且其中包含大量的重复信息与垃圾信息 ,用户从中筛选出自己感兴趣的网页仍然需要耗费很长的时间。另外一种情况就是 ,Web上明明存在某些重要网页 ,却没有被搜索引擎的robot发现。本文针对这种现象 ,重点讨论搜索引擎中的搜索策略 ,改善搜索算法 ,使Robot在搜索阶段就能够充分处理与Robot频繁交互的URL列表。根据网页的内容、HTML结构以及其中包含的超链信息计算网页的PageRank ,使URL列表能够根据重要性调整排列顺序。初步的试验结果表明 ,本文的优化算法可以较大程度地改进搜索引擎的整体性能  相似文献   

4.
The infrastructure of a typical search engine can be used to calculate and resolve persistent document identifiers: a string that can uniquely identify and locate a document on the Internet without reference to its original location (URL). Bookmarking a document using such an identifier allows its retrieval even if the document's URL, and, in many cases, its contents change. Web client applications can offer facilities for users to bookmark a page by reference to a search engine and the persistent identifier instead of the original URL. The identifiers are calculated using a global Internet term index; a document's unique identifier consists of a word or word combination that occurs uniquely in the specific document. We use a genetic algorithm to locate a minimal unique document identifier: the shortest word or word combination that will locate the document. We tested our approach by implementing tools for indexing a document collection, calculating the persistent identifiers, performing queries, and distributing the computation and storage load among many computers.  相似文献   

5.
6.
7.
文章对ISO、IEC、ITU、CEN等主要国际标准文献检索平台进行了系统调研,并从收录内容、检索方式、著录方式及检索结果四个方面作了分析、评价与比较,以便用户选择使用。最后,针对我国标准检索平台重质量轻数量、检索字段少、检准率和检全率低、收录不全、更新慢等不足给出了相关建议,包括提高著录质量、增加检索字段、提供多语言检索和检索结果的多种排序方式、提高时效性和连续性、整合标准文献。  相似文献   

8.
文章对ISO、IEC、ITU、CEN等主要国际标准文献检索平台进行了系统调研,并从收录内容、检索方式、著录方式及检索结果四个方面作了分析、评价与比较,以便用户选择使用。最后,针对我国标准检索平台重质量轻数量、检索字段少、检准率和检全率低、收录不全、更新慢等不足给出了相关建议,包括提高著录质量、增加检索字段、提供多语言检索和检索结果的多种排序方式、提高时效性和连续性、整合标准文献。  相似文献   

9.
随着社交网络的兴起和发展,互联网上出现了大量与商品有关的社会信息。如何利用这些社会信息结合商品元数据进行检索和推荐是信息检索领域中一个热门的研究问题。本文以社会图书检索为例,提出了一种通用的信息检索方法来解决这一问题。首先,通过分析原始图书数据集和图书的用户标签、用户评分和流行度等社会信息,从图书中提取不同的社会特征构建特征矩阵;然后分别计算图书在各种社会特征上的相似度,并使用不同的策略对搜索引擎返回的排序结果进行重排序;最后使用学习排序的方法进行重排结果融合,得到最终的图书检索结果。在实验中,使用该检索方法在INEX Social Book Search 2015和2016数据集上分别进行了训练和测试。结果表明,相比现有的技术,该检索方法能够有效提升图书检索的效果。  相似文献   

10.
Information Retrieval from Documents: A Survey   总被引:4,自引:0,他引:4  
Given the phenomenal growth in the variety and quantity of data available to users through electronic media, there is a great demand for efficient and effective ways to organize and search through all this information. Besides speech, our principal means of communication is through visual media, and in particular, through documents. In this paper, we provide an update on Doermann's comprehensive survey (1998) of research results in the broad area of document-based information retrieval. The scope of this survey is also somewhat broader, and there is a greater emphasis on relating document image analysis methods to conventional IR methods.Documents are available in a wide variety of formats. Technical papers are often available as ASCII files of clean, correct, text. Other documents may only be available as hardcopies. These documents have to be scanned and stored as images so that they may be processed by a computer. The textual content of these documents may also be extracted and recognized using OCR methods. Our survey covers the broad spectrum of methods that are required to handle different formats like text and images. The core of the paper focuses on methods that manipulate document images directly, and perform various information processing tasks such as retrieval, categorization, and summarization, without attempting to completely recognize the textual content of the document. We start, however, with a brief overview of traditional IR techniques that operate on clean text. We also discuss research dealing with text that is generated by running OCR on document images. Finally, we also briefly touch on the related problem of content-based image retrieval.  相似文献   

11.
In information retrieval research, models and systems traditionally assume that a single person is querying and reviewing the results. However, several empirical studies of professional practice identified collaboration during IR as everyday work patterns in order to solve a shared information need and to benefit from the diverse expertise and experience of the team members. Moreover, most IR systems that are employed in professional work routines are designed for individual use and prototype collaborative systems are too limited to support use in todays work practice. To bridge this gap, this papers develops and formalizes a decision theoretic approach towards supporting a team of people that explicitly set out together to resolve a shared information need. We develop a formal cost model for collaborative IR that considers the trade-off between estimated relevance of a document as well as estimated document redundancy. From this cost model, we use a decision theoretic approach to derive the notion of activity suggestions, that is, a formal optimum criterion that describes optimum collaboration strategies in IR as the solution of an integer linear program. Those collaboration strategies are suggested to team members with the aim to facilitate the collaborative performance of information retrieval tasks. We demonstrate the application of our model by means of search result division in two collaborative search tasks. In the conducted experiments, we study the effects of different domain knowledge and resulting relevance assessments of team members in four different conditions. The gathered results indicate that our approach can improve the retrieval effectiveness of teams in recall-oriented tasks.  相似文献   

12.
在数字图书馆中进行信息检索是一件繁琐和乏味的工作,由于无法识别用户的检索个性化,导致信息检索的结果不尽如人意。我们在数字图书馆的信息检索系统中,将系统集中在查询个性化;特别地,我们处理结构化的检索使存储的原数据加入相关的数据库中,通过对用户描述文件里的用户偏好的分析,描述了查询重写规则在构建个性化检索中的作用。  相似文献   

13.
文章通过对常用中文搜索引擎中奥运运动相关术语的检索与结果比较,分析目前常用中文搜索引擎专业运动术语信息查询能力.为深层次的专业信息查询提供借鉴,并促进中文搜索引擎文献信息专业服务的发展.  相似文献   

14.
图书馆信息服务和搜索引擎的跨界合作   总被引:13,自引:0,他引:13  
从以网络为中心的信息服务出发,分析图书馆信息服务与搜索引擎跨界合作的技术框架,从信息资源、图书馆员和服务时空3个方面对信息服务进行分析,结合谷歌地图,归纳数据合作、系统合作等跨界合作的具体方法和流程,以及跨界合作中知识产权、应用模式方面的难点,最后总结跨界合作对图书馆信息服务在信息集成、个性化服务方面的启示。  相似文献   

15.
The collective feedback of the users of an Information Retrieval (IR) system has been shown to provide semantic information that, while hard to extract using standard IR techniques, can be useful in Web mining tasks. In the last few years, several approaches have been proposed to process the logs stored by Internet Service Providers (ISP), Intranet proxies or Web search engines. However, the solutions proposed in the literature only partially represent the information available in the Web logs. In this paper, we propose to use a richer data structure, which is able to preserve most of the information available in the Web logs. This data structure consists of three groups of entities: users, documents and queries, which are connected in a network of relations. Query refinements correspond to separate transitions between the corresponding query nodes in the graph, while users are linked to the queries they have issued and to the documents they have selected. The classical query/document transitions, which connect a query to the documents selected by the users’ in the returned result page, are also considered. The resulting data structure is a complete representation of the collective search activity performed by the users of a search engine or of an Intranet. The experimental results show that this more powerful representation can be successfully used in several Web mining tasks like discovering semantically relevant query suggestions and Web page categorization by topic.  相似文献   

16.
基于用户信息需求的元搜索引擎的构建   总被引:5,自引:0,他引:5  
韩毅 《图书情报工作》2005,49(1):125-127
针对当前网络搜索引擎未能较多地关注用户需求、查全率与查准率不高、彼此间不兼容的缺陷,提出建立基于用户需求的元搜索引擎,并分析其基本原理,给出其基本结构,讨论其运行机制和关键技术。指出基于用户需求的元搜索引擎可使网络信息资源在一定程度上结构化,实现网络信息资源的自组织,提高网络信息检索的查全率和查准率.  相似文献   

17.
Most current machine learning methods for building search engines are based on the assumption that there is a target evaluation metric that evaluates the quality of the search engine with respect to an end user and the engine should be trained to optimize for that metric. Treating the target evaluation metric as a given, many different approaches (e.g. LambdaRank, SoftRank, RankingSVM, etc.) have been proposed to develop methods for optimizing for retrieval metrics. Target metrics used in optimization act as bottlenecks that summarize the training data and it is known that some evaluation metrics are more informative than others. In this paper, we consider the effect of the target evaluation metric on learning to rank. In particular, we question the current assumption that retrieval systems should be designed to directly optimize for a metric that is assumed to evaluate user satisfaction. We show that even if user satisfaction can be measured by a metric X, optimizing the engine on a training set for a more informative metric Y may result in a better test performance according to X (as compared to optimizing the engine directly for X on the training set). We analyze the situations as to when there is a significant difference in the two cases in terms of the amount of available training data and the number of dimensions of the feature space.  相似文献   

18.
While past research has shown that learning outcomes can be influenced by the amount of effort students invest during the learning process, there has been little research into this question for scenarios where people use search engines to learn. In fact, learning-related tasks represent a significant fraction of the time users spend using Web search, so methods for evaluating and optimizing search engines to maximize learning are likely to have broad impact. Thus, we introduce and evaluate a retrieval algorithm designed to maximize educational utility for a vocabulary learning task, in which users learn a set of important keywords for a given topic by reading representative documents on diverse aspects of the topic. Using a crowdsourced pilot study, we compare the learning outcomes of users across four conditions corresponding to rankings that optimize for different levels of keyword density. We find that adding keyword density to the retrieval objective gave significant learning gains on some topics, with higher levels of keyword density generally corresponding to more time spent reading per word, and stronger learning gains per word read. We conclude that our approach to optimizing search ranking for educational utility leads to retrieved document sets that ultimately may result in more efficient learning of important concepts.  相似文献   

19.
受“搜索引擎”流行的影响,目前大家已经习惯把图书情报领域使用的“情报检索系统”称之为“学术搜索引擎”。无论从技术层面上还是应用层面上,尽管二者有很大的共同点,但也有很大差异。传统的集中式的搜索引擎已经无法满足飞速发展的信息爆炸和普及化的海量需求用户,能够提供“云服务”的分布式搜索引擎已经成为必然。文章主要内容包括学术搜索引擎涉及的关键技术、分布式搜索引擎的架构,以及分布式搜索引擎在大数据领域的主要应用价值三个方面,最后给出了分布式搜索引擎RMSCIoud的典型应用介绍。  相似文献   

20.
对当前情报界“情报检索”与“文献检索”的相互关系、有关“情报检索”类型的几种常见的观点进行了分析,认为情报检索与文献检索实质上是相同的。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号