首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
New Mexico State University's Computing Research Lab has participated in research in all three phases of the US Government's Tipster program. Our work on information retrieval has focused on research and development of multilingual and cross-language approaches to automatic retrieval. The work on automatic systems has been supplemented by additional research into the role of the IR system user in interactive retrieval scenarios: monolingual, multilingual and cross-language. The combined efforts suggest that universal text retrieval, in which a user can find, access and use documents in the face of language differences and information overload, may be possible.  相似文献   

2.
Kleinbergs HITS algorithm (Kleinberg 1999), which was originally developed in a Web context, tries to infer the authoritativeness of a Web page in relation to a specific query using the structure of a subgraph of the Web graph, which is obtained considering this specific query. Recent applications of this algorithm in contexts far removed from that of Web searching (Bacchin, Ferro and Melucci 2002, Ng et al. 2001) inspired us to study the algorithm in the abstract, independently of its particular applications, trying to mathematically illuminate its behaviour. In the present paper we detail this theoretical analysis. The original work starts from the definition of a revised and more general version of the algorithm, which includes the classic one as a particular case. We perform an analysis of the structure of two particular matrices, essential to studying the behaviour of the algorithm, and we prove the convergence of the algorithm in the most general case, finding the analytic expression of the vectors to which it converges. Then we study the symmetry of the algorithm and prove the equivalence between the existence of symmetry and the independence from the order of execution of some basic operations on initial vectors. Finally, we expound some interesting consequences of our theoretical results.Supported in part by a grant from the Italian National Research Council (CNR) research project Technologies and Services for Enhanced Content Delivery.  相似文献   

3.
Locating and Recognizing Text in WWW Images   总被引:4,自引:0,他引:4  
The explosive growth of the World Wide Web has resulted in a distributed database consisting of hundreds of millions of documents. While existing search engines index a page based on the text that is readily extracted from its HTML encoding, an increasing amount of the information on the Web is embedded in images. This situation presents a new and exciting challenge for the fields of document analysis and information retrieval, as WWW image text is typically rendered in color and at very low spatial resolutions. In this paper, we survey the results of several years of our work in the area. For the problem of locating text in Web images, we describe a procedure based on clustering in color space followed by a connected-components analysis that seems promising. For character recognition, we discuss techniques using polynomial surface fitting and fuzzy n-tuple classifiers. Also presented are the results of several experiments that demonstrate where our methods perform well and where more work needs to be done. We conclude with a discussion of topics for further research.  相似文献   

4.
For the definition of electronic records, the use of new terms, like literary warrant, is not necessary, and for the European perspective even not understandable. If this expression simply means best practice and professional culture in recordkeeping, we only to know what creators did for centuries and still do today and probably will do also in the future, by referring to the archival science, diplomatics and archival practice for clarifying definitions in the recordkeeping environment. A multi-disciplinary approach is still required for the electronic recordkeeping system as it was in the past for traditional records, but the theory and the terminology should be consistent and based on the deep understanding of essential characteristics of records and essential requirements of good recordkeeping to produce in the first place and maintain reliable and authentic records. Of course, a record is more than recorded information created in the course of business activity: a record is the recorded representation of an act produced in a specific form – the form prescribed by the legal system – by a creator in the course of its activity.  相似文献   

5.
The Museum is a perspicuous site for analysing the complex interplay between social, organisational, cultural and political factors which have relevance to the design and use of virtual technologies. Specifically, the introduction of virtual technologies in museums runs up against the issue of the situated character of information use. Across a number of disciplines (anthropology, sociology, psychology, cognitive science) there is growing recognition of the situatedness of knowledge and its importance for the design and use of technology. This awareness is fostered by the fact that technological developments are often associated with disappointing gains for users. The effective use of technology relies on the degree to which it can be embedded in or congruent with the local practices of museum users. Drawing upon field research in two museums of science and technology, both of which are in the process of introducing virtual technologies and exploring the possibilities of on-line access, findings are presented which suggest that the success of such developments will depend on the extent to which they are informed by detailed understanding of practice-practices that are essentially socially constituted in the activities of museum visitors and the daily work of museum professionals.  相似文献   

6.
Detection As Multi-Topic Tracking   总被引:1,自引:0,他引:1  
The topic tracking task from TDT is a variant of information filtering tasks that focuses on event-based topics in streams of broadcast news. In this study, we compare tracking to another TDT task, detection, which has the goal of partitioning all arriving news into topics, regardless of whether the topics are of interest to anyone, and even when a new topic appears that had not been previous anticipated. There are clear relationships between the two tasks (under some assumptions, a perfect tracking system could solve the detection problem), but they are evaluated quite differently. We describe the two tasks and discuss their similarities. We show how viewing detection as a form of multi-topic parallel tracking can illuminate the performance tradeoffs of detection over tracking.  相似文献   

7.
Zusammenfassung. Dadurch, dass Literaturnachweise und Publikationen zunehmend in elektronischer und auch vernetzter Form angeboten werden, haben Anzahl und Größe der von wissenschaftlichen Bibliotheken angebotenen Datenbanken erheblich zugenommen. In den verbreiteten Metasuchen über mehrere Datenbanken sind Suchen mit natürlichsprachlichen Suchbegriffen heute der kleinste gemeinsame Nenner. Sie führen aber wegen der bekannten Mängel des booleschen Retrievals häufig zu Treffermengen, die entweder zu speziell oder zu lang und zu unspezifisch sind. Die Technische Fakultät der Universität Bielefeld und die Universitätsbibliothek Bielefeld haben einen auf Fuzzy- Suchlogik basierenden Rechercheassistenten entwickelt, der die Suchanfragen der Benutzer in Teilsuchfragen an die externen Datenbanken zerlegt und die erhaltenen Teilsuchergebnisse in einer nach Relevanz sortierten Liste kumuliert. Es ist möglich, Suchbegriffe zu gewichten und durch Fuzzy- Aggregationsoperatoren zu verknüpfen, die auf der Benutzeroberfläche durch natürlichsprachliche Fuzzy-Quantoren wie möglichst viele, einige u.a. repräsentiert werden. Die Suchparameter werden in der intuitiv bedienbaren einfachen Suche automatisch nach heuristischen Regeln ermittelt, können in einer erweiterten Suche aber auch explizit eingestellt werden. Die Suchmöglichkeiten werden durch Suchen nach ähnlichen Dokumenten und Vorschlagslisten für weitere Suchbegriffe ergänzt. Wir beschreiben die Ausgangssituation, den theoretischen Ansatz, die Benutzeroberfläche und berichten über eine Evalution zur Benutzung und einen Vergleichstest betreffend die Effizienz der Retrievalmethodik.CR Subject Classification: H.3.3, H.3.5Eingegangen am 3. März 2004 / Angenommen am 19. August 2004, Online publiziert am 18. Oktober 2004  相似文献   

8.
The Web contains a tremendous amount of information. It is challenging to determine which Web documents are relevant to a user query, and even more challenging to rank them according to their degrees of relevance. In this paper, we propose a probabilistic retrieval model using logistic regression for recognizing multiple-record Web documents against an application ontology, a simple conceptual modeling approach. We notice that many Web documents contain a sequence of chunks of textual information, each of which constitutes a record. This type of documents is referred to as multiple-record documents. In our categorization approach, a document is represented by a set of term frequencies of index terms, a density heuristic value, and a grouping heuristic value. We first apply the logistic regression analysis on relevant probabilities using the (i) index terms, (ii) density value, and (iii) grouping value of each training document. Hereafter, the relevant probability of each test document is interpolated from the fitting curves. Contrary to other probabilistic retrieval models, our model makes only a weak independent assumption and is capable of handling any important dependent relationships among index terms. In addition, we use logistic regression, instead of linear regression analysis, because the relevance probabilities of training documents are discrete. Using a test set of car-ads and another one for obituary Web documents, our probabilistic model achieves the averaged recall ratio of 100%, precision ratio of 83.3%, and accuracy ratio of 92.5%.  相似文献   

9.
Exploiting Hierarchy in Text Categorization   总被引:4,自引:3,他引:1  
With the recent dramatic increase in electronic access to documents, text categorization—the task of assigning topics to a given document—has moved to the center of the information sciences and knowledge management. This article uses the structure that is present in the semantic space of topics in order to improve performance in text categorization: according to their meaning, topics can be grouped together into meta-topics, e.g., gold, silver, and copper are all metals. The proposed architecture matches the hierarchical structure of the topic space, as opposed to a flat model that ignores the structure. It accommodates both single and multiple topic assignments for each document. Its probabilistic interpretation allows its predictions to be combined in a principled way with information from other sources. The first level of the architecture predicts the probabilities of the meta-topic groups. This allows the individual models for each topic on the second level to focus on finer discriminations within the group. Evaluating the performance of a two-level implementation on the Reuters-22173 testbed of newswire articles shows the most significant improvement for rare classes.  相似文献   

10.
Exploiting the Similarity of Non-Matching Terms at Retrieval Time   总被引:2,自引:0,他引:2  
In classic Information Retrieval systems a relevant document will not be retrieved in response to a query if the document and query representations do not share at least one term. This problem, known as term mismatch, has been recognised for a long time by the Information Retrieval community and a number of possible solutions have been proposed. Here I present a preliminary investigation into a new class of retrieval models that attempt to solve the term mismatch problem by exploiting complete or partial knowledge of term similarity in the term space. The use of term similarity enables to enhance classic retrieval models by taking into account non-matching terms. The theoretical advantages and drawbacks of these models are presented and compared with other models tackling the same problem. A preliminary experimental investigation into the performance gain achieved by exploiting term similarity with the proposed models is presented and discussed.  相似文献   

11.
Variability is a central concept in software product family development. Variability empowers constructive reuse and facilitates the derivation of different, customer specific products from the product family. If many customer specific requirements can be realised by exploiting the product family variability, the reuse achieved is obviously high. If not, the reuse is low. It is thus important that the variability of the product family is adequately considered when eliciting requirements from the customer. In this paper we sketch the challenges for requirements engineering for product family applications. More precisely we elaborate on the need to communicate the variability of the product family to the customer. We differentiate between variability aspects which are essential for the customer and aspects which are more related to the technical realisation and need thus not be communicated to the customer. Motivated by the successful usage of use cases in single product development we propose use cases as communication medium for the product family variability. We discuss and illustrate which customer relevant variability aspects can be represented with use cases, and for which aspects use cases are not suitable. Moreover we propose extensions to use case diagrams to support an intuitive representation of customer relevant variability aspects.Received: 14 October 2002, Accepted: 8 January 2003, This work was partially funded by the CAFÉ project From Concept to Application in System Family Engineering; Eureka ! 2023 Programme, ITEA Project ip00004 (BMBF, Förderkennzeichen 01 IS 002 C) and the state Nord-Rhein-Westfalia. This paper is a significant extension of the paper Modellierung der Variabilität einer Produktfamilie, [15].  相似文献   

12.
With a central focus on thecultural contexts of Pacific island societies,this essay examines the entanglement ofcolonial power relations in local recordkeepingpractices. These cultural contexts include theon-going exchange between oral and literatecultures, the aftermath of colonialdisempowerment and reassertion of indigenousrights and identities, the difficulty ofmaintaining full archival systems in isolated,resource-poor micro-states, and the drivinginfluence of development theory. The essayopens with a discussion of concepts ofexploration and evangelism in cross-culturalanalysis as metaphors for archival endeavour. It then explores the cultural exchanges betweenoral memory and written records, orality, andliteracy, as means of keeping evidence andremembering. After discussing the relation ofrecords to processes of political and economicdisempowerment, and the reclaiming of rightsand identities, it returns to the patterns ofarchival development in the Pacific region toconsider how archives can better integrate intotheir cultural and political contexts, with theaim of becoming more valued parts of theircommunities.  相似文献   

13.
We present an optimization approach that generates k-means like clustering algorithms. The batch k-means and the incremental k-means are two well known versions of the classical k-means clustering algorithm (Duda et al. 2000). To benefit from the speed of the batch version and the accuracy of the incremental version we combine the two in a ping–pong fashion. We use a distance-like function that combines the squared Euclidean distance with relative entropy. In the extreme cases our algorithm recovers the classical k-means clustering algorithm and generalizes the Divisive Information Theoretic clustering algorithm recently reported independently by Berkhin and Becher (2002) and Dhillon1 et al. (2002). Results of numerical experiments that demonstrate the viability of our approach are reported.This research was supported in part by the US Department of Defense, the United States–Israel Binational Science Foundation (BSF), and Northrop Grumman Mission Systems (NG/MS).  相似文献   

14.
Text Categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has been an application for many learning approaches, which prove effective. Nevertheless, TC provides many challenges to machine learning. In this paper, we suggest, for text categorization, the integration of external WordNet lexical information to supplement training data for a semi-supervised clustering algorithm which can learn from both training and test documents to classify new unseen documents. This algorithm is the Semi-Supervised Fuzzy c-Means (ssFCM). Our experiments use Reuters 21578 database and consist of binary classifications for categories selected from the 115 TOPICS classes of the Reuters collection. Using the Vector Space Model, each document is represented by its original feature vector augmented with external feature vector generated using WordNet. We verify experimentally that the integration of WordNet helps ssFCM improve its performance, effectively addresses the classification of documents into categories with few training documents and does not interfere with the use of training data.  相似文献   

15.
TIJAH: Embracing IR Methods in XML Databases   总被引:1,自引:0,他引:1  
This paper discusses our participation in INEX (the Initiative for the Evaluation of XML Retrieval) using the TIJAH XML-IR system. TIJAHs system design follows a standard layered database architecture, carefully separating the conceptual, logical and physical levels. At the conceptual level, we classify the INEX XPath-based query expressions into three different query patterns. For each pattern, we present its mapping into a query execution strategy. The logical layer exploits score region algebra (SRA) as the basis for query processing. We discuss the region operators used to select and manipulate XML document components. The logical algebra expressions are mapped into efficient relational algebra expressions over a physical representation of the XML document collection using the pre-post numbering scheme. The paper concludes with an analysis of experiments performed with the INEX test collection.  相似文献   

16.
The Archival Bond   总被引:1,自引:0,他引:1  
This paper presents the concept of archival bond as formulated by archival science and used in a research project carried out at the University of British Columbia, entitled The Preservation of Electronic Records. Being one of the essential components of the record, the concept of archival bond is discussed in the context of the traditional diplomatic and archival definitions of records, and its function in demonstrating the reliability and authenticity of records is shown. The most serious challenge with which we are confronted is to make explicit and preserve intact over the long term the archival bond between electronic and non electronic records belonging in the same aggregations.  相似文献   

17.
Public and privateorganizations depend, for their disciplinaryand surveillance power, on the creation andmaintenance of records. Entire societies maybe emprisoned in Foucauldian panopticism, asystem of surveillance and power-knowledge,based on and practised by registration, filing,and records. Archives resemble temples asinstitutions of surveillance and powerarchitecturally, but they also function assuch, because the panoptical archivedisciplines and controls throughknowledge-power. Inside the archives, therituals, surveillance, and discipline serve tomaintain the power of the archives and thearchivist. But the archives' power is (orshould be) the citizen's power too. Theviolation of human rights is documented in thearchives and the citizen who defends himselfappeals to the archives. People value storage as a means to keep account of thepresent for the future. In order to be useableas instruments of empowerment and liberation,archives have to be secured as storage memoryserving society's future functional memories.  相似文献   

18.
In this paper the problem of indexing heterogeneous structured documents and of retrieving semi-structured documents is considered. We propose a flexible paradigm for both indexing such documents and formulating user queries specifying soft constraints on both documents structure and content. At the indexing level we propose a model that achieves flexibility by constructing personalised document representations based on users views of the documents. This is obtained by allowing users to specify their preferences on the documents sections that they estimate to bear the most interesting information, as well as to linguistically quantify the number of sections which determine the global potential interest of the documents. At the query language level, a flexible query language for expressing soft selection conditions on both the documents structure and content is proposed.  相似文献   

19.
The application of relevance feedback techniques has been shown to improve retrieval performance for a number of information retrieval tasks. This paper explores incremental relevance feedback for ad hoc Japanese text retrieval; examining, separately and in combination, the utility of term reweighting and query expansion using a probabilistic retrieval model. Retrieval performance is evaluated in terms of standard precision-recall measures, and also using number-to-view graphs. Experimental results, on the standard BMIR-J2 Japanese language retrieval collection, show that both term reweighting and query expansion improve retrieval performance. This is reflected in improvements in both precision and recall, but also a reduction in the average number of documents which must be viewed to find a selected number of relevant items. In particular, using a simple simulation of user searching, incremental application of relevance information is shown to lead to progressively improved retrieval performance and an overall reduction in the number of documents that a user must view to find relevant ones.  相似文献   

20.
Abstract

Like many other libraries, the Science, Industry and Business Library (SIBL) of The New York Public Library instructs customers in using the Web. In addition, the library is using the Web to further educate and assist its customers. SIBL provides Web access to its catalogs, a Web menu for the selection of electronic databases, Web guides for doing research in various subjects, and Web-accessible instructional materials. The library is also planning Web-based tutorials for its site which will reach a new, remote audience. Remote access to learning opportunities will enhance and extend traditional library services.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号