共查询到20条相似文献,搜索用时 15 毫秒
1.
There is a wide set of evaluation metrics available to compare the quality of text clustering algorithms. In this article,
we define a few intuitive formal constraints on such metrics which shed light on which aspects of the quality of a clustering
are captured by different metric families. These formal constraints are validated in an experiment involving human assessments,
and compared with other constraints proposed in the literature. Our analysis of a wide range of metrics shows that only BCubed satisfies all formal constraints. We also extend the analysis to the problem of overlapping clustering, where items can simultaneously
belong to more than one cluster. As Bcubed cannot be directly applied to this task, we propose a modified version of Bcubed
that avoids the problems found with other metrics.
相似文献
Felisa VerdejoEmail: |
2.
Nathan Hollier 《Publishing Research Quarterly》2008,24(3):165-174
This article provides a summary of and commentary on ‘A Lovely Kind of Madness: Small and Independent Publishing in Australia’,
an unpublished report by Kate Freeth, commissioned by the Small Press Underground Networking Community (SPUNC), the representative
body for small and independent publishers in Australia, and released in November 2007. Freeth’s 14,000 word report constitutes
the most detailed and comprehensive study of Australian small and independent publishing since the second volume of Michael
Denholm’s Small Press Publishing in Australia (1991) and provides much primary material for policy makers, scholars, and people working in and around the publishing industry.
相似文献
Nathan HollierEmail: |
3.
Evaluation is a major driving force in advancing the state of the art in language technologies. In particular, methods for automatically assessing the quality of machine output is the preferred method for measuring progress, provided that these metrics have been validated against human judgments. Following recent developments in the automatic evaluation of machine translation and document summarization, we present a similar approach, implemented in a measure called POURPRE, an automatic technique for evaluating answers to complex questions based on n-gram co-occurrences between machine output and a human-generated answer key. Until now, the only way to assess the correctness of answers to such questions involves manual determination of whether an information “nugget” appears in a system's response. The lack of automatic methods for scoring system output is an impediment to progress in the field, which we address with this work. Experiments with the TREC 2003, TREC 2004, and TREC 2005 QA tracks indicate that rankings produced by our metric correlate highly with official rankings, and that POURPRE outperforms direct application of existing metrics.
相似文献
Dina Demner-FushmanEmail: |
4.
On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages 总被引:1,自引:1,他引:0
Web person search is one of the most common activities of Internet users. Recently, a vast amount of work on applying various
NLP techniques for person name disambiguation in large web document collections has been reported, where the main focus was
on English and few other major languages. This article reports on knowledge-poor methods for tackling person name matching
and lemmatization in Polish, a highly inflectional language with complex person name declension paradigm. These methods apply
mainly well-established string distance metrics, some new variants thereof, automatically acquired simple suffix-based lemmatization
patterns and some combinations of the aforementioned techniques. Furthermore, we also carried out some initial experiments
on deploying techniques that utilize the context, in which person names appear. Results of numerous experiments are presented.
The evaluation carried out on a data set extracted from a corpus of on-line news articles revealed that achieving lemmatization
accuracy figures greater than 90% seems to be difficult, whereas combining string distance metrics with suffix-based patterns
results in 97.6–99% accuracy for the name matching task. Interestingly, no significant additional gain could be achieved through
integrating some basic techniques, which try to exploit the local context the names appear in. Although our explorations were
focused on Polish, we believe that the work presented in this article constitutes practical guidelines for tackling the same
problem for other highly inflectional languages with similar phenomena.
相似文献
Marcin SydowEmail: |
5.
Jacob Soll 《Archival Science》2007,7(4):331-342
This article examines the archival methods developed by Colbert to train his son in state administration. Based on Colbert’s
correspondence with his son, it reveals the practices Colbert thought necessary to collect and manage information in his state
encyclopedic archive during the last half of the 17th century.
相似文献
Jacob SollEmail: |
6.
Modern retrieval test collections are built through a process called pooling in which only a sample of the entire document
set is judged for each topic. The idea behind pooling is to find enough relevant documents such that when unjudged documents
are assumed to be nonrelevant the resulting judgment set is sufficiently complete and unbiased. Yet a constant-size pool represents
an increasingly small percentage of the document set as document sets grow larger, and at some point the assumption of approximately
complete judgments must become invalid. This paper shows that the judgment sets produced by traditional pooling when the pools
are too small relative to the total document set size can be biased in that they favor relevant documents that contain topic
title words. This phenomenon is wholly dependent on the collection size and does not depend on the number of relevant documents
for a given topic. We show that the AQUAINT test collection constructed in the recent TREC 2005 workshop exhibits this biased
relevance set; it is likely that the test collections based on the much larger GOV2 document set also exhibit the bias. The
paper concludes with suggested modifications to traditional pooling and evaluation methodology that may allow very large reusable
test collections to be built.
相似文献
Ellen VoorheesEmail: |
7.
We consider the following autocompletion search scenario: imagine a user of a search engine typing a query; then with every
keystroke display those completions of the last query word that would lead to the best hits, and also display the best such
hits. The following problem is at the core of this feature: for a fixed document collection, given a set D of documents, and an alphabetical range W of words, compute the set of all word-in-document pairs (w, d) from the collection such that w ∈ W and d ∈ D. We present a new data structure with the help of which such autocompletion queries can be processed, on the average, in
time linear in the input plus output size, independent of the size of the underlying document collection. At the same time,
our data structure uses no more space than an inverted index. Actual query processing times on a large test collection correlate
almost perfectly with our theoretical bound.
相似文献
Ingmar WeberEmail: |
8.
9.
A summary overview of the children’s and young adult publishing industry in China with a focus on the size of the market,
ten major publishing houses, copyright and trends. Special emphasis has been placed on specific transaction for the sale of
translation rights from German language publishers to China and minimal activities of German rights sold to Chinese publishers.
相似文献
Jing BartzEmail: |
10.
Smoothing of document language models is critical in language modeling approaches to information retrieval. In this paper,
we present a novel way of smoothing document language models based on propagating term counts probabilistically in a graph
of documents. A key difference between our approach and previous approaches is that our smoothing algorithm can iteratively
propagate counts and achieve smoothing with remotely related documents. Evaluation results on several TREC data sets show that the proposed method significantly outperforms the
simple collection-based smoothing method. Compared with those other smoothing methods that also exploit local corpus structures,
our method is especially effective in improving precision in top-ranked documents through “filling in” missing query terms
in relevant documents, which is attractive since most users only pay attention to the top-ranked documents in search engine
applications.
相似文献
ChengXiang ZhaiEmail: |
11.
Xiaojun Wan 《Information Retrieval》2008,11(1):25-49
In recent years graph-ranking based algorithms have been proposed for single document summarization and generic multi-document
summarization. The algorithms make use of the “votings” or “recommendations” between sentences to evaluate the importance
of the sentences in the documents. This study aims to differentiate the cross-document and within-document relationships between
sentences for generic multi-document summarization and adapt the graph-ranking based algorithm for topic-focused summarization.
The contributions of this study are two-fold: (1) For generic multi-document summarization, we apply the graph-based ranking
algorithm based on each kind of sentence relationship and explore their relative importance for summarization performance.
(2) For topic-focused multi-document summarization, we propose to integrate the relevance of the sentences to the specified
topic into the graph-ranking based method. Each individual kind of sentence relationship is also differentiated and investigated
in the algorithm. Experimental results on DUC 2002–DUC 2005 data demonstrate the great importance of the cross-document relationships
between sentences for both generic and topic-focused multi-document summarizations. Even the approach based only on the cross-document
relationships can perform better than or at least as well as the approaches based on both kinds of relationships between sentences.
相似文献
Xiaojun WanEmail: |
12.
Norbert Fuhr 《Information Retrieval》2008,11(3):251-265
The classical Probability Ranking Principle (PRP) forms the theoretical basis for probabilistic Information Retrieval (IR)
models, which are dominating IR theory since about 20 years. However, the assumptions underlying the PRP often do not hold,
and its view is too narrow for interactive information retrieval (IIR). In this article, a new theoretical framework for interactive
retrieval is proposed: The basic idea is that during IIR, a user moves between situations. In each situation, the system presents
to the user a list of choices, about which s/he has to decide, and the first positive decision moves the user to a new situation.
Each choice is associated with a number of cost and probability parameters. Based on these parameters, an optimum ordering
of the choices can the derived—the PRP for IIR. The relationship of this rule to the classical PRP is described, and issues
of further research are pointed out.
相似文献
Norbert FuhrEmail: |
13.
This article concentrates on the retro-archiving of older digital research data. The ADA approach was developed and used to retro-archive older data files, most of which were between 10 and 30 years old. The origin and
main characteristics of the ADA approach are described in the second section of the article. The third section discusses two
recent data-archiving pilot projects that were conducted in the Netherlands. The first of these projects, the ADA project,
laid the foundation for the ADA approach, which was subsequently applied and tested again in the second project, eDNA, which focused on archaeological data. The final section of the article provides a comparison of the results of these
two projects.
相似文献
Heiko TjalsmaEmail: |
14.
Panagiotis Symeonidis Alexandros Nanopoulos Apostolos N. Papadopoulos Yannis Manolopoulos 《Information Retrieval》2008,11(1):51-75
Collaborative Filtering (CF) Systems have been studied extensively for more than a decade to confront the “information overload”
problem. Nearest-neighbor CF is based either on similarities between users or between items, to form a neighborhood of users
or items, respectively. Recent research has tried to combine the two aforementioned approaches to improve effectiveness. Traditional
clustering approaches (k-means or hierarchical clustering) has been also used to speed up the recommendation process. In this paper, we use biclustering
to disclose this duality between users and items, by grouping them in both dimensions simultaneously. We propose a novel nearest-biclusters
algorithm, which uses a new similarity measure that achieves partial matching of users’ preferences. We apply nearest-biclusters
in combination with two different types of biclustering algorithms—Bimax and xMotif—for constant and coherent biclustering,
respectively. Extensive performance evaluation results in three real-life data sets are provided, which show that the proposed
method improves substantially the performance of the CF process.
相似文献
Yannis ManolopoulosEmail: |
15.
Compound noun segmentation is a key first step in language processing for Korean. Thus far, most approaches require some form of human supervision, such as pre-existing dictionaries, segmented compound nouns, or heuristic rules. As a result, they suffer from the unknown word problem, which can be overcome by unsupervised approaches. However, previous unsupervised methods normally do not consider all possible segmentation candidates, and/or rely on character-based segmentation clues such as bi-grams or all-length n-grams. So, they are prone to falling into a local solution. To overcome the problem, this paper proposes an unsupervised segmentation algorithm that searches the most likely segmentation result from all possible segmentation candidates using a word-based segmentation context. As word-based segmentation clues, a dictionary is automatically generated from a corpus. Experiments using three test collections show that our segmentation algorithm is successfully applied to Korean information retrieval, improving a dictionary-based longest-matching algorithm.
相似文献
Jong-Hyeok LeeEmail: |
16.
Through a reading of the archived letters of Henry Garnet (1555–1606), Superior of the Jesuit order in England and suspected
Gunpowder plotter, this article investigates the nature of the archive in relation to narrative theory. Figuring the archive
as one of the number of narrating voices accrued by the individual record, I argue that models of communication such as those
put forward by Roman Jakobson, Wayne C. Booth and Seymour Chatman afford useful insights into the ways in which power is inscribed
and reinscribed in the record through successive acts of reading and rewriting.
Paul Wake is a Senior Lecturer in English Literature at Manchester Metropolitan University. He is the author of Conrad’s Marlow (2007), editor, with Simon Malpas, of The Routledge Companion to Critical Theory (2006), and he has published articles on narrative theory and postmodernism. 相似文献
Paul WakeEmail: |
Paul Wake is a Senior Lecturer in English Literature at Manchester Metropolitan University. He is the author of Conrad’s Marlow (2007), editor, with Simon Malpas, of The Routledge Companion to Critical Theory (2006), and he has published articles on narrative theory and postmodernism. 相似文献
17.
Jennifer S. Milligan 《Archival Science》2007,7(4):359-367
Curious Archives examines the creation of the museum of archives, the Musée de l’Histoire de France, at the Imperial Archives
of France under the direction of Leon de Laborde, 1858–1867. This museum was intended as a crucial tool for publicizing the
Archives and educating the public, but also represented a break from the Archives’ role as administrative storehouse both
in practice and in the popular imagination. The museum’s conception and reception reveal conflicts around the Archives’ mission
and contents, particularly regarding public interest, the potential dangers of public curiosity, and nature of documentary
and historical knowledge in nineteenth-century France.
相似文献
Jennifer S. MilliganEmail: |
18.
On rank-based effectiveness measures and optimization 总被引:1,自引:0,他引:1
Many current retrieval models and scoring functions contain free parameters which need to be set—ideally, optimized. The process
of optimization normally involves some training corpus of the usual document-query-relevance judgement type, and some choice
of measure that is to be optimized. The paper proposes a way to think about the process of exploring the space of parameter
values, and how moving around in this space might be expected to affect different measures. One result, concerning local optima,
is demonstrated for a range of rank-based evaluation measures.
相似文献
Hugo ZaragozaEmail: |
19.
To put an end to the large copyright trade deficit, both Chinese government agencies and publishing houses have been striving
for entering the international publication market. The article analyzes the background of the going-global strategy, and sums
up the performance of both Chinese administrations and publishers.
相似文献
Qing Fang (Corresponding author)Email: |
20.
This article is a general introduction into the special issue of Archival Science on “archiving research data”. It summarizes
the different contributions and gives an overview of the main issues in this special field of archiving. One of the leading
questions is how and why research data archives differ from public record offices. In the past, the developments in these
two worlds have been rather separate. There are however signs that they are converging in the digital world. In particular,
this can be seen in the areas of metadata and Internet dissemination as these are strongly influenced by the rapid changes
in information technology. These changes have also led to important new developments in the infrastructure of research data
to which special attention is paid. New concepts such as collaboratories, data curation, Open Access and the Open Archives
Initiative are discussed.
相似文献
Heiko TjalsmaEmail: |