共查询到20条相似文献,搜索用时 406 毫秒
1.
2.
We consider the following autocompletion search scenario: imagine a user of a search engine typing a query; then with every
keystroke display those completions of the last query word that would lead to the best hits, and also display the best such
hits. The following problem is at the core of this feature: for a fixed document collection, given a set D of documents, and an alphabetical range W of words, compute the set of all word-in-document pairs (w, d) from the collection such that w ∈ W and d ∈ D. We present a new data structure with the help of which such autocompletion queries can be processed, on the average, in
time linear in the input plus output size, independent of the size of the underlying document collection. At the same time,
our data structure uses no more space than an inverted index. Actual query processing times on a large test collection correlate
almost perfectly with our theoretical bound.
相似文献
Ingmar WeberEmail: |
3.
Fabio Aiolli Riccardo Cardin Fabrizio Sebastiani Alessandro Sperduti 《Information Retrieval》2009,12(5):559-580
In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between
the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which
represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization
research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose
an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact
on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification
in which the distinction between primary and secondary categories is present; these results are obtained by reformulating
the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label
multiclass classification; state-of-the-art learning technology such as SVMs and kernel-based methods are used. Third, we
improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data
expressed in preferential form, i.e., in the form “for document d
i
, category c′ is preferred to category c′′”; this allows us to distinguish between primary and secondary categories not only in the classification phase but also
in the learning phase, thus differentiating their impact on the classifiers to be generated. 相似文献
4.
Charles L. A. Clarke Gordon V. Cormack Thomas R. Lynam Chris Buckley Donna Harman 《Information Retrieval》2009,12(6):680-694
Experiments were conducted to explore the impact of combining various components of eight leading information retrieval systems.
Each system demonstrated improved effectiveness through the use of blind feedback, also known as pseudo-relevance feedback, a form of query expansion. Blind feedback uses the results of a preliminary retrieval step to augment the efficacy of a
secondary retrieval step. The hybrid combination of primary and secondary retrieval steps from different systems in a number
of cases yielded better effectiveness than either of the constituent systems alone. This positive combining effect was observed
when entire documents were passed between the two retrieval steps, but not when only the expansion terms were passed. Several
combinations of primary and secondary retrieval steps were fused using the CombMNZ algorithm; all yielded significant effectiveness
improvement over the individual systems, with the best yielding an improvement of 13% (p = 10−6) over the best individual system and an improvement of 4% (p = 10−5) over a simple fusion of the eight systems. 相似文献
5.
The definitions of the rational and real-valued variants of the h-index and g-index are reviewed. It is shown how they can be obtained both graphically and by calculation. Formulae are derived expressing the exact relations between the h-variants and between the g-variants. Subsequently these relations are examined. In a citation context the real h-index is often, but not always, smaller than the rational h-index. It is also shown that the relation between the real and the rational g-index depends on the number of citations of the article ranked g + 1. Maximum differences between h, hr and hrat on the one hand and between g, gr and grat on the other are determined. 相似文献
6.
A compressed full-text self-index for a text T, of size u, is a data structure used to search for patterns P, of size m, in T, that requires reduced space, i.e. space that depends on the empirical entropy (H
k
or H
0) of T, and is, furthermore, able to reproduce any substring of T. In this paper we present a new compressed self-index able to locate the occurrences of P in O((m + occ)log u) time, where occ is the number of occurrences. The fundamental improvement over previous LZ78 based indexes is the reduction of the search
time dependency on m from O(m
2) to O(m). To achieve this result we point out the main obstacle to linear time algorithms based on LZ78 data compression and expose
and explore the nature of a recurrent structure in LZ-indexes, the suffix tree. We show that our method is very competitive in practice by comparing it against other state of the art compressed
indexes.
相似文献
Arlindo L. OliveiraEmail: |
7.
Hierarchical Text Categorization (HTC) is the task of generating (usually by means of supervised learning algorithms) text classifiers that operate on hierarchically
structured classification schemes. Notwithstanding the fact that most large-sized classification schemes for text have a hierarchical
structure, so far the attention of text classification researchers has mostly focused on algorithms for “flat” classification,
i.e. algorithms that operate on non-hierarchical classification schemes. These algorithms, once applied to a hierarchical
classification problem, are not capable of taking advantage of the information inherent in the class hierarchy, and may thus
be suboptimal, in terms of efficiency and/or effectiveness. In this paper we propose TreeBoost.MH, a multi-label HTC algorithm consisting of a hierarchical variant of AdaBoost.MH, a very well-known member of the family of “boosting” learning algorithms. TreeBoost.MH embodies several intuitions that had arisen before within HTC: e.g. the intuitions that both feature selection and the selection
of negative training examples should be performed “locally”, i.e. by paying attention to the topology of the classification
scheme. It also embodies the novel intuition that the weight distribution that boosting algorithms update at every boosting
round should likewise be updated “locally”. All these intuitions are embodied within TreeBoost.MH in an elegant and simple way, i.e. by defining TreeBoost.MH as a recursive algorithm that uses AdaBoost.MH as its base step, and that recurs over the tree structure. We present the results of experimenting TreeBoost.MH on three HTC benchmarks, and discuss analytically its computational cost.
相似文献
Fabrizio SebastianiEmail: |
8.
《Communication methods and measures》2013,7(4):323-338
Reliability is an important bottleneck for content analysis and similar methods for generating analyzable data. This is because the analysis of complex qualitative phenomena such as texts, social interactions, and media images easily escape physical measurement and call for human coders to describe what they read or observe. Owing to coders inescapable individual differences in background, the data they generate for subsequent analysis are prone to errors not typically found in mechanical measuring devices. However, most agreement measures designed to indicate whether data are sufficiently reliable to warrant subsequent analysis do not differentiate among kinds of disagreement that make data unreliable. This paper distinguishes two kinds of disagreement, systematic disagreement and random disagreement, and suggests measures of them in conjunction with the agreement coefficient α (alpha) (Krippendorff, 2004a, pp. 211–256). These measures, previously proposed for interval data (Krippendorff, 1970), are here developed for nominal data. Their importance lies in their ability to not only aid the development of reliable coding instructions but also warn researchers about two kinds of errors they face when using imperfect data. 相似文献
9.
Recently, the margins between gaming and feminism have become increasingly contentious (Salter & Blodgett, 2012). This article addresses a cultural moment where masculine gaming culture became aware of and began responding to feminist game scholars by analyzing GamerGate conspiracy documents and social media discussions related to the now infamous “DiGRA fishbowl.” Worries about the opacity of academic practices and a disparaging of feminist knowledge-making practices dominate these documents. By looking at these discussions and practices through the lens of conspiracy theories (Fenster, 2008; Hofstadter, 1952) and counterknowledge (Fiske, 1994) we consider the broader meaning of GamerGate's attention to academia. 相似文献
10.
《Communication methods and measures》2013,7(3):256-272
Hayes, Glynn, and Shanahan (2005a) introduced the Willingness to Self-Censor Scale as a measure of the extent to which a person uses cues about the climate of opinion when deciding whether to publicly voice opinions. The study reported here provides new validation evidence, collected during actual rather than hypothetical discussions. Each participant interacted with two confederates about a controversial topic. The confederates were trained to produce a discussion climate that was either consistent or inconsistent with the participant's own opinion on the topic. The manipulation of the climate of opinion affected opinion expression only among dispositional self-censors (i.e., those scoring relatively higher on the scale), even after controlling for dispositional shyness. As expected, people who scored relatively low were unaffected by information about the climate of opinion. These results further attest to the construct validity of the Willingness to Self-Censor Scale. 相似文献
11.
《Communication methods and measures》2013,7(2):137-164
Co-cultural theory provides a theoretical framework that examines the ways that members of co-cultural groups communicate when interacting with members of a dominant culture (Orbe, 1998a). The tenants of the theory were inductively derived via phenomenological analyses of focus group and interview data. Two of the central theoretical components, preferred outcome and communication approach, have been conceptualized as general tendencies that influence communication practices by co-cultural group members within interactions with members of dominant cultural groups. This article reports on the design of a self-report measure of these two components of co-cultural theory and provides evidence from two studies for the construct validity and reliability of the co-cultural theory scales (C-CTS). 相似文献
12.
We were shooting on the steps of the Metropolitan Museum one night. It was lit romantically, and Jennifer was wearing an evening gown, looking incredibly stunning. Suddenly there must have been a thousand people screaming her name. It was like witnessing this icon. (Ralph Fiennes in the New York Times, 2002, p. 16, emphasis added) This stamp, honoring a Mexican artist who has transcended “la frontera” and has become and icon to Hispanics, feminists, and art lovers, will be a further reminder of the continuous cultural contributions of Latinos to the United States. (Cecilia Alvear, President of National Association of Hispanic Journalists (NAHJ) on the occasion of the introduction of the Frida Kahlo U.S. postage st& 2001; emphasis added) “Nothing Like the Icon on the Fridge” (column about Salma Hayek’s Frida by Stephanie Zacharek in the New York Times, 2002). 相似文献
13.
The archival sliver: Power, memory, and archives in South Africa 总被引:3,自引:3,他引:0
Verne Harris 《Archival Science》2002,2(1-2):63-86
Far from being a simple reflection of reality, archives are constructed windows into personal and collective processes. They
at once express and are instruments of prevailing relations of power. Verne Harris makes these arguments through an account
of archives and archivists in the context of South Africa's transition from apartheid to democracy. The account is deliberately
shaped around three themes — race, power, and public records. While he concedes that the constructedness of memory and the
dimension of power are most obvious in the extreme circumstances of oppression and rapid transition to democracy, he argues
that these are realities informing archives in all circumstances. He makes an appeal to archivists to enchant their work by
engaging these realities and by turning always towards the call of and for justice.
This essay draws heavily on four articles published previously by me: “Towards a Culture of Transparency: Public Rights of
Access to Official Records in South Africa”,American Archivist 57.4 (1994); “Redefining Archives in South Africa: Public Archives and Society in Transition, 1990–1996”,Archivaria 42 (1996); “Transforming Discourse and Legislation: A Perspective on South Africa's New National Archives Act”,ACARM Newsletter 18 (1996); and “Claiming Less, Delivering More: A Critique of Positivist Formulations on Archives in South Africa”,Archivaria 44 (1997). I am grateful to Ethel Kriger (National Archives of South Africa) and Tim Nuttall (University of Natal) for offering
sometimes tough comment on an early draft of the essay. I remain, of course, fully responsible for the final text. I presented
a version of it in the “Refiguring the Archive” seminar series, University of the Witwatersrand, Johannesburg, October 1998.
That version was published in revised form in Carolyn Hamilton et al.,Refiguring the Archive (Cape Town: David Philip, 2002). 相似文献
14.
Direct optimization of evaluation measures has become an important branch of learning to rank for information retrieval (IR).
Since IR evaluation measures are difficult to optimize due to their non-continuity and non-differentiability, most direct
optimization methods optimize some surrogate functions instead, which we call surrogate measures. A critical issue regarding
these methods is whether the optimization of the surrogate measures can really lead to the optimization of the original IR
evaluation measures. In this work, we perform formal analysis on this issue. We propose a concept named “tendency correlation”
to describe the relationship between a surrogate measure and its corresponding IR evaluation measure. We show that when a
surrogate measure has arbitrarily strong tendency correlation with an IR evaluation measure, the optimization of it will lead
to the effective optimization of the original IR evaluation measure. Then, we analyze the tendency correlations of the surrogate
measures optimized in a number of direct optimization methods. We prove that the surrogate measures in SoftRank and ApproxRank
can have arbitrarily strong tendency correlation with the original IR evaluation measures, regardless of the data distribution,
when some parameters are appropriately set. However, the surrogate measures in SVM
MAP
, DORM
NDCG
, PermuRank
MAP
, and SVM
NDCG
cannot have arbitrarily strong tendency correlation with the original IR evaluation measures on certain distributions of
data. Therefore SoftRank and ApproxRank are theoretically sounder than SVM
MAP
, DORM
NDCG
, PermuRank
MAP
, and SVM
NDCG
, and are expected to result in better ranking performances. Our theoretical findings can explain the experimental results
observed on public benchmark datasets. 相似文献
15.
Matthew M. Martin Sydney M. Staggers Carolyn M. Anderson 《Communication Research Reports》2013,30(3):275-280
This report is a validity study involving the Cognitive Flexibility Scale (Martin & Rubin, 1995). Participants completed an online questionnaire. As predicted, cognitive flexibility was positively related to measures of intellectual flexibility and self-compassion, and negatively related to a measure of dogmatism. The prediction that cognitive flexibility would be negatively related to preference for consistency was not supported. 相似文献
16.
Keshra Sangwal 《Journal of Informetrics》2013,7(2):487-504
The distributions of citations L, two- (IF2) and five-year impact factors (IF5), and citation half-lives λ of journals published in different selected countries are analyzed using Langmuir-type relation: yn = y0 {1 ? αKn/(1 + Kn)}, where yn denotes Ln, IF2n or IF5n of n-ranked journal, y0 is the value of yn when journal rank n = 0, α is an empirical effectiveness parameter, and K is the Langmuir constant. It was found that: (1) the general features of the distribution of Ln, IF2n or IF5n of the journals published in different individual countries are similar to the results obtained before by the author from the analysis of the citation distribution data of papers of individual authors (K. Sangwal, Journal of Informetrics 7 (2013) 36–49), (2) in contrast to the theoretically expected value of the effectiveness parameter α = 1, the calculated values of α > 1 for journals published in different countries, (3) the trends of the distribution of cited half-lives λn of journals differ from those of Ln, IF2n and IF5n data for different countries, and show one, two or three linear regions, the longest linear regions with low slopes are observed in the case of countries publishing relatively high number of journals, and (4) the product of the Langmuir constant K and the number N of journals for the processes of citations and two- and five-year impact factors of journals published in different countries is constant for a process. The results suggest that: (1) the values of α > 1 are associated with a process that retards the generation of items (i.e. citations or impact factors), the difference (α ? 1) being related to the dissemination of contents of the journals published by a country, and (2) the constancy of KN is related to the publication potential of a country. 相似文献
17.
《Communication methods and measures》2013,7(3):223-249
Two studies are utilized to test a revised version of Guerrero, Andersen, Eloy, Spitzberg, and Jorgensen's (1995) communicative responses to jealousy (CRJ) scale and examine how measures from the CRJ associate with relational satisfaction. Study 1 uses exploratory factor analysis to identify a preliminary factor structure. Study 2 uses confirmatory factor analysis to determine whether this factor structure holds across a second sample, as well as structural equation modeling to test hypotheses regarding the associations between communicative responses to jealousy and relational satisfaction. These studies suggest that there are 11 specific communicative responses to jealousy that fall under four superordinate categories: (a) destructive communication, which consists of negative communication, counter-jealousy induction, and violence; (b) constructive communication, which includes integrative communication and compensatory restoration; (c) avoidance, which comprises silence and denial; and (d) rival-focused communication, which includes signs of possession, surveillance, rival contacts, and derogation of the rival. Destructive communication and, to a lesser extent, rival-focused communication associated negatively with relational satisfaction, whereas constructive communication associated positively. Recommendations for using the CRJ scale in future studies are provided. 相似文献
18.
Conclusion No reasonable person could argue against learning to read. The point of this article is that learning to read is not just
a matter of mastering a few simple skills, nor is literacy just a matter of passing a reading test. Learning to read must
involve acquiring the reading habit. Literacy must be viewed as the regular exercise of reading skills through reading books.
The time-honored reasons why children should read books are now bolstered and supplemented by new research evidence that book
reading can make a unique and powerful contribution to children's reading development.
Our society, then, must provide all possible encouragement and opportunity for children to read books. Access to books is
a necessary condition for becoming a good reader. Reading itself is the key to literacy. Helping America's children build
lifelong reading habits must now be regarded as a true national priority.
Education…has produced a vast population able to read but unable to distinguish what is worth reading —George Macaulay Trevelyan, English Social History
Good habits gather by unseen degrees—As brooks make rivers, rivers run to seas. —John Dryden, Ovid, Metamorphoses
Professor Richard C. Anderson is the center's director. Their research assistant 相似文献
19.
From a sociolinguistic and discourse-analytic perspective, news stories have often been considered as operating within a similar structural framework to oral narratives (Labov, 1972), sharing formal elements with narratives produced in other contexts (although as Bell (1991) has demonstrated in relation to print news, these elements occur in temporal disorganization). In this paper, in line with other recent treatments of news stories, we suggest that news does not conform to this kind of “narrative” structure as such. Examining data taken from print and live-broadcast TV news through a Sacksian (1995) lens, we argue that it is possible to simplify the analysis of news structure by approaching the news as “stories,” where the story elements are organized around the notions of category, action, and reason rather than as a series of narrative clauses involving orientation, complicating actions, evaluation, and resolution (Bell, 1991; van Dijk, 1988). 相似文献
20.
From work to text to document 总被引:1,自引:1,他引:0
David Beard 《Archival Science》2008,8(3):217-226
The defining trope for the humanities in the last 30 years has been typified by the move from “work” to “text.” The signature text defining this move has been Roland Barthes seminal essay, “From Work to Text.” But the current move
in library, archival and information studies toward the “document” as the key term offers challenges for contemporary humanities research. In making our own movement from work to text to document, we can explicate fully the complexity of conducting archival humanistic research within disciplinary and institutional contexts
in the twenty-first century. This essay calls for a complex perspective, one that demands that we understand the raw materials
of scholarship are processed by disciplines, by institutions, and by the work of the scholar. When we understand our materials
as constrained by disciplines, we understand them as “works.” When we understand them as constrained by the institutions of
memory that preserve and grant access to them, we understand them as “documents.” And when we understand them as the ground
for our own interpretive activity, we understand them as “texts.” When we understand that humanistic scholarship requires
an awareness of all three perspectives simultaneously (an understanding demonstrated by case studies in historical studies
of the discipline of rhetoric), we will be ready for a richer historical scholarship as well as a richer collaboration between
humanists and archivists. 相似文献