首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 406 毫秒
We consider the following autocompletion search scenario: imagine a user of a search engine typing a query; then with every keystroke display those completions of the last query word that would lead to the best hits, and also display the best such hits. The following problem is at the core of this feature: for a fixed document collection, given a set D of documents, and an alphabetical range W of words, compute the set of all word-in-document pairs (w, d) from the collection such that w W and d ∈ D. We present a new data structure with the help of which such autocompletion queries can be processed, on the average, in time linear in the input plus output size, independent of the size of the underlying document collection. At the same time, our data structure uses no more space than an inverted index. Actual query processing times on a large test collection correlate almost perfectly with our theoretical bound.
Ingmar WeberEmail:

In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by reformulating the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label multiclass classification; state-of-the-art learning technology such as SVMs and kernel-based methods are used. Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e., in the form “for document d i , category c′ is preferred to category c′′”; this allows us to distinguish between primary and secondary categories not only in the classification phase but also in the learning phase, thus differentiating their impact on the classifiers to be generated.  相似文献   

Experiments were conducted to explore the impact of combining various components of eight leading information retrieval systems. Each system demonstrated improved effectiveness through the use of blind feedback, also known as pseudo-relevance feedback, a form of query expansion. Blind feedback uses the results of a preliminary retrieval step to augment the efficacy of a secondary retrieval step. The hybrid combination of primary and secondary retrieval steps from different systems in a number of cases yielded better effectiveness than either of the constituent systems alone. This positive combining effect was observed when entire documents were passed between the two retrieval steps, but not when only the expansion terms were passed. Several combinations of primary and secondary retrieval steps were fused using the CombMNZ algorithm; all yielded significant effectiveness improvement over the individual systems, with the best yielding an improvement of 13% (p = 10−6) over the best individual system and an improvement of 4% (p = 10−5) over a simple fusion of the eight systems.  相似文献   

The definitions of the rational and real-valued variants of the h-index and g-index are reviewed. It is shown how they can be obtained both graphically and by calculation. Formulae are derived expressing the exact relations between the h-variants and between the g-variants. Subsequently these relations are examined. In a citation context the real h-index is often, but not always, smaller than the rational h-index. It is also shown that the relation between the real and the rational g-index depends on the number of citations of the article ranked g + 1. Maximum differences between h, hr and hrat on the one hand and between g, gr and grat on the other are determined.  相似文献   

A compressed full-text self-index for a text T, of size u, is a data structure used to search for patterns P, of size m, in T, that requires reduced space, i.e. space that depends on the empirical entropy (H k or H 0) of T, and is, furthermore, able to reproduce any substring of T. In this paper we present a new compressed self-index able to locate the occurrences of P in O((m + occ)log u) time, where occ is the number of occurrences. The fundamental improvement over previous LZ78 based indexes is the reduction of the search time dependency on m from O(m 2) to O(m). To achieve this result we point out the main obstacle to linear time algorithms based on LZ78 data compression and expose and explore the nature of a recurrent structure in LZ-indexes, the suffix tree. We show that our method is very competitive in practice by comparing it against other state of the art compressed indexes.
Arlindo L. OliveiraEmail:

Hierarchical Text Categorization (HTC) is the task of generating (usually by means of supervised learning algorithms) text classifiers that operate on hierarchically structured classification schemes. Notwithstanding the fact that most large-sized classification schemes for text have a hierarchical structure, so far the attention of text classification researchers has mostly focused on algorithms for “flat” classification, i.e. algorithms that operate on non-hierarchical classification schemes. These algorithms, once applied to a hierarchical classification problem, are not capable of taking advantage of the information inherent in the class hierarchy, and may thus be suboptimal, in terms of efficiency and/or effectiveness. In this paper we propose TreeBoost.MH, a multi-label HTC algorithm consisting of a hierarchical variant of AdaBoost.MH, a very well-known member of the family of “boosting” learning algorithms. TreeBoost.MH embodies several intuitions that had arisen before within HTC: e.g. the intuitions that both feature selection and the selection of negative training examples should be performed “locally”, i.e. by paying attention to the topology of the classification scheme. It also embodies the novel intuition that the weight distribution that boosting algorithms update at every boosting round should likewise be updated “locally”. All these intuitions are embodied within TreeBoost.MH in an elegant and simple way, i.e. by defining TreeBoost.MH as a recursive algorithm that uses AdaBoost.MH as its base step, and that recurs over the tree structure. We present the results of experimenting TreeBoost.MH on three HTC benchmarks, and discuss analytically its computational cost.
Fabrizio SebastianiEmail:

Reliability is an important bottleneck for content analysis and similar methods for generating analyzable data. This is because the analysis of complex qualitative phenomena such as texts, social interactions, and media images easily escape physical measurement and call for human coders to describe what they read or observe. Owing to coders inescapable individual differences in background, the data they generate for subsequent analysis are prone to errors not typically found in mechanical measuring devices. However, most agreement measures designed to indicate whether data are sufficiently reliable to warrant subsequent analysis do not differentiate among kinds of disagreement that make data unreliable. This paper distinguishes two kinds of disagreement, systematic disagreement and random disagreement, and suggests measures of them in conjunction with the agreement coefficient α (alpha) (Krippendorff, 2004a Krippendorff, K. 2004a. Content analysis: An introduction to its methodology, 2nd, Thousand Oaks, CA: Sage.  [Google Scholar], pp. 211–256). These measures, previously proposed for interval data (Krippendorff, 1970 Krippendorff, K. 1970. Estimating the reliability, systematic error and random error of interval data. Educational and Psychological Measurement, 30: 6170. [Crossref], [Web of Science ®] [Google Scholar]), are here developed for nominal data. Their importance lies in their ability to not only aid the development of reliable coding instructions but also warn researchers about two kinds of errors they face when using imperfect data.  相似文献   

Recently, the margins between gaming and feminism have become increasingly contentious (Salter & Blodgett, 2012 Salter, A., & Blodgett, B. (2012). Hypermasculinity & Dickwolves: The contentious role of women in the new gaming public. Journal of Broadcasting & Electronic Media, 56, 401416. doi:10.1080/08838151.2012.705199[Taylor & Francis Online], [Web of Science ®] [Google Scholar]). This article addresses a cultural moment where masculine gaming culture became aware of and began responding to feminist game scholars by analyzing GamerGate conspiracy documents and social media discussions related to the now infamous “DiGRA fishbowl.” Worries about the opacity of academic practices and a disparaging of feminist knowledge-making practices dominate these documents. By looking at these discussions and practices through the lens of conspiracy theories (Fenster, 2008 Fenster, M. (2008). Conspiracy theories: Secrecy and power in American culture (2nd edition). Minneapolis, MN: University of Minnesota Press. [Google Scholar]; Hofstadter, 1952 Hofstadter, R. (1952). The paranoid style in American politics and other essays. Cambridge, MA: Harvard University Press. [Google Scholar]) and counterknowledge (Fiske, 1994 Fiske, J. (1994). Blackstream knowledge: Genocide. In Media matters: Everyday culture and political change. Minneapolis, MN: University of Minnesota Press. [Google Scholar]) we consider the broader meaning of GamerGate's attention to academia.  相似文献   

Hayes, Glynn, and Shanahan (2005a Hayes, A. F., Glynn, C. J. and Shanahan, J. 2005a. Willingness to self-censor: A construct and measurement tool for public opinion research. International Journal of Public Opinion Research, 17: 298323. [Crossref], [Web of Science ®] [Google Scholar]) introduced the Willingness to Self-Censor Scale as a measure of the extent to which a person uses cues about the climate of opinion when deciding whether to publicly voice opinions. The study reported here provides new validation evidence, collected during actual rather than hypothetical discussions. Each participant interacted with two confederates about a controversial topic. The confederates were trained to produce a discussion climate that was either consistent or inconsistent with the participant's own opinion on the topic. The manipulation of the climate of opinion affected opinion expression only among dispositional self-censors (i.e., those scoring relatively higher on the scale), even after controlling for dispositional shyness. As expected, people who scored relatively low were unaffected by information about the climate of opinion. These results further attest to the construct validity of the Willingness to Self-Censor Scale.  相似文献   

Co-cultural theory provides a theoretical framework that examines the ways that members of co-cultural groups communicate when interacting with members of a dominant culture (Orbe, 1998a Orbe, M. 1998a. Constructing co-cultural theory: An explication of culture, power, and communication, Thousand Oaks, CA: Sage. [Crossref] [Google Scholar]). The tenants of the theory were inductively derived via phenomenological analyses of focus group and interview data. Two of the central theoretical components, preferred outcome and communication approach, have been conceptualized as general tendencies that influence communication practices by co-cultural group members within interactions with members of dominant cultural groups. This article reports on the design of a self-report measure of these two components of co-cultural theory and provides evidence from two studies for the construct validity and reliability of the co-cultural theory scales (C-CTS).  相似文献   

We were shooting on the steps of the Metropolitan Museum one night. It was lit romantically, and Jennifer was wearing an evening gown, looking incredibly stunning. Suddenly there must have been a thousand people screaming her name. It was like witnessing this icon. (Ralph Fiennes in the New York Times, 2002, p. 16, emphasis added)

This stamp, honoring a Mexican artist who has transcended “la frontera” and has become and icon to Hispanics, feminists, and art lovers, will be a further reminder of the continuous cultural contributions of Latinos to the United States. (Cecilia Alvear, President of National Association of Hispanic Journalists (NAHJ) on the occasion of the introduction of the Frida Kahlo U.S. postage st& 2001; emphasis added)

“Nothing Like the Icon on the Fridge” (column about Salma Hayek’s Frida by Stephanie Zacharek in the New York Times, 2002 Zacharek, S. 2002. The New Season/Movies; Nothing Like the Icon on the Fridge. The New York Times, : 41 Sep.8.Section ZA [Google Scholar]).  相似文献   

The archival sliver: Power, memory, and archives in South Africa   总被引:3,自引:3,他引:0  
Far from being a simple reflection of reality, archives are constructed windows into personal and collective processes. They at once express and are instruments of prevailing relations of power. Verne Harris makes these arguments through an account of archives and archivists in the context of South Africa's transition from apartheid to democracy. The account is deliberately shaped around three themes — race, power, and public records. While he concedes that the constructedness of memory and the dimension of power are most obvious in the extreme circumstances of oppression and rapid transition to democracy, he argues that these are realities informing archives in all circumstances. He makes an appeal to archivists to enchant their work by engaging these realities and by turning always towards the call of and for justice. This essay draws heavily on four articles published previously by me: “Towards a Culture of Transparency: Public Rights of Access to Official Records in South Africa”,American Archivist 57.4 (1994); “Redefining Archives in South Africa: Public Archives and Society in Transition, 1990–1996”,Archivaria 42 (1996); “Transforming Discourse and Legislation: A Perspective on South Africa's New National Archives Act”,ACARM Newsletter 18 (1996); and “Claiming Less, Delivering More: A Critique of Positivist Formulations on Archives in South Africa”,Archivaria 44 (1997). I am grateful to Ethel Kriger (National Archives of South Africa) and Tim Nuttall (University of Natal) for offering sometimes tough comment on an early draft of the essay. I remain, of course, fully responsible for the final text. I presented a version of it in the “Refiguring the Archive” seminar series, University of the Witwatersrand, Johannesburg, October 1998. That version was published in revised form in Carolyn Hamilton et al.,Refiguring the Archive (Cape Town: David Philip, 2002).  相似文献   

Direct optimization of evaluation measures has become an important branch of learning to rank for information retrieval (IR). Since IR evaluation measures are difficult to optimize due to their non-continuity and non-differentiability, most direct optimization methods optimize some surrogate functions instead, which we call surrogate measures. A critical issue regarding these methods is whether the optimization of the surrogate measures can really lead to the optimization of the original IR evaluation measures. In this work, we perform formal analysis on this issue. We propose a concept named “tendency correlation” to describe the relationship between a surrogate measure and its corresponding IR evaluation measure. We show that when a surrogate measure has arbitrarily strong tendency correlation with an IR evaluation measure, the optimization of it will lead to the effective optimization of the original IR evaluation measure. Then, we analyze the tendency correlations of the surrogate measures optimized in a number of direct optimization methods. We prove that the surrogate measures in SoftRank and ApproxRank can have arbitrarily strong tendency correlation with the original IR evaluation measures, regardless of the data distribution, when some parameters are appropriately set. However, the surrogate measures in SVM MAP , DORM NDCG , PermuRank MAP , and SVM NDCG cannot have arbitrarily strong tendency correlation with the original IR evaluation measures on certain distributions of data. Therefore SoftRank and ApproxRank are theoretically sounder than SVM MAP , DORM NDCG , PermuRank MAP , and SVM NDCG , and are expected to result in better ranking performances. Our theoretical findings can explain the experimental results observed on public benchmark datasets.  相似文献   

This report is a validity study involving the Cognitive Flexibility Scale (Martin & Rubin, 1995 Martin , M. M. , & Rubin , R. B. ( 1995 ). A new measure of cognitive flexibility . Psychological Reports , 76 , 623626 .[Crossref], [Web of Science ®] [Google Scholar]). Participants completed an online questionnaire. As predicted, cognitive flexibility was positively related to measures of intellectual flexibility and self-compassion, and negatively related to a measure of dogmatism. The prediction that cognitive flexibility would be negatively related to preference for consistency was not supported.  相似文献   

The distributions of citations L, two- (IF2) and five-year impact factors (IF5), and citation half-lives λ of journals published in different selected countries are analyzed using Langmuir-type relation: yn = y0 {1 ? αKn/(1 + Kn)}, where yn denotes Ln, IF2n or IF5n of n-ranked journal, y0 is the value of yn when journal rank n = 0, α is an empirical effectiveness parameter, and K is the Langmuir constant. It was found that: (1) the general features of the distribution of Ln, IF2n or IF5n of the journals published in different individual countries are similar to the results obtained before by the author from the analysis of the citation distribution data of papers of individual authors (K. Sangwal, Journal of Informetrics 7 (2013) 36–49), (2) in contrast to the theoretically expected value of the effectiveness parameter α = 1, the calculated values of α > 1 for journals published in different countries, (3) the trends of the distribution of cited half-lives λn of journals differ from those of Ln, IF2n and IF5n data for different countries, and show one, two or three linear regions, the longest linear regions with low slopes are observed in the case of countries publishing relatively high number of journals, and (4) the product of the Langmuir constant K and the number N of journals for the processes of citations and two- and five-year impact factors of journals published in different countries is constant for a process. The results suggest that: (1) the values of α > 1 are associated with a process that retards the generation of items (i.e. citations or impact factors), the difference (α ? 1) being related to the dissemination of contents of the journals published by a country, and (2) the constancy of KN is related to the publication potential of a country.  相似文献   

Two studies are utilized to test a revised version of Guerrero, Andersen, Eloy, Spitzberg, and Jorgensen's (1995 Guerrero, L. K., Andersen, P. A., Jorgensen, P. F., Spitzberg, B. H. and Eloy, S. V. 1995. Coping with the green-eyed monster: Conceptualizing and measuring communicative responses to romantic jealousy. Western Journal of Communication, 59: 270304. [Taylor & Francis Online], [Web of Science ®] [Google Scholar]) communicative responses to jealousy (CRJ) scale and examine how measures from the CRJ associate with relational satisfaction. Study 1 uses exploratory factor analysis to identify a preliminary factor structure. Study 2 uses confirmatory factor analysis to determine whether this factor structure holds across a second sample, as well as structural equation modeling to test hypotheses regarding the associations between communicative responses to jealousy and relational satisfaction. These studies suggest that there are 11 specific communicative responses to jealousy that fall under four superordinate categories: (a) destructive communication, which consists of negative communication, counter-jealousy induction, and violence; (b) constructive communication, which includes integrative communication and compensatory restoration; (c) avoidance, which comprises silence and denial; and (d) rival-focused communication, which includes signs of possession, surveillance, rival contacts, and derogation of the rival. Destructive communication and, to a lesser extent, rival-focused communication associated negatively with relational satisfaction, whereas constructive communication associated positively. Recommendations for using the CRJ scale in future studies are provided.  相似文献   

Conclusion No reasonable person could argue against learning to read. The point of this article is that learning to read is not just a matter of mastering a few simple skills, nor is literacy just a matter of passing a reading test. Learning to read must involve acquiring the reading habit. Literacy must be viewed as the regular exercise of reading skills through reading books. The time-honored reasons why children should read books are now bolstered and supplemented by new research evidence that book reading can make a unique and powerful contribution to children's reading development. Our society, then, must provide all possible encouragement and opportunity for children to read books. Access to books is a necessary condition for becoming a good reader. Reading itself is the key to literacy. Helping America's children build lifelong reading habits must now be regarded as a true national priority. Education…has produced a vast population able to read but unable to distinguish what is worth reading —George Macaulay Trevelyan, English Social History Good habits gather by unseen degrees—As brooks make rivers, rivers run to seas. —John Dryden, Ovid, Metamorphoses Professor Richard C. Anderson is the center's director. Their research assistant  相似文献   

From a sociolinguistic and discourse-analytic perspective, news stories have often been considered as operating within a similar structural framework to oral narratives (Labov, 1972 Labov, W. 1972. Language in the inner city, Philadelphia: University of Pennsylvania Press.  [Google Scholar]), sharing formal elements with narratives produced in other contexts (although as Bell (1991) Bell, A. 1991. The language of news media, Oxford: Blackwell.  [Google Scholar] has demonstrated in relation to print news, these elements occur in temporal disorganization). In this paper, in line with other recent treatments of news stories, we suggest that news does not conform to this kind of “narrative” structure as such. Examining data taken from print and live-broadcast TV news through a Sacksian (1995) lens, we argue that it is possible to simplify the analysis of news structure by approaching the news as “stories,” where the story elements are organized around the notions of category, action, and reason rather than as a series of narrative clauses involving orientation, complicating actions, evaluation, and resolution (Bell, 1991 Bell, A. 1991. The language of news media, Oxford: Blackwell.  [Google Scholar]; van Dijk, 1988 Van Dijk, T. A. 1988. News as discourse, Mahwah, NJ: Lawrence Erlbaum.  [Google Scholar]).  相似文献   

From work to text to document   总被引:1,自引:1,他引:0  
The defining trope for the humanities in the last 30 years has been typified by the move from “work” to “text.” The signature text defining this move has been Roland Barthes seminal essay, “From Work to Text.” But the current move in library, archival and information studies toward the “document” as the key term offers challenges for contemporary humanities research. In making our own movement from work to text to document, we can explicate fully the complexity of conducting archival humanistic research within disciplinary and institutional contexts in the twenty-first century. This essay calls for a complex perspective, one that demands that we understand the raw materials of scholarship are processed by disciplines, by institutions, and by the work of the scholar. When we understand our materials as constrained by disciplines, we understand them as “works.” When we understand them as constrained by the institutions of memory that preserve and grant access to them, we understand them as “documents.” And when we understand them as the ground for our own interpretive activity, we understand them as “texts.” When we understand that humanistic scholarship requires an awareness of all three perspectives simultaneously (an understanding demonstrated by case studies in historical studies of the discipline of rhetoric), we will be ready for a richer historical scholarship as well as a richer collaboration between humanists and archivists.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号