共查询到20条相似文献,搜索用时 31 毫秒
1.
Classifying Amharic webnews 总被引:1,自引:1,他引:0
Lars Asker Atelach Alemu Argaw Björn Gambäck Samuel Eyassu Asfeha Lemma Nigussie Habte 《Information Retrieval》2009,12(3):416-435
We present work aimed at compiling an Amharic corpus from the Web and automatically categorizing the texts. Amharic is the
second most spoken Semitic language in the World (after Arabic) and used for countrywide communication in Ethiopia. It is
highly inflectional and quite dialectally diversified. We discuss the issues of compiling and annotating a corpus of Amharic
news articles from the Web. This corpus was then used in three sets of text classification experiments. Working with a less-researched
language highlights a number of practical issues that might otherwise receive less attention or go unnoticed. The purpose
of the experiments has not primarily been to develop a cutting-edge text classification system for Amharic, but rather to
put the spotlight on some of these issues. The first two sets of experiments investigated the use of Self-Organizing Maps
(SOMs) for document classification. Testing on small datasets, we first looked at classifying unseen data into 10 predefined
categories of news items, and then at clustering it around query content, when taking 16 queries as class labels. The second
set of experiments investigated the effect of operations such as stemming and part-of-speech tagging on text classification
performance. We compared three representations while constructing classification models based on bagging of decision trees
for the 10 predefined news categories. The best accuracy was achieved using the full text as representation. A representation
using only the nouns performed almost equally well, confirming the assumption that most of the information required for distinguishing
between various categories actually is contained in the nouns, while stemming did not have much effect on the performance
of the classifier.
相似文献
Lemma Nigussie HabteEmail: |
2.
Smoothing of document language models is critical in language modeling approaches to information retrieval. In this paper,
we present a novel way of smoothing document language models based on propagating term counts probabilistically in a graph
of documents. A key difference between our approach and previous approaches is that our smoothing algorithm can iteratively
propagate counts and achieve smoothing with remotely related documents. Evaluation results on several TREC data sets show that the proposed method significantly outperforms the
simple collection-based smoothing method. Compared with those other smoothing methods that also exploit local corpus structures,
our method is especially effective in improving precision in top-ranked documents through “filling in” missing query terms
in relevant documents, which is attractive since most users only pay attention to the top-ranked documents in search engine
applications.
相似文献
ChengXiang ZhaiEmail: |
3.
Previous studies have shown that weeding a library collection benefits patrons and increases circulation rates. However, the time required to review the collection and make weeding decisions presents a formidable obstacle. This study empirically evaluated methods for automatically classifying weeding candidates. A data set containing 80,346 items from a large-scale weeding project running from 2011 to 2014 at Wesleyan University was used to train six machine learning classifiers to predict a weeding decision of either ‘Keep’ or ‘Weed’ for each candidate. The study found statistically significant agreement (p?=?0.001) between classifier predictions and librarian judgments for all classifier types. The naive Bayes and linear support vector machine classifiers had the highest recall (fraction of items weeded by librarians that were identified by the algorithm), while the k-nearest-neighbor classifier had the highest precision (fraction of recommended candidates that librarians had chosen to weed). The variables found to be most relevant were: librarian and faculty votes for retention, item age, and the presence of copies in other libraries. 相似文献
4.
5.
Hierarchical Text Categorization (HTC) is the task of generating (usually by means of supervised learning algorithms) text classifiers that operate on hierarchically
structured classification schemes. Notwithstanding the fact that most large-sized classification schemes for text have a hierarchical
structure, so far the attention of text classification researchers has mostly focused on algorithms for “flat” classification,
i.e. algorithms that operate on non-hierarchical classification schemes. These algorithms, once applied to a hierarchical
classification problem, are not capable of taking advantage of the information inherent in the class hierarchy, and may thus
be suboptimal, in terms of efficiency and/or effectiveness. In this paper we propose TreeBoost.MH, a multi-label HTC algorithm consisting of a hierarchical variant of AdaBoost.MH, a very well-known member of the family of “boosting” learning algorithms. TreeBoost.MH embodies several intuitions that had arisen before within HTC: e.g. the intuitions that both feature selection and the selection
of negative training examples should be performed “locally”, i.e. by paying attention to the topology of the classification
scheme. It also embodies the novel intuition that the weight distribution that boosting algorithms update at every boosting
round should likewise be updated “locally”. All these intuitions are embodied within TreeBoost.MH in an elegant and simple way, i.e. by defining TreeBoost.MH as a recursive algorithm that uses AdaBoost.MH as its base step, and that recurs over the tree structure. We present the results of experimenting TreeBoost.MH on three HTC benchmarks, and discuss analytically its computational cost.
相似文献
Fabrizio SebastianiEmail: |
6.
针对主题爬行技术中的单一分类算法在面对多主题Web抓取和分类需求时泛化能力不强的局限,设计一种利用多种强分类算法形成的分类器组合,主题爬行器根据当前主题任务在线评估并为分类器排名,从中选择最优分类器分类的策略,并开展在多个主题抓取任务下的分类实验,比较每种分类算法的准确率和组合后的平均分类准确率以及对分类效率等评价指标的综合分析,结果证明该策略对领域局域性有所克服,普适性较强。 相似文献
7.
Due to the heavy use of gene synonyms in biomedical text, people have tried many query expansion techniques using synonyms
in order to improve performance in biomedical information retrieval. However, mixed results have been reported. The main challenge
is that it is not trivial to assign appropriate weights to the added gene synonyms in the expanded query; under-weighting
of synonyms would not bring much benefit, while overweighting some unreliable synonyms can hurt performance significantly.
So far, there has been no systematic evaluation of various synonym query expansion strategies for biomedical text. In this
work, we propose two different strategies to extend a standard language modeling approach for gene synonym query expansion
and conduct a systematic evaluation of these methods on all the available TREC biomedical text collections for ad hoc document
retrieval. Our experiment results show that synonym expansion can significantly improve the retrieval accuracy. However, different
query types require different synonym expansion methods, and appropriate weighting of gene names and synonym terms is critical
for improving performance.
相似文献
Chengxiang ZhaiEmail: |
8.
T. Couto N. Ziviani P. Calado M. Cristo M. Gonçalves E. S. de Moura W. Brandão 《Information Retrieval》2010,13(4):315-345
Automatic document classification can be used to organize documents in a digital library, construct on-line directories, improve
the precision of web searching, or help the interactions between user and search engines. In this paper we explore how linkage
information inherent to different document collections can be used to enhance the effectiveness of classification algorithms.
We have experimented with three link-based bibliometric measures, co-citation, bibliographic coupling and Amsler, on three
different document collections: a digital library of computer science papers, a web directory and an on-line encyclopedia.
Results show that both hyperlink and citation information can be used to learn reliable and effective classifiers based on
a kNN classifier. In one of the test collections used, we obtained improvements of up to 69.8% of macro-averaged F
1 over the traditional text-based kNN classifier, considered as the baseline measure in our experiments. We also present alternative ways of combining bibliometric
based classifiers with text based classifiers. Finally, we conducted studies to analyze the situation in which the bibliometric-based
classifiers failed and show that in such cases it is hard to reach consensus regarding the correct classes, even for human
judges. 相似文献
9.
Document length is widely recognized as an important factor for adjusting retrieval systems. Many models tend to favor the
retrieval of either short or long documents and, thus, a length-based correction needs to be applied for avoiding any length
bias. In Language Modeling for Information Retrieval, smoothing methods are applied to move probability mass from document
terms to unseen words, which is often dependant upon document length. In this article, we perform an in-depth study of this
behavior, characterized by the document length retrieval trends, of three popular smoothing methods across a number of factors,
and its impact on the length of documents retrieved and retrieval performance. First, we theoretically analyze the Jelinek–Mercer,
Dirichlet prior and two-stage smoothing strategies and, then, conduct an empirical analysis. In our analysis we show how Dirichlet
prior smoothing caters for document length more appropriately than Jelinek–Mercer smoothing which leads to its superior retrieval
performance. In a follow up analysis, we posit that length-based priors can be used to offset any bias in the length retrieval
trends stemming from the retrieval formula derived by the smoothing technique. We show that the performance of Jelinek–Mercer
smoothing can be significantly improved by using such a prior, which provides a natural and simple alternative to decouple
the query and document modeling roles of smoothing. With the analysis of retrieval behavior conducted in this article, it
is possible to understand why the Dirichlet Prior smoothing performs better than the Jelinek–Mercer, and why the performance
of the Jelinek–Mercer method is improved by including a length-based prior.
相似文献
Leif AzzopardiEmail: |
10.
Word embeddings and convolutional neural networks (CNN) have attracted extensive attention in various classification tasks for Twitter, e.g. sentiment classification. However, the effect of the configuration used to generate the word embeddings on the classification performance has not been studied in the existing literature. In this paper, using a Twitter election classification task that aims to detect election-related tweets, we investigate the impact of the background dataset used to train the embedding models, as well as the parameters of the word embedding training process, namely the context window size, the dimensionality and the number of negative samples, on the attained classification performance. By comparing the classification results of word embedding models that have been trained using different background corpora (e.g. Wikipedia articles and Twitter microposts), we show that the background data should align with the Twitter classification dataset both in data type and time period to achieve significantly better performance compared to baselines such as SVM with TF-IDF. Moreover, by evaluating the results of word embedding models trained using various context window sizes and dimensionalities, we find that large context window and dimension sizes are preferable to improve the performance. However, the number of negative samples parameter does not significantly affect the performance of the CNN classifiers. Our experimental results also show that choosing the correct word embedding model for use with CNN leads to statistically significant improvements over various baselines such as random, SVM with TF-IDF and SVM with word embeddings. Finally, for out-of-vocabulary (OOV) words that are not available in the learned word embedding models, we show that a simple OOV strategy to randomly initialise the OOV words without any prior knowledge is sufficient to attain a good classification performance among the current OOV strategies (e.g. a random initialisation using statistics of the pre-trained word embedding models). 相似文献
11.
Markus Schedl 《Information Retrieval》2012,15(3-4):183-217
Different term weighting techniques such as $TF\cdot IDF$ or BM25 have been used intensely for manifold text-based information retrieval tasks. Their use for modeling term profiles for named entities and subsequent calculation of similarities between these named entities have been studied to a much smaller extent. The recent trend of microblogging made available massive amounts of information about almost every topic around the world. Therefore, microblogs represent a valuable source for text-based named entity modeling. In this paper, we present a systematic and comprehensive evaluation of different term weighting measures, normalization techniques, query schemes, index term sets, and similarity functions for the task of inferring similarities between named entities, based on data extracted from microblog posts. We analyze several thousand combinations of choices for the above mentioned dimensions, which influence the similarity calculation process, and we investigate in which way they impact the quality of the similarity estimates. Evaluation is performed using three real-world data sets: two collections of microblogs related to music artists and one related to movies. For the music collections, we present results of genre classification experiments using as benchmark genre information from allmusic.com . For the movie collection, we present results of multi-class classification experiments using as benchmark categories from IMDb . We show that microblogs can indeed be exploited to model named entity similarity with remarkable accuracy, provided the correct settings for the analyzed aspects are used. We further compare the results to those obtained when using Web pages as data source. 相似文献
12.
运用图示法自动提取中文专利文本的语义信息 总被引:1,自引:0,他引:1
[目的/意义]提出利用图结构的表示法自动挖掘中文专利文本的语义信息,以为基于文本内容的专利智能分析提供语义支持。[方法/过程] 设计两种运用图结构的模型:①基于关键词的文本图模型;②基于依存关系树的文本图模型。第一种图模型通过计算关键词之间的相似性关系来定义;第二种图模型则由句中所提取的语法关系来定义。在案例研究中,借助频繁子图挖掘算法,对所建图模型进行子图挖掘, 并构建以子图为特征的文本分类器,用来检测所建图模型的表达性和有效性。[结果/结论]将所建的基于图模型的文本分类器应用于4个不同技术领域的专利文本数据集,并与经典文本分类器的测试结果相比较而知:前者在使用明显较少的特征数的基础上,分类性能较后者提升2.1%-10.5%。由此而推断,使用图结构的表达法并结合图挖掘技术从专利文本中所提取的语义信息是有效的,有助于进一步的专利文本分析。 相似文献
13.
Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR’ed
to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving
Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different
correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error
rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available,
then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction
with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to
be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction
can minimize the need for morphologically sensitive error correction.
相似文献
Kareem DarwishEmail: |
14.
Qinglei Wang Yanan Qian Ruihua Song Zhicheng Dou Fan Zhang Tetsuya Sakai Qinghua Zheng 《Information Retrieval》2013,16(4):484-503
Web search queries are often ambiguous or faceted, and the task of identifying the major underlying senses and facets of queries has received much attention in recent years. We refer to this task as query subtopic mining. In this paper, we propose to use surrounding text of query terms in top retrieved documents to mine subtopics and rank them. We first extract text fragments containing query terms from different parts of documents. Then we group similar text fragments into clusters and generate a readable subtopic for each cluster. Based on the cluster and the language model trained from a query log, we calculate three features and combine them into a relevance score for each subtopic. Subtopics are finally ranked by balancing relevance and novelty. Our evaluation experiments with the NTCIR-9 INTENT Chinese Subtopic Mining test collection show that our method significantly outperforms a query log based method proposed by Radlinski et al. (2010) and a search result clustering based method proposed by Zeng et al. (2004) in terms of precision, I-rec, D-nDCG and D#-nDCG, the official evaluation metrics used at the NTCIR-9 INTENT task. Moreover, our generated subtopics are significantly more readable than those generated by the search result clustering method. 相似文献
15.
Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents 总被引:1,自引:1,他引:0
Bassam H. Hammo 《Information Retrieval》2009,12(3):300-323
16.
Nieves R. Brisaboa Antonio Fariña Gonzalo Navarro José R. Paramá 《Information Retrieval》2007,10(1):1-33
Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress
natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed
text and random access capabilities, in exchange for producing around 11% larger compressed files. This work describes End-Tagged
Dense Code and (s, c)-Dense Code, two new semistatic statistical methods for compressing natural language texts. These techniques permit simpler
and faster encoding and obtain better compression ratios than Tagged Huffman Code, while maintaining its fast direct search
and random access capabilities. We show that Dense Codes improve Tagged Huffman Code compression ratio by about 10%, reaching
only 0.6% overhead over the optimal Huffman compression ratio. Being simpler, Dense Codes are generated 45% to 60% faster
than Huffman codes. This makes Dense Codes a very attractive alternative to Huffman code variants for various reasons: they
are simpler to program, faster to build, of almost optimal size, and as fast and easy to search as the best Huffman variants,
which are not so close to the optimal size.
相似文献
José R. ParamáEmail: |
17.
For the purposes of classification it is common to represent a document as a bag of words. Such a representation consists of the individual terms making up the document together with the number of times each term appears in the document. All classification methods make use of the terms. It is common to also make use of the local term frequencies at the price of some added complication in the model. Examples are the naïve Bayes multinomial model (MM), the Dirichlet compound multinomial model (DCM) and the exponential-family approximation of the DCM (EDCM), as well as support vector machines (SVM). Although it is usually claimed that incorporating local word frequency in a document improves text classification performance, we here test whether such claims are true or not. In this paper we show experimentally that simplified forms of the MM, EDCM, and SVM models which ignore the frequency of each word in a document perform about at the same level as MM, DCM, EDCM and SVM models which incorporate local term frequency. We also present a new form of the naïve Bayes multivariate Bernoulli model (MBM) which is able to make use of local term frequency and show again that it offers no significant advantage over the plain MBM. We conclude that word burstiness is so strong that additional occurrences of a word essentially add no useful information to a classifier. 相似文献
18.
Intelligent use of the many diverse forms of data available on the Internet requires new tools for managing and manipulating
heterogeneous forms of information. This paper uses WHIRL, an extension of relational databases that can manipulate textual
data using statistical similarity measures developed by the information retrieval community. We show that although WHIRL is
designed for more general similarity-based reasoning tasks, it is competitive with mature systems designed explicitly for
inductive classification. In particular, WHIRL is well suited for combining different sources of knowledge in the classification
process. We show on a diverse set of tasks that the use of appropriate sets of unlabeled background knowledge often decreases
error rates, particularly if the number of examples or the size of the strings in the training set is small. This is especially
useful when labeling text is a labor-intensive job and when there is a large amount of information available about a particular
problem on the World Wide Web.
相似文献
Haym HirshEmail: |
19.
V. I. Starodubov N. G. Kurakova L. A. Tsvetkova P. G. Aref’ev A. F. Kurakov 《Scientific and Technical Information Processing》2012,39(3):139-152
The main vectors for reforming Russia??s scientific-technological and innovative spheres, which are reflected in program documents that were adopted in late 2011 and early 2012, are discussed in the article. The authors propose to formalize the concepts that were used in these documents, such as world-class research competence and world-leading science and technology sectors. Using the Field Normalized Citation Score (NCSf) as a bibliometric indicator for the analysis of different research areas of Russia??s clinical medicine, the authors show the degree of variance between individual research areas within the same national subject field in terms of their correspondence to world-class excellence. The authors also emphasize the necessity of developing Russia??s national methodology for auditing Russian science, which will take its Russian-language segment into account where the world??s recognized methodologies are not always adequately applicable. 相似文献
20.
The Combination of Text Classifiers Using Reliability Indicators 总被引:4,自引:0,他引:4
The intuition that different text classifiers behave in qualitatively different ways has long motivated attempts to build a better metaclassifier via some combination of classifiers. We introduce a probabilistic method for combining classifiers that considers the context-sensitive reliabilities of contributing classifiers. The method harnesses reliability indicators—variables that provide signals about the performance of classifiers in different situations. We provide background, present procedures for building metaclassifiers that take into consideration both reliability indicators and classifier outputs, and review a set of comparative studies undertaken to evaluate the methodology. 相似文献