首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 250 毫秒
1.
Rocchio relevance feedback and latent semantic indexing (LSI) are well-known extensions of the vector space model for information retrieval (IR). This paper analyzes the statistical relationship between these extensions. The analysis focuses on each method’s basis in least-squares optimization. Noting that LSI and Rocchio relevance feedback both alter the vector space model in a way that is in some sense least-squares optimal, we ask: what is the relationship between LSI’s and Rocchio’s notions of optimality? What does this relationship imply for IR? Using an analytical approach, we argue that Rocchio relevance feedback is optimal if we understand retrieval as a simplified classification problem. On the other hand, LSI’s motivation comes to the fore if we understand it as a biased regression technique, where projection onto a low-dimensional orthogonal subspace of the documents reduces model variance.  相似文献   

2.
陈立华 《现代情报》2010,30(3):26-28,31
潜在语义分析是自然语言使用于情报检索系统的理论基础,以此理论建构的空间向量模型是评判检索系统性能优良与否的知识工具。阐述了潜在语义标引(LSI)的基本内容、LSI下影响自然语言检索查准率的因素及向量空间模型检索软件的运行机制。此评述对网络化的情报检索技术的发展起到了一定的参考作用。  相似文献   

3.
The vector space model (VSM) is a textual representation method that is widely used in documents classification. However, it remains to be a space-challenging problem. One attempt to alleviate the space problem is by using dimensionality reduction techniques, however, such techniques have deficiencies such as losing some important information. In this paper, we propose a novel text classification method that neither uses VSM nor dimensionality reduction techniques. The proposed method is a space efficient method that utilizes the first order Markov model for hierarchical Arabic text classification. For each category and sub-category, a Markov chain model is prepared based on the neighboring characters sequences. The prepared models are then used for scoring documents for classification purposes. For evaluation, we used a hierarchical Arabic text data collection that contains 11,191 documents that belong to eight topics distributed into 3-levels. The experimental results show that the Markov chains based method significantly outperforms the baseline system that employs the latent semantic indexing (LSI) method. That is, the proposed method enhances the F1-measure by 3.47%. The novelty of this work lies on the idea of decomposing words into sequences of characters, which found to be a promising approach in terms of space and accuracy. Based on our best knowledge, this is the first attempt to conduct research for hierarchical Arabic text classification with such relatively large data collection.  相似文献   

4.
This paper reports our experimental investigation into the use of more realistic concepts as opposed to simple keywords for document retrieval, and reinforcement learning for improving document representations to help the retrieval of useful documents for relevant queries. The framework used for achieving this was based on the theory of Formal Concept Analysis (FCA) and Lattice Theory. Features or concepts of each document (and query), formulated according to FCA, are represented in a separate concept lattice and are weighted separately with respect to the individual documents they present. The document retrieval process is viewed as a continuous conversation between queries and documents, during which documents are allowed to learn a set of significant concepts to help their retrieval. The learning strategy used was based on relevance feedback information that makes the similarity of relevant documents stronger and non-relevant documents weaker. Test results obtained on the Cranfield collection show a significant increase in average precisions as the system learns from experience.  相似文献   

5.
信息环境的异构性、动态性与海量性使传统基于自然文本的信息检索方法与技术面临极大挑战,集成概念空间理论与潜在语义索引技术能为这种困境提供一些解决方案.在分析概念空间内涵与特征的基础上,利用潜在语义索引原理讨论了概念提取方法、同义词近义词处理方法及基准向量的生成方法,分析了网络条件下基于概念空间的文本分类、聚类检索基本机制,最后给出了完善概念空间的自学习机制.  相似文献   

6.
Traditional index weighting approaches for information retrieval from texts depend on the term frequency based analysis of the text contents. A shortcoming of these indexing schemes, which consider only the occurrences of the terms in a document, is that they have some limitations in extracting semantically exact indexes that represent the semantic content of a document. To address this issue, we developed a new indexing formalism that considers not only the terms in a document, but also the concepts. In this approach, concept clusters are defined and a concept vector space model is proposed to represent the semantic importance degrees of lexical items and concepts within a document. Through an experiment on the TREC collection of Wall Street Journal documents, we show that the proposed method outperforms an indexing method based on term frequency (TF), especially in regard to the few highest-ranked documents. Moreover, the index term dimension was 80% lower for the proposed method than for the TF-based method, which is expected to significantly reduce the document search time in a real environment.  相似文献   

7.
传统信息检索方法忽视了文档结构对信息检索过程的影响.本文提出了一种改进的基于文档结构的信息检索方法,该方法首先使用第一类特征域对检索文档集进行过滤,然后使用第二类特征域进行匹配排序;引入AHP方法动态确定各特征域的重要性权重因子;最后使用向量内积计算的方法合成总相似度值.实验结果表明该方法可以提高信息检索的查准率和检索结果的排序合理性.  相似文献   

8.
Pseudo-relevance feedback (PRF) is a well-known method for addressing the mismatch between query intention and query representation. Most current PRF methods consider relevance matching only from the perspective of terms used to sort feedback documents, thus possibly leading to a semantic gap between query representation and document representation. In this work, a PRF framework that combines relevance matching and semantic matching is proposed to improve the quality of feedback documents. Specifically, in the first round of retrieval, we propose a reranking mechanism in which the information of the exact terms and the semantic similarity between the query and document representations are calculated by bidirectional encoder representations from transformers (BERT); this mechanism reduces the text semantic gap by using the semantic information and improves the quality of feedback documents. Then, our proposed PRF framework is constructed to process the results of the first round of retrieval by using probability-based PRF methods and language-model-based PRF methods. Finally, we conduct extensive experiments on four Text Retrieval Conference (TREC) datasets. The results show that the proposed models outperform the robust baseline models in terms of the mean average precision (MAP) and precision P at position 10 (P@10), and the results also highlight that using the combined relevance matching and semantic matching method is more effective than using relevance matching or semantic matching alone in terms of improving the quality of feedback documents.  相似文献   

9.
针对基于语义网络的图像检索中的过反馈问题,本文提出了将语义网络转换为语义矩阵,采用基于多用户相关反馈的投票机制,可有效地克服过反馈.在此基础上提出了一种新颖的新图添加策略和处理语义向量查询的有效方法,最后指出了还需进一步研究的问题.  相似文献   

10.
Relation classification is one of the most fundamental tasks in the area of cross-media, which is essential for many practical applications such as information extraction, question&answer system, and knowledge base construction. In the cross-media semantic retrieval task, in order to meet the needs of cross-media uniform representation and semantic analysis, it is necessary to analyze the semantic potential relationship and construct semantic-related cross-media knowledge graph. The relationship classification technology is an important part of solving semantic correlation classification. Most of existing methods regard relation classification as a multi-classification task, without considering the correlation between different relationships. However, two relationships in the opposite directions are usually not independent of each other. Hence, this kind of relationships are easily confused in the traditional way. In order to solve the problem of confusing the relationships of the same semantic with different entity directions, this paper proposes a neural network fusing discrimination information for relation classification. In the proposed model, discrimination information is used to distinguish the relationship of the same semantic with different entity directions, the direction of entity in space is transformed into the direction of vector in mathematics by the method of entity vector subtraction, and the result of entity vector subtraction is used as discrimination information. The model consists of three modules: sentence representation module, relation discrimination module and discrimination fusion module. Moreover, two fusion methods are used for feature fusion. One is a Cascade-based feature fusion method, and another is a feature fusion method based on convolution neural network. In addition, this paper uses the new function added by cross-entropy function and deformed Max-Margin function as the loss function of the model. The experimental results show that the proposed discriminant feature is effective in distinguishing confusing relationships, and the proposed loss function can improve the performance of the model to a certain extent. Finally, the proposed model achieves 84.8% of the F1 value without any additional features or NLP analysis tools. Hence, the proposed method has a promising prospect of being incorporated in various cross-media systems.  相似文献   

11.
基于机器学习的图像检索机制的研究   总被引:1,自引:1,他引:1  
主要针对当前基于低水平特征的图像检索机制不能捕获图像语义的状况 ,讨论了使用长期学习的方法来学习用户的相关反馈 ,以此推断语义 ,构造语义空间 ,并结合短期学习方法 ,通过运用学习监督机制来推断用户信息需求 ,优化查询 ,逐步提高搜索引擎的性能  相似文献   

12.
In this paper, we introduce a new collection selection strategy to be operated in search engines with document partitioned indexes. Our method involves the selection of those document partitions that are most likely to deliver the best results to the formulated queries, reducing the number of queries that are submitted to each partition. This method employs learning algorithms that are capable of ranking the partitions, maximizing the probability of recovering documents with high gain. The method operates by building vector representations of each partition on the term space that is spanned by the queries. The proposed method is able to generalize to new queries and elaborate document lists with high precision for queries not considered during the training phase. To update the representations of each partition, our method employs incremental learning strategies. Beginning with an inversion test of the partition lists, we identify queries that contribute with new information and add them to the training phase. The experimental results show that our collection selection method favorably compares with state-of-the-art methods. In addition our method achieves a suitable performance with low parameter sensitivity making it applicable to search engines with hundreds of partitions.  相似文献   

13.
We present an image retrieval framework based on automatic query expansion in a concept feature space by generalizing the vector space model of information retrieval. In this framework, images are represented by vectors of weighted concepts similar to the keyword-based representation used in text retrieval. To generate the concept vocabularies, a statistical model is built by utilizing Support Vector Machine (SVM)-based classification techniques. The images are represented as "bag of concepts" that comprise perceptually and/or semantically distinguishable color and texture patches from local image regions in a multi-dimensional feature space. To explore the correlation between the concepts and overcome the assumption of feature independence in this model, we propose query expansion techniques in the image domain from a new perspective based on both local and global analysis. For the local analysis, the correlations between the concepts based on the co-occurrence pattern, and the metrical constraints based on the neighborhood proximity between the concepts in encoded images, are analyzed by considering local feedback information. We also analyze the concept similarities in the collection as a whole in the form of a similarity thesaurus and propose an efficient query expansion based on the global analysis. The experimental results on a photographic collection of natural scenes and a biomedical database of different imaging modalities demonstrate the effectiveness of the proposed framework in terms of precision and recall.  相似文献   

14.
Similarity search with hashing has become one of the fundamental research topics in computer vision and multimedia. The current researches on semantic-preserving hashing mainly focus on exploring the semantic similarities between pointwise or pairwise samples in the visual space to generate discriminative hash codes. However, such learning schemes fail to explore the intrinsic latent features embedded in the high-dimensional feature space and they are difficult to capture the underlying topological structure of data, yielding low-quality hash codes for image retrieval. In this paper, we propose an ordinal-preserving latent graph hashing (OLGH) method, which derives the objective hash codes from the latent space and preserves the high-order locally topological structure of data into the learned hash codes. Specifically, we conceive a triplet constrained topology-preserving loss to uncover the ordinal-inferred local features in binary representation learning. By virtue of this, the learning system can implicitly capture the high-order similarities among samples during the feature learning process. Moreover, the well-designed latent subspace learning is built to acquire the noise-free latent features based on the sparse constrained supervised learning. As such, the latent under-explored characteristics of data are fully employed in subspace construction. Furthermore, the latent ordinal graph hashing is formulated by jointly exploiting latent space construction and ordinal graph learning. An efficient optimization algorithm is developed to solve the resulting problem to achieve the optimal solution. Extensive experiments conducted on diverse datasets show the effectiveness and superiority of the proposed method when compared to some advanced learning to hash algorithms for fast image retrieval. The source codes of this paper are available at https://github.com/DarrenZZhang/OLGH .  相似文献   

15.
Searching for relevant material that satisfies the information need of a user, within a large document collection is a critical activity for web search engines. Query Expansion techniques are widely used by search engines for the disambiguation of user’s information need and for improving the information retrieval (IR) performance. Knowledge-based, corpus-based and relevance feedback, are the main QE techniques, that employ different approaches for expanding the user query with synonyms of the search terms (word synonymy) in order to bring more relevant documents and for filtering documents that contain search terms but with a different meaning (also known as word polysemy problem) than the user intended. This work, surveys existing query expansion techniques, highlights their strengths and limitations and introduces a new method that combines the power of knowledge-based or corpus-based techniques with that of relevance feedback. Experimental evaluation on three information retrieval benchmark datasets shows that the application of knowledge or corpus-based query expansion techniques on the results of the relevance feedback step improves the information retrieval performance, with knowledge-based techniques providing significantly better results than their simple relevance feedback alternatives in all sets.  相似文献   

16.
潘晓  段鑫星 《情报科学》2021,39(7):131-135
【目的/意义】针对当前中小企业情报收集系统模型收集情报的准确性、信息检索查全率以及情报分类管理 效率较低的问题,提出基于LDA及模糊VIKOR法的中小企业情报收集系统模型构建。【方法/过程】根据LDA模型 设计并构建中小企业情报收集系统模型架构,通过企业管理架构采集知识资源,将获取的知识分别划分至管理架 构相应模块中,实现企业知识整合管理。根据模糊VIKOR法设计了中小企业情报分类步骤,引入贝叶斯统计的标 准法,获取最佳主题数量,采用Gibbs抽样算法得出分类隐含层主题集合概率整体分布的向量,实现中小企业情报 收集系统分类管理。【结果/结论】实验结果表明,该系统的准确性较高,能够有效提高情报分类管理效率以及信息 检索查全率。【创新/局限】本文采用LDA模型整合管理企业知识,结合模糊VIKOR法分类管理企业情报收集,构建 准确高效的系统模型,但本文构建的系统模型未应用于实际企业中进行反馈与完善。  相似文献   

17.
This paper describes our novel retrieval model that is based on contexts of query terms in documents (i.e., document contexts). Our model is novel because it explicitly takes into account of the document contexts instead of implicitly using the document contexts to find query expansion terms. Our model is based on simulating a user making relevance decisions, and it is a hybrid of various existing effective models and techniques. It estimates the relevance decision preference of a document context as the log-odds and uses smoothing techniques as found in language models to solve the problem of zero probabilities. It combines these estimated preferences of document contexts using different types of aggregation operators that comply with different relevance decision principles (e.g., aggregate relevance principle). Our model is evaluated using retrospective experiments (i.e., with full relevance information), because such experiments can (a) reveal the potential of our model, (b) isolate the problems of the model from those of the parameter estimation, (c) provide information about the major factors affecting the retrieval effectiveness of the model, and (d) show that whether the model obeys the probability ranking principle. Our model is promising as its mean average precision is 60–80% in our experiments using different TREC ad hoc English collections and the NTCIR-5 ad hoc Chinese collection. Our experiments showed that (a) the operators that are consistent with aggregate relevance principle were effective in combining the estimated preferences, and (b) that estimating probabilities using the contexts in the relevant documents can produce better retrieval effectiveness than using the entire relevant documents.  相似文献   

18.
This paper presents an investigation about how to automatically formulate effective queries using full or partial relevance information (i.e., the terms that are in relevant documents) in the context of relevance feedback (RF). The effects of adding relevance information in the RF environment are studied via controlled experiments. The conditions of these controlled experiments are formalized into a set of assumptions that form the framework of our study. This framework is called idealized relevance feedback (IRF) framework. In our IRF settings, we confirm the previous findings of relevance feedback studies. In addition, our experiments show that better retrieval effectiveness can be obtained when (i) we normalize the term weights by their ranks, (ii) we select weighted terms in the top K retrieved documents, (iii) we include terms in the initial title queries, and (iv) we use the best query sizes for each topic instead of the average best query size where they produce at most five percentage points improvement in the mean average precision (MAP) value. We have also achieved a new level of retrieval effectiveness which is about 55–60% MAP instead of 40+% in the previous findings. This new level of retrieval effectiveness was found to be similar to a level using a TREC ad hoc test collection that is about double the number of documents in the TREC-3 test collection used in previous works.  相似文献   

19.
宁琳 《现代情报》2014,34(1):155-158
跨语言检索是一种重要的信息检索手段之一。为了提高跨语言检索效率,采用语义扩展的方法,通过分析其设计思想和工作流程,构建出一种基于语义扩展的跨语言自动检索模型,重点对其语义扩展、知识库和结果聚类等设计进行了阐述,提出了语义理解切分法的分词方法,采用了Single-Pass算法进行聚类,实验结果表明,该模型能有效提高跨语言检索的查全率和查准率。  相似文献   

20.
The text retrieval method using latent semantic indexing (LSI) technique with truncated singular value decomposition (SVD) has been intensively studied in recent years. The SVD reduces the noise contained in the original representation of the term–document matrix and improves the information retrieval accuracy. Recent studies indicate that SVD is mostly useful for small homogeneous data collections. For large inhomogeneous datasets, the performance of the SVD based text retrieval technique may deteriorate. We propose to partition a large inhomogeneous dataset into several smaller ones with clustered structure, on which we apply the truncated SVD. Our experimental results show that the clustered SVD strategies may enhance the retrieval accuracy and reduce the computing and storage costs.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号