共查询到20条相似文献,搜索用时 750 毫秒
1.
Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this work, we conducted a careful, systematic evaluation of a set of tokenization heuristics on all the available TREC biomedical text collections for ad hoc document retrieval, using two representative retrieval methods and a pseudo-relevance feedback method. We also studied the effect of stemming and stop word removal on the retrieval performance. As expected, our experiment results show that tokenization can significantly affect the retrieval accuracy; appropriate tokenization can improve the performance by up to 96%, measured by mean average precision (MAP). In particular, it is shown that different query types require different tokenization heuristics, stemming is effective only for certain queries, and stop word removal in general does not improve the retrieval performance on biomedical text. 相似文献
2.
文本信息检索技术进展和性能评价框架 总被引:6,自引:0,他引:6
曾民族 《现代图书情报技术》1997,13(3):14-18
本文介绍TREC 评价信息检索系统的动向及其在推动研制新型检索系统中所起的作用, 并介绍新型检索系统的模式和特征及国外商品化全文文本检索系统性能评测指标。文中探讨了文本信息检索的性质评价标准问题, 并提出一个中文文本信息检索的系统评价框架。 相似文献
3.
4.
Improving English and Chinese Ad-Hoc Retrieval: A Tipster Text Phase 3 Project Report 总被引:1,自引:1,他引:0
K.L. Kwok 《Information Retrieval》2000,3(4):313-338
Both English and Chinese ad-hoc information retrieval were investigated in this Tipster 3 project. Part of our objectives is to study the use of various term level and phrasal level evidence to improve retrieval accuracy. For short queries, we studied five term level techniques that together can lead to good improvements over standard ad-hoc 2-stage retrieval for TREC5-8 experiments. For long queries, we studied the use of linguistic phrases to re-rank retrieval lists. Its effect is small but consistently positive.For Chinese IR, we investigated three simple representations for documents and queries: short-words, bigrams and characters. Both approximate short-word segmentation or bigrams, augmented with characters, give highly effective results. Accurate word segmentation appears not crucial for overall result of a query set. Character indexing by itself is not competitive. Additional improvements may be obtained using collection enrichment and combination of retrieval lists.Our PIRCS document-focused retrieval is also shown to have similarity with a simple language model approach to IR. 相似文献
5.
国内中文自动分词技术研究综述 总被引:22,自引:0,他引:22
认为分词是文本自动分类、信息检索、信息过滤、文献自动标引、摘要自动生成等中文信息处理的基础与关键技术之一,中文本身复杂性及语言规则的不确定性,使中文分词技术成为分词技术中的难点.全面归纳中文分词算法、歧义消除、未登录词识别、自动分词系统等研究,总结出当前中文分词面临的难点与研究热点. 相似文献
6.
7.
一种面向中文信息检索的汉语自动分词方法 总被引:3,自引:1,他引:3
孙巍 《现代图书情报技术》2006,1(7):33-36
阐述信息检索对汉语分词技术的要求,分析中文信息检索与汉语分词技术结合过程中有待解决的关键问题,并重点针对这些要求及关键问题提出一种面向中文信息检索的汉语自动分词方法。 相似文献
8.
全文检索系统由三大功能模块组成:索引模块、检索模块和存储模块。本文着重分析系统组成和XML数据库的设计、建立倒排索引文件、中文分词等技术难点。同时在此基础之上建立基于Lucene/XML的期刊文献全文检索系统。 相似文献
9.
10.
中文全文检索技术的研究及实现 总被引:9,自引:0,他引:9
本文设计了一个中文全文检索系统 ,在单汉字全文数据库的基础之上进行了全文检索的算法研究 ,提出了针对特定检索策略的计算公式。同时还对检索结果集的排序问题进行了讨论 ,并采用用户反馈信息量 ,使最后检出的结果在应用中不断得到优化 相似文献
11.
论第四种情报检索语言系统 总被引:7,自引:0,他引:7
第四种情报检索语言是自然语言与人工语言结合的一体化语言。第四种情报检索语言系统是一种基于网络的信息检索系统 ,比分类主题一体化情报检索语言系统更高级更新颖 ,是我国 2 1世纪情报检索语言系统研究的方向。加快我国第四种情报检索语言系统研究的关键 ,是解决汉语分词技术问题。参考文献 14。 相似文献
12.
采用Visual studio.NET 开发平台,使用C#程序设计语言以及XML知识描述和数据存储,对网络专题知识组织和知识元自动抽取系统进行开发设计。对该系统的文本信息预处理、快速汉字结合自增长分词、词频全文精确统计等重要功能的设计与实现进行了深入研究。 相似文献
13.
Cross-language information retrieval (CLIR) has so far been studied with the assumption that some rich linguistic resources such as bilingual dictionaries or parallel corpora are available. But creation of such high quality resources is labor-intensive and they are not always at hand. In this paper we investigate the feasibility of using only comparable corpora for CLIR, without relying on other linguistic resources. Comparable corpora are text documents in different languages that cover similar topics and are often naturally attainable (e.g., news articles published in different languages at the same time period). We adapt an existing cross-lingual word association mining method and incorporate it into a language modeling approach to cross-language retrieval. We investigate different strategies for estimating the target query language models. Our evaluation results on the TREC Arabic–English cross-lingual data show that the proposed method is effective for the CLIR task, demonstrating that it is feasible to perform cross-lingual information retrieval with just comparable corpora. 相似文献
14.
基于词序方法的文本相似度计算模型 总被引:1,自引:0,他引:1
针对传统向量空间模型对文本相似度的计算未考虑词序导致偏差的问题,提出使用马尔可夫模型的状态转移矩阵、两两文本的最长公共子序列以及它们的所有公共子串信息来描述词序信息,在此基础上提出一种将马尔可夫状态转移矩阵、最长公共子序列、公共子串和TF-IDF相结合,兼顾词序和词频信息的文本相似度计算方法,并使用英文TREC-9的部分数据集对基于词序方法的文本相似度计算方法进行了测试.试验结果表明:在同等分词及评估条件下,基于词序方法的文本相似度计算结果的准确率相对于单纯采用传统的基于向量空间模型的TF-IDF方法提高了5%~15%. 相似文献
15.
The application of word sense disambiguation (WSD) techniques to information retrieval (IR) has yet to provide convincing retrieval results. Major obstacles to effective WSD in IR include coverage and granularity problems of word sense inventories, sparsity of document context, and limited information provided by short queries. In this paper, to alleviate these issues, we propose the construction of latent context models for terms using latent Dirichlet allocation. We propose building one latent context per word, using a well principled representation of local context based on word features. In particular, context words are weighted using a decaying function according to their distance to the target word, which is learnt from data in an unsupervised manner. The resulting latent features are used to discriminate word contexts, so as to constrict query’s semantic scope. Consistent and substantial improvements, including on difficult queries, are observed on TREC test collections, and the techniques combines well with blind relevance feedback. Compared to traditional topic modeling, WSD and positional indexing techniques, the proposed retrieval model is more effective and scales well on large-scale collections. 相似文献
16.
17.
单汉字标引方法的改进研究 总被引:2,自引:1,他引:1
本文根据信息论中的交互信息,给出了相邻汉字相关度的测量方法,在此基础上提出了基于字串预分割的单汉字标引检索方法,对当前具有代表性的单汉字标引方法进行了改进研究。试验证明本文提出的方法具有较好的性能 相似文献
18.
中文文本关键词自动抽取方法研究 总被引:6,自引:1,他引:5
随着信息技术的发展,中文电子文本信息资源正以惊人的速度急剧增长.文本自动处理技术,通过自动组织海量文献信息资源,能够为用户提供简易有效的信息检索服务.关键词自动抽取是文本自动处理的基础和核心.汉语的特殊性加剧了中文文本关键词自动抽取的难度.本文提出了一种基于N-gram权重计算和关键词筛选算法的中文文本关键词自动抽取方法.该方法不依赖特定的数据集和中文分词技术,可以有效地抽取出任意单篇文本的关键词,而且通过参数调整,应用系统可以灵活地控制标引深度和标引专指度.实验表明,该方法简单、快速、断词错误率低,标引性能明显优于基于中文分词和TF/IDF的方法,可以满足大规模文本的在线处理要求. 相似文献
19.
Due to the heavy use of gene synonyms in biomedical text, people have tried many query expansion techniques using synonyms
in order to improve performance in biomedical information retrieval. However, mixed results have been reported. The main challenge
is that it is not trivial to assign appropriate weights to the added gene synonyms in the expanded query; under-weighting
of synonyms would not bring much benefit, while overweighting some unreliable synonyms can hurt performance significantly.
So far, there has been no systematic evaluation of various synonym query expansion strategies for biomedical text. In this
work, we propose two different strategies to extend a standard language modeling approach for gene synonym query expansion
and conduct a systematic evaluation of these methods on all the available TREC biomedical text collections for ad hoc document
retrieval. Our experiment results show that synonym expansion can significantly improve the retrieval accuracy. However, different
query types require different synonym expansion methods, and appropriate weighting of gene names and synonym terms is critical
for improving performance.
相似文献
Chengxiang ZhaiEmail: |
20.
中文电子病历的分词及实体识别研究 总被引:1,自引:0,他引:1
[目的/意义]健康医疗大数据是我国重要的基础性战略资源,本研究对中文电子病历分词与实体识别的探讨与实证较好地完成了医疗数据的信息抽取任务,对今后医疗大数据在语义层面的应用发展具有重要意义。[方法/过程]本研究首先融合权威词表、官方标准、健康网站数据及其他医学补充词库构建了词语数量级达到10万的医学词表;然后对电子病历的字段进行分词,对比了jieba工具、导入词典后的jieba、无监督学习及AC自动机4种模型的分词效果;最后,以自动分词和人工标注结果为语料,实现基于条件随机场的电子病历实体识别研究,并比较不同实体类别以及不同文本特征下的实体识别效果,选出最优模板。[结果/结论]分词结果显示,AC自动机的效果最好,F值可达82%;实体识别结果表明,"检查"和"疾病"实体的识别效果最好,而"症状"的识别效果不太理想。 相似文献