首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 375 毫秒
1.
基于遗传算法的主题信息搜索系统研究   总被引:1,自引:0,他引:1  
罗长寿  康丽  刘国靖 《现代情报》2009,29(3):176-178
针对网络信息资源“迷向”与“过载”的现象,本文通过对遗传算法的分析应用,构建了由基于遗传算法的主题爬虫、信息处理和查询服务三部分组成的主题信息搜索系统。实验结果表明,应用该系统可以获取与主题相关度高的网页信息。  相似文献   

2.
针对传统的基于Web图的垂直搜索策略Authorities and Hubs,提出了一种融合了网页内容评价和Web图的启发式垂直搜索策略,此外,引入向量空间模型进行针对网页内容的主题相关度判断,进一步提高主题网页下载的准确率.实验表明,文中算法有效地提高了主题网页的聚合程度,且随着网页下载数量的增加,垂直搜索引擎的准确率逐渐递增,并在下载网页达到一定数量后,准确率趋于稳定,算法具有较好的鲁棒性,可以应用到相关垂直搜索引擎系统中.  相似文献   

3.
为了提高用户对网站使用的效率,提高网站本体模型的搜索性能,研究一种高效的网页语义概念树构建方法,进行搜索覆盖度层状拓展。传统方法中,使用搜索引擎的词语相似度算法进行搜索拓展,利用规则、聚类等技术对形式背景进行约简,无法有效简历概念间的上下位关系,性能不好。提出一种基于语义主题树特征匹配的搜索覆盖度层状拓展方法,进行Web语义模型和主题树构建,构建特征空间互信息区域文档词频向量模型,对数据库中记录的属性字段进行归类抽象,形成概念汇聚点,实现语义主题树构建搜索覆盖度拓展设计,构建语义主题树特征匹配算法,优化搜索引擎对文本特征的搜索敏感度,提高搜索覆盖度,实现文本搜索覆盖度层状拓展。实验分析得出,该方法具有较好的文本特征分类结果,语义层次结构清晰,可以有效提高文本数据召回率和查准率,展示了较好的应用价值。  相似文献   

4.
袁红 《现代情报》2009,40(2):44-51
[目的/意义] 搜索策略是搜索行为的规划,是搜索过程的核心,一直是搜索行为研究的重要课题。探索用户搜索策略的运用及其转换的规律,对于IR系统的功能优化及提升用户信息搜索效率具有重要意义。[方法/过程] 研究确定了来自4个搜索主题的8个搜索任务,招募了30名参与者,开展了搜索实验,并对搜索行为视频加以编码,在统计不同搜索策略使用频次的基础上,构建了常见的用户搜索策略转换模式。[结果/结论] 访问和评估策略是信息搜索的常见策略,而修改查询语句、学习等搜索策略运用较少。向前访问→评估单个项目、评估搜索结果→向前访问为用户信息搜索最常见的策略转换模式,而向前访问→探索等策略转换发生概率极低。此外,用户在搜索的不同阶段的策略运用及策略转换呈现较大差异,这为IR系统设计提供了详尽有用的指导。  相似文献   

5.
万君  吴迪  赵宏霞 《现代情报》2014,34(12):7-11
本文选取网络搜索用户的点击意愿为研究对象,提出了网络搜索用户对竞价广告点击意愿的影响因素模型假设,并结合结构方程模型思想进行实证检验。实证研究表明,广告位置、内容相关度、信息丰富度、产品熟悉度和搜索背景均不同程度地影响用户对竞价广告的点击意愿,其中内容相关度的影响程度最大,而前后项关系对用户竞价广告点击意愿没有显著性影响。  相似文献   

6.
基于术语间本体关联度的文档相关度研究   总被引:1,自引:0,他引:1  
提出了一种基于术语间本体关联度的文档相关度计算方法,该方法利用树状本体结构计算术语间基于本体的关联关系,通过术语组间的本体关联度得到两组词语的本体关联关系,最后结合文档标引词的权重计算两个文档的相关度。新方法从本体的角度将语义信息融入传统向量空间模型,提高了文档相关度计算的准确性。实验选取计算机领域本体作为实验数据,对新方法和传统方法进行综合对比评测,实验结果验证了新方法的有效性和合理性。  相似文献   

7.
随着互联网技术的不断发展,用户收集和分析与特定主题相关的网页显得越来越困难.该文提出了面向主题的WWW信息的分类系统(WICS),该系统可以高效地收集网页,然后进行分类,最后将搜索结果呈现给用户.该文在分析典型的搜索引擎的基础上,介绍了Web文本挖掘,并将其应用到系统中.原型系统中使用了文本预处理、索引、倒排文件和向量空间距离测度等枝术、算法.初始实验表明,用原型系统进行Web信息分类,为用户获取信息提供了很大的方便,提高了搜索结果的相关性和精确度.  相似文献   

8.
翁勍力  施水才  赵捧未 《情报杂志》2007,26(9):114-116,119
针对目前搜索引擎返回结果的海量性和无结构性,构建一个基于元搜索的聚类挖掘引擎,旨在利用元搜索引擎返回的结果,提高搜索结果聚类效率,快速有效地为用户提供一个搜索结果结构视图,从而进行进一步的知识发现。介绍了搜索引擎和挖掘引擎的主要功能及差别,应用向量空间模型对元搜索结果进行处理。介绍当前主要的聚类算法-K—means划分法和层次凝聚聚类法,并在此基础上提出基于元搜索结果将两种聚类算法相结合的聚类方法。  相似文献   

9.
通过建立博弈模型证明了在不对称信息下主题搜索引擎将出现部分高质量信息服务无法获取的搜索无效率现象.以农产品主题搜索引擎为实验对象,从可定制框架的设计、索引信息筛选、查询提示三个层面讨论了解决这种无效率的信息服务策略.  相似文献   

10.
提出一种基于标准混合蛙跳算法的ASP数据库脚本程序边缘局部搜索最优路径提取算法,在进行ASP数据库信息交互中的脚本程序边缘局部搜索最优路径提取中,把路径搜索比喻为青蛙在觅食过程的位置更新,将搜索加速因子引入族群内部的搜索策略中,一定程度上提高了算法的全局搜索能力,利用局部最优个体、局部最差个体及全局最优个体的信息实现对脚本程序边缘局部最优路径搜索算法的改进。仿真结果表明,算法在时间成本及空间成本大幅降低,加速比提高。能摆脱局部最优解的能力强,收敛速度快,通过搜索最优路径的提取,信息交互中的信息配准提高。在ASP数据信息交互中实现可靠有效的数据通信。  相似文献   

11.
Topic distillation is one of the main information needs when users search the Web. Previous approaches for topic distillation treat single page as the basic searching unit, which has not fully utilized the structure information of the Web. In this paper, we propose a novel concept for topic distillation, named sub-site retrieval, in which the basic searching unit is sub-site instead of single page. A sub-site is the subset of a website, consisting of a structural collection of pages. The key of sub-site retrieval includes (1) extracting effective features for the representation of a sub-site using both the content and structure information, (2) delivering the sub-site-based retrieval results with a friendly and informative user interface. For the first point, we propose Punished Integration algorithm, which is based on the modeling of the growth of websites. For the second point, we design a user interface to better illustrate the search results of sub-site retrieval. Testing on the topic distillation task of TREC 2003 and 2004, sub-site retrieval leads to significant improvement of retrieval performance over the previous methods based on single pages. Furthermore, time complexity analysis shows that sub-site retrieval can be integrated into the index component of search engines.  相似文献   

12.
The widespread availability of the Internet and the variety of Internet-based applications have resulted in a significant increase in the amount of web pages. Determining the behaviors of search engine users has become a critical step in enhancing search engine performance. Search engine user behaviors can be determined by content-based or content-ignorant algorithms. Although many content-ignorant studies have been performed to automatically identify new topics, previous results have demonstrated that spelling errors can cause significant errors in topic shift estimates. In this study, we focused on minimizing the number of wrong estimates that were based on spelling errors. We developed a new hybrid algorithm combining character n-gram and neural network methodologies, and compared the experimental results with results from previous studies. For the FAST and Excite datasets, the proposed algorithm improved topic shift estimates by 6.987% and 2.639%, respectively. Moreover, we analyzed the performance of the character n-gram method in different aspects including the comparison with Levenshtein edit-distance method. The experimental results demonstrated that the character n-gram method outperformed to the Levensthein edit distance method in terms of topic identification.  相似文献   

13.
本文详细介绍了面向计算机教育资源的垂直搜索引擎的体系结构,重点叙述了构成垂直搜索引擎的主题爬虫的爬行策略、主题相关度算法和主题词库的设计策略。实验结果表明:软件系统中Heri-trix的最大响应时间是0.563秒,查询精度和主题相关度判别算法的精度均达到了60%以上,可以面向Web加以应用。  相似文献   

14.
In the whole world, the internet is exercised by millions of people every day for information retrieval. Even for a small to smaller task like fixing a fan, to cook food or even to iron clothes persons opt to search the web. To fulfill the information needs of people, there are billions of web pages, each having a different degree of relevance to the topic of interest (TOI), scattered throughout the web but this huge size makes manual information retrieval impossible. The page ranking algorithm is an integral part of search engines as it arranges web pages associated with a queried TOI in order of their relevance level. It, therefore, plays an important role in regulating the search quality and user experience for information retrieval. PageRank, HITS, and SALSA are well-known page ranking algorithm based on link structure analysis of a seed set, but ranking given by them has not yet been efficient. In this paper, we propose a variant of SALSA to give sNorm(p) for the efficient ranking of web pages. Our approach relies on a p-Norm from Vector Norm family in a novel way for the ranking of web pages as Vector Norms can reduce the impact of low authority weight in hub weight calculation in an efficient way. Our study, then compares the rankings given by PageRank, HITS, SALSA, and sNorm(p) to the same pages in the same query. The effectiveness of the proposed approach over state of the art methods has been shown using performance measurement technique, Mean Reciprocal Rank (MRR), Precision, Mean Average Precision (MAP), Discounted Cumulative Gain (DCG) and Normalized DCG (NDCG). The experimentation is performed on a dataset acquired after pre-processing of the results collected from initial few pages retrieved for a query by the Google search engine. Based on the type and amount of in-hand domain expertise 30 queries are designed. The extensive evaluation and result analysis are performed using MRR, [email protected], MAP, DCG, and NDCG as the performance measuring statistical metrics. Furthermore, results are statistically verified using a significance test. Findings show that our approach outperforms state of the art methods by attaining 0.8666 as MRR value, 0.7957 as MAP value. Thus contributing to the improvement in the ranking of web pages more efficiently as compared to its counterparts.  相似文献   

15.
在对P2P用户行为进行分析的基础上,提出了一种自动机制,能够区分出用户对不同主题领域的关注度,计算出邻居节点查询各关注主题领域相关文档的能力,通过选择对特定领域查询能力最强的k个邻居节点转发查询消息提高效率,该机制能够区分出用户的典型行为和即兴行为,通过采用不同策略进一步提高即兴查询的效率。  相似文献   

16.
李志义 《现代情报》2011,31(10):31-35
网络爬虫对网页的抓取与优化策略直接影响到网页采集的广度、深度,以及网页预处理的数量和搜索引擎的质量。搜索引擎的设计应在充分考虑网页遍历策略的同时,还应加强对网络爬虫优化策略的研究。本文从主题、优先采集、不重复采集、网页重访、分布式抓取等方面提出了网络爬虫的五大优化策略,对网络爬虫的设计有一定的指导和启迪作用。  相似文献   

17.
The Web and especially major Web search engines are essential tools in the quest to locate online information for many people. This paper reports results from research that examines characteristics and changes in Web searching from nine studies of five Web search engines based in the US and Europe. We compare interactions occurring between users and Web search engines from the perspectives of session length, query length, query complexity, and content viewed among the Web search engines. The results of our research shows (1) users are viewing fewer result pages, (2) searchers on US-based Web search engines use more query operators than searchers on European-based search engines, (3) there are statistically significant differences in the use of Boolean operators and result pages viewed, and (4) one cannot necessary apply results from studies of one particular Web search engine to another Web search engine. The wide spread use of Web search engines, employment of simple queries, and decreased viewing of result pages may have resulted from algorithmic enhancements by Web search engine companies. We discuss the implications of the findings for the development of Web search engines and design of online content.  相似文献   

18.
传统遗传算法在面对一些搜索空间巨大的复杂问题时,其表现往往难以令人满意。作者针对传统遗传算法解决高维多峰值问题时可能会出现的困难进行了分析,然后根据困难出现的原因,基于PVM设计了并行分布式遗传算法,并对适应度评估、交叉、变异算子做了一些改进,旨在加强算法的全局搜索能力,提高算法的收敛速度。为了验证算法多项措施的有效性,对一多峰函数在高维条件下进行多方面的测试,实验结果表明这几项措施是有效的。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号