首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
通过对广西师范大学门户网站的网页进行分层搜索,本文组建了一个以网页为节点、超链为边的有向网络,发现它的出八度分布均呈现良好的幂律分布特性,出度分布在开始部分出现弯曲.根据这些特性,本文提出了一种含有增长,择优和重连机制的有向网络演化模型,模型的演化机制较好的反应了广西师范大学网的这些特性.  相似文献   

2.
以净化网页、提取网页主题内容为目标,提出一个基于网页规划布局的网页主题内容抽取算法。该算法依据原始网页的规划布局,通过构造标签树为网页分块分类,进而通过计算内容块的主题相关度,辨别网页主题,剔除不相关信息,提取网页主题内容。实验表明,算法适用于主题型网页的“去噪”及内容提取,具体应用中有较理想的表现。  相似文献   

3.
A fast and efficient page ranking mechanism for web crawling and retrieval remains as a challenging issue. Recently, several link based ranking algorithms like PageRank, HITS and OPIC have been proposed. In this paper, we propose a novel recursive method based on reinforcement learning which considers distance between pages as punishment, called “DistanceRank” to compute ranks of web pages. The distance is defined as the number of “average clicks” between two pages. The objective is to minimize punishment or distance so that a page with less distance to have a higher rank. Experimental results indicate that DistanceRank outperforms other ranking algorithms in page ranking and crawling scheduling. Furthermore, the complexity of DistanceRank is low. We have used University of California at Berkeley’s web for our experiments.  相似文献   

4.
This paper talks about several schemes for improving retrieval effectiveness that can be used in the named page finding tasks of web information retrieval (Overview of the TREC-2002 web track. In: Proceedings of the Eleventh Text Retrieval Conference TREC-2002, NIST Special Publication #500-251, 2003). These methods were applied on top of the basic information retrieval model as additional mechanisms to upgrade the system. Use of the title of web pages was found to be effective. It was confirmed that anchor texts of incoming links was beneficial as suggested in other works. Sentence–query similarity is a new type of information proposed by us and was identified to be the best information to take advantage of. Stratifying and re-ranking the retrieval list based on the maximum count of index terms in common between a sentence and a query resulted in significant improvement of performance. To demonstrate these facts a large-scale web information retrieval system was developed and used for experimentation.  相似文献   

5.
随着网络的飞速发展,网页数量急剧膨胀,近几年来更是以指数级进行增长,搜索引擎面临的挑战越来越严峻,很难从海量的网页中准确快捷地找到符合用户需求的网页。网页分类是解决这个问题的有效手段之一,基于网页主题分类和基于网页体裁分类是网页分类的两大主流,二者有效地提高了搜索引擎的检索效率。网页体裁分类是指按照网页的表现形式及其用途对网页进行分类。介绍了网页体裁的定义,网页体裁分类研究常用的分类特征,并且介绍了几种常用特征筛选方法、分类模型以及分类器的评估方法,为研究者提供了对网页体裁分类的概要性了解。  相似文献   

6.
7.
In the whole world, the internet is exercised by millions of people every day for information retrieval. Even for a small to smaller task like fixing a fan, to cook food or even to iron clothes persons opt to search the web. To fulfill the information needs of people, there are billions of web pages, each having a different degree of relevance to the topic of interest (TOI), scattered throughout the web but this huge size makes manual information retrieval impossible. The page ranking algorithm is an integral part of search engines as it arranges web pages associated with a queried TOI in order of their relevance level. It, therefore, plays an important role in regulating the search quality and user experience for information retrieval. PageRank, HITS, and SALSA are well-known page ranking algorithm based on link structure analysis of a seed set, but ranking given by them has not yet been efficient. In this paper, we propose a variant of SALSA to give sNorm(p) for the efficient ranking of web pages. Our approach relies on a p-Norm from Vector Norm family in a novel way for the ranking of web pages as Vector Norms can reduce the impact of low authority weight in hub weight calculation in an efficient way. Our study, then compares the rankings given by PageRank, HITS, SALSA, and sNorm(p) to the same pages in the same query. The effectiveness of the proposed approach over state of the art methods has been shown using performance measurement technique, Mean Reciprocal Rank (MRR), Precision, Mean Average Precision (MAP), Discounted Cumulative Gain (DCG) and Normalized DCG (NDCG). The experimentation is performed on a dataset acquired after pre-processing of the results collected from initial few pages retrieved for a query by the Google search engine. Based on the type and amount of in-hand domain expertise 30 queries are designed. The extensive evaluation and result analysis are performed using MRR, [email protected], MAP, DCG, and NDCG as the performance measuring statistical metrics. Furthermore, results are statistically verified using a significance test. Findings show that our approach outperforms state of the art methods by attaining 0.8666 as MRR value, 0.7957 as MAP value. Thus contributing to the improvement in the ranking of web pages more efficiently as compared to its counterparts.  相似文献   

8.
Search engines play an essential role in the usability of Internet-based information systems and without them the web would certainly break down or, at the very least would develop at a much slower rate. Our main objective is to analyze and evaluate the retrieval effectiveness of various indexing and searching strategies on a new web text collection, using a rigorous evaluation methodology. Our second aim is to describe and evaluate different preprocessing techniques that might be implemented in order to improve retrieval effectiveness. As a third objective, this paper will evaluate whether or not hyperlinks may serve as useful sources of evidence in improving retrieval algorithms.  相似文献   

9.
Web信息检索系统中的网页质量分析方法评价   总被引:1,自引:0,他引:1  
李树青  崔慧智 《情报科学》2008,26(5):729-734
改进对高质量网页的检索精度,将会极大提高Web信息检索系统的用户满意度。首先提出了信息检索中的“有用性”指标,并据此论述了基于网页质量分析方法的Web信息检索模型,然后提出了网页质量直接测度指标和网页质量间接测度指标。最后,详细介绍了各种网页质量指标的相关研究内容和方法,并做出了针对性的评价。  相似文献   

10.
本文通过对网页结构和内容特征的深入分析和识别,对噪音网页的过滤方法进行研究和实验。首先利用阈值过滤具有明显特征的噪音网页,而后建立网页特征向量,利用SVM对网页进行分类。采用采集自Web的网页数据进行实验分析,最后得出研究结论,并展望下一步工作。  相似文献   

11.
朱学红  彭婷  谌金宇 《资源科学》2020,42(8):1489-1503
为了考察战略性关键金属贸易网络特征对产业结构升级的影响,本文基于复杂网络分析方法,以1996—2017年7种战略性关键金属贸易数据为基础,定量刻画了全球战略性关键金属贸易网络的拓扑特征,并从入度中心度、出度中心度、接近中心度、中间中心度和特征向量中心度,5个维度全新解构各国在贸易网络中的角色和地位,以此并构建面板回归模型,实证分析贸易网络特征对产业结构升级的影响。研究结果表明:①1996—2017年期间全球战略性关键金属贸易网络具有松散性和异质性,并呈现“小世界”特征;②就个体而言,中国、美国和德国是全球战略性关键金属贸易的重要参与者,在贸易网络中占据中心地位;③接近中心度和中间中心度对一国产业结构升级具有显著的促进作用,且这种影响在低收入国家表现更为突出,而出入度和特征向量中心度的影响则不显著。中国应采取更加开放的贸易政策,并提升在全球价值链中的分工地位,促进产业结构优化升级。  相似文献   

12.
针对主题搜索引擎反馈信息主题相关度低的问题,提出了将遗传算法与基于内容的空间向量模型相结合的搜索策略。利用空间向量模型确定网页与主题的相关度,并将遗传算法应用于相关度判别,提高主题信息搜索的准确率和查全率。在Heritrix框架基础上,利用Eclipse3.3实现了相应功能。实验结果表明,搜索策略改进后的系统抓取主题页面所占比例与原系统相比提高了约30%。  相似文献   

13.
The goal of the study presented in this article is to investigate to what extent the classification of a web page by a single genre matches the users’ perspective. The extent of agreement on a single genre label for a web page can help understand whether there is a need for a different classification scheme that overrides the single-genre labelling. My hypothesis is that a single genre label does not account for the users’ perspective. In order to test this hypothesis, I submitted a restricted number of web pages (25 web pages) to a large number of web users (135 subjects) asking them to assign only a single genre label to each of the web pages. Users could choose from a list of 21 genre labels, or select one of the two ‘escape’ options, i.e. ‘Add a label’ and ‘I don’t know’. The rationale was to observe the level of agreement on a single genre label per web page, and draw some conclusions about the appropriateness of limiting the assignment to only a single label when doing genre classification of web pages. Results show that users largely disagree on the label to be assigned to a web page.  相似文献   

14.
介绍了网络监控系统的概念,并根据实践需要提出了一种适用于网络监控系统的网页分类技术。该网页分类技术是基于网站本身所具有的结构性,并通过URL充分表现这一特点提出来的。与传统的基于数据挖掘技术的网页分类技术有本质区别。该技术着重于实用性,实现算法只需要少量的计算机资源,是适合网络监控系统的一种网页分类技术。  相似文献   

15.
The exponential growth of information available on the World Wide Web, and retrievable by search engines, has implied the necessity to develop efficient and effective methods for organizing relevant contents. In this field document clustering plays an important role and remains an interesting and challenging problem in the field of web computing. In this paper we present a document clustering method, which takes into account both contents information and hyperlink structure of web page collection, where a document is viewed as a set of semantic units. We exploit this representation to determine the strength of a relation between two linked pages and to define a relational clustering algorithm based on a probabilistic graph representation. The experimental results show that the proposed approach, called RED-clustering, outperforms two of the most well known clustering algorithm as k-Means and Expectation Maximization.  相似文献   

16.
This paper is concerned with automatic extraction of titles from the bodies of HTML documents (web pages). Titles of HTML documents should be correctly defined in the title fields by the authors; however, in reality they are often bogus. It is advantageous if we can automatically extract titles from HTML documents. In this paper, we take a supervised machine learning approach to address the problem. We first propose a specification on HTML titles, that is, a ‘definition’ on HTML titles. Next, we employ two learning methods to perform the task. In one method, we utilize features extracted from the DOM (direct object model) Tree; in the other method, we utilize features based on vision. We also combine the two methods to further enhance the extraction accuracy. Our title extraction methods significantly outperform the baseline method of using the lines in largest font size as title (22.6–37.4% improvements in terms of F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (25.1–30.3% improvements).  相似文献   

17.
JavaScript传统上是单线程的,在HTML页面中执行一个需较长时间运行的脚本会阻塞所有的页面功能直至脚本完成。Web Worker是HTML5提供的JavaScript多线程解决方案。解析了Web Worker的工作原理和过程;提供了Web Worker代码示例和代码调试方法;说明了使用Web Worker如何提高Web应用的性能。由于Web Worker相对较新,目前关于Web Worker的示例和文献非常有限。该研究院提供了Web Worker的参考应用场景及进一步研究和应用的方向。  相似文献   

18.
A new dictionary-based text categorization approach is proposed to classify the chemical web pages efficiently. Using a chemistry dictionary, the approach can extract chemistry-related information more exactly from web pages. After automatic segmentation on the documents to find dictionary terms for document expansion, the approach adopts latent semantic indexing (LSI) to produce the final document vectors, and the relevant categories are finally assigned to the test document by using the k-NN text categorization algorithm. The effects of the characteristics of chemistry dictionary and test collection on the categorization efficiency are discussed in this paper, and a new voting method is also introduced to improve the categorization performance further based on the collection characteristics. The experimental results show that the proposed approach has the superior performance to the traditional categorization method and is applicable to the classification of chemical web pages.  相似文献   

19.
随着互联网的快速发展,恶意网页所造成的危害也越来越大。对典型恶意网页进行了分析与分类,通过对现有的恶意网页检测技术的比较分类,分析了各种检测技术的优缺点。  相似文献   

20.
Recreational queries from users searching for places to go and things to do or see are very common in web and mobile search. Users specify constraints for what they are looking for, like suitability for kids, romantic ambiance or budget. Queries like “restaurants in New York City” are currently served by static local results or the thumbnail carousel. More complex queries like “things to do in San Francisco with kids” or “romantic places to eat in Seattle” require the user to click on every element of the search engine result page to read articles from Yelp, TripAdvisor, or WikiTravel to satisfy their needs. Location data, which is an essential part of web search, is even more prevalent with location-based social networks and offers new opportunities for many ways of satisfying information seeking scenarios.In this paper, we address the problem of recreational queries in information retrieval and propose a solution that combines search query logs with LBSNs data to match user needs and possible options. At the core of our solution is a framework that combines social, geographical, and temporal information for a relevance model centered around the use of semantic annotations on Points of Interest with the goal of addressing these recreational queries. A central part of the framework is a taxonomy derived from behavioral data that drives the modeling and user experience. We also describe in detail the complexity of assessing and evaluating Point of Interest data, a topic that is usually not covered in related work, and propose task design alternatives that work well.We demonstrate the feasibility and scalability of our methods using a data set of 1B check-ins and a large sample of queries from the real-world. Finally, we describe the integration of our techniques in a commercial search engine.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号