首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 296 毫秒
1.
随着网络的飞速发展,网页数量急剧膨胀,近几年来更是以指数级进行增长,搜索引擎面临的挑战越来越严峻,很难从海量的网页中准确快捷地找到符合用户需求的网页。网页分类是解决这个问题的有效手段之一,基于网页主题分类和基于网页体裁分类是网页分类的两大主流,二者有效地提高了搜索引擎的检索效率。网页体裁分类是指按照网页的表现形式及其用途对网页进行分类。介绍了网页体裁的定义,网页体裁分类研究常用的分类特征,并且介绍了几种常用特征筛选方法、分类模型以及分类器的评估方法,为研究者提供了对网页体裁分类的概要性了解。  相似文献   

2.
This research is a part of ongoing study to better understand citation analysis on the Web. It builds on Kleinberg's research (J. Kleinberg, R. Kumar, P. Raghavan, P. Rajagopalan, A. Tomkins, Invited survey at the International Conference on Combinatorics and Computing, 1999) that hyperlinks between web pages constitute a web graph structure and tries to classify different web graphs in the new coordinate space: out-degree, in-degree. The out-degree coordinate is defined as the number of outgoing web pages from a given web page. The in-degree coordinate is the number of web pages that point to a given web page. In this new coordinate space a metric is built to classify how close or far are different web graphs. Kleinberg's web algorithm (J. Kleinberg, Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 1998, pp. 668–677) on discovering “hub web pages” and “authorities web pages” is applied in this new coordinate space. Some very uncommon phenomenon has been discovered and new interesting results interpreted. This study does not look at enhancing web retrieval by adding context information. It only considers web hyperlinks as a source to analyze citations on the web. The author believes that understanding the underlying web page as a graph will help design better web algorithms, enhance retrieval and web performance, and recommends using graphs as a part of visual aid for search engine designers.  相似文献   

3.
介绍了网络监控系统的概念,并根据实践需要提出了一种适用于网络监控系统的网页分类技术。该网页分类技术是基于网站本身所具有的结构性,并通过URL充分表现这一特点提出来的。与传统的基于数据挖掘技术的网页分类技术有本质区别。该技术着重于实用性,实现算法只需要少量的计算机资源,是适合网络监控系统的一种网页分类技术。  相似文献   

4.
A fast and efficient page ranking mechanism for web crawling and retrieval remains as a challenging issue. Recently, several link based ranking algorithms like PageRank, HITS and OPIC have been proposed. In this paper, we propose a novel recursive method based on reinforcement learning which considers distance between pages as punishment, called “DistanceRank” to compute ranks of web pages. The distance is defined as the number of “average clicks” between two pages. The objective is to minimize punishment or distance so that a page with less distance to have a higher rank. Experimental results indicate that DistanceRank outperforms other ranking algorithms in page ranking and crawling scheduling. Furthermore, the complexity of DistanceRank is low. We have used University of California at Berkeley’s web for our experiments.  相似文献   

5.
JavaScript传统上是单线程的,在HTML页面中执行一个需较长时间运行的脚本会阻塞所有的页面功能直至脚本完成。Web Worker是HTML5提供的JavaScript多线程解决方案。解析了Web Worker的工作原理和过程;提供了Web Worker代码示例和代码调试方法;说明了使用Web Worker如何提高Web应用的性能。由于Web Worker相对较新,目前关于Web Worker的示例和文献非常有限。该研究院提供了Web Worker的参考应用场景及进一步研究和应用的方向。  相似文献   

6.
7.
以净化网页、提取网页主题内容为目标,提出一个基于网页规划布局的网页主题内容抽取算法。该算法依据原始网页的规划布局,通过构造标签树为网页分块分类,进而通过计算内容块的主题相关度,辨别网页主题,剔除不相关信息,提取网页主题内容。实验表明,算法适用于主题型网页的“去噪”及内容提取,具体应用中有较理想的表现。  相似文献   

8.
时念云  杨晨  滕良娟 《情报科学》2006,24(12):1841-1844
传统黄页服务在知识描述方面采用的是基于语法层面的描述,而缺乏对话义的表示、处理等能力,这就导致了目前黄页服务质量低下的缺陷。该文提出使用语义Web技术和Web服务相结合的语义Web服务来解决该问题,即构造语义级别的黄页服务。最后说明了语义Web服务在黄页服务中的应用现状和设想。  相似文献   

9.
The goal of the study presented in this article is to investigate to what extent the classification of a web page by a single genre matches the users’ perspective. The extent of agreement on a single genre label for a web page can help understand whether there is a need for a different classification scheme that overrides the single-genre labelling. My hypothesis is that a single genre label does not account for the users’ perspective. In order to test this hypothesis, I submitted a restricted number of web pages (25 web pages) to a large number of web users (135 subjects) asking them to assign only a single genre label to each of the web pages. Users could choose from a list of 21 genre labels, or select one of the two ‘escape’ options, i.e. ‘Add a label’ and ‘I don’t know’. The rationale was to observe the level of agreement on a single genre label per web page, and draw some conclusions about the appropriateness of limiting the assignment to only a single label when doing genre classification of web pages. Results show that users largely disagree on the label to be assigned to a web page.  相似文献   

10.
本文论述了Web用户访问模式挖掘中的数据预处理,主要提出了数据预处理中如何识别会话的一种改进算法。该方法通过使用三个因素来构造会话:①根据先验知识,确定会话时间阈值识别会话;②根据页面访问时间统计分布,确定相邻网页访问时间间隔阈值识别会话;③页面内容及站点结构确定页面重要程度识别会话。实验结果表明,相对于传统的单一方法进行会话识别的方法,该方法能够准确的识别会话,更为合理有效。  相似文献   

11.
The exponential growth of information available on the World Wide Web, and retrievable by search engines, has implied the necessity to develop efficient and effective methods for organizing relevant contents. In this field document clustering plays an important role and remains an interesting and challenging problem in the field of web computing. In this paper we present a document clustering method, which takes into account both contents information and hyperlink structure of web page collection, where a document is viewed as a set of semantic units. We exploit this representation to determine the strength of a relation between two linked pages and to define a relational clustering algorithm based on a probabilistic graph representation. The experimental results show that the proposed approach, called RED-clustering, outperforms two of the most well known clustering algorithm as k-Means and Expectation Maximization.  相似文献   

12.
One of the important problems in text classification is the high dimensionality of the feature space. Feature selection methods are used to reduce the dimensionality of the feature space by selecting the most valuable features for classification. Apart from reducing the dimensionality, feature selection methods have potential to improve text classifiers’ performance both in terms of accuracy and time. Furthermore, it helps to build simpler and as a result more comprehensible models. In this study we propose new methods for feature selection from textual data, called Meaning Based Feature Selection (MBFS) which is based on the Helmholtz principle from the Gestalt theory of human perception which is used in image processing. The proposed approaches are extensively evaluated by their effect on the classification performance of two well-known classifiers on several datasets and compared with several feature selection algorithms commonly used in text mining. Our results demonstrate the value of the MBFS methods in terms of classification accuracy and execution time.  相似文献   

13.
This paper is concerned with automatic extraction of titles from the bodies of HTML documents (web pages). Titles of HTML documents should be correctly defined in the title fields by the authors; however, in reality they are often bogus. It is advantageous if we can automatically extract titles from HTML documents. In this paper, we take a supervised machine learning approach to address the problem. We first propose a specification on HTML titles, that is, a ‘definition’ on HTML titles. Next, we employ two learning methods to perform the task. In one method, we utilize features extracted from the DOM (direct object model) Tree; in the other method, we utilize features based on vision. We also combine the two methods to further enhance the extraction accuracy. Our title extraction methods significantly outperform the baseline method of using the lines in largest font size as title (22.6–37.4% improvements in terms of F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (25.1–30.3% improvements).  相似文献   

14.
Web sites often provide the first impression of an organization. For many organizations, web sites are crucial to ensure sales or to procure services within. When a person opens a web site, the first impression is probably made in a few seconds, and the user will either stay or move on to the next site on the basis of many factors. One of the factors that may influence users to stay or go is the page aesthetics. Another reason may involve a user’s judgment about the site’s credibility. This study explores the possible link between page aesthetics and a user’s judgment of the site’s credibility. Our findings indicate that when the same content is presented using different levels of aesthetic treatment, the content with a higher aesthetic treatment was judged as having higher credibility. We call this the amelioration effect of visual design and aesthetics on content credibility. Our study suggests that this effect is operational within the first few seconds in which a user views a web page. Given the same content, a higher aesthetic treatment will increase perceived credibility.  相似文献   

15.
ASP.NET是微软公司的最新一种网页编程代码,它已逐渐成为网站建设者首选的网页编程语言。ASP.NET中连接数据库和操作数据库的就是ADO.NET,ADO.NET是与数据源交互的.NET技术。本文将通过分析ADO.NET,研究其在网站建设中的应用。  相似文献   

16.
王成 《人天科学研究》2011,(10):126-127
随着Internet的飞速发展,恶意网页已经成为影响网络安全的主要问题之一。Rootkit是一种基于Windows分层驱动模型的技术。介绍了一种基于Rootkit技术的恶意网页防护系统的设计,对恶意网页防护的研究具有一定的参考价值。  相似文献   

17.
In this paper, we propose a tensor factorization method, called CLASS-RESCAL, which associates the class labels of data samples with their latent representations. Specifically, we extend RESCAL to produce a semi-supervised factorization method that combines a classification error term with the standard factor optimization process. CLASS-RESCAL assimilates information from all the relations of the tensor, while also taking into account classification performance. This procedure forces the data samples within the same class to have similar latent representations. Experimental results on several real-world social network data indicate this is a promising approach for multi-relational classification tasks.  相似文献   

18.
In the whole world, the internet is exercised by millions of people every day for information retrieval. Even for a small to smaller task like fixing a fan, to cook food or even to iron clothes persons opt to search the web. To fulfill the information needs of people, there are billions of web pages, each having a different degree of relevance to the topic of interest (TOI), scattered throughout the web but this huge size makes manual information retrieval impossible. The page ranking algorithm is an integral part of search engines as it arranges web pages associated with a queried TOI in order of their relevance level. It, therefore, plays an important role in regulating the search quality and user experience for information retrieval. PageRank, HITS, and SALSA are well-known page ranking algorithm based on link structure analysis of a seed set, but ranking given by them has not yet been efficient. In this paper, we propose a variant of SALSA to give sNorm(p) for the efficient ranking of web pages. Our approach relies on a p-Norm from Vector Norm family in a novel way for the ranking of web pages as Vector Norms can reduce the impact of low authority weight in hub weight calculation in an efficient way. Our study, then compares the rankings given by PageRank, HITS, SALSA, and sNorm(p) to the same pages in the same query. The effectiveness of the proposed approach over state of the art methods has been shown using performance measurement technique, Mean Reciprocal Rank (MRR), Precision, Mean Average Precision (MAP), Discounted Cumulative Gain (DCG) and Normalized DCG (NDCG). The experimentation is performed on a dataset acquired after pre-processing of the results collected from initial few pages retrieved for a query by the Google search engine. Based on the type and amount of in-hand domain expertise 30 queries are designed. The extensive evaluation and result analysis are performed using MRR, [email protected], MAP, DCG, and NDCG as the performance measuring statistical metrics. Furthermore, results are statistically verified using a significance test. Findings show that our approach outperforms state of the art methods by attaining 0.8666 as MRR value, 0.7957 as MAP value. Thus contributing to the improvement in the ranking of web pages more efficiently as compared to its counterparts.  相似文献   

19.
试论网页制作   总被引:8,自引:0,他引:8  
徐亚先 《情报科学》2001,19(2):182-184
本文论述了制作网页的意义,网页设计工具,网页设计的一般原则,网页设计的要素分析以及网页设计中应注意的问题。  相似文献   

20.
色彩是网页给人的第一印象,色彩运用的好坏直接影响人们对这个网页的认同与否,设计者在设计网页时除了考虑需要网页本身的特点外,还要遵循一定的艺术规律,从而设计出色彩鲜明、性格独特的网页。因此,如何在网页设计中运用色彩是网页设计的重中之重。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号