首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 512 毫秒
1.
2.
3.
This paper proposes to use random walk (RW) to discover the properties of the deep web data sources that are hidden behind searchable interfaces. The properties, such as the average degree and population size of both documents and terms, are of interests to general public, and find their applications in business intelligence, data integration and deep web crawling. We show that simple RW can outperform the uniform random (UR) samples disregarding the high cost of UR sampling. We prove that in the idealized case when the degrees follow Zipf’s law, the sample size of UR sampling needs to grow in the order of O(N/ln 2 N) with the corpus size N, while the sample size of RW sampling grows logarithmically. Reuters corpus is used to demonstrate that the term degrees resemble power law distribution, thus RW is better than UR sampling. On the other hand, document degrees have lognormal distribution and exhibit a smaller variance, therefore UR sampling is slightly better.  相似文献   

4.
5.
The non-citation rate refers to the proportion of papers that do not attract any citation over a period of time following their publication. After reviewing all the related papers in Web of Science, Google Scholar and Scopus database, we find the current literature on citation distribution gives more focus on the distribution of the percentages and citations of papers receiving at least one citation, while there are fewer studies on the time-dependent patterns of the percentage of never-cited papers, on what distribution model can fit their time-dependent patterns, as well as on the factors influencing the non-citation rate. Here, we perform an empirical pilot analysis to the time-dependent distribution of the percentages of never-cited papers in a series of different, consecutive citation time windows following their publication in our selected six sample journals, and study the influence of paper length on the chance of papers’ getting cited. Through the above analysis, the following general conclusions are drawn: (1) a three-parameter negative exponential model can well fit time-dependent distribution curve of the percentages of never-cited papers; (2) in the initial citation time window, the percentage of never-cited papers in each journal is very high. However, as the citation time window becomes wider and wider, the percentage of never-cited papers begins to drop rapidly at first, and then drop more slowly, and the total degree of decline for most of journals is very large; (3) when applying the wider citation time windows, the percentage of never-cited papers for each journal begins to approach a stable value, and after that value, there will be very few changes in these stable percentages, unless we meet a large amount of “Sleeping Beauties” type papers; (4) the length of an paper has a great influence on whether it will be cited or not.  相似文献   

6.
Frame Synchronized Ring (FSR-bus) is a new high speed interconnection network, developed for a wide range of real time applications. The medium access control (MAC) algorithm of the FSR has been analyzed with analytical models and simulations. However, these methods have not been powerful enough for proving some interesting properties of the algorithm. In this paper we explain, how predicate/transition (Pr/T) nets can be used in the modeling of the FSR-bus. In addition, we prove the deadlock freeness and the fairness of the MAC by analyzing the Pr/T-net model of the FSR.  相似文献   

7.
The rise of software as a research object is mirrored by increasing interests in quantitative studies of scientific software. However, inconsistent citation practices have led most existing studies of this type to base their analysis of software impact on software name mentions, as identified in full-text publications. Despite its limitations, citation data exists in much greater quantities and covers a broader array of scientific fields than full-text data, and thus can support investigations with much wider scope. This paper aims to analyze the extent to which citation data can be used to reconstruct the impact of software. Specifically, we identify the variety of citable objects related to the lme4 R package and examine how the package’s impact is dispersed across these objects. Our results shed light on a little-discussed challenge of using citation data to measure software impact: even within the category of formal citation, the same software object might be cited in different forms. We consider the implications of this challenge and propose a method to reconstruct the impact of lme4 through its citations nonetheless.  相似文献   

8.
Starting from the notion of h-type indices for infinite sequences we investigate if these indices satisfy natural inequalities related to the arithmetic, the geometric and the harmonic mean. If f denotes an h-type index, such as the h- or the g-index, then we investigate inequalities such as min(f(X),f(Y)) ≤ f((X?+?Y)/2) ≤ max(f(X), f(Y)). We further investigate if: f(min(X,Y)) = min(f(X),f(Y)) and if f(max(X,Y)) = max(f(X),f(Y)). It is shown that the h-index satisfies all the equalities and inequalities we investigate but the g-index does not always, while it is always possible to find a counterexample involving the R-index. This shows that the h-index enjoys a number of interesting mathematical properties as an operator in the partially ordered positive cone (R+) of all infinite sequences with non-negative real values.In a second part we consider decreasing vectors X and Y with components at most at distance d. Denoting by D the constant sequence (d,d,d, …) and by Y-D the vector (max(yr-d), 0)r, we prove that under certain natural conditions, the double inequality h(Y-D) ≤ h(X) ≤ h(Y?+?D) holds.  相似文献   

9.
《Knowledge Acquisition》1992,4(4):371-386
This paper illustrates a technique for discovering mutual implications among hierarchically structured data. Such a technique may be applied to both knowledge and data bases. If the hierarchical structure makes it possible to define granularity levels, mutual implications can be evaluated at any level. Results can be quantitative (i.e. a degree in the range [0, 1]) or qualitative (i.e. a label taken from a user-defined set). If the ground data do not represent a mapping among individuals, i.e. the level of information granularity is not the highest, a local approximation based on T-Norms can be used. The process of implication discovery allows one to derive inference rules for expert systems and to detect default values. In addition, it might be successfully used by sophisticated machine learning algorithms.  相似文献   

10.
In an earlier paper by Glänzel and Schubert [Glänzel, W., & Schubert, A. (1988a). Characteristic scores and scales in assessing citation impact. Journal of Information Science, 14(2), 123–127; Glänzel, W., & Schubert, A. (1988b). Theoretical and empirical studies of the tail of scientometric distributions. In L. Egghe, & R. Rousseau (Eds.), Informetrics: Vols. 87/88, (pp. 75–83). Elsevier Science Publisher B.V.], a method for classifying ranked observations into self-adjusting categories was developed. This parameter-free method, which was called method of characteristic scores and scales, is independent of any particular bibliometric law. The objective of the present study is twofold. In the theoretical part, the analysis of its properties for the general form of the Pareto distribution will be extended and deepened; in the empirical part the citation history of individual scientific disciplines will be studied. The chosen citation window of 21 years makes it possible to analyse dynamic aspects of the method, and proves sufficiently large to also obtain stable patterns for each of the disciplines. The theoretical findings are supplemented by regularities derived from the long-term observations.  相似文献   

11.
12.
13.
Transaction logs from online search engines are valuable for two reasons: First, they provide insight into human information-seeking behavior. Second, log data can be used to train user models, which can then be applied to improve retrieval systems. This article presents a study of logs from PubMed®, the public gateway to the MEDLINE® database of bibliographic records from the medical and biomedical primary literature. Unlike most previous studies on general Web search, our work examines user activities with a highly-specialized search engine. We encode user actions as string sequences and model these sequences using n-gram language models. The models are evaluated in terms of perplexity and in a sequence prediction task. They help us better understand how PubMed users search for information and provide an enabler for improving users’ search experience.  相似文献   

14.
Journal metrics are employed for the assessment of scientific scholar journals from a general bibliometric perspective. In this context, the Thomson Reuters journal impact factors (JIFs) are the citation-based indicators most used. The 2-year journal impact factor (2-JIF) counts citations to one and two year old articles, while the 5-year journal impact factor (5-JIF) counts citations from one to five year old articles. Nevertheless, these indicators are not comparable among fields of science for two reasons: (i) each field has a different impact maturity time, and (ii) because of systematic differences in publication and citation behavior across disciplines. In fact, the 5-JIF firstly appeared in the Journal Citation Reports (JCR) in 2007 with the purpose of making more comparable impacts in fields in which impact matures slowly. However, there is not an optimal fixed impact maturity time valid for all the fields. In some of them two years provides a good performance whereas in others three or more years are necessary. Therefore, there is a problem when comparing a journal from a field in which impact matures slowly with a journal from a field in which impact matures rapidly. In this work, we propose the 2-year maximum journal impact factor (2M-JIF), a new impact indicator that considers the 2-year rolling citation time window of maximum impact instead of the previous 2-year time window. Finally, an empirical application comparing 2-JIF, 5-JIF, and 2M-JIF shows that the maximum rolling target window reduces the between-group variance with respect to the within-group variance in a random sample of about six hundred journals from eight different fields.  相似文献   

15.
In a recent work by Anderson, Hankin, and Killworth (2008), Ferrers diagrams and Durfee squares are used to represent the scientific output of a scientist and construct a new h-based bibliometric indicator, the tapered h-index (hT). In the first part of this paper we examine hT, identifying its main drawbacks and weaknesses: an arbitrary scoring system and an illusory increase in discrimination power compared to h. Subsequently, we propose a new bibliometric tool, the citation triad (CT), that better exploits the information contained in a Ferrers diagram, giving a synthetic overview of a scientist's publication output. The advantages of this new approach are discussed in detail. Argument is supported by several examples based on empirical data.  相似文献   

16.
Based on the rank-order citation distribution of e.g. a researcher, one can define certain points on this distribution, hereby summarizing the citation performance of this researcher. Previous work of Glänzel and Schubert defined these so-called “characteristic scores and scales” (CSS), based on average citation data of samples of this ranked publication–citation list.In this paper we will define another version of CSS, based on diverse h-type indices such as the h-index, the g-index, the Kosmulski's h(2)-index and the g-variant of it, the g(2)-index.Mathematical properties of these new CSS are proved in a Lotkaian framework. These CSS also provide an improvement of the single h-type indices in the sense that they give h-type index values for different parts of the ranked publication–citation list.  相似文献   

17.
How does educational stage affect the way people find information? In previous research using the Digital Visitors & Residents (V&R) framework for semi-structured interviews, context was a factor in how individuals behaved. This study of 145 online, open-ended surveys examines the impact that one's V&R educational stage has on the likelihood of attending to digital and human sources across four contexts. These contexts vary according to whether the search was professional or personal and successful or struggled. The impact of educational stage differs based on context. In some contexts, people at higher educational stages are more likely to attend to digital sources and less likely to attend to human sources. In other contexts, there is no statistically significant difference (p < 0.10) among educational stages. These findings provide support for previous V&R research, while also demonstrating that online surveys can be used to supplement and balance the data collected from semi-structured interviews.  相似文献   

18.
In multiplex networks, each layer may represent different interactions or the same interaction over different time periods. Presently all centralities method may fail to detect the change among different layers (totally M layers). As the minimum unit of a multiplex network is duplex network (M = 2), we can clarify layer difference via duplex network. In a duplex network, the layer similarity LSim is defined for measuring similarity between layers, via node similarity of two layers, and then the layer difference is described by the similarity. The methodology can be extended to multiplex network by repeats of duplex networks. Two information networks and two extending empirical cases are investigated and verified.  相似文献   

19.
Most current h-type indicators use only a single number to measure a scientist's productivity and impact of his/her published works. Although a single number is simple to calculate, it fails to outline his/her academic performance varying with time. We empirically study the basic h-index sequence for cumulative publications with consideration of the yearly citation performance (for convenience, referred as L-Sequence). L-Sequence consists of a series of L factors. Based on the citations received in the corresponding individual year, every factor along a scientist's career span is calculated by using the h index formula. Thus L-Sequence shows the scientist's dynamic research trajectory and provides insight into his/her scientific performance at different periods. Furthermore, L, summing up all factors of L-Sequence, is for the evaluation of the whole research career as alternative to other h-index variants. Importantly, the partial factors of the L-Sequence can be adapted for different evaluation tasks. Moreover, L-Sequence could be used to highlight outstanding scientists in a specific period whose research interests can be used to study the history and trends of a specific discipline.  相似文献   

20.
Different term weighting techniques such as $TF\cdot IDF$ or BM25 have been used intensely for manifold text-based information retrieval tasks. Their use for modeling term profiles for named entities and subsequent calculation of similarities between these named entities have been studied to a much smaller extent. The recent trend of microblogging made available massive amounts of information about almost every topic around the world. Therefore, microblogs represent a valuable source for text-based named entity modeling. In this paper, we present a systematic and comprehensive evaluation of different term weighting measures, normalization techniques, query schemes, index term sets, and similarity functions for the task of inferring similarities between named entities, based on data extracted from microblog posts. We analyze several thousand combinations of choices for the above mentioned dimensions, which influence the similarity calculation process, and we investigate in which way they impact the quality of the similarity estimates. Evaluation is performed using three real-world data sets: two collections of microblogs related to music artists and one related to movies. For the music collections, we present results of genre classification experiments using as benchmark genre information from allmusic.com . For the movie collection, we present results of multi-class classification experiments using as benchmark categories from IMDb . We show that microblogs can indeed be exploited to model named entity similarity with remarkable accuracy, provided the correct settings for the analyzed aspects are used. We further compare the results to those obtained when using Web pages as data source.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号