How to normalize cooccurrence data? An analysis of some well‐known similarity measures
Top Cited Papers
- 13 April 2009
- journal article
- research article
- Published by Wiley in Journal of the American Society for Information Science and Technology
- Vol. 60 (8) , 1635-1651
- https://doi.org/10.1002/asi.21075
Abstract
In scientometric research, the use of cooccurrence data is very common. In many cases, a similarity measure is employed to normalize the data. However, there is no consensus among researchers on which similarity measure is most appropriate for normalization purposes. In this article, we theoretically analyze the properties of similarity measures for cooccurrence data, focusing in particular on four well‐known measures: the association strength, the cosine, the inclusion index, and the Jaccard index. We also study the behavior of these measures empirically. Our analysis reveals that there exist two fundamentally different types of similarity measures, namely, set‐theoretic measures and probabilistic measures. The association strength is a probabilistic measure, while the cosine, the inclusion index, and the Jaccard index are set‐theoretic measures. Both our theoretical and our empirical results indicate that cooccurrence data can best be normalized using a probabilistic measure. This provides strong support for the use of the association strength in scientometric research.Keywords
This publication has 80 references indexed in Scilit:
- Appropriate similarity measures for author co‐citation analysisJournal of the American Society for Information Science and Technology, 2008
- Voice matters in a dictator gameExperimental Economics, 2007
- Citation mining: Integrating text mining and bibliometrics for research user profilingJournal of the American Society for Information Science and Technology, 2001
- Cognitive resemblance and citation relations in chemical engineering publicationsJournal of the American Society for Information Science, 1995
- Mapping economics through the journal literature: An experiment in journal cocitation analysisJournal of the American Society for Information Science, 1991
- A classification of presence/absence based dissimilarity coefficientsJournal of Classification, 1989
- Pictures of relevance: A geometric analysis of similarity measuresJournal of the American Society for Information Science, 1987
- Metric and Euclidean properties of dissimilarity coefficientsJournal of Classification, 1986
- A method for investigating and representing a person's implicit theory of personality: Theodore Dreiser's view of people.Journal of Personality and Social Psychology, 1972
- A multidimensional approach to the structure of personality impressions.Journal of Personality and Social Psychology, 1968