How to normalize cooccurrence data? An analysis of some well‐known similarity measures

Top Cited Papers

13 April 2009

journal article
research article
Published by Wiley in Journal of the American Society for Information Science and Technology

Vol. 60 (8) , 1635-1651
https://doi.org/10.1002/asi.21075

Abstract

In scientometric research, the use of cooccurrence data is very common. In many cases, a similarity measure is employed to normalize the data. However, there is no consensus among researchers on which similarity measure is most appropriate for normalization purposes. In this article, we theoretically analyze the properties of similarity measures for cooccurrence data, focusing in particular on four well‐known measures: the association strength, the cosine, the inclusion index, and the Jaccard index. We also study the behavior of these measures empirically. Our analysis reveals that there exist two fundamentally different types of similarity measures, namely, set‐theoretic measures and probabilistic measures. The association strength is a probabilistic measure, while the cosine, the inclusion index, and the Jaccard index are set‐theoretic measures. Both our theoretical and our empirical results indicate that cooccurrence data can best be normalized using a probabilistic measure. This provides strong support for the use of the association strength in scientometric research.

Keywords

This publication has 80 references indexed in Scilit:

Appropriate similarity measures for author co‐citation analysis
Journal of the American Society for Information Science and Technology, 2008
Voice matters in a dictator game
Experimental Economics, 2007
Citation mining: Integrating text mining and bibliometrics for research user profiling
Journal of the American Society for Information Science and Technology, 2001
Cognitive resemblance and citation relations in chemical engineering publications
Journal of the American Society for Information Science, 1995
Mapping economics through the journal literature: An experiment in journal cocitation analysis
Journal of the American Society for Information Science, 1991
A classification of presence/absence based dissimilarity coefficients
Journal of Classification, 1989
Pictures of relevance: A geometric analysis of similarity measures
Journal of the American Society for Information Science, 1987
Metric and Euclidean properties of dissimilarity coefficients
Journal of Classification, 1986
A method for investigating and representing a person's implicit theory of personality: Theodore Dreiser's view of people.
Journal of Personality and Social Psychology, 1972
A multidimensional approach to the structure of personality impressions.
Journal of Personality and Social Psychology, 1968