The effect of data pre-processing on understanding the evolution of collaboration networks

Article ID	Journal	Published Year	Pages	File Type
523929	Journal of Informetrics	2015	11 Pages	PDF

Abstract

•Author names were disambiguated by algorithm, all-, and first-initial of given name.•Algorithmic disambiguation approximated the ground-truth better than initial methods.•Initial methods distorted size, degree, distance, and clustering of coauthor network.•Distortion of network properties by initial methods got severe over time.•Initial methods produced degree distributions seemingly following a power law.

This paper shows empirically how the choice of certain data pre-processing methods for disambiguating author names affects our understanding of the structure and evolution of co-publication networks. Thirty years of publication records from 125 Information Systems journals were obtained from DBLP. Author names in the data were pre-processed via algorithmic disambiguation. We applied the commonly used all-initials and first-initial based disambiguation methods to the data, generated over-time networks with a yearly resolution, and calculated standard network metrics on these graphs. Our results show that initial-based methods underestimate the number of unique authors, average distance, and clustering coefficient, while overestimating the number of edges, average degree, and ratios of the largest components. These self-reinforcing growth and shrinkage mechanisms amplify over time. This can lead to false findings about fundamental network characteristics such as topology and reasoning about underlying social processes. It can also cause erroneous predictions of trends in future network evolution and suggest unjustified policies, interventions and funding decisions. The findings from this study suggest that scholars need to be more attentive to data pre-processing when analyzing or reusing bibliometric data.

Keywords

Disambiguation Network evolution Collaboration network