How to evaluate rankings of academic entities using test data

Article ID	Journal	Published Year	Pages	File Type
6933969	Journal of Informetrics	2018	25 Pages	PDF

Abstract

Using two publication databases and four test data sets we demonstrate the functionality of the framework and analyse the stability and discriminative power of the most common information retrieval evaluation measures. We find that there is no clear winner and that the performance of the evaluation measures is highly dependent on the underlying data. Our results show that the average rank is indeed an adequate and stable measure. However, we also show that relatively large performance differences are required to confidently determine if one ranking algorithm is significantly superior to another. Lastly, we list alternative measures that also yield stable results and highlight measures that should not be used in this context.

Keywords

Significance testing Test data