Reliability studies of diagnostic tests are not using enough observers for robust estimation of interobserver agreement: a simulation study

Article ID	Journal	Published Year	Pages	File Type
1083699	Journal of Clinical Epidemiology	2008	6 Pages	PDF

Abstract

ObjectiveAny attempt to generalize the performance of a subjective diagnostic method should take into account the sample variation in both cases and readers. Most current measures of the performance of a test, especially the indices of reliability, only tackle the variation of cases, and hence are not suitable for generalizing results across the population of readers. We attempted to study the effect of readers' variation on two measures of multireader reliability: pair-wise agreement and Fleiss' kappa.Study Design and SettingWe used a normal hierarchical model with a latent trait (signal) variable to simulate a binary decision-making task by different number of readers on an infinite sample of cases.ResultsIt could be shown that both measures, especially Fleiss' kappa, have a large sample variance when estimated by a small number of readers, casting doubt on their accuracy given the number of readers typically used in current reliability studies.ConclusionThe majority of the current agreement studies is likely limited by the number of readers and is unlikely to produce a reliable estimate of reader agreement.

Keywords

reproducibility of results Signal detection Observer variation Computer simulation Study design Reliability