Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
6836699 | Computers in Human Behavior | 2016 | 12 Pages |
Abstract
Test collection is extensively used to evaluate information retrieval systems in laboratory-based evaluation experimentation. In a classic setting of a test collection, human assessors involve relevance judgments, which are both a costly and time-consuming task, and scales poorly. Researchers are still being challenged in performing reliable and low-cost evaluation of retrieval systems. Crowdsourcing as a novel method provides a cost-effective and quick solution for creating relevance judgments. However, crowdsourcing comes with the risk of a heterogeneous mass of potential workers who create the relevance judgments with varied levels of accuracy. It is, therefore, essential to understand the factors that affect the reliability of crowdsourced judgments. In this article, we measured various cognitive characteristics of workers, and explored the effects of these characteristics on judgment reliability, in comparison with a human gold standard. We discovered a significant correlation between judgment reliability and the level of verbal comprehension skill. This association conveys an idea for improving the reliability of judgments by discriminating workers into various groups according to their cognitive abilities and to filter out (or to include) certain group(s) of workers. Aside from that, we also discovered a significant association between reliability of judgments and self-reported difficulty of judgment as well as confidence in the task.
Keywords
Related Topics
Physical Sciences and Engineering
Computer Science
Computer Science Applications
Authors
Parnia Samimi, Sri Devi Ravana, Yun Sing Koh,