A method for determining the number of documents needed for a gold standard corpus

Article ID	Journal	Published Year	Pages	File Type
10356085	Journal of Biomedical Informatics	2012	11 Pages	PDF

Abstract

âº Annotated documents are necessary for NLP machine learning, modeling and testing. âº We create a method to determine a required sample size for the annotation set. âº The probability of word capture from a corpus provides the basis for the method. âº Dictation letters from a pain management medical practice are used as an example. âº We also demonstrate steps for creating a representative sample of dictations.

Keywords

CPT MPC PAM ICD-9-CM i2b2 NLP CLEF Capture probability Sampling Natural Language Processing