کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
532226 869923 2013 12 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Unsupervised profiling of OCRed historical documents
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر چشم انداز کامپیوتر و تشخیص الگو
پیش نمایش صفحه اول مقاله
Unsupervised profiling of OCRed historical documents
چکیده انگلیسی

In search engines and digital libraries, more and more OCRed historical documents become available. Still, access to these texts is often not satisfactory due to two problems: first, the quality of optical character recognition (OCR) on historical texts is often surprisingly low; second, historical spelling variation represents a barrier for search even if texts are properly reconstructed. As one step towards a solution we introduce a method that automatically computes a two-channel profile from an OCRed historical text. The profile includes (1) “global” information on typical recognition errors found in the OCR output, typical patterns for historical spelling variation, vocabulary and word frequencies in the underlying text, and (2) “local” hypotheses on OCR-errors and historical orthography of particular tokens of the OCR output. We argue that availability of this kind of knowledge represents a key step for improving OCR and Information Retrieval (IR) on historical texts: profiles can be used, e.g., to automatically finetune postcorrection systems or adapt OCR engines to the given input document, and to define refined models for approximate search that are aware of the kind of language variation found in a specific document. Our evaluation results show a strong correlation between the true distribution of spelling variation patterns and recognition errors in the OCRed text and estimated ranks and scores automatically computed in profiles. As a specific application we show how to improve the output of a commercial OCR engine using profiles in a postcorrection system.


► A two-channel profile for historical language variation and OCR errors.
► Good correlation with real language and error patterns.
► Significant improvement of correction candidate selection.
► Significant impact on efficiency of document postcorrection.
► Possible Industrial Application: automatic adaption of linguistic modeling and character classifiers of OCR engines.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Pattern Recognition - Volume 46, Issue 5, May 2013, Pages 1346–1357
نویسندگان
, ,