کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
530702 869784 2012 12 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Word spotting in historical printed documents using shape and sequence comparisons
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر چشم انداز کامپیوتر و تشخیص الگو
پیش نمایش صفحه اول مقاله
Word spotting in historical printed documents using shape and sequence comparisons
چکیده انگلیسی

Information spotting in scanned historical document images is a very challenging task. The joint use of the mechanical press and of human controlled inking introduced great variability in ink level within a book or even within a page. Consequently characters are often broken or merged together and thus become difficult to segment and recognize. The limitations of commercial OCR engines for information retrieval in historical document images have inspired alternative means of identification of given words in such documents. We present a word spotting method for scanned documents in order to find the word images that are similar to a query word, without assuming a correct segmentation of the words into characters. The connected components are first processed to transform a word pattern into a sequence of sub-patterns. Each sub-pattern is represented by a sequence of feature vectors. A modified Edit distance is proposed to perform a segmentation-driven string matching and to compute the Segmentation Driven Edit (SDE) distance between the words to be compared. The set of SDE operations is defined to obtain the word segmentations that are the most appropriate to evaluate their similarity. These operations are efficient to cope with broken and touching characters in words. The distortion of character shapes is handled by coupling the string matching process with local shape comparisons that are achieved by Dynamic Time Warping (DTW). The costs of the SDE operations are provided by the DTW distances. A sub-optimal version of the SDE string matching is also proposed to reduce the computation time, nevertheless it did not lead to a great decrease in performance. It is possible to enter a query by example or a textual query entered with the keyboard. Textual queries can be used to directly spot the word without the need to synthesize its image, as far as character prototype images are available. Results are presented for different documents and compared with other methods, showing the efficiency of our method.


► Word-spotting enables information retrieval in historical digital libraries.
► The matching of word images tolerates inaccurate segmentation of words into ascii characters.
► Word segmentation is performed in the course of the matching process.
► The method is based on coupling local shape comparisons with the comparison of shape sequences.
► A sub-optimal version of the method speeds up word spotting with only slight performance decrease.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Pattern Recognition - Volume 45, Issue 7, July 2012, Pages 2598–2609
نویسندگان
, , ,