کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
403474 677241 2015 12 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Semi-supervised cluster-and-label with feature based re-clustering to reduce noise in Thai document images
ترجمه فارسی عنوان
خوشه و برچسب نیمه نظارت شده با خوشهبندی مجدد ویژگی برای کاهش نویز در تصاویر سند تایلندی
کلمات کلیدی
کاهش سر و صدا، بهبود سند، طبقه بندی نیمه نظارت، خوشه و برچسب، سند تایلندی
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
چکیده انگلیسی


• We proposed a novel noise reduction method for document images.
• Semi-supervised learning is applied to classify noise from character components.
• The proposed method is suitable for Non-Latin based scripts i.e. Thai document image.
• We proposed an enhance labeling method of semi-supervised cluster-and-label approach.
• The performance of proposed methods are significantly better than comparison methods.

Noise components are a major cause of poor performance in document analysis. To reduce undesired components, most recent research works have applied an image processing technique. However, the effectiveness of these techniques is suitable only for a Latin script document but not a non-Latin script document. The characteristics of the non-Latin script document, such as Thai, are considerably more complicated than the Latin script document and include many levels of character alignment, no word or sentence separator, and variability in a character’s size. When applying an image processing technique to a Thai document, we usually remove the characters that are relatively close to noise. Hence, in this paper, we propose a novel noise reduction method by applying a machine learning technique to classify and reduce noise in document images. The proposed method uses a semi-supervised cluster-and-label approach with an improved labeling method, namely, feature selected sub-cluster labeling. Feature selected sub-cluster labeling focuses on the clusters that are incorrectly labeled by conventional labeling methods. These clusters are re-clustered into small groups with a new feature set that is selected according to class labels. The experimental results show that this method can significantly improve the accuracy of labeling examples and the performance of classification. We compared the performance of noise reduction and character preservation between the proposed method and two related noise reduction approaches, i.e., a two-phased stroke-like pattern noise (SPN) removal and a commercial noise reduction software called ScanFix Xpress 6.0. The results show that semi-supervised noise reduction is significantly better than the compared methods of which an F-measure of character and noise is 86.01 and 97.82, respectively.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Knowledge-Based Systems - Volume 90, December 2015, Pages 58–69
نویسندگان
, ,