Finding and identifying text in 900+ languages

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
457921	696081	2012	10 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Discriminative training - آموزش تبعیض آمیز Text extraction - استخراج متن Language identification - شناسایی زبان Smoothing - صاف کردن

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر شبکه های کامپیوتری و ارتباطات

پیش نمایش صفحه اول مقاله

Finding and identifying text in 900+ languages

چکیده انگلیسی

This paper presents a trainable open-source utility to extract text from arbitrary data files and disk images which uses language models to automatically detect character encodings prior to extracting strings and for automatic language identification and filtering of non-textual strings after extraction. With a test set containing 923 languages, consisting of strings of at most 65 characters, an overall language identification error rate of less than 0.4% is achieved. False-alarm rates on random data are 0.34% when filtering thresholds are set for high recall and 0.012% when set for high precision, with corresponding miss rates of 0.002% and 0.009% in running text.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Digital Investigation - Volume 9, Supplement, August 2012, Pages S34–S43

نویسندگان

Ralf D. Brown,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Finding and identifying text in 900+ languages

دسترسی سریع

ارتباط

English Website