دانلود رایگان مقاله: شناسایی زبان موثر متون انجمن بر اساس روش های آماری

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
514940	866917	2016	22 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

Effective language identification of forum texts based on statistical approaches

ترجمه فارسی عنوان

شناسایی زبان موثر متون انجمن بر اساس روش های آماری

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

پردازش زبان طبیعی؛ شناسایی خودکار زبان؛ متون انجمن؛ روش های ترکیبی؛ روش های آماری؛

N-grams - N گرم Hybrid approaches - روشهای ترکیبی Statistical approaches - رویکردهای آماری Natural Language Processing - پردازش زبان‌های طبیعی

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر

پیش نمایش مقاله

شناسایی زبان موثر متون انجمن بر اساس روش های آماری

چکیده انگلیسی

• This investigation deals with the problem of language identification of noisy texts.
• Two statistical approaches are proposed: High Frequency Approach and Nearest Prototype Approach.
• The proposed methods are evaluated on forum datasets containing 32 different languages.
• An experimental comparison is made with LIGA, NTC, Google translate and Microsoft Word.
• Results show that the proposed approaches are interesting in language identification of forum texts.

This investigation deals with the problem of language identification of noisy texts, which could represent the primary step of many natural language processing or information retrieval tasks. Language identification is the task of automatically identifying the language of a given text. Although there exists several methods in the literature, their performances are not so convincing in practice.In this contribution, we propose two statistical approaches: the high frequency approach and the nearest prototype approach. In the first one, 5 algorithms of language identification are proposed and implemented, namely: character based identification (CBA), word based identification (WBA), special characters based identification (SCA), sequential hybrid algorithm (HA1) and parallel hybrid algorithm (HA2). In the second one, we use 11 similarity measures combined with several types of character N-Grams.For the evaluation task, the proposed methods are tested on forum datasets containing 32 different languages. Furthermore, an experimental comparison is made between the proposed approaches and some referential language identification tools such as: LIGA, NTC, Google translate and Microsoft Word. Results show that the proposed approaches are interesting and outperform the baseline methods of language identification on forum texts.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Processing & Management - Volume 52, Issue 4, July 2016, Pages 491–512

نویسندگان

Kheireddine Abainia, Siham Ouamour, Halim Sayoud,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

دانلود رایگان مقاله ISI : شناسایی زبان موثر متون انجمن بر اساس روش های آماری

دسترسی سریع

ارتباط

English Website