دانلود رایگان مقاله: شناسایی زبان بدون نظارت بر اساس تخصیص پنهان دیریکله

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
558976	1451688	2016	20 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

Unsupervised language identification based on Latent Dirichlet Allocation

ترجمه فارسی عنوان

شناسایی زبان بدون نظارت بر اساس تخصیص پنهان دیریکله

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

جداسازی زبان;پالایش زبان ; شناسنامه زبان

Language identification - شناسایی زبان

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر پردازش سیگنال

پیش نمایش مقاله

شناسایی زبان بدون نظارت بر اساس تخصیص پنهان دیریکله

چکیده انگلیسی

• An unsupervised language identification approach based on Latent Dirichlet Allocation with high precisions, recalls and F scores.
• The raw n-gram count as features without any smoothing, pruning or interpolation.
• Purifies main language with unknown number of other languages in high precision.
• Find out the nearest measure related to the minimum of topic number.

To automatically build, from scratch, the language processing component for a speech synthesis system in a new language, a purified text corpora is needed where any words and phrases from other languages are clearly identified or excluded. When using found data and where there is no inherent linguistic knowledge of the language/languages contained in the data, identifying the pure data is a difficult problem. We propose an unsupervised language identification approach based on Latent Dirichlet Allocation where we take the raw n-gram count as features without any smoothing, pruning or interpolation. The Latent Dirichlet Allocation topic model is reformulated for the language identification task and Collapsed Gibbs Sampling is used to train an unsupervised language identification model. In order to find the number of languages present, we compared four kinds of measure and also the Hierarchical Dirichlet process on several configurations of the ECI/UCI benchmark. Experiments on the ECI/MCI data and a Wikipedia based Swahili corpus shows this LDA method, without any annotation, has comparable precisions, recalls and F-scores to state of the art supervised language identification techniques.

On the left generative model, we propose an unsupervised language identification approach based on Latent Dirichlet Allocation (LDA-LI) where we take the raw n-gram count as features without any smoothing, pruning or interpolation.Figure optionsDownload as PowerPoint slideAs the right experiments on ECI/MCI benchmark, the LDA-LI has comparable precisions, recalls and F scores to state of the art supervised language identification techniques (langID.py and Guess_language, etc.).

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computer Speech & Language - Volume 39, September 2016, Pages 47–66

نویسندگان

Wei Zhang, Robert A.J. Clark, Yongyuan Wang, Wen Li,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

دانلود رایگان مقاله ISI : شناسایی زبان بدون نظارت بر اساس تخصیص پنهان دیریکله

دسترسی سریع

ارتباط

English Website