کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
1118924 1488464 2013 9 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Which Granularity to Bootstrap a Multilingual Method of Document Alignment: Character N-grams or Word N-grams?
موضوعات مرتبط
علوم انسانی و اجتماعی علوم انسانی و هنر هنر و علوم انسانی (عمومی)
پیش نمایش صفحه اول مقاله
Which Granularity to Bootstrap a Multilingual Method of Document Alignment: Character N-grams or Word N-grams?
چکیده انگلیسی

This article tackle multilingual automatic alignment. Alignment refers to the process by which segments that are translation of one another are automatically matched. Instead of comparing only pairs of languages at sentence level, as it is usually done to conform to human process in translation. The computer is used here for its capacity to infer semantic alignment from a collection of texts that are translations of the same content. The corpus contains press releases from Europa, the European Community website, available in up to 23 languages. The alignment process takes advantage of frequency similarity between different linguistic versions of a document by computing matching features for each repeated string in all versions. This is done to find reliable anchors in the process of linking versions. The question of the best granularity is raised to bring out some semantic equivalences, when comparing two linguistic versions, character N-grams or word N-grams. The alignment systems are traditionally based on word N-grams splitting. The observation of the morphological variety of languages, even inside a single linguistic family, quickly shows that the word granularity is inadequate to provide a widely multilingual system, i.e. a language independent system able to handle flexional languages as well as positional languages. Instead, when starting from a multilingual collection to focus on pairs of texts, we defend that character N-grams alignment is more efficient than word N-grams alignment.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Procedia - Social and Behavioral Sciences - Volume 95, 25 October 2013, Pages 473-481