کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
559026 875034 2014 23 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
A lexicon of multiword expressions for linguistically precise, wide-coverage natural language processing
ترجمه فارسی عنوان
یک سخنرانی از اصطلاحات چند جمله ای برای پردازش زبان طبیعی با دقت و گسترده و گسترده
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر پردازش سیگنال
چکیده انگلیسی


• We show a lexicon of multiword expressions compiled for Japanese processing.
• 111,000 expressions and their exceeding 820,000 notational variants are contained.
• Syntactic functions, structures and flexibilities of each expression are given in it.
• Its validity is implied by comparing with N-gram frequency big data: LDC2009T08.
• The lexicon is compiled manually, i.e., beyond the “empiricism” in these 30 years.

Since Sag et al. (2002) highlighted a key problem that had been underappreciated in the past in natural language processing (NLP), namely idiosyncratic multiword expressions (MWEs) such as idioms, quasi-idioms, clichés, quasi-clichés, institutionalized phrases, proverbs and old sayings, and how to deal with them, many attempts have been made to extract these expressions from corpora and construct a lexicon of them. However, no extensive, reliable solution has yet been realized. This paper presents an overview of a comprehensive lexicon of Japanese multiword expressions (Japanese MWE Lexicon: JMWEL), which has been compiled in order to realize linguistically precise and wide-coverage natural Japanese processing systems. The JMWEL is characterized by significant notational, syntactic, and semantic diversity as well as a detailed description of the syntactic functions, structures, and flexibilities of MWEs. The lexicon contains about 111,000 header entries written in kana (phonetic characters) and their almost 820,000 variants written in kana and kanji (ideographic characters). The paper demonstrates the JMWEL's validity, supported mainly by comparing the lexicon with a large-scale Japanese N-gram frequency dataset, namely the LDC2009T08 generated by Google Inc. (Kudo and Kazawa, 2009). The present work is an attempt to provide a tentative answer for Japanese, from outside statistical empiricism, to the question posed by Church (2011): “How many multiword expressions do people know?”

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computer Speech & Language - Volume 28, Issue 6, November 2014, Pages 1317–1339
نویسندگان
, , ,