Term extraction from sparse, ungrammatical domain-specific documents

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
384599	660849	2013	11 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Term extraction - استخراج مدت Text mining - متن‌کاوی Business intelligence - هوش تجاری Natural Language Processing - پردازش زبان‌های طبیعی

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی

پیش نمایش صفحه اول مقاله

Term extraction from sparse, ungrammatical domain-specific documents

چکیده انگلیسی

Existing term extraction systems have predominantly targeted large and well-written document collections, which provide reliable statistical and linguistic evidence to support term extraction. In this article, we address the term extraction challenges posed by sparse, ungrammatical texts with domain-specific contents, such as customer complaint emails and engineers’ repair notes. To this aim, we present ExtTerm, a novel term extraction system. Specifically, as our core innovations, we accurately detect rare (low frequency) terms, overcoming the issue of data sparsity. These rare terms may denote critical events, but they are often missed by extant TE systems. ExtTerm also precisely detects multi-word terms of arbitrarily lengths, e.g. with more than 2 words. This is achieved by exploiting fundamental theoretical notions underlying term formation, and by developing a technique to compute the collocation strength between any number of words. Thus, we address the limitation of existing TE systems, which are primarily designed to identify terms with 2 words. Furthermore, we show that open-domain (general) resources, such as Wikipedia, can be exploited to support domain-specific term extraction. Thus, they can be used to compensate for the unavailability of domain-specific knowledge resources. Our experimental evaluations reveal that ExtTerm outperforms a state-of-the-art baseline in extracting terms from a domain-specific, sparse and ungrammatical real-life text collection.

► Novel technique for term extraction from sparse, ungrammatical texts.
► Accurately detects terms, even those with extremely low frequency (sparse).
► Extracts terms that contain any number of words, even those with more than 2 words.
► Exploits open-domain knowledge base to support domain-specific term extraction.
► The proposed technique outperforms a state-of-the-art baseline over real-life texts.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Expert Systems with Applications - Volume 40, Issue 7, 1 June 2013, Pages 2530–2540

نویسندگان

Ashwin Ittoo, Gosse Bouma,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Term extraction from sparse, ungrammatical domain-specific documents

دسترسی سریع

ارتباط

English Website