کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
384599 660849 2013 11 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Term extraction from sparse, ungrammatical domain-specific documents
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
پیش نمایش صفحه اول مقاله
Term extraction from sparse, ungrammatical domain-specific documents
چکیده انگلیسی

Existing term extraction systems have predominantly targeted large and well-written document collections, which provide reliable statistical and linguistic evidence to support term extraction. In this article, we address the term extraction challenges posed by sparse, ungrammatical texts with domain-specific contents, such as customer complaint emails and engineers’ repair notes. To this aim, we present ExtTerm, a novel term extraction system. Specifically, as our core innovations, we accurately detect rare (low frequency) terms, overcoming the issue of data sparsity. These rare terms may denote critical events, but they are often missed by extant TE systems. ExtTerm also precisely detects multi-word terms of arbitrarily lengths, e.g. with more than 2 words. This is achieved by exploiting fundamental theoretical notions underlying term formation, and by developing a technique to compute the collocation strength between any number of words. Thus, we address the limitation of existing TE systems, which are primarily designed to identify terms with 2 words. Furthermore, we show that open-domain (general) resources, such as Wikipedia, can be exploited to support domain-specific term extraction. Thus, they can be used to compensate for the unavailability of domain-specific knowledge resources. Our experimental evaluations reveal that ExtTerm outperforms a state-of-the-art baseline in extracting terms from a domain-specific, sparse and ungrammatical real-life text collection.


► Novel technique for term extraction from sparse, ungrammatical texts.
► Accurately detects terms, even those with extremely low frequency (sparse).
► Extracts terms that contain any number of words, even those with more than 2 words.
► Exploits open-domain knowledge base to support domain-specific term extraction.
► The proposed technique outperforms a state-of-the-art baseline over real-life texts.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Expert Systems with Applications - Volume 40, Issue 7, 1 June 2013, Pages 2530–2540
نویسندگان
, ,