Normalization of informal text

Article ID	Journal	Published Year	Pages	File Type
558299	Computer Speech & Language	2014	22 Pages	PDF

Abstract

•Normalization of abbreviations in noisy, informal text.•Collection, filtering and annotation of Twitter status messages.•Comparison of statistical and machine translation approaches.•Effects of language model order on accuracy.•Combination of methods to achieve best results.

This paper describes a noisy-channel approach for the normalization of informal text, such as that found in emails, chat rooms, and SMS messages. In particular, we introduce two character-level methods for the abbreviation modeling aspect of the noisy channel model: a statistical classifier using language-based features to decide whether a character is likely to be removed from a word, and a character-level machine translation model. A two-phase approach is used; in the first stage the possible candidates are generated using the selected abbreviation model and in the second stage we choose the best candidate by decoding using a language model. Overall we find that this approach works well and is on par with current research in the field.

Keywords

Text normalization