Extracting Collocations from Bengali Text Corpus

Article ID	Journal	Published Year	Pages	File Type
493365	Procedia Technology	2012	5 Pages	PDF

Abstract

Automatic collocation extraction is very important in various applications in the field of natural language processing such as machine translation, word sense disambiguation, information retrieval, and language modelling in speech processing, lexicography and many more. The success of extracting collocations depends on the technique of preprocessing. A systematic pre-processing technique is described in this paper. Then the pre-processed data is used to extract collocation by using two methods: Point-wise Mutual Information and Fuzzy Bi-gram Index. The paper mainly focuses on bi-gram extraction from a Bengali news corpus. Collocations of higher length i.e., n-grams (n>2) are then obtained when the extracted collocations of lower lengths are treated as individual words.