The Making of Lingala Corpus: An Under-resourced Language and the Internet

Article ID	Journal	Published Year	Pages	File Type
1110968	Procedia - Social and Behavioral Sciences	2015	9 Pages	PDF

Abstract

Lingala is now the most widespread language in Congo. The Internet provides a great amount of data. This paper has attempted to elucidate the issues that are involved with building a corpus for an under-resourced language where access to internet texts is difficult. To extract Lingala text from a mass of French text, it has been necessary to go through a process of selection by seed words list. The raw corpus is composed of 6,080,426 tokens. I have intervened on the data from internet sources by standardizing the spelling. This standardized corpus is stored separately from the raw corpus.