Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
1110968 | Procedia - Social and Behavioral Sciences | 2015 | 9 Pages |
Abstract
Lingala is now the most widespread language in Congo. The Internet provides a great amount of data. This paper has attempted to elucidate the issues that are involved with building a corpus for an under-resourced language where access to internet texts is difficult. To extract Lingala text from a mass of French text, it has been necessary to go through a process of selection by seed words list. The raw corpus is composed of 6,080,426 tokens. I have intervened on the data from internet sources by standardizing the spelling. This standardized corpus is stored separately from the raw corpus.
Related Topics
Social Sciences and Humanities
Arts and Humanities
Arts and Humanities (General)