Article ID Journal Published Year Pages File Type
1110968 Procedia - Social and Behavioral Sciences 2015 9 Pages PDF
Abstract

Lingala is now the most widespread language in Congo. The Internet provides a great amount of data. This paper has attempted to elucidate the issues that are involved with building a corpus for an under-resourced language where access to internet texts is difficult. To extract Lingala text from a mass of French text, it has been necessary to go through a process of selection by seed words list. The raw corpus is composed of 6,080,426 tokens. I have intervened on the data from internet sources by standardizing the spelling. This standardized corpus is stored separately from the raw corpus.

Related Topics
Social Sciences and Humanities Arts and Humanities Arts and Humanities (General)