Article ID Journal Published Year Pages File Type
485433 Procedia Computer Science 2016 8 Pages PDF
Abstract

Scarcity of resources in under resourced languages may leave these languages behind in race of development of data driven NLP systems. Crowdsourcing has come up as a technique to bridge this gap, as it offers approach for collecting such resources in collaborative manner. Though some of Indian languages are widely spoken throughout the world yet many of them are resource poor when it is measured in terms of availability of transcribed and annotated resources for building reliable data driven systems. This paper describes an experience of speech data collection for Hindi through mobile using this approach for building automatic speech recognition and other speech based retrieval systems. This approach covers a lot of variety in terms of microphones and surrounding environment etc. Besides cost saving and speedy data collection it offers the advantage of adaptation of the framework for collecting different types of resources for various applications in language independent manner like word sense disambiguation, Named Entity Recognition, Sentiment Analysis etc. Experiences, analysis and challenges faced in recordings of more than 100 speakers are reported.

Keywords
Related Topics
Physical Sciences and Engineering Computer Science Computer Science (General)
Authors
, , , , ,