On NLP Methods Robust to Noisy Indian Social Media Data


Khudabukhsh A, Palakodety S, Carbonell J. On NLP Methods Robust to Noisy Indian Social Media Data , in AI for Social Good Workshop. ; 2020.


Much of the computational social science research focusing on issues faced in developing nations concentrates on web content written in a world language often ignoring a significant chunk of a corpus written in a poorly resourced yet highly prevalent first language of the region in concern. Such omissions are common and convenient due to the sheer mismatch between linguistic resources offered in a world language and its low-resource counterpart. However, the path to analyze English content generated in linguistically diverse regions, such as the Indian subcontinent, is not straight-forward either. Social science/AI for social good research focusing on Indian sub-continental issues faces two major Natural Language Processing (NLP) challenges: (1) how to extract a (reasonably clean) monolingual English corpus? (2) How to extend resources and analyses to its low-resource counterpart? In this paper1 , we share NLP methods, lessons learnt from our multiple projects, and outline future focus areas that could be useful in tackling these two challenges. The discussed results are critical to two important domains: (1) detecting peace-seeking, hostility-diffusing hope speech in the context of the 2019 India-Pakistan conflict (2) detecting user generated web-content encouraging COVID-19 health compliance.

Back to AI for Social Good event