On NLP Methods Robust to Noisy Indian Social Media Data

Publication information:

Khudabukhsh, A., Palakodety, S. & Carbonell, J. On NLP Methods Robust to Noisy Indian Social Media Data . in AI for Social Good Workshop (2020).

Abstract

Much of the computational social science researchfocusing on issues faced in developing nations concentrates on web content written in a world language often ignoring a significant chunk of a corpuswritten in a poorly resourced yet highly prevalentfirst language of the region in concern. Such omissions are common and convenient due to the sheermismatch between linguistic resources offered ina world language and its low-resource counterpart.However, the path to analyze English content generated in linguistically diverse regions, such as theIndian subcontinent, is not straight-forward either.Social science/AI for social good research focusing on Indian sub-continental issues faces two major Natural Language Processing (NLP) challenges:(1) how to extract a (reasonably clean) monolingualEnglish corpus? (2) How to extend resources andanalyses to its low-resource counterpart? In thispaper1, we share NLP methods, lessons learnt fromour multiple projects, and outline future focus areas that could be useful in tackling these two challenges. The discussed results are critical to twoimportant domains: (1) detecting peace-seeking,hostility-diffusing hope speech in the context of the2019 India-Pakistan conflict (2) detecting user generated web-content encouraging COVID-19 healthcompliance.

Back to AI for Social Good event