People of ACM - Partha Talukdar
October 17, 2023
What is an example of a natural language processing challenge you are working on now at Google Research India?
There are more than 7000 languages in the world, while language technologies (such as machine translation, question answering, speech recognition and synthesis, etc.,) are available for only a handful of languages. Lack of usable language technologies exacerbates the digital divide in society, creating asymmetries in access to information and opportunities depending on the language one knows or was born into. In my group at Google Research India, we are focusing on developing inclusive AI and language technologies to make access to information truly universal.
Developing capabilities one language at a time is quite time consuming and expensive. Fortunately, recent advances in Multilingual Language Modeling, and in particular Large Language Models (LLMs), have opened up a door to developing language technologies across languages. We are working on making LLMs more linguistically inclusive.
This is a challenging problem because data availability is often skewed across languages. While we have good methods to develop LLMs for languages with high web resources, this doesn’t readily apply to most of the world’s languages with limited or no text corpus available on the Internet. Thus, we are developing methods which can scale LLM coverage to languages with limited web resources.
Additionally, we are also working to strengthen the language data ecosystem. For example, through Project Vaani, we are partnering with the Indian Institute of Science (IISc) Bangalore to capture the speech landscape of India by collecting more than 150 thousand hours of speech data from all districts of India and open source all of it.
Finally, building LLMs in a responsible way while paying attention to region-specific biases is very important. We are working on Project Bindi to recontextualize fairness and bias research across geographies and cultures.
For someone who is unfamiliar with your field, how are innovations in multilingual language modeling advancing natural language processing?
Multilingual language modeling makes it possible to transfer knowledge and training supervision across languages, making it possible to bootstrap language technologies for low-web resource languages. For example, multilingual language modeling makes it possible to develop a question answering system for Assamese, a language spoken by more than 15 million people in India and one with limited language technologies, while leveraging training resources in English and other high web-resource languages. While this opens up a unique opportunity, further research is necessary to bring up quality to a level that can be useful to end users.
You were part of the NELL project at CMU. What was the motivation and what were the key insights from this project?
Let us consider an example first—in order for an AI system to make sense of the sentence “State Farm stocks tumbled along with Berkshire Hathaway,” knowledge about companies, the stock market, and any underlying cause which may have caused this correlated movement of stock prices is necessary. The need for such world knowledge in AI has been realized from the early days of AI, although scalable ways of assembling such knowledge was a challenge.
The Never-Ending Language Learning (NELL) project, led by Prof. Tom Mitchell at CMU, was an ambitious effort to gather factual world knowledge by reading the Web on a never-ending basis. During its run from 2010 until 2018, NELL accumulated a knowledge base of millions of facts by continuously reading web documents in a primarily self-supervised manner. In the pre-LLM world, this was a successful demonstration of a system which continuously learned over a decade without semantic drift.
Along with your co-authors, you received an Outstanding Paper Award at Association of Computational Linguistics (ACL 2019) meeting for the paper “Zero-Shot Word Sense Disambiguation Using Sense Definition Embeddings.” Why is word sense disambiguation a long-standing problem in natural language processing, and what was a key insight of your paper?
Natural language is ambiguous, and a single word can mean different things in different contexts. For example, the word “tie” means to level a match (as in “He didn’t expect a tie in the match”) while the same word also means necktie (as in “He wore a tie to the event”). Each one of these meanings or interpretations is called a “sense.” The challenge of Word Sense Disambiguation (WSD) is to automatically identify the sense of a word in a given context. This is a challenging problem because all senses of a word are not usually covered in the training data, and moreover there is skew among the senses of a word which are present in the training data.
The key insight in our ACL 2019 paper was the successful integration of the three types of supervision, viz., sense annotated data, dictionary definitions, and lexical knowledge bases. Because of this, EWISE (our proposed approach) was able to handle word senses which were absent during training time (referred to as “zero-shot” in the literature) and also achieved superior performance to the existing approaches at that time.
What is another example of an exciting research avenue or new development that will impact your field in the near future?
Coming back to the topic of inclusive large language models (LLMs), I am particularly excited about the integration of multiple modalities into LLMs in addition to text. We are currently working on speech-text models covering 100+ Indic languages, and I am curious to see what kind of cross-modal and cross-lingual transfers are possible and how they may contribute towards linguistic inclusion.
Partha Talukdar is a Senior Staff Research Scientist at Google Research India and an Associate Professor at the Indian Institute of Science (IISc) Bangalore. Talukdar completed his undergraduate degree from BITS Pilani and received a PhD from University of Pennsylvania. His recent research has focused on inclusive and equitable language technology development through multilingual-multimodal Large Language Modeling. Talukdar also founded Kenome, an enterprise knowledge graph company with the mission to help enterprises make sense of big dark data.
Talukdar received the ACM India Early Career Researcher Award for combining deep scholarship of natural language processing (NLP), graphical knowledge representation, and machine learning to solve long-standing problems.