Apxor has been doing open discussions with experts for some time now. But this time around it was very special. It was the first of many firsts. When Dr. Sowmya Vajjala - NLP Researcher at the National Research Council, Canada, accepted our invitation, we were elated to have our first female speaker. But little did we know that we would be having a toddler too in the session. Being a Saturday, her daughter made her appearance for us, but ensured that the session progressed without much disturbance.
Dr. Sowmya Vajjala is an alumni of the International Institute of Information Technology (IIIT) and headed on to pursue her doctoral degree in Germany. She is a super woman who has donned many hats, including those of a data scientist, researcher, assistant professor and author, while she also keenly pursues her hobby of reading. She even co-founded Pustakam.net in 2009, which provides a forum for people to write about the books they read.
Sowmya speaks very passionately about her pet project Pustakam.net and says it was a highly successful experiment, with a lot of people contributing to the website even till date, and where they even interviewed famous personalities like Amish, and the founders of Gutenberg too.
About her Book
Sowmya recently co-authored the book “Practical Natural Language Processing: A Comprehensive Guide to Building Real World NLP Systems”, published by O’Reilly Media (June, 2020). While the book aims to be a go to resource for industry practitioners, it is also being used as a textbook in applied NLP courses in some Indian and US universities.
Sowmya says that ‘When you look at existing textbooks on NLP, on one hand there are textbooks that are taught in colleges which talk more about algorithms and their complexity, and on the other hand there are textbooks that talk about how to do text classification using technologies like PyTorch etc. Our book takes the best of both worlds actually. Starting right from how to create data to build a machine learning system for NLP, to the point where when we deploy a model, and decisions regarding how to update the model, the book covers all these topics. It also helps the readers understand how to hire an AI team. There are plenty of code examples, case studies and real world use cases including the pipeline of Google’s customer ticketing assistant. The book is not just about practice, it has more than 500 references to state of the art research as well.’ When the book has so much to give, it's natural for Sowmya to feel that the book is the best book on NLP in the world.
You can purchase this amazing book from here -
About Data Science for Businesses
Talking about how data science can help business, Sowmya says, ‘Compared to 5 years ago, data science is now being used in almost all stages of running a business which doesn’t necessarily need to be a technical business. For example tools like Apxor use a lot of data science and analytics to help customers understand why their users are leaving. There is a lot of data science that goes into targeted marketing campaigns using machine learning and even for customer acquisition. There is also a lot of data science usage for customer support, how do you identify customer grievances, finding out user sentiment and understanding what are the actionable items. Irrespective of the domain of the business, data science has a role to play in many different stages.’
Sowmya advises that while building a data science team for your business, the first person you should be hiring is an engineer and not a data scientist. Because an average data scientist may not be a good software programmer. She says that it is optimal to build the infrastructure of machine learning systems to be built first and then hire a data scientist into the team.
About Data Science, Products and Ethics
When asked about how products like Facebook use data science to retain users, Sowmya says, ‘Facebook is a large organization, and deals with a lot of UGC. It has a number of teams working on data science, ML and NLP. Facebook does a great job in translating posts from other languages. Recently, I saw a post made in Korean language by a Korean friend, translated into english and shown to us. So these are small small use cases where NLP and data science help in keeping the users hooked. Also FB has a lot of data about the pages we like, the videos we watch, the friends we have and our chats and a lot more. The model builds a profile for us using all these data points. Data science is useful in those aspects as well. Facebook also gives you rich features like searching for say ‘Friends in Paris’, you can search through a graph. When you are not paying for all such features, then the goal of FB is to keep us there, and to collect all this data from us too.’
So what is Sowmya’s opinion about the ethics behind the data collection models? Sowmya feels that it's a tough question to answer. ‘Even the research committee in ML just started to notice these things only in the past 3-4 years. Especially when it comes to apps like FB, we are not paying for the services, hence their ROI is getting us (our information) as data. When we probe more into ethics of these things, there are certain things to be considered, like
Fairness - Are these models clearly discriminating against some section of people. Like some reports of face recognition models discriminating against colored people, and Amazon’s resume extractor being biased against female candidates (which was later taken off)
Privacy and Consent - There is a major issue of consent. Are we really taking the consent of all the people whose data we are using? Popular tool Imagenet, which everyone uses to build face recognition algorithms, landed into a controversy when in April this year a speaker showed example images of kids playing in a bathtub with their parents. It had a lot of criticism regarding privacy.
There is a lot of criticism these days regarding how exactly we should collect the data, If the data itself is biased that's a problem, if it has privacy violations it's another problem.There is no clear known solution yet. Last week we were reading about this issue in a paper from google, but at the end of the paper we all felt that there were no clear actionable. So no one is actually yet clear as to what is to be done.’
About NLP for Vernacular Apps
We asked Sowmya about how NLP can aid in creating solutions for vernacular apps for Bharat. And also how can we create language translations in real-time, say for in-app popups or nudges.
Here is a short video of her answer.
About User Sentiment Analysis
How can NLP help in understanding user sentiment, when survey answers or reviews are in free text form?
Sowmya says there are many solutions available for products to use NLP to perform sentiment analysis, as sentiment analysis is a very well studied area of NLP and it has been in use practically from a very long time.
‘Pay as you go services - There are many different libraries that help do sentiment analysis. Even big providers like Google, IBM, Watson etc provide APIs to do sentiment analysis. If you want to do sentiment analysis, there is an option for you to use these pay as you go services.
Open source libraries - Because it is such a common problem, there are many open source libraries that you can use, which are mostly free.
Aspect based sentiment analysis - When we write our opinion about something there may not be just one sentiment related to that. Like somebody might write a restaurant review saying its very pricey but the food is not that great. So how do you derive the sentiment from such reviews? For example for our book there was a review saying the book is very good but paper quality is very bad. So what exactly is the sentiment here. In the aspect of content the book is good, but in the aspect of printing quality the book is very bad. Aspect based sentiment analysis is not as prevalent as normal sentiment analysis, It is also dependent on the domain, for restaurants it can be food, price, service etc while for a phone, it could be camera or screen size etc. So it's a challenge to do aspect based sentiment analysis, as you need to be aware as to what are the potential kind of things that customers will talk about.’
Here is a good read about How to capture the right user sentiment while the user is using your app.
Regarding GPT-3 and its hype, Sowmya describes to us its advantages and disadvantages.
'GPT-3 has created a lot of buzz. So many people started feeling that NLP can be solved using GPT-3. But then there has been a lot of criticism also from people who work in that area. Major criticism is that we cannot verify the factuality of what GPT3 is producing. For example, consider I am a journalist and producing an article for a newspaper using GPT-3. I might be able to produce a perfectly grammatical, fluently readable article, but I just don’t know whether it is right or wrong. There was an example of text generation regarding Covid-19 few days back. The prompt given was ‘Covid-19 vaccine’, and GPT-3 produced a whole amount of statistical data and text, which almost read like the vaccine is already there. This kind of creates opportunities for fake news and misinformation. So these are some problems associated with GPT-3.
Talking about advantages, there are also many areas where it is useful. Spelling and grammar correction is one area where GPT-3 can be used. And for people who are creating some creative text, it can give some ideas, as all that the neural network does is hallucinating on the text that it already saw. It also learns a representation of a text, and its numerical format can be used to create many downstream applications like text classification or something like that.’
About the Future of Voice Assistants and Chatbots
Talking about the arrival of voice assistants into B2B and B2C applications, she says that sees them being used in this area very soon, where you can have some kind of restricted conversations with the application. You cannot yet converse with a computer like you can with human beings. And that is an active area of research. Regarding the future of chatbots, Sowmya says that the major direction for them is to have as natural conversations as possible.
We had some more queries for Sowmya regarding Federated Learning, Tips for budding data scientists, Women in Tech etc. We also had a number of questions from our audience, which Sowmya answered effortlessly.
While we are still absorbing all the pieces of knowledge from this discussion bit by bit, do check out the entire video here.