In this post, we discuss what NLP is and how it differs from CV and basic data science algorithms we have seen before.
To understand a very basic and unique feature associated with texts, lets consider these examples. Suppose I ask you to recite the alphabets A-Z as fast as possible.
Record the time you took. Now try to recite the alphabets but this time starting from Z and move backward towards A. Can you do that with the same speed as before?
Same goes for a song or a poem.
So one of the very unique features of the text is the presence of a sequence. When it comes to language, it’s not just the words, but their sequence that makes a meaningful and complete sentence.
So what exactly is the outcome we desire from an NLP task? We want an AI model that can understand not only the words but their order, understand the sentiment associated with a sentence, can respond to texts, and perform translations.
A unique point about language translation is that there are:
- More than one way to translate a given sentence into another.
- The output length is not fixed and depends on the words being used.
- We need to think of a loss function which deals with such output conditions.
Problems with Text Data
Making models on text data requires a lot of cleaning and preprocessing.
Real life text data comes from internet sources, surveys and other fields where people of different education backgrounds write “content”. Consider a case where data is acquired from You-tube comments section.
Lets try to list down a couple of problems that we might face as an outcome of the same:
- Spelling mistakes can ruin data as a computer wont consider “beotiful” and “beautiful” as same words.
- Slang language mix-up words like “wanna”, “gotcha” would be considered as different words.
- Tasty, delicious, tastiest, yummy might be four different words but are technical conveying the same information.
- A lot of words would just increase the vocabulary size and wont help much in model building. Words like “is, as, a, an” and others are just there for grammatical purpose and wont add up much while making sense out of a sentence.
- We need to develop an algorithm that can incorporate the “sequential information” along with understanding words.
- Given that there are too many words in any given language the dimensionality of NLP tasks tend to be higher.
Applications of NLP
Here we just list down the various applications/ fields in which natural language processing is used. We will take up each topic in detail in posts that follow:
- Sentiment analysis in tweets and product reviews
- Fake news classifier
- Document classifier
- Language translation
- Processing voice commands
- Building responsive chatbots
In further posts we will discuss text preprocessing techniques like stemming, lemmatization , removing duplicates and then discuss few algorithms that deal with NLP tasks.