NLP interview Q&A

 NLP interview Q&A:

  1. What do you mean by NLP? and discuss real time use cases of NLP?
  2. Define structured and unstructured data?
  3. Discuss NLU and NLG?
  4. What are the tools required to perform NLP?
  5. Discuss the difference between NLTK & Spacy?
  6. What are the steps involved in NLP pipeline?
  7. Discuss the importance pre-processing techniques in NLP?
  8. What is the difference between NLP and CI?
  9. What are regular expressions and their applications?
  10. What is Information extraction?
  11. What is Text similarity?
  12. What is Text classification?
  13. What is Text summarization?
  14. Is it necessary to convert text into number for model training?
  15. What is Tokenization?
  16. What do you mean by stemming?
  17. What is Lemmatization?
  18. What are the differences between stemming and lemmatization?
  19. What are stop words in text process?
  20. Explain TF-IDF and it's purpose?
  21. Discuss Named entity recognition?
  22. Explain how feature engineering  implement in NLP?
  23. Explain parts of speech in text processing?
  24. What do you mean by Bag of words in NLP?
  25. Discuss N-grams? and where we use in real life scenarios?
  26. What is Syntactic analysis?
  27. Explain Semantic analysis?
  28. Explain how parsing done in NLP?
  29. What is Latent semantic indexing in NLP?
  30. Discuss NLP metrics?


What do you mean by NLP? and discuss real time use cases of NLP?

Natural language processing is a field of AI and computer science, that gives machines an ability to understand human language better and assist in language related tasks.

For instance, face-to-face conversations, tweets, blogs, emails, websites, SMS messages, all comes under natural language. In NLP we have to find useful information from natural language.

NLP use cases: 

  • Information extraction
  • Text summarization
  • Text classification
  • Text similarity
  • Voice recognition
  • Language translation
  • Chat bots

Define structured and unstructured data?

According to industry estimates, more than 80% of the data being generated in unstructured format, may be in the form of text, image, audio, video, etc. A few examples include- posts/tweets on social media, chat conversations, news, blogs, product or service reviews of E-commerce and patient records in health care sector.

Structured data: The elements in data, organized in a pre-defined format, like rows and columns( Excel-file).

Unstructured data: The elements in data are not organized in pre-defined form.

In order to produce significant and actionable insights from text data, we use Natural language processing coupled with machine learning and deep learning.


Discuss NLU and NLG?

NLP consists of  both Natural language generation(NLG) and Natural language understanding(NLU) to achieve language related tasks.

NLU is an ability of a machine to understand, process the speech or text of human language, the capability to make sense of natural language.

NLG is an another sub-category of NLP, that construct the sentences based on the context.

What are the tools required to perform NLP?

Tools are helps to perform language related tasks on text. 
  • NLTK
  • Spacy
  • TextBlob
  • Standford NLP

Discuss the difference between NLTK & Spacy?

  • NLTK is a python library, which means Natural Language Tool-kit. NLTK is a mother of all NLP libraries, where Spacy is newly developed NLP library.
  • NLTK supports wide range of languages as compared to Spacy.
  • Spacy is object oriented library, NLTK is string processing library.
  • Spacy can support word vector while NLTK cannot.


What are the steps involved in NLP pipeline?


Data Acquisition: The procedure of collecting required data to find insights and patterns. This data in text form, audio, chat and SMS messages and etc.

Data cleaning: The data we gathered will be in different formats like structured and unstructured so we need to clean and extract the required data. Like performing deleting null values, duplicates

Pre-processing: In this stage, we perform tokenization, stemming, lemmatization and many.

Feature Engineering: In this step, we will create and manipulate essential features for the model.

Model building: We will choose the perfect model that suits our requirements.

Evaluation: In Evaluation stage we test how the model performing with new instance. We check the model accuracy and try to get best accuracy.

Deployment: In this step we deploy our model in server, for the users.

Monitor & Update: After deploying the model, accuracy may decrease over the time. So it is essential to monitor and update the model for better usage.


Discuss the importance of pre-processing techniques in NLP?

The data we gathered is in combination structured and unstructured formats, there  may be so much unwanted text. This unimportant text will leads to low accuracy, and might it hard to understand and analyze. So, proper preprocessing is must be done on raw data.

The pre-processing techniques in NLP are:

  • Tokenization
  • Stemming
  • Lemmatization
  • Parts of speech
  • Named entity recognition
  • Bag of words
  • TF-IDF
  • N-grams

What is the difference between NLP and CI?


What are regular expressions and their applications?
Regular expression (Regex) is a pattern that describes a set of strings that matches the pattern. In other words Regex filter certain strings from the whole text.
Web scraping, Information retrieval and more are the applications of regex.


What is Information extraction?
Information extraction is like we want retrieve filter the essential information from the whole. For example anything we search in browser, we will get related information.

What is Text similarity?
Text similarity is like finding similar kind of text. For instance, finding correct employee from the resume by applying text similarity with job description.

What is Text classification?
Classifying the text on the basis of categories. For example, news channel home page is categories as Sports, Political, Science & Technology, Health, Entertainment and etc.


What is Text summarization?
Text summarization will give us summary or verdict of the whole story. For example movie reviews verdict.




Is it necessary to convert text into number for model training?
Yes, it is essential to convert human language (text or audio)  into machine language, because machine can't understand natural language. After getting model predictions, we will get out put in natural language for our understanding.


What is Tokenization?
In general tokenization is the first step in the natural language processing pipeline. Tokenization is used to dividing a sentences into chunks of words. There are two levels in tokenization:  word-level and sentence level.
Word-level tokenization:
words = " We are discussing natural language processing concepts "

If we apply tokenization on above words, we will get each word as tokens as below.

words= [ 'We' , 'are' , 'discussing' , 'natural' , 'language' , 'processing' , 'concepts' ]

Sentence-level tokenization:

sentence = " Hear peace. See peace. Speak peace. "

  • s1 = " Hear peace "
  • s2 = " See peace "
  • s3 = " Speak peace "

What do you mean by stemming?
Stemming refers to the process of stripping suffixes from words in attempt to normalize them and reduce them to their non-changing portion. For instance if we perform stemming on " computational" , " computed" , " computing" will give us " comput" since, this is the non-changing part of the words. 

What is Lemmatization?
Lemmatization is also similar to stemming but it takes context into account and give us base words. For example if we perform lemmatization on " good" , "nice" , "gone" , " studied" we will get output as "good" , " go" , " study". Here, good and nice give us same meaning.


What are the differences between stemming and lemmatization?
The both stemming and lemmatization will helps to get base words in the NLP processing, but the only difference in between them is stemming won't give us meaningful words while other hand lemmatization considers the context into account and give us meaningful base words.

What are stop words in text process?
Stop words in natural language are words that do not provide any useful information in a given context. For instance, if you are developing an emotion detection engine in the sentence: 
" I am feeling happy today " 
I and am can be removed, since they do not provide any emotion related information.  

Explain TF-IDF and it's purpose?

TF-IDF is a combination of two values: TF (Term frequency) and IDF (Inverse document frequency).
  • D1= " I am happy for your success "
  • D2= " I am sorry for your loss"
  • D3= " He is sorry, he cannot come "
TF = ( Number of occurence of a word) / ( Total words in the document)

Term frequency refers to the number of times a word occurs within a document. In the document D1, the term "happy" occurs one time. 

Inverse Document Frequency for a particular word refers to the total number of documents in a dataset divided by the number of documents in which the word.
To reduce the impact of uniqueness, it is common practice to take log of the IDF value. The final formula for IDF of a particular word looks like this:

IDF(word) = log ( Total number of documents )/ ( Number of documents containing the word )

IDF(happy) = log (3/1) = 0.477

Finally, the term TF-IDF is a product of TF and IDF values for a particular term in the document. For "happy", the TF-IDF value will be 1x 0.477 = 0.477.

Discuss Named entity recognition?
Named entity recognition refers to the process of classifying entities into predefined categories such as person, location, organization, and etc.

For instance, in the sentence " I completed my master's in Acharya Nagarjuna university located in Guntur. "
  • I ---> Person
  • Master's ---> Education
  • Acharya Nagarjuna university ---> Organization
  • Guntur ---> Location.
An important application of named entity recognition is that of topic modeling where using the information about the entities in the text, the topic of the document automatically be detected.

Explain parts of speech in text processing?
POS is another important NLP task. To construct a meaningful and grammatically correct sentence, parts of speech play an important role.
For instance, " laptop, mouse, keyboard " are tagged as nouns. Similarly "eating, playing " are verbs while " good and bad" are tagged as adjectives.

What do you mean by Bag of words in NLP?
Bag-of-words refers to a methodology used for extracting features from text documents. These features can then be used for various tasks, such as training machine learning algorithms.
  • D1= " I am happy for your success "
  • D2= " I am sorry for your loss"
  • D3= " He is sorry, he cannot come "
This is called bag-of-words approach since the sequence of words in a document isn't taken into account.

Explain how feature engineering  implement in NLP?
The procedure of converting raw text data into machine understandable formats (numbers) is called feature engineering of text data. Machine learning and deep learning algorithms performance and accuracy is fundamentally dependent on the type of feature engineering technique used. The different types of feature engineering methods are :
  • One Hot encoding
  • Count vectorizer
  • N-grams
  • Co-occurrence matrix
  • Hash vectorizer
  • TF-IDF
  • Word embedding
  • Implementing fastText.
Discuss N-grams? and where we use in real life scenarios?
N-grams refers to set of co-occurring words. The intuition behind the N-gram approach is that words occurring together provide more information rather than those occurring individually.

S = "I am learning NLP. "

Here, if we create a feature set for this sentence with individual words, it will look like this:
Features = { I, am, learning, NLP }

Here we have the word " not bad." If this is split into individual words, then it will lose it's actual meaning ' good'. This problem can be solved by N-grams like implementing:
Unigrams are the unique words present in the sentence.
Bigrams is the combination of 2 words.
Trigrams is the combination of 3 words.

For example,
Unigram: " I " , " am " , " learning ", " NLP "
Bigrams: "I am". " am learning", " learning NLP "
Trigrams: " I am learning ". " am learning NLP "
 
N-grams(S) = X - (N-1)

* We use N-grams in auto completion and auto corrections.

What is Syntactic analysis?
Syntactic analysis is a technique of analyzing sentences to extract meaning from it. Using syntactic analysis, a machine can analyze and understand the order of words arranged in a sentence.

Parsing: It helps in deciding the structure of a sentence or text in a document. It helps analyze the words in the text based on the grammar of the language.
Word segmentation: In this small words will be in small groups that is segments.
Morphological segmentation: The purpose of morphological segmentation is to break words into their base form.
Stemming: It is the process of removing the suffix from a word to obtain its root word.
Lemmatization: It helps combine words using suffixes, without altering the meaning of the word.


Explain Semantic analysis?
Semantic analysis helps make a machine understand the meaning of a text. It uses various algorithms for the interpretation of words in sentences. 

  • NER is the process of information retrieval that helps identify entities such as the name of a person, organization, place, time, emotion, etc,
  • WSD helps identify the sense of a word used in different sentences.
  • NLG is a process to generate the text.

What is Latent semantic indexing in NLP?
Latent semantic indexing is a mathematical technique used to improve the accuracy of the information retrieval process. It aids in the discovery of hidden (latent) relationships between words (semantics) by generating a set of various concepts associated with the terms of a phrase in order to increase information comprehension.


Comments