# NLP
###### tags: `NLP` `BOW` `TFDIF`
[toc]
## Term:
**Lemmatization:**
**Stemming:**
**Stop word:**
**Tokenization: **splitsthe strings of text into single word = token
**token:** word
**feature vectors:**
**n-gram:** text selection contents. How many words you pick at EACH time.
**Parts-of-speech (POS) Tagging:**
Assigning a category tag to the tokenized parts of a sentence. The most popular POS tagging would be identifying words as nouns, verbs, adjectives
**Text Normalization:** = Cleaning
Normalization is a process that converts a list of words to the same form:
converting all text to the same case (upper or lower), removing punctuation, expanding contractions, converting numbers to their word equivalents, and so on
## Vectorization:
Convert the text to number for computer to understand
### Bad Of Word:
Bad of word is the extracting feature from the text

Sentences on the left, Frequency of word presented in feature vector
**Drawbacks of BoW**
* Term ordering is not considered
* Rareness of a term is not considered
### TF IDF:
TFIDF, reflect how important a word is to a document or corpus.

## BOW (CountVectorizer) vs TF IDF (TfidfVectorizer):
TfidfVectorizer and CountVectorizer both are methods for converting text data into vectors as model can process only numerical data.
In CountVectorizer we only count the number of times a word appears in the document. this ends up in ignoring rare words which could have helped is in processing our data more efficiently.
To overcome this , we use TfidfVectorizer .
In TfidfVectorizer we consider overall document weightage of a word. It helps us in dealing with most frequent words. Using it we can penalize them. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.
## Comparison techniques

https://www.kdnuggets.com/2017/02/natural-language-processing-key-terms-explained.html