NLP - HackMD

# NLP ###### tags: `NLP` `BOW` `TFDIF` [toc] ## Term: **Lemmatization:** **Stemming:** **Stop word:** **Tokenization: **splitsthe strings of text into single word = token **token:** word **feature vectors:** **n-gram:** text selection contents. How many words you pick at EACH time. **Parts-of-speech (POS) Tagging:** Assigning a category tag to the tokenized parts of a sentence. The most popular POS tagging would be identifying words as nouns, verbs, adjectives **Text Normalization:** = Cleaning Normalization is a process that converts a list of words to the same form: converting all text to the same case (upper or lower), removing punctuation, expanding contractions, converting numbers to their word equivalents, and so on ## Vectorization: Convert the text to number for computer to understand ### Bad Of Word: Bad of word is the extracting feature from the text ![](https://i.imgur.com/sD4QMRv.png) Sentences on the left, Frequency of word presented in feature vector **Drawbacks of BoW** * Term ordering is not considered * Rareness of a term is not considered ### TF IDF: TFIDF, reflect how important a word is to a document or corpus. ![](https://i.imgur.com/UJipaEK.png) ## BOW (CountVectorizer) vs TF IDF (TfidfVectorizer): TfidfVectorizer and CountVectorizer both are methods for converting text data into vectors as model can process only numerical data. In CountVectorizer we only count the number of times a word appears in the document. this ends up in ignoring rare words which could have helped is in processing our data more efficiently. To overcome this , we use TfidfVectorizer . In TfidfVectorizer we consider overall document weightage of a word. It helps us in dealing with most frequent words. Using it we can penalize them. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents. ## Comparison techniques ![](https://i.imgur.com/Us1VzgC.png) https://www.kdnuggets.com/2017/02/natural-language-processing-key-terms-explained.html