An Introduction to Natural Langauge Processing

# An Introduction to Natural Langauge Processing | ![source: Devopedia.org](https://devopedia.org/images/article/187/4949.1560446213.png) | |:-------------------------------------------------------------------------------------- | | *source: [Devopedia](https://devopedia.org/natural-language-processing)* | Natural Language Processing (NLP), is a subset of artificial intelligence that enables computers to understand and process human languages. NLP utilizes Machine Learning (ML) and Deep Learning (DL) tools and techniques, but also has applications outside of them. While humans understand a language by context and communicate in words, computers only understand 1s and 0s. The challenge is not only to make computers “speak” like humans, but also to make computers understand patterns in human languages at scale. Computer scientists have been trying to “teach” computers to understand a language like human and using machine learning to uncover patterns in big language data. NLP has a broad range of applications in business and medical research, including email filters, search engines optimization, language translation, and open-ended survey analysis. NLP can also be used to detect fake news or hate speech on social media, and even predecting people with suicidal inentions. These use cases show that understanding how language can be understood computationally is highly important. This blog will give a gentle introduction to the field of Natural Language. We will first walk through all steps involved in working with text data in R, including data cleaning, preprocessing, and unsupervised and supervised machine learning approaches. ## Data Cleaning and Preprocessing Techniques In NLP, we are dealing with text data that are sequence of words, instead of variables. As is the case in most subfields of data science, preprocessing data "cleans" text data to remove noise and prepares it to be processed computationally. For example, we can remove special characters, symbols, punctuation, html tags, etc. from the raw data which contains no information for our model to learn, or for us to analyze. While the exact steps of preprocessing depend on the data being used and the use case, we highlight some useful and common preprocessing methods when working with NLP on English or English-like languages. ### Lowercase: An extremely simple and effective preprocessing method - making all text lower case makes sure all words are types uniformly through our data. ![](https://i.imgur.com/IjdnbID.png) ### Tokenization: Tokenization is the process of breaking up text document into individual words called tokens. There are a variety of tokenization methods that all split text in different ways. The most simplest tokenization would be splitting text on whitespace. ![](https://i.imgur.com/e1Ubvbk.png) ### Stop words removal: Common words that do not contribute to the information in text are called stop words. Words like 'a', 'the', 'is', 'are' don't add much meaning and can add noise to our data. While the exact list of stop words depends on the data and tast at hand, NLP frameworks have built-in stopword lists that can be used to remove them from our data. Moreover, we can define a custom list of stopwords and supply it as an argument to prebuilt stopword removal functions, allowing for much needed flexibility. ![](https://i.imgur.com/CNXhj4d.png) ### Stemming: Words in most languages can be trimmed to their root word by removing morphological affixes, leaving only the stem. This reduces the inflection in words and makes uniform all words arising from the same root form. For example, 'increase', 'increased', 'increasing', 'increasingly' all map to the same root word 'increase'. ![](https://i.imgur.com/dhMABBa.png) Word stemming does not check if the root word is the valid lemma for a word. That is why 'cared' maps to 'car', while it should have mapped to 'care'. ### Lemmatization: Lemmatization is very similar to stemming with one major difference. While stemming converts words to a root form, it does not check if the root is a valid word in the language dictionary. Lemmatization adds this extra check to improve the accuracy of word stemming, giving a valid lemma for the word. For example, in the case of stemming, the word 'pared' would map to 'par'. It would map to 'pare' in the case of Lemmatization, which is the correct lemma. > Lemmatization is very similar to stemming with one major difference - checking if the stem is the valid lemma in the language. ![](https://i.imgur.com/9JPpAo5.png) ### N-grams To better understand the structure and meaning of language, it is not enough to look only at words. We can learn sentence structure and information better if we study the sequence of words. This is possible using N-grams, which are the combination of multiple words occuring together in a sequence. > N-grams are the combination of multiple words occuring together in a sequence N-grams with N=1 are unigrams, N=2 are bigrams, N=3 are trigrams, and so on. Choosing N can be tricky. While a higher N will capture broader context and sequence information, it can also be noisy if it is too broad. We want to choose N such that enough context is included, but not too much or too little. ![](https://i.imgur.com/PAk4reL.png) ## Text Vectorization Computers can ultimately only deal with numbers. The process of converting text into numbers is called text data vectorization. After text preprocessing, we need to numerically represent text data i.e., encoding the data in numbers which can be further used by algorithms. There are a lot of methods of vectorizing text; here we cover two important one. ### Bag of words(BOW): It is one of the simplest text vectorization techniques. The intuition behind BOW is that two sentences are said to be similar if they contain similar set of words. > In NLP tasks, each text sentence is called a document and collection of such documents is referred to as text corpus. BOW constructs a dictionary of d unique words in corpus(collection of all the tokens in data). ![](https://i.imgur.com/7HVcwd3.png) ### TF-IDF: Stands for Term-Frequency (TF) - Inverse Document Frequency (IDF). Term Frequency is the probability of find a word in a document. Inverse Document Frequency indicates how unique the word is. If the word appears in every documenet, it will have a low IDF. TF-IDF is the multiplication of TF and IDF, which gives more weight to words that occur more in one document but less overall. The choice of vectorization method will depend on the task at hand. ## Machine Learning Approaches to Language Processing Various machine learning approaches can be used to create pipelines and models using the cleaned and preprocessed data. It is able to generalize and deal with novel cases. We can train supervised machine learning models for tasks such as sequence and text classification, translation, etc. Supervised learning requires used to have a manually labeled dataset for training. If the algorithm is applied to large sets of data to extract meaning without the need for preannotation, that is unsupervised machine learning. Unsupervised techniques are best to decipher patterns in large scale data based on the characteristics of the text itself. Clustering documents into topics is a simple example. By understanding the difference between supervised and unsupervised machine learning, we can get the best result from them. ### Unsupervised: In unsupervised machine learning, we train the model without annotating or tagging the output. Clustering is one example that we group similar documents together as groups or sets. Topic modeling is a method for unsupervised classification of documents, similar to clustering on numeric data. It finds natural groups of items (topics) in the data. [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a popular topic modeling algorithm that uses TF-IDF to cluster documents into topics. ### Supervised: In supervised machine learning, we provide examples of what the algorithm should look for when training the model, then later to apply to a larger set to retrain the model as it can learn more from the documents. The most popular supervised machine learning algorithms are: support vector machine, bayesian networks, random forests maximum entropy, conditional random field, and neural network/deep learning. There exist base models trained on enormous corpuses containing embeddings for words that we can use as vector inputs in a neural network. For example, [BERT](https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270) is a state of the art language model used to train supervised NLP models. ## NLP and Text Mining Packages in R There are many packages that are already available to use for Natural Language Processing, and here are some highlights: ### koRpus koRpus is a package for analyzing text including automatic language detection and indices of lexical diversity. It also offers R GUI plugin and IDE RKWard which provides graphical dialogs for its basic features. ### lsa lsa (Latent Semantic Analysis) provides routines for performing latent semantic analysis in R. The idea of the package is that the text has a higher latent semantic structure concealed within the words by using synonyms or polysemy. ### OpenNLP OpenNLP provides an R interface to Apache OpenNLP, which is a collection of NLP tools written in Java. OpenNLP supports various tasks such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing and coreference resolution. ### Quanteda Quanteda is a fast, flexible, and comprehensive framework for quantitative text analysis package in R for managing and analyzing text. It provides functionality for corpus management, creating and manipulating tokens and ngrams, exploring keywords in context, forming and manipulating sparse matrices of documents by features and more. There are some other notable packages for NLP in R such as RWeka, Spacyr, Stringr, Text2vec, TM and Wordcloud. ## Application Example | Fake News Classification To demonstrate the pipeline and workflow of a typical NLP task, [this Colab](https://colab.research.google.com/drive/1KncNB1AC5ZI644mR3JBcHRQqY0LQ1gLY?usp=sharing) file demonstrates classification of news articles as Fake or Not. We use 'tm', a popular text mining package in R to preprocess our text data, and fit a random forest model with 0.77 accuracy. ## Further Readings and Directions 1. NLP Research Papers are not too hard to read, and really interesting! Here are some good links - - [100 must-read NLP papers](https://github.com/mhagiwara/100-nlp-papers) - [NLP papers with code](https://paperswithcode.com/area/natural-language-processing) 2. Python has some good packages for NLP tasks, such as [spaCy](https://spacy.io/) and [NLTK](https://www.nltk.org/). ## Conclusion You should now have the tools and background required to start working with text data in R! Natural Language Processing can be incredibly powerful in gaining insights from our language and other data. What we have demonstrated is a simple application of what NLP can do. There are many interesting dataset and topic on Kaggle (such as sentiment analysis of movie reviews and scripts of sitcoms) that you can explore and get started with NLP. ## References [1] Natural Language Processing is Fun!; https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e [2] Natural Language Processing With R; https://www.udacity.com/blog/2020/10/natural-language-processing-with-r.html [3] Gentle Start to Natural Language Processing using Python; https://towardsdatascience.com/gentle-start-to-natural-language-processing-using-python-6e46c07addf3 [4] RPubs - Using the TM package; https://rpubs.com/tsholliger/301914 [5] Top 10 R Packages For Natural Language Processing (NLP); https://analyticsindiamag.com/top-10-r-packages-for-natural-language-processing-nlp/ [6] Natural language processing: an introduction; https://academic.oup.com/jamia/article/18/5/544/829676?ref=https://codemonkey.link ## Acknowledgments Created by: Aryan Srivastava and Haolin Chen