Natural Language Processing

# Natural Language Processing (NLP) NLP is a subfield of artificial intelligence (AI) that focuses on the interaction between computers and human language. It enables machines to comprehend, interpret, and respond to natural language input, whether it's in written or spoken form. Machine Learning, on the other hand, equips computers with the ability to learn from data and improve their performance on specific tasks without being explicitly programmed. The integration of NLP and Machine Learning has revolutionized the way we interact with computers. It has opened up possibilities for applications such as machine translation, sentiment analysis, chatbots, voice assistants, and much more. By harnessing the power of algorithms and statistical models, we can extract valuable insights from vast amounts of textual data, automate language-related tasks, and even create systems that can communicate with us in a human-like manner. At the heart of NLP lies the challenge of understanding the complexities of human language. Unlike programming languages with rigid rules and precise syntax, natural language is rich in ambiguity, context, and nuances. Words can have multiple meanings, sentences can be interpreted differently based on their context, and language evolves and changes over time. Machine Learning approaches in NLP enable computers to tackle these intricacies by learning patterns from large datasets. Techniques such as neural networks, deep learning, and probabilistic models have significantly advanced the field, allowing machines to grasp the subtleties of language and generate meaningful responses. ![](https://hackmd.io/_uploads/Syen_8z_3.png) Throughout this chapter, we will delve into various topics, including preprocessing and feature extraction techniques, language modeling, sequence-to-sequence models, sentiment analysis, and language generation. We will explore the fundamental concepts behind these methods, discuss their applications, and provide practical examples to help solidify your understanding. ![](https://hackmd.io/_uploads/rkoyt8z_2.png) ### NLP Pipeline The NLP pipeline is a sequence of interconnected steps designed to process and analyze natural language text. It typically begins with text preprocessing, which involves tasks like tokenization, removing stop words, and stemming. Next, the pipeline moves to feature extraction, where relevant information is extracted from the text, such as part-of-speech tags or named entities. After that, the text is fed into machine learning models for tasks like sentiment analysis, named entity recognition, or text classification. Finally, post-processing steps may be applied, such as generating summaries or translating the text. The NLP pipeline enables the automation of language-related tasks, facilitating efficient and accurate natural language understanding. ![](https://hackmd.io/_uploads/By4UKLG_3.png) Here is a simple example: ![](https://hackmd.io/_uploads/S1WmY8zO2.png) ### Zipf’s law Zipf's Law, named after linguist George Kingsley Zipf, states that in a large corpus of natural language text, the frequency of a word is inversely proportional to its rank in the frequency distribution. In simpler terms, a small number of words occur very frequently, while the majority of words are rare. For example, the most common word in English, such as "the," appears much more frequently than less common words like "elephant" or "elucidate." Zipf's Law has been observed across different languages and domains, and it has significant implications for tasks like language modeling, information retrieval, and text processing in natural language processing research. ![](https://hackmd.io/_uploads/ryI5tLMdh.png) ## Normalization & Vectorization ### Normalization Normalization in NLP refers to the process of transforming text into a standard, normalized form to remove inconsistencies and variations. It helps ensure that different representations of the same word or phrase are treated as identical, allowing for more accurate analysis and comparison. Normalization techniques commonly used in NLP include: ![](https://hackmd.io/_uploads/Bk6_oUfd2.png) ![](https://hackmd.io/_uploads/SkrtjLfdn.png) Another technique is **stop word removal**. Where the idea is removing commonly occurring words (e.g., "and," "the," "is") that typically don't contribute much to the meaning. ![](https://hackmd.io/_uploads/SkbmsLzuh.png) ### Vectorization Vectorization in NLP refers to the process of representing textual data as numerical vectors that machine learning algorithms can process. It involves converting words, sentences, or documents into numerical representations that capture semantic and contextual information. #### Bag of Words The Bag-of-Words (BoW) model is a popular vectorization technique where a document is represented as a "bag" or collection of its constituent words, disregarding grammar and word order. The BoW model creates a vocabulary from the entire corpus and counts the frequency of each word in a document. The resulting vector represents the presence or absence of words in the document. ![](https://hackmd.io/_uploads/r1Eyp8fdh.png) ![](https://hackmd.io/_uploads/HyuQTUMuh.png) #### TF-IDF TF-IDF (Term Frequency-Inverse Document Frequency) is another vectorization approach that accounts for the importance of words in a document relative to the entire corpus. TF-IDF assigns a weight to each word in a document based on its frequency (term frequency) and inversely proportional to its occurrence in other documents (inverse document frequency). This weighting scheme helps highlight words that are more discriminative and relevant to a particular document. ![](https://hackmd.io/_uploads/ryZWkvG_n.png) If we seek to differentiate each document by the words that compose it, those words that are present in all of them do not provide information (information theory). Therefore, it is necessary to measure not only how much a word appears in an instance (document), but also how frequent that word is in the entire corpus. ![](https://hackmd.io/_uploads/SJ-Hyvfu2.png) ### Code examples Enough of the theory, lets see some coding examples: https://colab.research.google.com/drive/1f45q4Ojc7U87p56pyJ0Z_VR0Q_-7mwKB ## Word embeddings ### Word2Vec Word embedding is a technique in natural language processing (NLP) that aims to represent words as dense, real-valued vectors in a high-dimensional space. The key idea behind word embedding is to capture semantic and syntactic relationships between words, so that words with similar meanings are represented as vectors that are closer to each other in this space. ![](https://hackmd.io/_uploads/r1YSMvfd3.png) In the example given, let's consider the words "cat" and "dog." With a well-trained word embedding model, these words would likely have similar vector representations because they share similar semantic meaning, both referring to small domestic felines. Similarly, words like "dog" and "puppy" might have vectors close to each other due to their related meanings. ![](https://hackmd.io/_uploads/BkB3rvzd2.png) The website https://projector.tensorflow.org/ allows you to visualize and explore high-dimensional data embeddings in a user-friendly way. You can upload your own word embeddings or use pre-trained ones to visualize relationships between words or other data points, gaining insights into their similarities, clusters, and patterns. ![](https://hackmd.io/_uploads/S1sOBPfO2.png) ### Algorithms CBOW (Continuous Bag-of-Words) and Skip-gram are two popular algorithms used in word embedding models, specifically in the Word2Vec framework. ![](https://hackmd.io/_uploads/B1x9vwz_2.png) - **CBOW** aims to predict a target word based on its surrounding context words. It takes a window of context words and tries to predict the target word at the center of that window. This approach is useful for learning word representations that capture the overall meaning of a word based on its context. - **Skip-gram** works in the opposite way. It takes a target word and tries to predict the context words within a certain window around the target word. Skip-gram is effective for learning word representations that can capture different contextual uses of a word. Both CBOW and Skip-gram algorithms leverage a neural network architecture to train word embeddings. The models are trained by adjusting the weights of the neural network through backpropagation, where the goal is to minimize the prediction error between the predicted and actual words. *Example:* ![](https://hackmd.io/_uploads/HJPrtwf_2.png) > CBOW is computationally efficient and tends to work well with frequent words, while Skip-gram is better at handling rare words and capturing fine-grained semantic relationships. ### Architecture #### **CBOW** In the Continuous Bag-of-Words (CBOW) architecture of Word2Vec, the goal is to predict a target word based on its surrounding context words. The architecture consists of several layers, including input, embedding, lambda, and dense layers. Let's explore each of these components: - **Input Layer**: The input layer represents the context words surrounding the target word. These context words are typically one-hot encoded or represented using integer indices, indicating their position in the vocabulary. - **Embedding Layer:** The embedding layer maps the one-hot encoded context words to dense vectors of fixed dimensions. It learns the word embeddings by adjusting the weights during the training process. The embedding layer transforms the input into a dense vector representation, capturing semantic and contextual information. - **Lambda Layer**: The lambda layer computes the average or sum of the embedding vectors of the context words. This layer takes the embedded context words and combines them to generate a single vector representation. - **Dense Layers**: The dense layers are fully connected layers that receive the vector representation from the lambda layer as input. These layers perform non-linear transformations and capture complex relationships between the input and the target word. The number and size of dense layers can vary depending on the complexity of the task. ![](https://hackmd.io/_uploads/ryHKtDM_3.png) The output of the dense layers is typically passed through a softmax activation function, which converts the output into a probability distribution over the vocabulary. The probabilities indicate the likelihood of each word being the target word. During the training process, the weights of the embedding and dense layers are adjusted using backpropagation and optimization techniques like stochastic gradient descent, minimizing the prediction error between the predicted and actual target words. By training the CBOW architecture, the Word2Vec model learns word embeddings that capture semantic relationships and contextual information, enabling downstream NLP tasks to benefit from these dense and meaningful representations. #### **Skipgram** In the Skip-gram architecture of Word2Vec, the objective is to predict context words given a target word. The architecture involves input, embedding, merge, and dense layers. Let's delve into each component based on the provided text: - **Input Layer**: The input layer receives the target word and context word pairs. These pairs are typically represented using one-hot encoding or integer indices, indicating the position of the words in the vocabulary. - **Embedding Layers**: The target word and context word pairs are individually passed through embedding layers. These layers convert the one-hot encoded or integer representations into dense word embeddings. Each word is assigned a fixed-dimensional vector representation, capturing semantic and contextual information. - **Merge Layer**: The merge layer computes the dot product of the embeddings of the target word and the context word. This dot product operation measures the similarity or relatedness between the two words. - **Dense Layers**: The dot product value from the merge layer is then passed through dense sigmoid layers. The sigmoid activation function squashes the dot product value to a range between 0 and 1. This output represents the probability of the context word given the target word. ![](https://hackmd.io/_uploads/SykMsDzu3.png) During training, the output of the dense layer is compared with the actual label (0 or 1), and the loss is computed. Backpropagation is then performed, updating the weights of the embedding layer and optimizing the model using techniques such as stochastic gradient descent. This process allows the model to learn meaningful word embeddings that capture relationships between target and context words. By training the Skip-gram architecture, Word2Vec creates word embeddings that encode semantic information and context, providing a rich representation for words. These embeddings can be utilized in various downstream NLP tasks, enhancing their performance and accuracy. ### Feature size In Word2Vec, the feature size refers to the dimensionality of the word embeddings generated by the model. It determines the length of the dense vector representation assigned to each word. The feature size is a hyperparameter set before training, typically chosen based on the size of the training dataset and the complexity of the language. **Higher feature sizes can capture more intricate relationships but may require more training data and computational resources**. The feature size should strike a balance between capturing sufficient semantic information and avoiding overfitting. Ad**ding more dimensions or adding more training data provides diminishing improvements.** ![](https://hackmd.io/_uploads/rJYWAwfun.png) ### Window size In Word2Vec, the window size refers to the number of context words considered on either side of a target word during training. It determines the extent of the local context used to predict the target word. **A larger window size captures more global context but may dilute the specific word relationships, while a smaller window size focuses on more immediate context.** Here is a window size of 5: ![](https://hackmd.io/_uploads/Hynn0PGd3.png) ### Code examples Enough of the theory, lets see some coding examples: https://colab.research.google.com/drive/1vDG1l3QcNZSJi6zFwSKB2MvJDr2cfbmh Here are some recomendation for embeddings: https://colab.research.google.com/drive/1in7eWwduy3t1Tx8R08enM6Bn1r1hgR1d ## Recurrent Neural Networks (RNN) Recurrent Neural Networks (RNNs) are a class of artificial neural networks designed to effectively process sequential and temporal data. Unlike traditional feedforward neural networks, RNNs have a feedback loop that allows information to persist across different time steps. RNNs are well-suited for tasks involving sequences, such as natural language processing, speech recognition, and time series analysis. They possess a hidden state that acts as a memory, allowing them to capture dependencies and patterns in sequential data. At each time step, an RNN takes an input and combines it with the previous hidden state to produce an output and update the hidden state. This recurrent structure allows RNNs to consider the context of previous inputs when making predictions. However, traditional RNNs suffer from the vanishing or exploding gradient problem, which hampers their ability to learn long-term dependencies. To address this, variations of RNNs have been developed, such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), which incorporate gating mechanisms to control the flow of information and alleviate the gradient issues. By leveraging their recurrent nature, RNNs excel in tasks like sequence generation, sentiment analysis, machine translation, and speech synthesis. They have proven to be powerful models for capturing temporal patterns and dependencies, enabling machines to process and generate sequential data effectively. ### Architecture The architecture of a Recurrent Neural Network (RNN) consists of recurrent layers that allow the network to maintain and propagate information across sequential data. ![](https://hackmd.io/_uploads/SJAcfO7uh.png) > `A` looks at the `xt` and then outputs a value `ht`. A loop allows information to be passed from one step of the network to the next. It is importante to note that for texts RNN will take a vector at once and then pass that hidden state information to the next vector and so on. So if in each vector we have words then the RNN would look like this: ![](https://hackmd.io/_uploads/rJD6IO7u3.png) > The banishing color of the word `what` (first vector) is know as short-term memory which is a problem caused by the vanishing gradient The key components of an RNN architecture are as follows: - **Input:** At each time step, an input is provided to the RNN. This input can be a single value, a vector, or even a sequence of vectors representing sequential data. - **Hidden State:** The hidden state of the RNN is a memory component that captures and encodes information from previous time steps. It represents the network's understanding of the context and provides a form of memory to process sequential data. - **Recurrent Connections:** The recurrent connections in an RNN enable information to be passed from one time step to the next. They allow the hidden state to persist and carry forward information learned from previous time steps. - **Activation Function:** An activation function is applied to the hidden state at each time step, introducing non-linearity to the network. Common choices include the sigmoid, tanh, or ReLU (Rectified Linear Unit) activation functions. - **Output:** The output of the RNN can be obtained at each time step or at the final time step, depending on the task at hand. It can be a single value, a vector, or even a sequence of vectors depending on the desired output representation. ![](https://hackmd.io/_uploads/BkJE7O7O3.png) > `x<t>` is the input, `g1` is the hidden state or output of the first layer, `a<t>` are the recurrent connections, the circle within the circle is the activation function, `g2` is the output for the second layer, `y<t>` is the final state, `Wii` are the weights and `bi` are the bias terms. The architecture of an RNN can vary depending on the specific task and requirements. Different types of RNNs, such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), introduce additional layers or modifications to the basic RNN architecture to address challenges like vanishing gradients and improve the model's ability to capture long-term dependencies. Overall, the architecture of an RNN enables the network to process sequential data by maintaining and updating a hidden state, allowing it to learn and capture temporal information and dependencies within the input sequence. ### Types Here are examples of different types of Recurrent Neural Networks (RNNs) based on their input-output mappings: ![](https://hackmd.io/_uploads/SJsr2uXO2.png) - **One-to-One**: The most basic form of an RNN where each input corresponds to a single output. It is similar to a traditional feedforward neural network. Example: Sentiment analysis, where the RNN predicts the sentiment of a single sentence. - **One-to-Many**: In this type, the RNN takes a single input and generates a sequence of outputs. Example: Image captioning, where the RNN takes an image as input and generates a descriptive sentence. - **Many-to-One**: Here, the RNN takes a sequence of inputs and produces a single output. Example: Sentiment classification, where the RNN takes a sequence of words as input and predicts the sentiment of the entire sentence. - **Many-to-Many**: In this type, the RNN processes a sequence of inputs and produces a sequence of outputs, where the input and output lengths can vary. Example: Machine translation, where the RNN takes a sequence of words in one language and generates a sequence of translated words in another language. ### Backpropagation through time In order to train an RNN, one must first define a loss function (L) that estimates the error between the output and the true label and minimizes it using forward pass and backward pass. The technique is carried out as follows for a single time step: The input is sent, processed through a hidden layer or state, and then the estimated label is determined. In this step, the loss function is calculated to determine how much the true label and the estimated label differ from one another. The forward pass is completed by computing the total loss function, L. The numerous derivatives are determined in the second phase, which is known as the backward pass. ![](https://hackmd.io/_uploads/r13tCNUun.png) #### How it works: Operations and functions that may be succinctly expressed as a computational graph are at the core of backpropagation. Consider the function f = z(x+y) as an example. Below is a representation of its computational graph: ![](https://hackmd.io/_uploads/HJN-t4Ud2.png) In essence, a computational graph is a directed graph with operations and functions acting as nodes. The forward pass, which involves computing the outputs from the inputs, is typically displayed above the graph's edges. We calculate the output's gradients relative to the inputs in the backward pass and display them below the edges. Here, we compute gradients as we work our way from the finish to the beginning. For this instance, let's perform the backward pass. Let's use the notation "a/b" to denote the derivative of a with respect to b throughout the essay. ![](https://hackmd.io/_uploads/BkMgtVUO3.png) We begin by computing f/f, which equals 1, proceeding backward, we then compute f/q, which equals z, f/z, which equals q, and finally we compute f/x and f/y. As you can see, we can't calculate f/x and f/y directly, so we utilize the chain rule to first calculate q/x and then multiply that result by the f/q that was calculated in the step before to get f/x. Here, the gradients f/q, f/x, and q/x are referred to as the upstream, downstream, and local, respectively. ![](https://hackmd.io/_uploads/rJ6m94Uun.png) ``` downstream gradient = local gradient × upstream gradient ``` #### Backpropagation in RNN: ![](https://hackmd.io/_uploads/B1ddoV8O3.png) **Forward pass:** In the forward pass, the input vector and the hidden state vector from the prior timestep are multiplied by the corresponding weight matrices and are added together by the addition node at a specific timestep. Then, after passing through a non-linear function, they are duplicated: one is used as an input for the following time step, and the other is placed in the classification head where it is multiplied by a weight matrix to produce the logits vector before the cross-entropy loss is calculated. **Backward pass:** Beginning at the end, we compute the gradient of the classification loss wrt the logits vector in the backward pass. This gradient travels backward to the matrix multiplication node, where the gradients with respect to the weight matrix and hidden state are computed. The gradient relating to the concealed state travels backward to the copy node, where it collides with the gradient relating to the prior time step. Because an RNN effectively processes sequences one step at a time, gradients flow backward across time steps during backpropagation. We refer to this as backpropagation across time. As a result, the gradient with respect to the hidden state and the gradient from the previous time step collide at the copy node and are added together. They then flow backward to the tanh non-linearity node, where the gradient is determined as: tanh(x)/x = 1tanh2 (x). The gradient then goes to the addition node, where it is distributed between the input vector's and the previous hidden state vector's matrix multiplication nodes. Unless there is a special requirement, we rarely compute the gradient with respect to the input vector. Instead, we compute the gradient with respect to the prior hidden state vector, which then flows back to the previous time step. ### Vanishing and Exploding Gradients Since we backpropagate gradients via layers and time, training an RNN is not simple. So, according to the equation, we must add up all contributions made before the present one at each time step: ![](https://hackmd.io/_uploads/S1AyyrLOn.png) The vanishing and expanding gradients are two frequent issues that come up during the backpropagation of time-series data. There are two problematic instances in the equation above: ![](https://hackmd.io/_uploads/rJGSyB8_h.png) In the first scenario, the term approaches 0 exponentially quickly, making it challenging to understand some long period dependencies. The term "vanishing gradient" refers to this issue. - **Vanishing gradients** refer to the first scenario where the term approaches 0 exponentially, the gradients calculated during backpropagation become extremely small as they propagate backward through time. This can happen when the weights in the network are such that they cause the gradient to decrease exponentially with each time step. As a result, the updates to the earlier layers of the network become negligible, making it difficult for the network to learn long-term dependencies. The vanishing gradients problem hampers the ability of the RNN to propagate information over long sequences. Layers that are further back in time receive weak gradients, and thus the weights associated with those layers are not effectively updated. Consequently, the RNN struggles to capture and retain information from earlier time steps, limiting its ability to model long-term dependencies. - **Exploding gradients**, on the other hand, occurs when the gradients during backpropagation become extremely large making term increase exponentially. This can happen when the weight matrices are such that they cause the gradient to increase exponentially with each time step. As a result, the updates to the weights can become very large, leading to unstable training and difficulty in finding an optimal solution. The exploding gradients problem can cause the weights to update by a large magnitude, resulting in overshooting the optimal weights and causing the network to diverge or fail to converge. This instability in training can make it challenging to optimize the network effectively. #### [Addressing Vanishing and Exploding Gradients](https://www.analyticsvidhya.com/blog/2021/06/the-challenge-of-vanishing-exploding-gradients-in-deep-neural-networks/) 1. Truncated Backpropagation Through Time 2. Proper weight initialization 3. Using non-saturating activation functions 4. Batch normalization 5. Gradient Clipping ### [LSTMs](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) ### Code examples