[Study] How Embedding & EmbeddingBag are used in NLP

# [Study] How Embedding & EmbeddingBag are used in NLP ###### tags: `research-DLRM` [TOC] ## Problem description (NLP as an example) Consider the following training sentences are fed into a "semtiment analysis" AI. The number 0 or 1 in the begining indicates the semtiment of that sentence. 0 for negative and 1 for positive sentiment. ``` 0 ~ This was a BAD movie. 1 ~ I liked this film! 0 ~ Just awful ... ``` Assume that batch size is 3, which means only 3 sentences will be fed into Embedding lookup at a time. :::info Embedding is the mapping of concepts, objects or items into a vector space. ::: ## Word Embedding Approach ### Tokenizer - Break sentence into several words - Handle punctuation - Convert to lower cases As a result, the training data is converted to the following forms: | this | was | a | bad | movie | | ---- | ----- | ------- | ------- | ------- | | i | liked | this | film | \<pad\> | | just | awful | \<pad\> | \<pad\> | \<pad\> | This table will then be converted to numbers. Each word will be mapped to an unique ID. | 4 | 3 | 2 | 10 | 12 | | ---- | ----- | ------- | ------- | ------- | | 8 | 5 | 9 | 11 | 1 | | 7 | 6 | 1 | 1 | 1 | ### Embedding lookup Matrix E is a $m \times n$ matrix created for embedding lookup where $m$ is the dimension of vocabulary and $n$ is the dimension of embedding. For instance, if 13 vocabularies are mapped to $R^2$ then the embedding table as follow: ``` tensor([[ 0., 13.], [ 1., 14.], [ 2., 15.], [ 3., 16.], [ 4., 17.], [ 5., 18.], [ 6., 19.], [ 7., 20.], [ 8., 19.], [ 9., 20.], [10., 23.], [11., 24.], [12., 25.]]) ``` and the sentences "this was a bad movie" will be mapped to $R^2$ | this | was | a | bad | movie | | ---- | ----- | ------- | ------- | ------- | | 4 | 3 | 2 | 10 | 12 | |[ 4., 17.]|[ 3., 16.]|[ 2., 15.]|[10., 23.]|[12., 25.]| | i | liked | this | film | \<pad\> | | ---- | ----- | ------- | ------- | ------- | | 8 | 5 | 9 | 11 | 1 | |[ 8., 19.]|[ 5., 18.]|[ 9., 20.]|[11., 24.]|[1., 14.]| | just | awful | \<pad\> | \<pad\> | \<pad\> | | ---- | ----- | ------- | ------- | ------- | | 7 | 6 | 1 | 1 | 1 | |[ 7., 20.]|[ 6., 19.]|[ 1., 14.]|[ 1., 14.]|[ 1., 14.]| Finally, these 3 sentences are converted to $$ \text{This was a bad movie} = \left\{ \begin{matrix} 4 & 3 & 2 & 10 & 12 \\ 17 & 16 & 15 & 23 & 25 \\ \end{matrix} \right\} $$ $$ \text{I liked this file} = \left\{ \begin{matrix} 8 & 5 & 9 & 11 & 1 \\ 19 & 18 & 20 & 24 & 14 \\ \end{matrix} \right\} $$ $$ \text{Just awful} = \left\{ \begin{matrix} 7 & 6 & 1 & 1 & 1 \\ 20 & 19 & 14 & 14 & 14 \\ \end{matrix} \right\} $$ ## Embedding Bag Approach ### Tokenizer As a result, the training data is converted to the following forms: | this | was | a | bad | movie | i | liked | this | film | just | awful | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | This table will then be converted to numbers. Each word will be mapped to an unique ID. | 4 | 3 | 2 | 10 | 12 | 8 | 5 | 9 | 11 | 7 | 6 | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | ^ | | | | | ^ | | | | ^ | | The "offset" table remembers the starting index of each sentense. | Offset |0|5|9| | - | - | - | - | | Input | 4 | 3 | 2 | 10 | 12 | 8 | 5 | 9 | 11 | 7 | 6 | - | - | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | - | | |[0]|[1]|[2]|[3]|[4]|[5]|[6]|[7]|[8]|[9]|[10]| ### EmbeddingBag lookup - Break down sentences - input[0:5] = tensor([4, 3, 2, 10, 12]) - input[5:9] = tensor([8, 5, 9, 11]) - input[9:] = tensor([7, 6]) - Embedding table lookup - Lookup {4, 3, 2, 10, 12} rows in matrix E. Gets a $5 \times 2$ matrix. $$\left\{ \begin{matrix} 4 & 17 \\ 3 & 16 \\ 2 & 15 \\ 10 & 23 \\ 12 & 25 \\ \end{matrix} \right\} $$ - Lookup {8, 5, 9, 11} rows in matrix E. Gets a $4 \times 2$ matrix. $$\left\{ \begin{matrix} 8 & 19 \\ 5 & 18 \\ 9 & 20 \\ 11 & 24 \\ \end{matrix} \right\} $$ - Lookup {7, 6} rows in matrix E. Gets a $2 \times 2$ matrix. $$\left\{ \begin{matrix} 7 & 20 \\ 6 & 19 \\ \end{matrix} \right\} $$ - Apply operation (assume sum operation) - The resulting matrix will be sum in the first dimension for each sentence. - These 3 sentences are converted to the following matrix: $$ \text{feature} = \left\{ \begin{matrix} 31 & 96 \\ 33 & 81 \\ 13 & 39 \\ \end{matrix} \right\} $$ ## Appendix - ![](https://i.imgur.com/mxKAVtH.png) ## Reference - [Sentiment Analysis Using a PyTorch EmbeddingBag Layer](https://visualstudiomagazine.com/Articles/2021/07/06/sentiment-analysis.aspx?Page=1)