# [Study] How Embedding & EmbeddingBag are used in NLP
###### tags: `research-DLRM`
[TOC]
## Problem description (NLP as an example)
Consider the following training sentences are fed into a "semtiment analysis" AI. The number 0 or 1 in the begining indicates the semtiment of that sentence. 0 for negative and 1 for positive sentiment.
```
0 ~ This was a BAD movie.
1 ~ I liked this film!
0 ~ Just awful
...
```
Assume that batch size is 3, which means only 3 sentences will be fed into Embedding lookup at a time.
:::info
Embedding is the mapping of concepts, objects or items into a vector space.
:::
## Word Embedding Approach
### Tokenizer
- Break sentence into several words
- Handle punctuation
- Convert to lower cases
As a result, the training data is converted to the following forms:
| this | was | a | bad | movie |
| ---- | ----- | ------- | ------- | ------- |
| i | liked | this | film | \<pad\> |
| just | awful | \<pad\> | \<pad\> | \<pad\> |
This table will then be converted to numbers. Each word will be mapped to an unique ID.
| 4 | 3 | 2 | 10 | 12 |
| ---- | ----- | ------- | ------- | ------- |
| 8 | 5 | 9 | 11 | 1 |
| 7 | 6 | 1 | 1 | 1 |
### Embedding lookup
Matrix E is a $m \times n$ matrix created for embedding lookup where $m$ is the dimension of vocabulary and $n$ is the dimension of embedding.
For instance, if 13 vocabularies are mapped to $R^2$ then the embedding table as follow:
```
tensor([[ 0., 13.],
[ 1., 14.],
[ 2., 15.],
[ 3., 16.],
[ 4., 17.],
[ 5., 18.],
[ 6., 19.],
[ 7., 20.],
[ 8., 19.],
[ 9., 20.],
[10., 23.],
[11., 24.],
[12., 25.]])
```
and the sentences "this was a bad movie" will be mapped to $R^2$
| this | was | a | bad | movie |
| ---- | ----- | ------- | ------- | ------- |
| 4 | 3 | 2 | 10 | 12 |
|[ 4., 17.]|[ 3., 16.]|[ 2., 15.]|[10., 23.]|[12., 25.]|
| i | liked | this | film | \<pad\> |
| ---- | ----- | ------- | ------- | ------- |
| 8 | 5 | 9 | 11 | 1 |
|[ 8., 19.]|[ 5., 18.]|[ 9., 20.]|[11., 24.]|[1., 14.]|
| just | awful | \<pad\> | \<pad\> | \<pad\> |
| ---- | ----- | ------- | ------- | ------- |
| 7 | 6 | 1 | 1 | 1 |
|[ 7., 20.]|[ 6., 19.]|[ 1., 14.]|[ 1., 14.]|[ 1., 14.]|
Finally, these 3 sentences are converted to
$$ \text{This was a bad movie} = \left\{
\begin{matrix}
4 & 3 & 2 & 10 & 12 \\
17 & 16 & 15 & 23 & 25 \\
\end{matrix}
\right\}
$$
$$ \text{I liked this file} = \left\{
\begin{matrix}
8 & 5 & 9 & 11 & 1 \\
19 & 18 & 20 & 24 & 14 \\
\end{matrix}
\right\}
$$
$$ \text{Just awful} = \left\{
\begin{matrix}
7 & 6 & 1 & 1 & 1 \\
20 & 19 & 14 & 14 & 14 \\
\end{matrix}
\right\}
$$
## Embedding Bag Approach
### Tokenizer
As a result, the training data is converted to the following forms:
| this | was | a | bad | movie | i | liked | this | film | just | awful |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
This table will then be converted to numbers. Each word will be mapped to an unique ID.
| 4 | 3 | 2 | 10 | 12 | 8 | 5 | 9 | 11 | 7 | 6 |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| ^ | | | | | ^ | | | | ^ | |
The "offset" table remembers the starting index of each sentense.
| Offset |0|5|9|
| - | - | - | - |
| Input | 4 | 3 | 2 | 10 | 12 | 8 | 5 | 9 | 11 | 7 | 6
| - | - | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | - |
| |[0]|[1]|[2]|[3]|[4]|[5]|[6]|[7]|[8]|[9]|[10]|
### EmbeddingBag lookup
- Break down sentences
- input[0:5] = tensor([4, 3, 2, 10, 12])
- input[5:9] = tensor([8, 5, 9, 11])
- input[9:] = tensor([7, 6])
- Embedding table lookup
- Lookup {4, 3, 2, 10, 12} rows in matrix E. Gets a $5 \times 2$ matrix.
$$\left\{
\begin{matrix}
4 & 17 \\
3 & 16 \\
2 & 15 \\
10 & 23 \\
12 & 25 \\
\end{matrix}
\right\}
$$
- Lookup {8, 5, 9, 11} rows in matrix E. Gets a $4 \times 2$ matrix.
$$\left\{
\begin{matrix}
8 & 19 \\
5 & 18 \\
9 & 20 \\
11 & 24 \\
\end{matrix}
\right\}
$$
- Lookup {7, 6} rows in matrix E. Gets a $2 \times 2$ matrix.
$$\left\{
\begin{matrix}
7 & 20 \\
6 & 19 \\
\end{matrix}
\right\}
$$
- Apply operation (assume sum operation)
- The resulting matrix will be sum in the first dimension for each sentence.
- These 3 sentences are converted to the following matrix:
$$ \text{feature} = \left\{
\begin{matrix}
31 & 96 \\
33 & 81 \\
13 & 39 \\
\end{matrix}
\right\}
$$
## Appendix
- 
## Reference
- [Sentiment Analysis Using a PyTorch EmbeddingBag Layer](https://visualstudiomagazine.com/Articles/2021/07/06/sentiment-analysis.aspx?Page=1)