# Proposal
## Modeling Word Embedding Drift From Sparse Data
#### Motivation
In a recommender system for online retail, in the summer, "vacation" might be associated with "swimwear" while in winter "vacation" might be associated with "skiing". Thus, by seeing the embeddings of "skiing" and "vacation", even if we have not seen "gloves" or "snow boots" that much in our datasets, we might infer from the proximity of "skiing" and "vacation" that gloves" or "snow boots" should also be close to "vacation" during winter.
#### Step 1: Synthetic Dataset Creation
0. Freeze a vocabulary of 100 synthetic words.
1. Make a dataset of 1000 mini-datasets as follows.
- a) Treating the words as nodes of an undirected graph create a random graph $G_1$ with each edge having a random weight. You may try to make the graph sparse by having most of the edges to be 0.
- b) Make synthetic text $D_1$ using the graph by performing a random walk (based on edge weights) on the graph.
- c) Sample a "drift" for the graph. (Eg. Such as changing the weight of one of the edges etc.). You need to define what is a reasonable assumption of drift.
- d) This will give you a drifted graph $G_2$.
- e) From $G_2$, make a synthetic dataset $D_2^\text{small}$ but containing only 50 out of 100 words.
- f) Similarly, make a fuller dataset $D_2^\text{full}$ but this time containing all the 100 words.
- g) Keep 800 for training, 100 for validation and 100 for testing.
#### Step 2: Estimate Drift
For each mini-dataset:
2. Train word embedding $E_1^\text{full}$ using $D_1$.
3. Train word embedding $E_2^\text{small}$ for the 50 words using $D_2^\text{small}$
4. Train word embedding $E_2^\text{full}$ for all the words using $D_2^\text{full}$.
5. Using drift in the embeddings of the 50 words i.e. from $E_1^\text{small}$ to $E_2^\text{small}$ as input, try to predict the embedding $E_2^\text{full}$ as learning targets. You may use any neural net or more recent architectures like transformer.
- a) Look up a simple implementation of transformers. Play with it.
- b) In a nutshell, transformer takes a set input and returns a set output. And each output prediction depends on all the inputs in a non-linear way
$$
y_{1:T} \leftarrow \text{Transformer}(x_{1:T})
$$
So if there are T inputs, there will be T outputs, each output corresponding to one input.
- c) Thus, we need to construct a set input to give to the transformer.
$$
E_2^\text{small} \cup E_1^\text{full}
$$
This will give M + N outputs. However, we only want N outputs. So we can ignore the M outputs and use only N of them.
- d) To train, we can minimize some MSE loss with the targets $E_2^\text{full}$.
#### Careful
1. Make sure that when you train $E_2^\text{small}$ and $E_2^\text{full}$, they are aligned with $E_1^\text{large}$. Otherwise, due to random seed effects, there may be no clear pattern for us to predict $E_2^\text{full}$ from $E_1^\text{full}$. May be this alignment is not needed in case Transformer figures out that such randomness can exist.
2. Alignment can be ensured by taking $E_1^\text{full}$ as initializations when training word-embeddings for $E_2^\text{full}$ and $E_2^\text{small}$.
<!-- $$
\mathcal{V}_\text{partial} = w_{1:M}\\
\mathcal{V}_\text{whole} = w_{1:N}\\
$$
$$
e_{1:N}^{D_1^\text{large}}\\
e_{1:M}^{D_2^\text{small}}\\
e_{1:N}^{D_2^\text{large}}
$$ -->
<!--
## Modeling Word-Embedding Drift from Sparse Data
<!-- Given :
$$
x_{1:N, t}
$$
where $N$ is the vocabulary size.
Generative Model:
$$
p(x_{1:N, t} | x_{1:N, t-1}) = p(x_{1:N, t}|x_{1:N, t-1}, z_{t-1,t}) p(z_{t-1,t}|x_{1:N, t-1}) \\= \big[\prod_{n=1}^N p(x_{n, t}|x_{n, t-1}, z_{t-1,t}) \big]p(z_{t-1,t}|x_{1:N, t-1})
$$
Inference step :
$$
q(x_{M:N, t} | x_{1:N, t-1}, x_{1:M-1, t})
$$
-->
<!-- Main Idea : -->
<!-- ###### Main Idea
- At every time-step, we receive only little data which do not use all words in variety of contexts.
- By adjusting the word embedding based on estimated drift, we should be better able to prevent performance drop over time.
- -->
<!--
###### Method
Training time,
$$
e = \text{Transformer}(x_{1:M, t-1}, x_{1:M,t})
$$
$$
\hat{x}_{n,t} \leftarrow \text{RNN}(x_{n, t-1}, e)
$$
To train, minimize the MSE loss
$$
|| x_{n,t} - \hat{x}_{n,t}||^2
$$
Testing Time,
$$
e = \text{Transformer}(x_{1:M, t-1}, x_{1:M,t})
$$
$$
\hat{x}_{n,t} \leftarrow \text{RNN}(x_{n, t-1}, e)
$$
#### Evaluation Methodology
1. We shall do a downstream task.
2. And we will evaluate how the downstream task performance loss is mitigated when we use our model is used to adjust the embedding.
###### Downstream Task Possibilities
1. Product Recommendation data temporal?
2. Sentiment
3. Toy Dataset (I prefer this!)
###### Baselines and Models
1. Use same embedding as previous time-step
2. Baseline method to adjust embedding
3. Our method
###### Metric
Same as the metric of the downstream task.
- For example, recommendation. nDCG
- For example, sentiment classification. Accuracy
--> -->