Proposal - HackMD

# Proposal ## Modeling Word Embedding Drift From Sparse Data #### Motivation In a recommender system for online retail, in the summer, "vacation" might be associated with "swimwear" while in winter "vacation" might be associated with "skiing". Thus, by seeing the embeddings of "skiing" and "vacation", even if we have not seen "gloves" or "snow boots" that much in our datasets, we might infer from the proximity of "skiing" and "vacation" that gloves" or "snow boots" should also be close to "vacation" during winter. #### Step 1: Synthetic Dataset Creation 0. Freeze a vocabulary of 100 synthetic words. 1. Make a dataset of 1000 mini-datasets as follows. - a) Treating the words as nodes of an undirected graph create a random graph $G_1$ with each edge having a random weight. You may try to make the graph sparse by having most of the edges to be 0. - b) Make synthetic text $D_1$ using the graph by performing a random walk (based on edge weights) on the graph. - c) Sample a "drift" for the graph. (Eg. Such as changing the weight of one of the edges etc.). You need to define what is a reasonable assumption of drift. - d) This will give you a drifted graph $G_2$. - e) From $G_2$, make a synthetic dataset $D_2^\text{small}$ but containing only 50 out of 100 words. - f) Similarly, make a fuller dataset $D_2^\text{full}$ but this time containing all the 100 words. - g) Keep 800 for training, 100 for validation and 100 for testing. #### Step 2: Estimate Drift For each mini-dataset: 2. Train word embedding $E_1^\text{full}$ using $D_1$. 3. Train word embedding $E_2^\text{small}$ for the 50 words using $D_2^\text{small}$ 4. Train word embedding $E_2^\text{full}$ for all the words using $D_2^\text{full}$. 5. Using drift in the embeddings of the 50 words i.e. from $E_1^\text{small}$ to $E_2^\text{small}$ as input, try to predict the embedding $E_2^\text{full}$ as learning targets. You may use any neural net or more recent architectures like transformer. - a) Look up a simple implementation of transformers. Play with it. - b) In a nutshell, transformer takes a set input and returns a set output. And each output prediction depends on all the inputs in a non-linear way $$ y_{1:T} \leftarrow \text{Transformer}(x_{1:T}) $$ So if there are T inputs, there will be T outputs, each output corresponding to one input. - c) Thus, we need to construct a set input to give to the transformer. $$ E_2^\text{small} \cup E_1^\text{full} $$ This will give M + N outputs. However, we only want N outputs. So we can ignore the M outputs and use only N of them. - d) To train, we can minimize some MSE loss with the targets $E_2^\text{full}$. #### Careful 1. Make sure that when you train $E_2^\text{small}$ and $E_2^\text{full}$, they are aligned with $E_1^\text{large}$. Otherwise, due to random seed effects, there may be no clear pattern for us to predict $E_2^\text{full}$ from $E_1^\text{full}$. May be this alignment is not needed in case Transformer figures out that such randomness can exist. 2. Alignment can be ensured by taking $E_1^\text{full}$ as initializations when training word-embeddings for $E_2^\text{full}$ and $E_2^\text{small}$.      -->

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.