# Todos
## Problem Formulation
Tabular data prediction regards a row of the table $T_{i,:}$ as a data sample and a column of the table $T_{:,j}$ as a feature field. An entry of a table $T_{i,j}$ represents the feature value of the $j$-th field for the $i$-th data sample.
We focus on tabular data prediction on a single table. While multiple tables can be joined into a single table using various techniques, e.g., Deep Feature Synthesis.
## 1. Column-wise: AIM: Arbitrary Order Interaction Machines
### Methodology
**Interaction Component**
A beneficial feature interaction set: $I_i=\{x_{i1}, x_{i2},\dots,x_{ik}\}.$
We could use **one layer message passing** to obtain the product feature interactions.
$$
\forall x \in I_i, s_{ij} = \mathbf{1}_{x_{ij}>0}, h_{ij} = abs(x_{ij}),\\
m_i = AGG_{x_{ij}\in I_i}\{\log(h_{ij})\},\\
s_i = \left(deg(m_i)-\sum_{j}\{s_{ij}\}\right)\%2,\\
h_i = exp(m_i)\odot s_i.
$$
For each input, we generate $N$ feature interaction sets $\mathbf{I}=\{I_1, \dots, I_N\}$.
The above process could be done highly parrallel, it could be seen as one layer message passing on a uni-directed graph.
If we use linear aggregators, the computation complexity is $O(Ed)$, where $E$ is the number of edges of the constructed graph , which is $E=\sum_{I\in \mathbf{I}}cardinality(I)$. If we use attention-based aggregators, the complexity is $O(Ed^2)$.
Another important problem is how we generate the $\mathbf{I}$, which we solve in the pruning component.
**Pruning Component**
Principle: Sufficiency and Low Redundancy.
* Feature-value-wise
1. Regularize via the deep infomax loss and the contrastive loss.
*Deep infomax mainly contribute a new loss. It simplifies the computation of mutual information between two variables into computing the KL divergence between the joint distribution and the marginal distribution. In practice, it use a discriminator to tell the difference between the two distributions.

2. Other pruning methods like attention-based, etc.
* Field-wise (Lighter)
1. A simple way could be first precomputing the MI between each pair of two fields and then use greedy algorithms to decide the feature combination sets.
I will survey and find out more pruning methods.
**Overall**
DeepFM could be seen as a reduced version of AIM. If we restrict the $\mathbf{I}$ into the pair-wise feature interactions and singles. Therefore, theoretically, we could beat DeepFM given good pruning component. Also, we generate arbitrary order product feature interactions, which is strictly $n$-way instead of $2$-way.
**Baselines to compare with**: DeepFM, xDeepFM, AutoFIS, PNN, AFM, etc.
**Other**: We can further combine AIM with the multi-rows trick introduced as below.
## 2. Row-wise
Motivations:
1. Why we need multi-rows
- Previous methods: Make prediction based on target input feature and model parameters.
$$p(y*|x,\theta).$$
- Single-row model might not be able to remember the relations between datapoints well.
- For example, $\theta$ is trained within the whole data space. While the distribution of features tend to be long-tail, for less-hit features, it could be biased.
- In recommendation senario, modeling user behavior sequences proves to be effective.
- Our method: Consider direct dependencies between datapoints.
$$p(y*|x,D_{train},\theta).$$
Related Work: NPT, Hopular, SAINT, RIM, PET, etc.
2. Why we need retrieval.
- Homogeneity is important.
- Save source.
**Base model**: $f(\cdot)$.
**Index Factory**: $R(\cdot)$.
**Single row prediction process**: $T_{i,:} \rightarrow f(\cdot)\rightarrow h_i \rightarrow p(\cdot)\rightarrow \hat{y}_i$.
**Multi-rows usage**:
$$
N(t) = R(T_{t,:}),\\
h_t = f(T_{t,:}),\\
a_{ti} = \frac{exp(h_th_i)}{\sum_{j \in N(t)}exp(h_th_j)},\\
n_t = \sum_{i \in N(t)} a_{ti}\cdot \Phi(y_i),\\
\hat{y}_t = p(h_t, n_t).
$$
**Base model**: $y_{base} = \Phi_\theta(X_t).$
**Knowledge patterns**: $\mathbf{K}=\{K_i\}.$
$\forall K_i\in\mathbf{K}$:
$$
N_i(t) = K_i(X_t,\theta),\\
\alpha_{tj} = \frac{exp(\mathbf{h_th_j})}{\sum_{j \in N_i(t)}exp(\mathbf{h_th_j})},\\
\mathbf{n^i_t} = \sum_{j \in N_i(t)} \alpha_{tj}\cdot \mathbf{h_j},\\
\mathbf{kn^i_t} = \left[ \sum_{j \in N_i(t)} \alpha_{tj}\cdot \mathbf{h_j}|| \sum_{j \in N_i(t)} \alpha_{tj}\cdot \mathbf{y_j}\right].
$$
Then,
$$
\{\alpha_{ti}\} \leftarrow Attn(\{\mathbf{n^i_t}\}, \mathbf{h_t}),\\
\mathbf{kn_t} = \sum_i\alpha_{ti}\cdot\mathbf{kn^i_t},\\
\hat{y_t} = \Phi_\theta(X_t) + MLP(\mathbf{kn_t}).
$$
Optimal knowledge patterns:
$$max I((\mathbf{k_1, k_2,\dots, k_N});y)-\alpha \sum_{i=1}^N H(\mathbf{k_i}|y)-\beta \sum_{i\neq j}I(\mathbf{k_i};\mathbf{k_j};y).$$
Its approximation:
$$max \sum_{i=1}^N I(\mathbf{k_i};y)-\sigma\sum_{i\neq j}I(\mathbf{k_i};\mathbf{k_j}).$$
$$max \sum_i I(\mathbf{k_i};y)=max \frac1N\sum_i (\log D_\theta(\mathbf{k_i}, c^+)+(1-\log D_\theta(\mathbf{k_i}, c^-))),\\
min \sum_{i\neq j}I(\mathbf{k_i};\mathbf{k_j}) = max \frac1N\sum_i \left( \log D_w\left(\mathbf{k_i'},\mathbf{k_i''}\right)+(1-\log D_w(\mathbf{k_i'},\mathbf{k_j''}))\right).
$$
*$n_t$ should serve as a residual. Even if we retrieve garbage, it won't degrade the performance.
**Benefits brought from GNNs (PET)**: (Just a discussion. Doesn't mean that we need to use it. We could borrow the most useful part and simplify it as a trick.)
- Stablize multi-rows distributions. GNNs have good stabalization ability and generalization ability.
- FATE offered a proof for this with non-linear GCN.
- Since the assumptions of GNNs and LP are the same, we can parallelly use features and labels. Also,LP is stronger than simple attention-based aggregation. (More costy thougth.) Even on weaker homophily graph, LP is strong.
## 上线可能性
1. SimHash替代Attention,作为软过滤的方式:
**References**:
https://arxiv.org/pdf/2108.04468.pdf
https://arxiv.org/pdf/2205.10249.pdf
**ETA**:

**SDIM**:

- 利用simhash后fingerprint的汉明距离检索,计算汉明距离是原子操作O(1),因而检索时间是线性。
- **SimHash/分桶的部分和candidate item无关,因而可以在召回/粗排的时候并发做SimHash,提前缓存好,实际预测时可以直接取,整个流程可以online和end-to-end。**
Why ETA & SDIM work:
- Feasible for modeling long user behavior sequences.
- Different from the two-stage methods, (e.g., SIM), it alleviates the divergent targets & distribution shift problem.
## 其他可以改进的点&论文入手点
1. OOD generalization (分布外泛化)。
- cv领域有用KNN解决OOD的paper,22年的:(主要是用KNN距离来进行detection)https://arxiv.org/pdf/2204.06507.pdf
- Online retrieval使用最新embedding,可以缓解distribution shift。
- 在learning methodology上我们也可以设计解决OOD问题。
- Survey: https://arxiv.org/pdf/2108.13624.pdf
- 解耦表示学习、因果等等
2. 引入label,从能量模型的角度分析KNN和label对OOD问题带来的好处。
3. 从效率角度可以探索:Collision更少、更能mimic target attention的hash函数。
- Feature Hashing for Large Scale Multitask Learning: https://alex.smola.org/papers/2009/Weinbergeretal09.pdf
## 3. Retrieval
### Sparse Retrieval
Retrieve relevant rows in explicit space according to feature cooccurences, e.g., TF-IDF values.
**Strength**: Light and accurate for tables with less fields.
**Weakness**: When faced with tables with a large number of fields, the sparse retrieval performs poor. Since: 1) It regards that the distances between each two feature values of a feature field are equal. 2) It is vulnerable to garbage columns.
- **An easy fix**: Only use the top-K fields that have highest mutual information with labels to retrieve, e.g., use user attribute columns and item id columns.
Some evidences:

### Dense Retrieval
Retrieve relevant rows in latent space according to cosine distances between feature vectors.
**Strength**: Given expressive representations, it could
* be aware of the ''real'' distance of different feature values,
* apply different term weights to different fields,
* more robust to garbage columns.
**Weakness**: Rely on expressive representations.
* I tried on pure dense retrieval on dota2games, phillipines, etc. (Tables that have hundreds of columns). And it retrieved nearly garbage.
* Ray continued with sparse + dense retrieval on Taoqi datasets. And it got good results.
**Industial applications**:
Approximate KNNs. Some open tools:
- Python binding: Faiss, N2.
- C++ based: OpenSearch.
### Other thoughts
Retrieval itself relies on good cross-rows dependencies. Retrieval helps well on sequential tasks. Temporal information could be important in retrieval.
* Retrieve from histories. Since we could only see the past.
* Retrieve from the whole data pool. Larger retrieval space, but may suffer from the inductive bias. Since in real world we could only see the past.
## Action Terms
1. Prepare for the interview.
2. Try on AIM. I have constructed the pruning process. Should be fast to try.
3. Try on multi-rows tricks.
4. Need more surveys.
Some chitchat:
My VPN collapses recently. Very incovenient :( , any recommendation?