# Flow to ingest model into Kokoyi: a porter's guide
We can imagine a Kokoyi user, or more precisely the users who will help to move M0 to M1 (let's call them porter), to import a new model in the following steps, after going through the Kokoyi quick start guide and a few basic examples, then begin with a few questions:
* **What is the task?** There is a not a lot of varieties: it can be an end task itself, for instance single instance classification, regression, autoregression and seq2seq over sequences, or the goal is to learn a low-dimensional representation or infer a latent variable. The set of tasks is fixed.
* **what is the objective function**? There are more variaties here. Some work is exclusively on coming up with good loss functions. We can provide a few most popular ones, for instance log likelihood, various distance functions to measure the gap between probabilities (e.g. cross entropy). Defining the the signature of the objective function defines the model input/output at learning time.
* **what is the optimization routine?** Most of them are pretty straightforward, but some will break into different phases, in particular when there are multiple set of parameters, as in GAN or cycle training.
* **where does the samples come from?** The easiest is when samples come from a training set. Some algorithm internally generate samples from a starting initial state, as in reinforcement learning. Yet some others mix the samples, as in DAGGER.
Now the porter can write the model in plain latex, with Kokoyi syntax warm in their mind (this way we can see what the user prefers, rather than asking them to stick with Kokoyi).
This decomposition leads to a few interesting points, for instance can we automatically generate codes for some of the steps? Or can we provide template that easily modifieable?
## The Kokoyi OpTemp
The following is a prelimary set of ML tasks. We should add more.
<!--
It seems that we can define a set of object funciton and optimization template that work for a majority of the models. A user can choose one template (or a combination of them), and then proceed to define a model.
-->
### Single Instance Tasks
#### Classification
* Input: $x \in \mathcal{X}$
* Model: $p_\theta(y|x)$
* Inference output: $y \sim p_\theta(y|x)$
* Objective function (maximum likelihood): $F(x, y; \theta) = \log p_\theta(y|x)$
* Optimization: $\theta^* = argmax_\theta E_{(x, y) \sim D}[F(x, y; \theta)]$
#### Regression
#### Set Prediction
#### Denoising Autoencoder
* Input: $x \in \mathcal{X}$
* Model:
* $\tilde x = x + c(x)$; $c(x)$ is a corruption function
* $h = h_{\theta_h}(\tilde x)$
* $\hat x = g_{\theta_g}(h)$
* Inference output: $h = g_{\theta_h}(x)$
* Objective function: $F(x; \theta_h, \theta_g) = \|\hat x - g_{\theta_g}(h_{\theta_h}(x + c(x)) \|$
* Optimization: $\theta_h^*, \theta_g^* = argmin_{\theta_g, \theta_h} E_{x \sim D(x)}[F(x; \theta)]$
### Sequences
#### Autoregression
* Input: $x \in \mathcal{X}^L$, where $L$ is the maximum length of sequences and $\mathcal{X}$ is the vocabulary (a set of tokens).
* Model: $p_\theta(x_{1:n}) = \prod_{t=1}^n p_\theta(x_t|x_{1:t-1})$
* Inference output: $\hat x_{t+1} \sim p_\theta(x_{t+1}|\hat x_{1:t})$
* Objective function: $F(x; \theta) = \sum_{1:T} \log p_\theta(x_{t+1}|x_{1:t})$
* Optimization: $\theta^* = argmin_\theta E_{x \sim D(x)}[F(x; \theta)]$
#### Seq2Seq
* Input: $(x, y) \in \mathcal{X}^L \times \mathcal{Y}^L$, where $X$ and $Y$ are input and output vocabularies
> Only log probability and logit are computed in practice. The conditional probability is never computed.
* Model: $p_\theta (y_{1 : n} | x_{1 : m}) = \prod_{t = 1}^m p_\theta (y_t | y_{1 : t - 1}, x_{1 : n})$
> How to implement beam search?
* Inference output: $\hat y_{t + 1} \sim p_\theta (y_{t + 1} | y_{1 : t}, x_{1 : n})$
> How to implement log-sigmoid and log-softmax in Kokoyi?
* Objective function: $F(x, y; \theta) = \log p_\theta (y_{1 : n} | x_{1 : m}) = \sum_{t = 1}^m \log p_\theta (y_t | y_{1 : t - 1}, x_{1 : n})$
* Optimization: $\theta^* = argmin_\theta E_{x,y \sim D(x, y)}[F(x, y; \theta)]$
#### Masked Learning in Sequence
* Input: $(x, b) \in X^L \times \{0, 1\}^L$
* Model: $p_\theta (x_{\{t \in [L] : b_t = 1\}} | x_{1 : n}) = \prod_{t = 1}^n b_t p_\theta (x_t | x_{1 : n})$
* Inference output: no inference is needed since this is pre-training?
* Objective function: $\log p_\theta (x_{\{t \in [L] : b_t = 1\}} | x_{1 : n}) = \sum_{t = 1}^n b_t \log p_\theta (x_t | x_{1 : n})$
* Optimization: $\theta^* = argmin_\theta E_{x \sim D(x), b \sim L-way\ bernoulli)}[F(x, b; \theta)]$
#### RL: policy gradient
* Input: $(s, a, r) \in \mathcal{S} \times \mathcal{A} \times \mathcal{R}$
* Model: $p_\theta (a | s) = \pi_\theta (a, s)$, where $\pi_\theta$ is a neural network parameterized by $\theta$
* Inference output: $a \sim p_\theta (a | s)$ (can be more greedy)
> The tricky part is to sample trajectories...
* Objective function: $\mathbb{E}_{s, a, r} [r \nabla_\theta \log \pi_\theta (a, s)]$
#### Multi-set Prediction
> here we use [DETR](https://arxiv.org/pdf/2005.12872.pdf); this is in the wrong subsection. where should it be?
* Input: $x \in R^{w \times h}$
* Output: $\hat y = \{\hat y_i\}_{i=1}^N$
* Object function:
* $\sigma = argmin_{\sigma \in \Sigma_N} \sum_{i=1}^N L(y_i, \hat y_{\sigma(i)})$; $\sigma$ is a permutation over integer up to $N$
*
*
### Learning on Graphs
### Leaning Framework
#### GAN
> the strange thing about GAN is we cannot write the object function as a single sample instance. thoughts?
* Input: $x \in \mathcal{X}$
* Model:
* Training: $D_\phi(x)$, $G_\theta(z)$
* Generation/inference: $G_\theta(z)$
* Object function: $F(x; \theta, \phi) = E_x[\log D_\phi(x)] + E_{z \sim NN(., .)}[\log 1 - D_\phi(G_\theta(z))]$
* Optimization: $\theta^*, \phi^* = min_\phi max_\theta E_{x \sim D(x)}[F(x; \theta, \phi)]$
#### Contrastiv Learning
#### VAE
* Input: $x \in \mathcal{X}$
* Model:
* Generator: $p_\theta(x|z)$
* Prior: $p(z)$ (no parameter, a multi-variate unit Guassian)
* Approximate posterior: $q_\phi(z|x)$
* Inference output: $p_\theta(x|z \sim p(z))$
* Objective function: $F(x, \theta, \phi) = E_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - KL(q_\phi(z|x)|p(z)))$
* Optimization: $\theta^*, \phi^* = argmax_{\theta, \phi}E_{x \sim D}[F(x; \theta, \phi)]$
#### Cycle training
* Input: $x \in \mathcal{X}$, $y \in \mathcal{Y}$
* Model: $f: X \rightarrow Y$, $g: Y \rightarrow X$
* Object function: $F(x, y; \phi, \theta) = ||x - g_\phi(f_\theta(x))|| + ||y - f_\theta(g_\phi(y))||$
* Optimization: $\theta^*, \phi^* = argmax_{\phi, \theta} E_{x,y \sim D(x,y)}[F(x,y;\theta, \phi)]$