Learning strategy: Empirical Risk Minmization
### General Setup for Thoerem Solver
Now suppose a theory $T$ contains $m$ input variables, the assignment $z_{in}$ of which is given, and $n$ output variabels, the assignment of which $z_{out}$ is unknown. Suppose a solver $s_T$ can solver for the theorm $T$.
When given $z_{in}$ and some $z_{out}$, $(t, U) = s_T.check(z_{in}, z_{out}), t\in\{0, 1\}$, corresponds the problem is `UNSAT` and `SAT`, respectively. $U$ is a set of unsat cores if $t=0$, i.e. `UNSAT`.
When given $z_{in}$ only, the solver returns $z_{out} = s_T(z_{in})$. $z_{out}$ will be the correct assignment if the problem is satisfiable, i.e. `SAT`, or baseline values, i.e. 0s, otherwise.
### Network with a Solver Layer
Let $(x_i, y_i)$ be the input and the labels in the training set with $N$ instance. In the example of MNIST addition, $x_i$ are two images of digits and $y_i$ is the sum. Let $z=sign[f(x;\theta)]$ be a network that takes an input $x\in\mathbb{R}^d$ and outputs a vector $z\in \{0, 1\}^m$. We now will use $z$ as the assignment of input variables for the solver. For simplicity, we overload some notations in the last paragraph and write $y' = s(z)$.
### Training Target: ERM
$$\min_\theta \mathcal{L} =\min_\theta \frac{1}{N} \sum^{N-1}_i L(y'_i, y_i;\theta)$$
where $L$ is the loss function. For example, a binary cross entropy, and $y' = s(z_i), z_i = sign[f(x_i;\theta)]$.
### Architecture

### SGD Update Rule
For the simplicity, consider the network is using a batch size of 1 so we only deal with one instance (x, y) per iteration. For the iteration $\tau$, the parameter $\theta$ is updated as follows:
Step 0: In the forward pass, solve for $y' = s(z) = s( sign[f(x;\theta)])$
Step 1: compute the empirical ristk $\mathcal{L}(y', y)$ and $\frac{\partial \mathcal{L}}{\partial y'}$.
Step 2: $y^* = sign[y' - 2 \times sign[\frac{\partial \mathcal{L}}{\partial y'}]]$.
Step 3: Compute $(t, U) =s.check(z, y^*)$
Step 4: Compute $g = BCE(ReLU(z), z)$ where BCE is the binary cross entropy
Step 5: $\frac{\partial \mathcal{L}}{\partial z_j} = -g_j$ if `t==0` and $j \in U$. Otherwise, $\frac{\partial \mathcal{L}}{\partial z_j} = g_j$.
Step 6: $\frac{\partial \mathcal{L}}{\partial \theta} = \frac{\partial \mathcal{L}}{\partial z} \times \frac{\partial z}{\partial \theta}, \theta_{\tau+1} = \theta_\tau - \eta * \frac{\partial \mathcal{L}}{\partial \theta}$ where $\eta$ is the learning rate.