Learning strategy: Empirical Risk Minmization ### General Setup for Thoerem Solver Now suppose a theory $T$ contains $m$ input variables, the assignment $z_{in}$ of which is given, and $n$ output variabels, the assignment of which $z_{out}$ is unknown. Suppose a solver $s_T$ can solver for the theorm $T$. When given $z_{in}$ and some $z_{out}$, $(t, U) = s_T.check(z_{in}, z_{out}), t\in\{0, 1\}$, corresponds the problem is `UNSAT` and `SAT`, respectively. $U$ is a set of unsat cores if $t=0$, i.e. `UNSAT`. When given $z_{in}$ only, the solver returns $z_{out} = s_T(z_{in})$. $z_{out}$ will be the correct assignment if the problem is satisfiable, i.e. `SAT`, or baseline values, i.e. 0s, otherwise. ### Network with a Solver Layer Let $(x_i, y_i)$ be the input and the labels in the training set with $N$ instance. In the example of MNIST addition, $x_i$ are two images of digits and $y_i$ is the sum. Let $z=sign[f(x;\theta)]$ be a network that takes an input $x\in\mathbb{R}^d$ and outputs a vector $z\in \{0, 1\}^m$. We now will use $z$ as the assignment of input variables for the solver. For simplicity, we overload some notations in the last paragraph and write $y' = s(z)$. ### Training Target: ERM $$\min_\theta \mathcal{L} =\min_\theta \frac{1}{N} \sum^{N-1}_i L(y'_i, y_i;\theta)$$ where $L$ is the loss function. For example, a binary cross entropy, and $y' = s(z_i), z_i = sign[f(x_i;\theta)]$. ### Architecture ![](https://i.imgur.com/MjOcC3b.png) ### SGD Update Rule For the simplicity, consider the network is using a batch size of 1 so we only deal with one instance (x, y) per iteration. For the iteration $\tau$, the parameter $\theta$ is updated as follows: Step 0: In the forward pass, solve for $y' = s(z) = s( sign[f(x;\theta)])$ Step 1: compute the empirical ristk $\mathcal{L}(y', y)$ and $\frac{\partial \mathcal{L}}{\partial y'}$. Step 2: $y^* = sign[y' - 2 \times sign[\frac{\partial \mathcal{L}}{\partial y'}]]$. Step 3: Compute $(t, U) =s.check(z, y^*)$ Step 4: Compute $g = BCE(ReLU(z), z)$ where BCE is the binary cross entropy Step 5: $\frac{\partial \mathcal{L}}{\partial z_j} = -g_j$ if `t==0` and $j \in U$. Otherwise, $\frac{\partial \mathcal{L}}{\partial z_j} = g_j$. Step 6: $\frac{\partial \mathcal{L}}{\partial \theta} = \frac{\partial \mathcal{L}}{\partial z} \times \frac{\partial z}{\partial \theta}, \theta_{\tau+1} = \theta_\tau - \eta * \frac{\partial \mathcal{L}}{\partial \theta}$ where $\eta$ is the learning rate.