POPL rebuttal - HackMD

# POPL rebuttal ## Overview We thank the reviewers for their thoughtful comments, and we appreciate your feedback. We will start by addressing questions that came up in multiple reviews about how our paper relates to prior work, and provide more detailed answers to specific questions raised by individual reviewers at the end. We summarize the changes that we will make to the paper after the overview. ### Relation to and comparison with prior methods Several reviewers posed questions about how our work relates to other methods for integrating symbolic knowledge into learning, and whether quantitative comparisons are appropriate and necessary in our evaluation. Of the methods that we are aware of and were mentioned by reviewers, we see them as falling into three camps: solver layers (Wang et al. 2020, Vlastelica et al. 2020), differentiable logic (refs 1, 4, 5 in Review C), and neural deductive databases (Manhaeve et al. 2020, Huang et al. 2021). Each type of method has strengths and weaknesses that manifest differently across use cases, and in some cases, offer distinct capabilities that are not supported by others. We believe that there is value to the community in pursuing and publishing research on all of these directions. Our paper falls into the "solver layer" bucket. Our rationale for omitting a quantitative comparison is given below, but if reviewers believe that such a comparison would be informative, we are happy to make this change. #### Solver layers The two examples of prior work in this area are SATNet (Wang et al. 2020) and that of Vlastelica et al. (2020). The former describes a neural network layer that provides an approximate MaxSAT solver, implemented using a differentiable SDP solver; the latter a layer that solves a certain class of combinatorial optimization problems (shortest path, traveling salesman, etc). Neither example aims to provide the same functionality as our layer, and a quantitative comparison might yield results that are difficult to interpret objectively. * In the case of SATNet, training the solver layer amounts to learning the CNF clauses that are solved in the forward pass. The authors' open-source implementation of the approach does not provide a clear interface to manually specify a set of clauses. Our experiments focus on settings in which the training data does not contain sufficient information to learn the complete logical constraints, which would put SATNet at an unfair disadvantage when measuring accuracy on test data. On the other hand, we would expect the runtime performance of SATNet to dominate our approach, as it is parallelized and can take advantage of hardware acceleration, whereas our implementation uses a single-threaded instance of Z3 that cannot be accelerated with CUDA. * The approach of Vlastelica et al. was not designed to handle logical constraints without a linear optimization objective; our approach was not designed to expose Z3's discrete optimizer beyond the use of MaxSMT with constraint weights derived from network activations. Thus, none of the problems studied in our evaluation are solvable using their work, and the problems that their method was designed to handle are beyond the scope of our current approach. In summary, the capabilities provided by prior work in this area do not present a meaningful apples-to-apples comparison with the approach described in our paper. #### Neural deductive databases Another set of related techniques, "neural probabilistic logic programming", may be applicable to some of decomposable problems that we focus on. The approach taken by these is quite different: rather than integrating a solver into the architecture of a network, one writes a probabilistic datalog program with certain predicates that are computed by neural networks. For problems where the logical theory in our SMTLayer can be encoded as probabilistic datalog, we expect the prediction performance of these approaches to be comparable; e.g., for MNIST Addition, [Huang et al 2021] (Scallop) report ~97% accuracy, whereas we report ~98%. The earlier work of [Manhaeve et al. 2020] (DeepProbLog) reports similar results. In terms of training cost, although [Huang et al 2021] evaluate on slightly older hardware accelerators than those used in our evaluation, they report training for ~1000 seconds to reach that accuracy, whereas our approach required ~400 seconds. [Huang et al. 2021] also reported that DeepProbLog took 14 hours to complete three epochs of training on MNIST addition. If reviewers believe that a comparison on visual algebra would be informative, we will add this to the paper. However, our view is that neural differentiable logic programs and solver layers are two substantially different approaches for solving problems that involve logical structure and perception. When applicable to the same problem, we expect that their quantitative performance (i.e., accuracy) will be comparable, and it is likely that engineering effort can be devoted to make their costs comparable as well. The important differentiating factor is not as easily comparable: just as some tasks are better handled by a deductive database than by an SMT solver, or vice versa, some tasks are likely better handled by DeepProbLog/Scallop than SMTLayer, or vice versa. Pushing research forward on both methods is valuable. #### Differentiable Logic We believe that the setting and approach in our work are not competitive with the cited work on differentiable logic, and may be complementary. While work on differentiable logic is motivated by incorporating symbolic knowledge into learning, it often has distinct aims, and best facilitates distinct capabilities. Also, these approaches require that constraints be representable in differentiable logic, whereas we don't have that restriction. Much work on differentiable logic [1,4 in Review C] looks to encode symbolic knowledge into differentiable loss functions, which can be used as an optimization objective to train a model. In contrast, we aim to integrate a solver into the computation performed by a model in a way that allows symbolic knowledge to inform both training, as well as the inferences made at test time. This means two things. * Models trained against DL losses do not have access to the formula or a solver during inference, whereas in our work, a solver does a portion of the inference. This makes our approach more suited to tasks where inference involves computations that solvers are good at (as stated by Reviewer C, "we can use solvers to do arithmetic more intelligently"). While DL approaches might also be applied to these tasks, doing so would mean training a model to do what solvers are already good at, which we agree is not a preferred approach. * Some of the most compelling uses of DL, as Reviewer C suggests, involve problems that arise from model certification, e.g. encoding robustness, continuity, similarity, and other properties primarily of *models* and their behavior, symbolically in the loss. We do not currently envision a way to encode such properties within a solver layer. However, we can easily see why one may want to train a model that includes a solver layer using a DL loss that, for instance, promotes Lipschitzness (Cf l. 524-529). Thus, we view our work and the work on DL as complementary. We will expand our related work to include this discussion of DL. ## Change list * Add an overview subsection to section 3, giving a "big picture" perspective of our proposed setting and the theoretical results. * Clarify the meaning and practical ramifications imposed by restrictions on formulas $\phi$ used by the solver, in reference to comments by reviewers C, D. * Add a more thorough discussion of how our approach relates to neural deductive databases and differentiable logic in the introduction and related work. * If the reviewers believe it to be necessary, add experimental comparisons with related work. ## Response ### Reviewer A > The theoretical section also needs framing of a bigger picture. We appreciate the reviewer's perspective, and will add an overview subsection to section 3 to address it. Regarding your particular question about why we do not place computational restrictions on the grounding function, we attempted to motivate this in section 3.1, where we point to the likely possibility that the data is generated by natural processes (e.g., handwritten digits, medical images), so the corresponding grounding function may not be a simple function, or one whose computational properties are known. Our new overview will elaborate on this and other assumptions and conditions of the ERM($\phi$) setting, and relate them to our result to give a better sense of the big picture. > how is solving this using a DNN + SMT different from just using the DNN to “read” the sudoku and then separately using the SMT solver to solve the sudoku? At inference time (i.e., when predictions are being made on new data), our approach does effectively use a DNN to read the data, and then uses the SMT solver to make a prediction using the information "read in" by the DNN. Closely related work, such as that of Vlastelica et al. and Wang et al. (SATNet), work similarly, although rather than an SMT solver, they study combinatorial optimizers and approximate MaxSAT solvers, respectively. However, the essential challenge that remains is how to construct a DNN that produces results that are usable by the solver. If one has access to training labels that are granular and detailed enough to associate a full set of ground terms to populate the SMT solver's constraints (e.g., in the case of Sudoku, labels for all of the digits that appear in the hint), then one can solve a standard supervised learning problem. Our work, and the related work mentioned previously, assumes that such granular labels are not available, and provides a solution for training DNNs with solver-compatible representations. > it is not clear whether the tasks of grounding and prediction are linked (as in interleaved) or not. In the example problems studied in our paper, grounding always precedes prediction. They are tightly coupled in the sense that they must agree on a representation of the data, e.g., the grounding function for a Sudoku problem needs to produce digits in a format that the solver can take in. > It’s unclear how this work is just different from the ‘obvious’ solution of grounding first, then predicting ... why is it difficult then? The core challenge is in learning a function that grounds correctly. As mentioned earlier, we assume that labels sufficient to associate ground terms with training data are not available, so in order to make use of a set of logical constraints for prediction, one first needs to train a function that produces these labels. > Also, is the hardness result because there’s an assumption of no deep sharing between the layers? The hardness result (Theorem 3.10) concerns the problem of generating a sample that could be used to directly train a grounding function, and does not depend on how a model that would be trained on that sample is constructed. The SMTLayer approach presented in section 4 does not negate this result, but instead proposes an alternative way to learn a function that is compatible with a theory used for prediction. Rather than construct a grounding sample to use directly for training, SMTLayer uses information in unsat cores, or provided by a MaxSMT solver, to provide training updates to a DNN. > Would it be accurate to say that this work studies when grounding is possible and advantageous compared to approaches that try to avoid any leakage between the layers? The first part of this statement is accurate in regard to section 3 of our paper; section 4 follows with a set of practical techniques for training neural networks that approximate grounding functions. We are unsure about the reference to approaches that avoid leakage between the layers---if the reviewer could provide an example of such an approach, we may be able to give a more definitive response. > It sounds like this pre-training step is needed because the SMTLayer makes training very slow. Does this mean that SMTLayer is too cost-prohibitive to use time-wise? Whether the approach is cost-prohibitive likely depends on context, but to give an idea of the difference with and without: with pre-training, only 5 epochs were used to obtain the results in Table 1, whereas without pre-training, up to 10-15 epochs may be necessary to achieve comparable results. As per-epoch times range in single-digit minutes on current hardware, the difference is non-trivial but may not be decisive in all settings. We will add these results to future versions of the paper. ### Reviewer B > However, I don't fully understand the idea of the backward function, and I think more explanations are needed. In particular, I don't get l744/765: why is \hat{y} calculated this way? What's "corrected layer output"? Where does 2 come from? The "corrected layer output" refers to a value that produces a smaller loss than the ouput value produced by the layer in the forward pass. For the final layer of the network, this means the correct prediction. Consider a scalar logit output, which we interpret as a boolean (i.e., "false" if negative, "true" if positive). If this output was correct, then the gradient of the loss will have the opposite sign of the logit, and if it was incorrect, its sign will match. To obtain the correct output, we subtract the sign of this gradient from the sign of the logit; we multiply the former by 2 to ensure that the result is strictly positive or negative, so that it can likewise be interpreted as a boolean. This is currently described in the proof of Theorem 4.1, and we will add a description in the main body of section 4 as well. > The network setting is a bit specific as the SMTLayer is always used at the top of a deep neural network. However, the algorithms as described seem to work when used a bit deeper inside neural work. Is that true? Yes, it is correct that the algorithms work when the layer is integrated lower in the network. We have considered some applications where doing so may prove beneficial, such as a generative model with a symbolic, interpretable latent space. However, our focus in this paper is on decomposable prediction tasks like those formalized in Def. 3.1, and we believe that suitable models for these tasks are most straightforward with the symbolic layer at the top of the network. > According to the discussion in related work, the idea of integrating solvers and neural networks is not new, but the experiment does not contain a comparison with existing work. Is that because existing work is not applicable in your experiment settings? Please explain. We hope that we were able to address this question with the discussion in our overview. We are happy to add comparisons that the reviewer believes would be informative. ### Reviewer C > You impose the requirement that all satisfying assignments for y are unique. It is not very clear from the text how limiting this restriction is. Its technical role is clear from this paper, but how restrictive it is from the point of view of applications is less clear. We believe that this is not a restriction for learning problems that assume the target hypothesis is a function as opposed to a relation: given a set of features, there is one correct label or prediction associated with it. The assumed condition imposes this requirement. If the grounding function associates one set of ground terms with the given features, then requiring that $\phi$ has one satisfying assignment for its "output" variable $y$ associated with that set of ground terms ensures that composing $\phi$ with the grounding function maintains this functional behavior. In hindsight, we could have more aptly referred to this requirement as imposing a "functional" restriction rather than "uniqueness". We will address this in future versions of the paper. > How would your work relate to the training methods suggested within the differentiable logic trend? We hope that we were able to address this question in our overview. > Please explain why you had to use bitvector grounding in all examples? Our natural language example uses propositional ground terms, as there is no need to introduce bitvectors for that problem. MNIST addition and visual algebra both use bitvector groundings, as Z3 is able to solve the corresponding constraints in the theory of bitvector arithmetic more efficiently than those in the theory of integer arithmetic. It is not necessary to use bitvectors, but doing so in these examples makes sense for performance reasons. If the reviewers feel that the illustrated examples of these problems earlier in the paper would be more clear if presented in the theory of integers, then we are happy to make this change. > We can use solvers to do arithmetic more intelligently ... a less convoluted problem would be to define logical properties that arise from the (certification of) learning systems naturally: eg robustness, similarity of classes, etc. Examples of such properties are given in eg [1] and [2]. We agree wholeheartedly that models need not be trained to do what solvers are good at. This observation is a primary motive, and Def. 3.1 aims to encapsulate a broad class of problems that match the description. We note that the challenge posed by this observation is to train models with representations that solvers can use, and this may require very granular labels (we know that MNIST addition is toy in this regard, see l. 295-303). Our techniques aim to address this without requiring additional labels. Regarding evaluation of properties arising from certification, please see our response in the overview; we do not believe that our techniques are directly applicable to most of these. > I think the proposed algorithm is too simplistic, it simply uses an SMT solver to generate counterexamples and then uses cross entropy on them. It is not a novel idea at all. See e.g. [3]. This description is not precise enough to distinguish either backward pass algorithm (Algs. 3, 4) from a broad class of techniques that do not work. We regret any confusion on this matter that may have come about due to the text on lines 789-796, where we attempt to convey a high-level insight that is common to both algorithms. We will address this part of the text. We did try a method similar to the one suggested by the reviewer early in our work. Unfortunately, we find that backward passes that most closely match the reviewer's description typically fail to converge to a good solution on the problems studied in our evaluation. By contrast, our algorithms use information from an unsat core, or a MaxSMT solution that incorporates the network's activations, to identify a limited subset of parameters to provide updates to. We are familiar with [3], but are unable to locate a technique or proposal that appears to coincide with Algorithms 3 or 4. We are happy to give proper attribution to prior work, and if we have missed something, would appreciate a pointer to the specific element of that work with overlap. > Theorem 3.10 has some unquantified terms in its statement. Eg, what are z and y in \phi (z,y) with the theorem statement? The body of the proof for that theorem mentions encoding of features of D as CNFs. Where was this defined or explained? We write $\phi(z,y)$ to denote that $\phi$ is a formula with free occurrences of variables $z$ and $y$. The proof relies on the features of $\mathcal{D}$ *encoding* a CNF as a matrix with $\{-1,0,1\}$ entries that represent literals in each clause appearing negated, not present, or positive, as described on l. 423-425. The last paragraph of the proof (l. 437-443) explains how a solution to the grounding problem described on l. 422-436 can be converted into a solution to the SAT problem represented by the CNF that appears in the grounding problem. > ... statement of Theorem 3.11: why is that condition on \phi? why epsilon is divided by 2? why it has to be 0-1 loss? That condition on $\phi$ ensures that the prediction label is a function of the ground terms $z$, and that the correct ground terms are uniquely defined by the prediction label. The point of the theorem is to delineate a class of decomposable problems that can be directly solved by ERM, with minimal integration of a theory solver in training. The statement requires $\epsilon/2$ for similar reasons as Lemma 3.7 does: we bound the true loss of the ERM solution by $\epsilon/2$ from its empirical loss, and then the empirical loss of any solution by $\epsilon/2$ of that; summing together yields the $\epsilon$ bound. It uses the 0-1 loss, as our focus throughout the paper is on classification problems where each set of features has one correct prediction label. A similar result would likely be achievable assuming a bounded loss function, with a related constant factor appearing on the right of the inequality; we believe that this additional generality may complicate the result without providing much utility. > Remark on lines 524 - 529 cannot be easily parsed. You'd need to explain at least: isoperimetric inequalities, c-isoperimetry, Lipschitz functions, ... In Theorem 4.1 you need to define M-Lipschitz property for loss functions and convexity of sets. We added this remark to point out an interesting direction that we believe may be fruitful for those familiar with the results in [Bubeck & Sellke 2021] to explore, but did not intend to give the impression that this is a result of the current work. We will address these issues by adding to the background section. ### Reviewer D > What could be broken if the targeted problem does not have a unique solution? This is possible due to either the problem's nature or the user-provided logic formula being under-constrained. First, we believe that there may be a misunderstanding about the uniqueness requirement. In particular, the reviewer writes: >> Many simple reasoning tasks do not have unique solutions, even for simple MNIST addition. For example, $a + b = 10$ (assuming $a, b$ range from 0 to 9), there will be nine different solutions ... Having constraints like `a=b` to make the solution unique seems quite artificial. We do not require that there be one satisfying assignment for all of the non-"output" variables in $\phi$ given the label. Rather, we require that given values for all of the variables that the neural network provides, there be at most one solution to the remaining "output" variables. In the above example, if $\phi$ is $a + b = y$, and the neural network attempts to ground $a, b$ so the prediction is given by $y$, then there be at most one solution for $y$ after grounding. This example meets our requirement. This restriction limits our approach to settings where the correct hypothesis for the learning problem is a function, but does not, for example, require constraining $a=b$ in MNIST addition. We agree that requiring symbolic knowledge to yield a unique solution per label would be very limiting. When training on an MNIST addition sample where the two summands are the same digit, **we do not encode $a=b$ in the constraints used by SMTLayer**. We use the same logic for MNIST addition as DeepProbLog. The purpose of such an experiment is to evaluate how well the approach performs with an impoverished or heavily biased training sample. Regarding your question about what would be broken if the functional requirement on $\phi$ is violated: this restriction is used in the theoretical analysis given in Section 3, but it does not impact our algorithms in Section 4. In cases where $\phi$ is relational in $y$, rather than functional, the forward pass algorithms would simply select an arbitrary value from among those that satisfy. In the backward pass, if the $\hat{y}$ computed on l. 3 of Algs 3&4 were inconsistent with the values of $z$ provided by previous layers, then both algorithms would provide an update as they would if $\phi$ were functional. If $\hat{y}$ were consistent with $z$, then they would provide the update initialized on l. 2 of Algs 3&4. > By design, ERM($\phi$) seems to be a stricter generalization of SATNet. Does it outperform SATNet on the Visual Sudoku task? This is correct: SATNet learns the clauses and also provides gradient signal to the network, whereas our approach assumes that the logical formula is provided by the user. At the time of submission, we did not have resources to evaluate both Visual Sudoku and the Liar's Puzzle in time for the deadline; we opted to include the latter, as it provided more variety in our examples. We have since evaluated Visual Sudoku, and confirm that it *(1)* performs as well as SATNet on rich training data; *(2)* significantly outperforms on training data that is not comprehensive enough to learn the rules of Sudoku. For example, it learns very quickly on data comprised of a limited set of examples where nearly all of the Sudoku puzzle has already been completed. > Image segmentation can be challenging but has been manually performed in the MNIST Addition task and Visual Algebra task (#1 configuration). DeepProbLog has the same setup (i.e., segmented images + manually provided rules). It is fairly reasonable to compare with DeepProbLog at least on these two tasks. Why not do so? We agree that this is a reasonable comparison, and are happy to include it if the reviewer believes that it would be informative. Please see the overview for more discussion on this topic. > In algorithms 3 and 4, I wonder whether it is necessary to introduce $\hat{y}$. Is it essentially $\mathrm{sign}(y^*)$ instead? What is the difference and advantages (if any)? Your understanding is correct: $\mathrm{sign}(\hat{y})$ is the same as $\mathrm{sign}(y^*)$. We debated writing it as you suggest, but decided otherwise because the interface that a layer module is typically given, e.g. in Pytorch, does not receive the ground truth training label, but only the gradient of the loss. We wanted to present algorithms that remain faithful to this widely-adopted interface, but we will add text to clarify this part of the algorithms along the lines of your remark. > How is $h'$ related to $\phi$? We will expand on this in the paper. One can view $h'$ as a functional "substitute" for $\phi$. As you say, this is the part of the hypothesis that bridges the gap between $h$ and $h_f$, and can be replaced by a logical formula and a solver. > perhaps $g$ is not really necessary, or there could be a simplified definition afterward, making the formalization much easier to follow. We appreciate this point, and will modify the text accordingly. We wanted to present a general definition, which could potentially be referenced by future work that studies the more general setting. We may present the suggested simplified definition in the main body of the paper, and a more general variant in an appendix with some discussion. > Lemma 3.7 could not be found (or at least not obvious) in the referred textbook, in Chapter 3 of Shalev-Shwartz and Ben-David Our apologies, this is a typo. It should be Lemma 4.2 in Chapter 4. > In Theorem 3.14, the relation between $\delta_1$ and $\delta_2$ is a bit subtle. Perhaps there is a tighter bound like $\mathrm{min}(\delta_1(n),\delta_2(n))$? It is not immediately clear. There may be a tighter bound if one can relate Defs 3.8 and 3.12, but it seems like this might need more assumptions on $\mathcal{H}$. We will continue to think about the tightness of this bound, and update the paper with our findings if it is appropriate. > Interpretability could be a really cool part of ERM($\phi$) ... We agree! This is one of the aspects of this direction that originally motivated us to explore it, and it is a focus of our ongoing work. We can confirm that in our experiments, the neural representation does indeed match the expected symbolic structure, but there are some subtleties that need more investigation. For example, the representation can deviate from expectation some of the time under covariate shift (e.g., configuration 2 of visual algebra), but there may still be some interpretable structure to pick out. We look forward to exploring this further, and to releasing our library so that others in the community can explore it as well.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.