Rebuttal NIPS 2023: SFR paper

### General Answer We thank the Reviewers wxaY, mZpC, qtJo, and aT2u for their valuable feedback. As noted by the Reviewers, our method represents a practical approach to building a sparse representation of a trained neural network relying on the dual parameterization. Furthermore, this parameterization can be further used to update this representation, avoiding costly retraining from scratch, which is often not practical in many sequential settings. We empirically showed with a diverse and broad set of experiments how our method achieves competitive performance in several tasks, ranging from uncertainty quantification to continual and reinforcement learning. We have replied to the Reviewers' comments to make clarifications where necessary and to detail the improvements we have made to our manuscript. Our largest change is improving the clarity of our results. In particular, we have aligned the points taken from our results to SFR's main motivations. To this end, we have made the following changes: 1. We have updated Figure 3 to show the number of inducing points as a % of the training data. We believe this is a clearer way to demonstrate SFR's representative power with fewer inducing points (shown in the supplementary pdf). 2. We include a new table in the Appendix (shown in the rebuttal's supplement), which clearly demonstrates our main points: 1) SFR matches the performance of BNN/GLM and 2) SFR outperforms the GP subset when using fewer inducing points. Importantly, this table contains less information so is easier to read. Based on the Reviewers' feedback, we have also made some minor changes to the manuscript: 1. In App. A.2, we provide additional details on the computational complexity of SFR's training, inference and fast updates, i.e. updating the posterior with new data using the dual conditioning (Lines 219-299). This shows the power (low complexity) of updating the posterior using the dual parameterization.  2. In Sec. 3, we have improved the description of our method, making the derivation of the posterior and how we sparsify it explicit.  3. In Sec. 5, we clarify the motivations behind our experiments.  ___ ___ ### Reviewer wxaY (6: Weak Accept; conf: 3) We thank Reviewer wxaY for their detailed feedback and for highlighting the promise of our dual variables. In light of the comments, we have updated our manuscript and now detail the changes here. **Q1:** *"The paper did not discuss in detail about their computation cost such as numbers of GPU, training time, sampling cost, etc."* **A1:** We agree that it is important to discuss our method's computation cost. We refer the Reviewer to App. A.2, where we have already detailed 1. The computational complexity of training and performing inference with our method; 2. The GPUs that we used for our experiments. However, in light of these reviews, we have added additional details on the computational complexity of training/inference/prediction/updating for our method and the baselines. We have also updated App. A.2 and stated that we used a single GPU. This provides extra clarity and addresses this issue. Note that we do not include training/inference/predict/update times because computation time is highly dependent on the hardware. As mentioned in Lines 594/595, we calculate the dual parameters for batches of the training data, keeping the GPU's memory occupation in check. Depending on the GPUs memory, we are able to leverage different batch sizes. Thus, computation time varies across the different clusters we used for our experiments. **Q2:** *"The experiments in this paper are primitive tasks in corresponding field, with relatively straightforward neural network implementation. It could be better if there is some ablation study on some more challenging downstream tasks with more complicated model structures."* **A2:** We designed our experiments section with two main goals in mind *(i)* show that our method constitutes a valid alternative to the methods based on Laplace approximation in the literature on uncertainty quantification, and *(ii)* show the broad range of applications that our method can be suitable in. We agree with the Reviewer that it would be interesting to see the performance of our method on more complicated and challenging tasks, and we see this direction as a valuable path for future works. **Q3:** *"The paper should also briefly introduce the feasibility to train the sparse function space in comparison to weight-space neural network training process."* **A3:** SFR is a post hoc method like Laplace, meaning that we compute the dual variables that define the Sparse GP used at prediction time, or in the downstream tasks, starting from the MAP weights obtained with standard gradient descent techniques for training NNs. **Q4:** *"In order to be meaningful enough representation of neural network(keep most information), is there a mathematically equivalent operation in sparse function space to gradient descent(or autodiff) in weight-space neural network. If so, the training cost might reduce significantly."* **A4:** It should be possible to perform updates to the Jacobian (*i.e.*, update in function space). This is an interesting direction that we hope to explore for future work. We thank the Reviewer for the interesting question and will include it in the paper as future work. **Q5:** *"Can this approach be applied from scratch(on randomly initialized neural network right before any training)?"* **A5:** Yes, it can, but the resulting kernel function will not have learnt the features of the data so it will revert to the prior kernel (induced by the random initial weights of the NN). We refer the Reviewer to Lines 335/341 in the paper. **Q6:** *"The paper pointed out that despite an approach similar to sparse Gaussian Process, SFR does not provide a straightforward method for specifying prior covariance function. The author suggested this could be addressed by choice of model architecture and activation functions, very abstract but thanks to the black-box characteristic of neural network, seemingly empirically correct."* **A6:** This comment is spot on. There is ongoing work looking into what different model architectures and activation functions encode prior assumptions. We refer the Reviewer to [i]–[iv] for more details. **Q7:** *"Laplace GGN linearization results in locally linear approximation, the ability of global approximation of these GP-like approach should be explored more in the future."* **A7:** We agree that the difference between the weight and function space approximation offers advantages and disadvantages. In the continual learning example, regularizing the posterior to be similar in function space is beneficial. It helps to explain why function space regularization outperforms weight-space equivalents such as Online-EWC. It is an exciting research direction to view the equivalent posteriors based on the different optimization problems they define but outside the scope of the paper. [i] *"Kernel methods for deep learning"*, NIPS 2009. [ii] *"Stationary activations for uncertainty calibration in deep learning"*, NeurIPS 2020. [iii] *"Periodic activation functions induce stationarity"*, NeurIPS 2021. [iv] *"Priors in Bayesian deep learning: A review"*, International Statistical Review 2022. ___ ### Reviewer mZpC (3: Reject; conf: 4) We thank Reviewer mZpC for their valuable comments and for highlighting the importance of the research direction treated in our submission and how our method represents an advancement over prior works on sparse GPs with NTK. **Q1:** *"(major) From a technical point of view, it is difficult for me to see what is new. [...] it was difficult to see any new techniques or ideas that may meet the bar for the conference*" **A1:** We respectfully disagree with Reviewer mZpC's notion that our paper lacks novelty. We argue that our method is a valuable combination of different techniques that form a new approach with the benefits from GPs, dual parameterization, and neural networks. We think this is especially the case as our method offers the benefits of sparse GPs alongside the representation learning of neural networks. In contrast to other Bayesian deep learning approaches (e.g., Laplace approximation, variational inference, and deep ensembles), our method can update given new data without retraining the neural network or access to the training set. As such, other researchers will likely use our ideas and build on them in areas such as continual learning, online learning, and transfer learning. Moreover, we demonstrated that SFR can be used in both continual and reinforcement learning. To the best of our knowledge, this is the first time a GP representation of a NN has been used for RL. We also refer Reviewer mZpC to the other reviewers' contribution scores (3,3,2) as they have seen the value of our paper's contribution.  **Q2:** *"(major) The experiment design needs better motivation. [...] as the focus of the paper is on sparsification, the experiments should have focused more on reduction in complexity, and answer the question of how much can be scale GPs for deep learning,[...] can this approach compete against other approaches in Bayesian Deep Learning like deep ensembles, etc."* **A2:** We thank Reviewer mZpC for the valuable feedback. As our work builds on the GP subset method from Immer et al. (2021) [28] we evaluated our method on the same supervised learning experiments. To demonstrate the flexibility and robustness of our method, we tested it on a broad and diverse set of experiments on continual learning and reinforcement learning tasks. We also note that we do already compare to deep ensembles in the RL experiments as this is a common baseline from the literature. We refer the Reviewer to Sec. 5.4 and Fig. 4. However, comparing to deep ensembles is out of the scope of this work for several reasons: *(i)* this paper is focused on learning GP representations as they can be used to update the model on new data without retraining, *(ii)* we primarily put interest in post hoc methods, and *(iii)* the applicability of ensemble methods is questionable in many of the application domains we consider due to storage/training cost. Finally, we also update the first paragraph of Sec. 5 to better introduce and clarify the motivations backing our experiments. **Q3:** *"Showing that introduced sparsity may not harm the performance seems rather repetitive of literature on sparse GPs."* **A3:** We respectfully disagree as we think this is an important experiment to include given that we are the first to report such experiments for neural networks. Moreover, the fact that the experiments are popular indicates that it is common practice to include such experiments.  **Q3:** *"(minor) The title "sparse function space representation of neural networks" might be misleading. In Bayesian Deep Learning literature, function space representations and priors are usually referred to as the follow-up works of "functional variational neural networks" [...]. Including keywords like GPs, dual parameterization, etc may highlight the content of the paper."* **A3:** We thank the Reviewer for their feedback on the title of our manuscript. A title such as '**Neural Networks as Sparse Gaussian Processes via Dual Parameters**' could indeed fit the message, and if this gets support from the other reviewers/AC we are not against updating the title. Moreover, the function space we refer to in our paper is not related to the functional variational NN literature. **Q4:** *"(minor) In lines 67-70, it is stated that the sparse formulation of GPs for the linearization formulation of the neural network is under-explored. Some clear references are however missing. For example, one paper from 2021 [i] shows the sparse formulation using mixtures of GP experts. Well-known SN-GP [ii] uses last-layer approximation, which can also be thought of as sparse GPs."* **A4:** We thank the Reviewer for pointing out these two additional references that we have now included. **Q5:** *"Why should "uncertainty quantification in RL" should be included in the related work section? To me, this part was rather occupying space. I would have rather expanded the section on the methods."* **Q6:** *"On lines 154-157: certain derivation is referred to a book. If possible, more insight should be conveyed here, because this part seems rather one of the key ideas."* **A5/A6:** We agree with the Reviewer on these minor points. We moved the "Uncertainty quantification in RL" into the Appendix and we further detail the derivation in the Method section. [i] *"Trust Your Robots! Predictive Uncertainty Estimation of Neural Networks with Sparse Gaussian Processes"*, in CoRL 2022. [ii] *"A Simple Approach to Improve Single-Model Deep Uncertainty via Distance-Awareness"*, JMLR 2023. ___ ### Reviewer qtJo (4: Borderline reject; conf: 3) We thank Reviewer **qtJo** for their valuable feedback and for pointing out some points of unclarity in our manuscript. We updated our submission addressing the concerns raised in their review, and we now discuss each point to explain the changes and clarify the unclear parts in turn. **Q1:** *"[...] incorporating new data samples into dual variables in the full GP expression (eq. 7) is also doable but just very time-consuming, so the advantage of the proposed solution is that it reduces the time complexity of incorporating new samples, not the argument that it augments the full DP expression with the capability of including new samples."* **A1:** The Reviewer is correct the dual parameters also allow for incorporating new data points in the full GP setting. However, the computational cost grows with the number of data points, which is the downside of using the full GP. Contrasting Eq. (8) to Eq. (4) in the paper, the complexity has switched from a dependence on the number of data points to the number of inducing points $m$, which we now control. As the Reviewer points out, incorporating new data points has a lower complexity, e.g., assuming we have $n$ original data points and $n_{\textit{new}}$ to incorporate, the complexity in full model is $O((n + n_{\textit{new}})^{3})$ while it is reduced to $O((n_{\textit{new}} \; m)^{2})$ with the sparse model. This lower complexity makes our sparse model suitable for most practical applications. **Q2:** *"It might be interesting to conduct experiments to test the significance of whether or not to use inducing points. iirc, reference 42 mentioned in the submission handles the GP without inducing points, not sure if the code is available. I do think that using inducing points would reduce the computational cost, but it also introduces errors since they are randomly sampled from the existing datasets."* **A2:** GP inference approximations are always employed to scale to large data sets. The approximation in [42] differs from our sparse GP approximation using conjugate gradients (CG) approximation, with an additional '*finite difference Fisher vector products*,' which the authors admit introduces a further error of the standard CG approximation. Both approaches are approximations to the full GP. Our paper focuses on building a flexible sparse GP model approximation of the neural network allowing for supervised and continual learning, whereas [42] focuses on transfer learning tasks, thus making a direct comparison difficult as the authors did not publish results on SL tasks. For SL and CL, we focus on baselines that have directly published results and code in that domain, making comparisons fair. **Q3:** *"the paper argues that we would only need to find dual variables for new data samples to incorporate them, but that is only true under the linearization assumption, and standard finetuning or training offers higher-order changes in a neural network. Would the authors recommend only updating the dual variables to incorporate new samples rather than finetuning or retraining neural networks in terms of performance?"* **A3:** We thank the Reviewer for the interesting question. As correctly noted by the Reviewer and discussed in Sec. 6, our method relies on the Laplace-GGN approach that linearizes the network around the MAP solution. However, this local approximation becomes useful in many practical situations. As discussed in our paper, for sequential settings is often not feasible to retrain the neural network from scratch to incorporate new data. SFR offers an effective approach to consider "some of" the information brought by the new data without retraining the model in weight space (which requires access to **all** of the training data). In light of this comment, we have added extra details to the manuscript to indicate when the dual updates might be preferred to retraining the neural network from scratch (and vice-versa). ___ ### Reviewer aT2u (6: Weak Accept; conf: 3)  We thank Reviewer **aT2u** for finding our approach technically sound and our paper well-written. We improved our manuscript using their valuable feedback. **Q1:** *"From Table 1/2, it seems that the method from [Immer et al] performs just as well or better in many of the testing scenarios, so what is the motivation for SFR?"* **A1:** SFR follows the GP subset method whose aim is to equip NNs with the benefits of GPs (e.g., uncertainty quantification, function space representation, ability to add new data with multivariate normal conditioning). SFR is further motivated as 1) it scales better than the GP subset method from Immer et al. (2021) [28] (because it needs fewer inducing points), and 2) its dual parameterization offers a more computationally efficient mechanism for incorporating new data without retraining.   In light of this comment, we have improved the clarity of our UCI results tables/figures by making the following changes (shown in the rebuttal's supplement): 1. We have updated Figure 3 to show the number of inducing points as a percentage of the training data. We believe this is a clearer way to demonstrate SFR's representative power with fewer inducing points. 2. We include a new table of results for the UCI data set in the Appendix, which clearly demonstrates that SFR 1) matches the performance of BNN/GLM and 2) outperforms the GP subset when using fewer inducing points. Importantly, this table contains less information, so it is easier to read.       **Q2:** *"does SFR outperform [Immer et al] on continual learning (Table 3)?"* **A2:** Immer et al. (2021) [28] does not present any application of their methods for continual learning. Adapting their method to work in the CL setting would require designing a new method which is beyond the scope of this paper.  **Q3:** *"it seems that computational complexity is a claimed advantage of the sparse GP formulation. If so, it would be nice to have a comparison of the computation time/resources required by different methods in the main paper."* **A3:** We make no claim that SFR has computational complexity advantages over any of the other approaches. On the contrary, our method has the same computational complexity as the GP subset method. Instead, we claim that we can match the predictive performance of BNN/GLM with far fewer inducing points than the GP subset method. We then empirically show this in Table 4 and Figure 3. Let us also bring the reviewers attention to App. A.2, where we detail the computational costs involved in computing the dual parameters from the trained neural network. Note that we do not include computation time because it is highly dependent on the hardware. As mentioned in Lines 594/595, we calculate the dual parameters for batches of the training data, keeping the GPU's memory occupation in check. Depending on the GPUs memory, we are able to leverage different batch sizes. Thus, computation time varies across hardware. In light of this comment, we have added additional details on the computational complexity of our method and the baselines. This provides extra clarity and addresses this issue.   **Q4:** *"Many of the CL comparisons (Table 3) seem to be somewhat older. E.g., many of the permuted MNIST results in table 1 of Dhawan et al seem stronger. Is there a reason not to include those newer comparisons?"* **A4:** As we specify in L283, our CL experiments consider only the single-head (SH) setting, which is more challenging and realistic than multi-head (MH). The performances reported in Dhawan et al. (2023) [i] consider the multi-head setting (MH) (see App. E.2 [i]). The MH scenario is considered "easier" than SH (see discussion in [70]); moreover, the SH is also well known to be problematic for parametric-based approaches, such as the one proposed in [i]. In fact, when considering the MH setting, the CL method also has access at test time to a task identifier to discriminate among the different tasks experienced so far in the CL training regime. Our results align with those reported by Rudner et al. (2022) [62] and Pan et al. (2020) [55]. **Q5:** *"L273 sentence ends abruptly"* **A5:** Thanks for pointing this out, we have added the missing link. **Q6:** *"How $q(\mathbf{u})$ is determined (L168)."* **A6:** The distribution $q(\mathbf{u})$ is usually parameterized as $\mathcal{N}(\mathbf{u};\mathbf{m},\mathbf{V})$ but we choose to use the dual parameterization allowing us to form the posterior from a converged neural network. The form of $q(\mathbf{u})$ is shown in detail in App. B.1. We thank the Reviewer for the question and have updated our main paper to reference to App. B.1.   **Q7:** *"In general, what is the motivation for training such a GP on data (versus standard NN finetuning), if uncertainty is not a concern? As mentioned, the validity of the kernel (determined by the jacobian) will decrease if new data is too different from what the original NN weights were trained on, and the GP is incapable of new "deep feature learning.""* **A7:** Our method turns a trained NN into a linearized approximation at the MAP of the corresponding GP kernel. This formulation lets us compute uncertainty over the model predictions and is also the starting point to build a function-space regularizer helpful in the CL setting and quantifying the uncertainty in model-based RL. Complete retraining of an already trained NN to include the information in the new samples may be inefficient or impossible for such sequential settings. This perspective represents an important motivation to choose our method over other competing approaches in the literature. The limitation in the validity of the kernel to the distribution of the training data is a well-known limitation of this typology of methods. As we discuss in Sec. 6 of our manuscript, this represents an important direction for future work with such methods.     [i] *"Efficient Parametric Approximations of Neural Network Function Space Distance"*, ICML 2023. ___ ___ #### Complexity - [ ] Add table on complexity of training/inference/prediction/updates? + algorithm box > "where defining the cost of a forward pass (and backward pass) as $[F\!P]$, this can be roughly sketched as $\mathcal{O}(2 E N[F\!P])$, where $E$ is the number of training epochs and $N$ the number of training samples. Thus, there is no additional cost involved in the training phase of the neural network." > "Computing the kernel Gram matrix for a batch of $B$ samples with this implementation has a computational complexity of $\mathcal{O}(B C [F\!P] + B^2 C^2 P)$, where $C$ is the number of outputs and $P$ is the number of parameters of the neural network." > (from Appendix A.2) ###### Notation $N$ → the number of elements in the training set $E$ → number of training epochs $[FP]$ → computational complexity of a forward/backward pass of the NN $B$ → number of elements we are considering for inference/prediction/fast-update $C$ → number of outputs of the NN $M$ → number of inducing points (for GP subset the size of the training set samples) $S$ → number of posterior samples #### Training time The first stage is the training of the NN. This requires for each element in the training set of cardinality $N$ to go through a forward pass and a backward pass over the NN for $E$ epochs. ***Training complexity:*** $\mathcal{O}(2 E [FP])$ #### Inference time (I believe the claims in the main paper are not completely fine for now) We have the trained NN obtained after the training phase and we need the Jacobian contraction NTK to compute the kernel Gram Matrix iterating with batchsize $B$ over the $N$ training samples. **NTK complexity:** $\mathcal{O}(N C [FP] + N^2 C^2 P)$ **On batch $B$ samples:** $\mathcal{O}(B C [FP] + B^2 C^2 P)$ Then we need to compute the outer product for computing $\mathbf{B}_\mathbf{u}$ : **iterate through the inner product:** $\mathcal{O}(N M^2)$ ***Inference complexity:*** $\mathcal{O}(NC[FP] + N^2 C^2 P + N M^2)$ #### Prediction time At prediction time we don't need anymore to worry about the computation of $\mathbf{B}_\mathbf{u}$ and the other $\mathbf{K}_\mathbf{zz}$ terms, that are cached in the form of its Cholesky decomposition. At prediction time, we iterate through the new datapoints in batch of $B$ elements to compute $\mathbf{K}_\mathbf{**}$ and $\mathbf{K}_\mathbf{z*}$, with complexity $\mathcal{O}(C [FP] (M + B) + B^2 C^2 P + B M C^2 P)$ This is a bit tricky, but we basically we need to compute the two tensors/matrices, meaning that we pass through the NN both inducing points and the new batch and then perform the predictive equations that introduce an additional $\mathcal{O}(MBS)$ ***Prediction complexity:*** $\mathcal{O}(C [FP] (M + B) + B^2 C^2 P + B M C^2 P + MBS)$ #### Fast Updates ??? #### Other methods ???? #### Table with complexities | Method | Training | Inference | Prediction | Updates | | -------- | -------- | -------- | -------- | -------- | | SFR | $\mathcal{O}(2 E N[F\!P])$ | $\mathcal{O}(NC[FP] + N^2 C^2 P + N M^2)$ | $\mathcal{O}(C [FP] (M + B) + B^2 C^2 P + B M C^2 P + MBS)$ | $??$ | | SFR | $\mathcal{O}(2 E N[F\!P])$ | $\mathcal{O}(N M^2)$ | $\mathcal{O}( B C [F\!P] + B^2 C^2 P + B M S)$ | $??$ | | GP subset | $\mathcal{O}(2 E N[F\!P])$ | $\mathcal{O}(N M^2)$ | $\mathcal{O}( B C [F\!P] + B^2 C^2 P + B M S)$ | $??$ | | GLM? | $??$ | $??$ | $??$ | $??$ | | BNN? | $??$ | $??$ | $??$ | $??$ |  - [ ] Fast-update table UCI Datasets - [ ] Add algorithm to show what is calculated and when 1. Train NN (like normal) 2. Inference, i.e. fit the sparse GP posterior - Calculate the dual variables ($\mathbf{\alpha}_{\mathbf{u}}$, $\mathbf{B}_{\mathbf{u}}$) using Eq. 7 - Calculate matrices which can be cached as they're not dependent on the test input. $$\mathbf{K}_{\mathbf{z}\mathbf{z}}^{-1} + (\mathbf{K}_{\mathbf{z}\mathbf{z}} + \mathbf{B}_{\mathbf{u}})^{-1} \quad \text{and} \quad\mathbf{K}_{\mathbf{z}\mathbf{z}}^{-1}\mathbf{\alpha}_{\mathbf{u}}$$ 3. Predict ___ ___ ## Aidan Notes ### Future TODOs - [ ] More complicated model architectures (more than MLP/CNN) - [ ] CIFAR10/CIFAR100 results

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.