### General Answer
We thank the Reviewers wxaY, mZpC, qtJo, and aT2u for their valuable feedback. As noted by the Reviewers, our method represents a practical approach to building a sparse representation of a trained neural network relying on the dual parameterization. Furthermore, this parameterization can be further used to update this representation, avoiding costly retraining from scratch, which is often not practical in many sequential settings. We empirically showed with a diverse and broad set of experiments how our method achieves competitive performance in several tasks, ranging from uncertainty quantification to continual and reinforcement learning.
We have replied to the Reviewers' comments to make clarifications where necessary and to detail the improvements we have made to our manuscript. Our largest change is improving the clarity of our results. In particular, we have aligned the points taken from our results to SFR's main motivations. To this end, we have made the following changes:
1. We have updated Figure 3 to show the number of inducing points as a % of the training data. We believe this is a clearer way to demonstrate SFR's representative power with fewer inducing points (shown in the supplementary pdf).
2. We include a new table in the Appendix (shown in the rebuttal's supplement), which clearly demonstrates our main points: 1) SFR matches the performance of BNN/GLM and 2) SFR outperforms the GP subset when using fewer inducing points. Importantly, this table contains less information so is easier to read.
Based on the Reviewers' feedback, we have also made some minor changes to the manuscript:
1. In App. A.2, we provide additional details on the computational complexity of SFR's training, inference and fast updates, i.e. updating the posterior with new data using the dual conditioning (Lines 219-299). This shows the power (low complexity) of updating the posterior using the dual parameterization. <!-- , which is not possible for the GP subset method from Immer et al. (2021)[28]. -->
2. In Sec. 3, we have improved the description of our method, making the derivation of the posterior and how we sparsify it explicit. <!-- Further, we have clarified that the GP subset method from Immer et al. (2021)[28] does not use the dual parameterization in Eq. 4 and, as such, cannot use the fast updates, i.e. the dual conditioning. -->
3. In Sec. 5, we clarify the motivations behind our experiments.
<!-- 4. **TODO** I think we should remove this -->
___
___
### Reviewer wxaY (6: Weak Accept; conf: 3)
We thank Reviewer wxaY for their detailed feedback and for highlighting the promise of our dual variables. In light of the comments, we have updated our manuscript and now detail the changes here.
**Q1:** *"The paper did not discuss in detail about their computation cost such as numbers of GPU, training time, sampling cost, etc."*
**A1:** We agree that it is important to discuss our method's computation cost. We refer the Reviewer to App. A.2, where we have already detailed
1. The computational complexity of training and performing inference with our method;
2. The GPUs that we used for our experiments.
However, in light of these reviews, we have added additional details on the computational complexity of training/inference/prediction/updating for our method and the baselines. We have also updated App. A.2 and stated that we used a single GPU. This provides extra clarity and addresses this issue.
Note that we do not include training/inference/predict/update times because computation time is highly dependent on the hardware. As mentioned in Lines 594/595, we calculate the dual parameters for batches of the training data, keeping the GPU's memory occupation in check. Depending on the GPUs memory, we are able to leverage different batch sizes. Thus, computation time varies across the different clusters we used for our experiments.
**Q2:** *"The experiments in this paper are primitive tasks in corresponding field, with relatively straightforward neural network implementation. It could be better if there is some ablation study on some more challenging downstream tasks with more complicated model structures."*
**A2:** We designed our experiments section with two main goals in mind *(i)* show that our method constitutes a valid alternative to the methods based on Laplace approximation in the literature on uncertainty quantification, and *(ii)* show the broad range of applications that our method can be suitable in. We agree with the Reviewer that it would be interesting to see the performance of our method on more complicated and challenging tasks, and we see this direction as a valuable path for future works.
**Q3:** *"The paper should also briefly introduce the feasibility to train the sparse function space in comparison to weight-space neural network training process."*
**A3:** SFR is a post hoc method like Laplace, meaning that we compute the dual variables that define the Sparse GP used at prediction time, or in the downstream tasks, starting from the MAP weights obtained with standard gradient descent techniques for training NNs.
**Q4:** *"In order to be meaningful enough representation of neural network(keep most information), is there a mathematically equivalent operation in sparse function space to gradient descent(or autodiff) in weight-space neural network. If so, the training cost might reduce significantly."*
**A4:** It should be possible to perform updates to the Jacobian (*i.e.*, update in function space). This is an interesting direction that we hope to explore for future work. We thank the Reviewer for the interesting question and will include it in the paper as future work.
**Q5:** *"Can this approach be applied from scratch(on randomly initialized neural network right before any training)?"*
**A5:** Yes, it can, but the resulting kernel function will not have learnt the features of the data so it will revert to the prior kernel (induced by the random initial weights of the NN). We refer the Reviewer to Lines 335/341 in the paper.
**Q6:** *"The paper pointed out that despite an approach similar to sparse Gaussian Process, SFR does not provide a straightforward method for specifying prior covariance function. The author suggested this could be addressed by choice of model architecture and activation functions, very abstract but thanks to the black-box characteristic of neural network, seemingly empirically correct."*
**A6:** This comment is spot on. There is ongoing work looking into what different model architectures and activation functions encode prior assumptions. We refer the Reviewer to [i]–[iv] for more details.
**Q7:** *"Laplace GGN linearization results in locally linear approximation, the ability of global approximation of these GP-like approach should be explored more in the future."*
**A7:** We agree that the difference between the weight and function space approximation offers advantages and disadvantages. In the continual learning example, regularizing the posterior to be similar in function space is beneficial. It helps to explain why function space regularization outperforms weight-space equivalents such as Online-EWC. It is an exciting research direction to view the equivalent posteriors based on the different optimization problems they define but outside the scope of the paper.
[i] *"Kernel methods for deep learning"*, NIPS 2009.
[ii] *"Stationary activations for uncertainty calibration in deep learning"*, NeurIPS 2020.
[iii] *"Periodic activation functions induce stationarity"*, NeurIPS 2021.
[iv] *"Priors in Bayesian deep learning: A review"*, International Statistical Review 2022.
___
### Reviewer mZpC (3: Reject; conf: 4)
We thank Reviewer mZpC for their valuable comments and for highlighting the importance of the research direction treated in our submission and how our method represents an advancement over prior works on sparse GPs with NTK.
**Q1:** *"(major) From a technical point of view, it is difficult for me to see what is new. [...] it was difficult to see any new techniques or ideas that may meet the bar for the conference*"
**A1:** We respectfully disagree with Reviewer mZpC's notion that our paper lacks novelty. We argue that our method is a valuable combination of different techniques that form a new approach with the benefits from GPs, dual parameterization, and neural networks.
We think this is especially the case as our method offers the benefits of sparse GPs alongside the representation learning of neural networks. In contrast to other Bayesian deep learning approaches (e.g., Laplace approximation, variational inference, and deep ensembles), our method can update given new data without retraining the neural network or access to the training set. As such, other researchers will likely use our ideas and build on them in areas such as continual learning, online learning, and transfer learning. Moreover, we demonstrated that SFR can be used in both continual and reinforcement learning. To the best of our knowledge, this is the first time a GP representation of a NN has been used for RL.
We also refer Reviewer mZpC to the other reviewers' contribution scores (3,3,2) as they have seen the value of our paper's contribution.
<!-- First to use dual parameteisation for NN, first to show sparse GP of NN, first to show adding data to NN via sparse GP conditioning -->
**Q2:** *"(major) The experiment design needs better motivation. [...] as the focus of the paper is on sparsification, the experiments should have focused more on reduction in complexity, and answer the question of how much can be scale GPs for deep learning,[...] can this approach compete against other approaches in Bayesian Deep Learning like deep ensembles, etc."*
**A2:** We thank Reviewer mZpC for the valuable feedback. As our work builds on the GP subset method from Immer et al. (2021) [28] we evaluated our method on the same supervised learning experiments. To demonstrate the flexibility and robustness of our method, we tested it on a broad and diverse set of experiments on continual learning and reinforcement learning tasks.
We also note that we do already compare to deep ensembles in the RL experiments as this is a common baseline from the literature. We refer the Reviewer to Sec. 5.4 and Fig. 4. However, comparing to deep ensembles is out of the scope of this work for several reasons: *(i)* this paper is focused on learning GP representations as they can be used to update the model on new data without retraining, *(ii)* we primarily put interest in post hoc methods, and *(iii)* the applicability of ensemble methods is questionable in many of the application domains we consider due to storage/training cost.
Finally, we also update the first paragraph of Sec. 5 to better introduce and clarify the motivations backing our experiments.
**Q3:** *"Showing that introduced sparsity may not harm the performance seems rather repetitive of literature on sparse GPs."*
**A3:** We respectfully disagree as we think this is an important experiment to include given that we are the first to report such experiments for neural networks. Moreover, the fact that the experiments are popular indicates that it is common practice to include such experiments.
<!-- Nevertheless, we have improved the motivation of our experiments by linking them back to our paper's main contributions: providing NNs with uncertainty estimates, the ability to incorporate new data without retraining, and preventing catastrophic forgetting. To this end, we have included an extra experiment table comparing how the methods incorporate new data. We report results for SFR and the GP Subset method, which can both use SFR's dual conditioning (Lines 219–229), to incorporate new data without retraining. We also compare to Laplace GLM/BNN which requires the NN to be retrained on the new data. -->
**Q3:** *"(minor) The title "sparse function space representation of neural networks" might be misleading. In Bayesian Deep Learning literature, function space representations and priors are usually referred to as the follow-up works of "functional variational neural networks" [...]. Including keywords like GPs, dual parameterization, etc may highlight the content of the paper."*
**A3:** We thank the Reviewer for their feedback on the title of our manuscript. A title such as '**Neural Networks as Sparse Gaussian Processes via Dual Parameters**' could indeed fit the message, and if this gets support from the other reviewers/AC we are not against updating the title. Moreover, the function space we refer to in our paper is not related to the functional variational NN literature.
**Q4:** *"(minor) In lines 67-70, it is stated that the sparse formulation of GPs for the linearization formulation of the neural network is under-explored. Some clear references are however missing. For example, one paper from 2021 [i] shows the sparse formulation using mixtures of GP experts. Well-known SN-GP [ii] uses last-layer approximation, which can also be thought of as sparse GPs."*
**A4:** We thank the Reviewer for pointing out these two additional references that we have now included.
**Q5:** *"Why should "uncertainty quantification in RL" should be included in the related work section? To me, this part was rather occupying space. I would have rather expanded the section on the methods."*
**Q6:** *"On lines 154-157: certain derivation is referred to a book. If possible, more insight should be conveyed here, because this part seems rather one of the key ideas."*
**A5/A6:** We agree with the Reviewer on these minor points. We moved the "Uncertainty quantification in RL" into the Appendix and we further detail the derivation in the Method section.
[i] *"Trust Your Robots! Predictive Uncertainty Estimation of Neural Networks with Sparse Gaussian Processes"*, in CoRL 2022.
[ii] *"A Simple Approach to Improve Single-Model Deep Uncertainty via Distance-Awareness"*, JMLR 2023.
___
### Reviewer qtJo (4: Borderline reject; conf: 3)
We thank Reviewer **qtJo** for their valuable feedback and for pointing out some points of unclarity in our manuscript. We updated our submission addressing the concerns raised in their review, and we now discuss each point to explain the changes and clarify the unclear parts in turn.
**Q1:** *"[...] incorporating new data samples into dual variables in the full GP expression (eq. 7) is also doable but just very time-consuming, so the advantage of the proposed solution is that it reduces the time complexity of incorporating new samples, not the argument that it augments the full DP expression with the capability of including new samples."*
**A1:** The Reviewer is correct the dual parameters also allow for incorporating new data points in the full GP setting. However, the computational cost grows with the number of data points, which is the downside of using the full GP. Contrasting Eq. (8) to Eq. (4) in the paper, the complexity has switched from a dependence on the number of data points to the number of inducing points $m$, which we now control. As the Reviewer points out, incorporating new data points has a lower complexity, e.g., assuming we have $n$ original data points and $n_{\textit{new}}$ to incorporate, the complexity in full model is $O((n + n_{\textit{new}})^{3})$ while it is reduced to $O((n_{\textit{new}} \; m)^{2})$ with the sparse model. This lower complexity makes our sparse model suitable for most practical applications.
**Q2:** *"It might be interesting to conduct experiments to test the significance of whether or not to use inducing points. iirc, reference 42 mentioned in the submission handles the GP without inducing points, not sure if the code is available. I do think that using inducing points would reduce the computational cost, but it also introduces errors since they are randomly sampled from the existing datasets."*
**A2:** GP inference approximations are always employed to scale to large data sets. The approximation in [42] differs from our sparse GP approximation using conjugate gradients (CG) approximation, with an additional '*finite difference Fisher vector products*,' which the authors admit introduces a further error of the standard CG approximation. Both approaches are approximations to the full GP.
Our paper focuses on building a flexible sparse GP model approximation of the neural network allowing for supervised and continual learning, whereas [42] focuses on transfer learning tasks, thus making a direct comparison difficult as the authors did not publish results on SL tasks. For SL and CL, we focus on baselines that have directly published results and code in that domain, making comparisons fair.
**Q3:** *"the paper argues that we would only need to find dual variables for new data samples to incorporate them, but that is only true under the linearization assumption, and standard finetuning or training offers higher-order changes in a neural network. Would the authors recommend only updating the dual variables to incorporate new samples rather than finetuning or retraining neural networks in terms of performance?"*
**A3:** We thank the Reviewer for the interesting question. As correctly noted by the Reviewer and discussed in Sec. 6, our method relies on the Laplace-GGN approach that linearizes the network around the MAP solution. However, this local approximation becomes useful in many practical situations. As discussed in our paper, for sequential settings is often not feasible to retrain the neural network from scratch to incorporate new data. SFR offers an effective approach to consider "some of" the information brought by the new data without retraining the model in weight space (which requires access to **all** of the training data).
In light of this comment, we have added extra details to the manuscript to indicate when the dual updates might be preferred to retraining the neural network from scratch (and vice-versa).
___
### Reviewer aT2u (6: Weak Accept; conf: 3)
<!-- We thank Reviewer **aT2u** for finding our approach technically sound and our paper well-written and understandable. We followed their valuable feedback to clarify and improve our manuscript. We now address the concerns raised in both the Weaknesses and Questions sections. -->
We thank Reviewer **aT2u** for finding our approach technically sound and our paper well-written. We improved our manuscript using their valuable feedback.
**Q1:** *"From Table 1/2, it seems that the method from [Immer et al] performs just as well or better in many of the testing scenarios, so what is the motivation for SFR?"*
**A1:** SFR follows the GP subset method whose aim is to equip NNs with the benefits of GPs (e.g., uncertainty quantification, function space representation, ability to add new data with multivariate normal conditioning). SFR is further motivated as 1) it scales better than the GP subset method from Immer et al. (2021) [28] (because it needs fewer inducing points), and 2) its dual parameterization offers a more computationally efficient mechanism for incorporating new data without retraining.
<!-- Note that these "fast" updates are made possible by the dual parameterization and are not available for the GP subset method from Immer et al. (2021) [28]. -->
<!-- This comment highlights some issues with the clarity of our results tables. Let us first clarify the aim of our results:
1. Show that SFR matches the performance (NLPD) of the Laplace approximation with BNN/GLM predictions,
2. Show that SFR outperforms the GP subset method with fewer inducing points.
-->
In light of this comment, we have improved the clarity of our UCI results tables/figures by making the following changes (shown in the rebuttal's supplement):
1. We have updated Figure 3 to show the number of inducing points as a percentage of the training data. We believe this is a clearer way to demonstrate SFR's representative power with fewer inducing points.
2. We include a new table of results for the UCI data set in the Appendix, which clearly demonstrates that SFR 1) matches the performance of BNN/GLM and 2) outperforms the GP subset when using fewer inducing points. Importantly, this table contains less information, so it is easier to read.
<!-- **TODO** -->
<!-- 1. need less M than GP subset -->
<!-- 2. can leverage the dual parameterization for fast updates -->
<!-- **Q2:** *"Figure 3 demonstrates some advantages at lower M (inducing points) but the difference often vanishes by M=512."*
**A2:** This is correct and shows that SFR obtains the same predictive performance as the GP Subset method with fewer inducing points. As such, SFR scales better as it requires less inducing points. -->
<!-- In light of this comment, we have changed the x-axis to show the number of inducing points as a percentage of the number of training data. -->
<!-- **A2:** We agree with the Reviewer that Fig. 3 and its description need to be clarified to explain the behavior of our approach with the increasing number of inducing points. The importance of converging faster with a low number of inducing points for the UCI data sets comes from the low dimensionality of these data sets. In light of this comment, we have changed the x-axis to show the number of inducing points as a percentage of the number of training data. We have included this figure in the rebuttal's supplement. -->
**Q2:** *"does SFR outperform [Immer et al] on continual learning (Table 3)?"*
**A2:** Immer et al. (2021) [28] does not present any application of their methods for continual learning. Adapting their method to work in the CL setting would require designing a new method which is beyond the scope of this paper.
<!-- In light of this, our paper first compares SFR to their proposals (BNN and GLM), matching or improving the results in the supervised learning (SL) setting (see the updated Table 1 results). We then go a step further than SL and show two possible use cases of our method: i) building a function-space regularizer to mitigate the effect of forgetting on various standard continual learning data sets, and ii) quantifying the uncertainty of a dynamics model for exploration in model-based RL. -->
**Q3:** *"it seems that computational complexity is a claimed advantage of the sparse GP formulation. If so, it would be nice to have a comparison of the computation time/resources required by different methods in the main paper."*
**A3:** We make no claim that SFR has computational complexity advantages over any of the other approaches. On the contrary, our method has the same computational complexity as the GP subset method. Instead, we claim that we can match the predictive performance of BNN/GLM with far fewer inducing points than the GP subset method. We then empirically show this in Table 4 and Figure 3.
Let us also bring the reviewers attention to App. A.2, where we detail the computational costs involved in computing the dual parameters from the trained neural network. Note that we do not include computation time because it is highly dependent on the hardware. As mentioned in Lines 594/595, we calculate the dual parameters for batches of the training data, keeping the GPU's memory occupation in check. Depending on the GPUs memory, we are able to leverage different batch sizes. Thus, computation time varies across hardware.
In light of this comment, we have added additional details on the computational complexity of our method and the baselines. This provides extra clarity and addresses this issue.
<!-- We do not claim that computational complexity is an advantage of our formulation but we do claim that we can match the predictive performance of BNN/GLM with far fewer inducing inputs than the GP subset method. However, this is empirically shown in Table 4 and Figure 3. -->
<!-- **A4:** We do not claim computational complexity advantage for our sparse formulation. However, we detail how relying on a set of inducing points, together with the sparse GP formulation, is the preferred choice to better model all the training data to relying on a subset of the training set. Moreover, in App. A.2, we discuss the computational costs involved in computing the dual parameters starting from the trained neural network (i.e., to compute $\mathbf{\alpha}_\mathbf{u}$ and $\mathbf{B}_\mathbf{u}$). Following the Reviewer's feedback, we improve this section by discussing the complexity of inference, i.e., what is required to compute the predictive mean and variance, and how we compute the batched version of the method. This is because computing the Jacobian on the entire training data set is prohibitively expensive in both memory and computational time.
The computational speed to compute the dual parameters is hardware dependent because the Jacobian computation is a heavy-memory operation, thus having access to large-memory GPUs may be a way to reduce the time needed to compute $\mathbf{B}_\mathbf{u}$. -->
**Q4:** *"Many of the CL comparisons (Table 3) seem to be somewhat older. E.g., many of the permuted MNIST results in table 1 of Dhawan et al seem stronger. Is there a reason not to include those newer comparisons?"*
**A4:** As we specify in L283, our CL experiments consider only the single-head (SH) setting, which is more challenging and realistic than multi-head (MH). The performances reported in Dhawan et al. (2023) [i] consider the multi-head setting (MH) (see App. E.2 [i]).
The MH scenario is considered "easier" than SH (see discussion in [70]); moreover, the SH is also well known to be problematic for parametric-based approaches, such as the one proposed in [i]. In fact, when considering the MH setting, the CL method also has access at test time to a task identifier to discriminate among the different tasks experienced so far in the CL training regime.
Our results align with those reported by Rudner et al. (2022) [62] and Pan et al. (2020) [55].
**Q5:** *"L273 sentence ends abruptly"*
**A5:** Thanks for pointing this out, we have added the missing link.
**Q6:** *"How $q(\mathbf{u})$ is determined (L168)."*
**A6:** The distribution $q(\mathbf{u})$ is usually parameterized as $\mathcal{N}(\mathbf{u};\mathbf{m},\mathbf{V})$ but we choose to use the dual parameterization allowing us to form the posterior from a converged neural network. The form of $q(\mathbf{u})$ is shown in detail in App. B.1. We thank the Reviewer for the question and have updated our main paper to reference to App. B.1.
<!-- We follow [64] in assuming conditional independence $q(\mathbf{f},\mathbf{u}) = p(\mathbf{f} \mid \mathbf{u}) q(\mathbf{u})$. Given $q(\mathbf{u})$, the marginal distribution at all other points is easily determined given the conditional independence assumption. -->
<!-- **A7:** We have improved the description and split out the equation to be standalone. We follow the derivation detailed in [64] that assumes conditional independence $q(\mathbf{f},\mathbf{u}) = p(\mathbf{f} \mid \mathbf{u}) q(\mathbf{u})$. Given the distribution $q(\mathbf{u})$, the marginal distribution at all other points is easily determined given the conditional independence assumption. The distribution $q(\mathbf{u})$ is usually parameterized as $\mathcal{N}(\mathbf{u};\mathbf{m},\mathbf{V})$, but we choose to use the dual parameterization allowing us to form the posterior from a converged neural network. The form of $q(\mathbf{u})$ is shown in detail in App. B.1, and we will link the section in the main paper. -->
**Q7:** *"In general, what is the motivation for training such a GP on data (versus standard NN finetuning), if uncertainty is not a concern?
As mentioned, the validity of the kernel (determined by the jacobian) will decrease if new data is too different from what the original NN weights were trained on, and the GP is incapable of new "deep feature learning.""*
**A7:** Our method turns a trained NN into a linearized approximation at the MAP of the corresponding GP kernel. This formulation lets us compute uncertainty over the model predictions and is also the starting point to build a function-space regularizer helpful in the CL setting and quantifying the uncertainty in model-based RL. Complete retraining of an already trained NN to include the information in the new samples may be inefficient or impossible for such sequential settings. This perspective represents an important motivation to choose our method over other competing approaches in the literature.
The limitation in the validity of the kernel to the distribution of the training data is a well-known limitation of this typology of methods. As we discuss in Sec. 6 of our manuscript, this represents an important direction for future work with such methods.
<!-- **A7:** Our method turns a trained neural network into the corresponding linearized approximation at the MAP of the corresponding GP kernel. This formulation lets us compute uncertainty over the model predictions and is also the starting point to build a function-space regularizer helpful in the CL setting and quantifying the uncertainty in model-based RL. Doing a complete retraining of an already trained neural network to include the information in the new samples may be inefficient or impossible for such sequential settings. This perspective represents an important motivation to choose our method over other competing approaches in the literature. -->
<!-- The limitation in the validity of the kernel to the distribution of the training data is a well-known limitation of this typology of methods, as we discuss in Sec. 6 of our manuscript, and this represents an important direction for future research directions meant to improve the generalizability of such methods. -->
<!-- We thank Reviewers aT2u and qtJo for their interesting questions regarding when to prefer the fast updates (via dual conditioning) vs finetuning or retraining from scratch.
. As correctly noted by the Reviewer and discussed in Sec. 6, our method relies on the Laplace-GGN approach that linearizes the network around the MAP solution. However, this local approximation becomes useful in many practical situations. As discussed in our paper, for sequential settings is often not feasible to retrain the neural network from scratch to incorporate new data. SFR offers an effective approach to consider “some of” the information brought by the new data without retraining the model in weight space (which requires access to all of the training data).
In light of this comment, we have added extra details to the manuscript to indicate when the dual updates might be preferred to retraining the neural network from scratch (and vice-versa). -->
<!-- **A8:** Our method turns a trained neural network into the corresponding linearized approximation at the MAP of the corresponding GP kernel. This formulation lets us compute uncertainty over the model predictions and is also the starting point to build a function-space regularizer helpful in the CL setting and quantifying the uncertainty in model-based RL. Doing a complete retraining of an already trained neural network to include the information in the new samples may be inefficient or impossible for such sequential settings. This perspective represents an important motivation to choose our method over other competing approaches in the literature.
The limitation in the validity of the kernel to the distribution of the training data is a well-known limitation of this typology of methods, as we discuss in Sec. 6 of our manuscript, and this represents an important direction for future research directions meant to improve the generalizability of such methods. -->
[i] *"Efficient Parametric Approximations of Neural Network Function Space Distance"*, ICML 2023.
___
___
#### Complexity
- [ ] Add table on complexity of training/inference/prediction/updates? + algorithm box
> "where defining the cost of a forward pass (and backward pass) as $[F\!P]$, this can be roughly sketched as $\mathcal{O}(2 E N[F\!P])$, where $E$ is the number of training epochs and $N$ the number of training samples. Thus, there is no additional cost involved in the training phase of the neural network."
> "Computing the kernel Gram matrix for a batch of $B$ samples with this implementation has a computational complexity of $\mathcal{O}(B C [F\!P] + B^2 C^2 P)$, where $C$ is the number of outputs and $P$ is the number of parameters of the neural network."
> (from Appendix A.2)
###### Notation
$N$ → the number of elements in the training set
$E$ → number of training epochs
$[FP]$ → computational complexity of a forward/backward pass of the NN
$B$ → number of elements we are considering for inference/prediction/fast-update
$C$ → number of outputs of the NN
$M$ → number of inducing points (for GP subset the size of the training set samples)
$S$ → number of posterior samples
#### Training time
The first stage is the training of the NN. This requires for each element in the training set of cardinality $N$ to go through a forward pass and a backward pass over the NN for $E$ epochs.
***Training complexity:*** $\mathcal{O}(2 E [FP])$
#### Inference time
(I believe the claims in the main paper are not completely fine for now)
We have the trained NN obtained after the training phase and we need the Jacobian contraction NTK to compute the kernel Gram Matrix iterating with batchsize $B$ over the $N$ training samples.
**NTK complexity:** $\mathcal{O}(N C [FP] + N^2 C^2 P)$
**On batch $B$ samples:** $\mathcal{O}(B C [FP] + B^2 C^2 P)$
Then we need to compute the outer product for computing $\mathbf{B}_\mathbf{u}$ :
**iterate through the inner product:** $\mathcal{O}(N M^2)$
***Inference complexity:*** $\mathcal{O}(NC[FP] + N^2 C^2 P + N M^2)$
#### Prediction time
At prediction time we don't need anymore to worry about the computation of $\mathbf{B}_\mathbf{u}$ and the other $\mathbf{K}_\mathbf{zz}$ terms, that are cached in the form of its Cholesky decomposition.
At prediction time, we iterate through the new datapoints in batch of $B$ elements to compute $\mathbf{K}_\mathbf{**}$ and $\mathbf{K}_\mathbf{z*}$, with complexity
$\mathcal{O}(C [FP] (M + B) + B^2 C^2 P + B M C^2 P)$
This is a bit tricky, but we basically we need to compute the two tensors/matrices, meaning that we pass through the NN both inducing points and the new batch and then perform the predictive equations that introduce an additional $\mathcal{O}(MBS)$
***Prediction complexity:*** $\mathcal{O}(C [FP] (M + B) + B^2 C^2 P + B M C^2 P + MBS)$
#### Fast Updates
???
#### Other methods
????
#### Table with complexities
| Method | Training | Inference | Prediction | Updates |
| -------- | -------- | -------- | -------- | -------- |
| SFR | $\mathcal{O}(2 E N[F\!P])$ | $\mathcal{O}(NC[FP] + N^2 C^2 P + N M^2)$ | $\mathcal{O}(C [FP] (M + B) + B^2 C^2 P + B M C^2 P + MBS)$ | $??$ |
| SFR | $\mathcal{O}(2 E N[F\!P])$ | $\mathcal{O}(N M^2)$ | $\mathcal{O}( B C [F\!P] + B^2 C^2 P + B M S)$ | $??$ |
| GP subset | $\mathcal{O}(2 E N[F\!P])$ | $\mathcal{O}(N M^2)$ | $\mathcal{O}( B C [F\!P] + B^2 C^2 P + B M S)$ | $??$ |
| GLM? | $??$ | $??$ | $??$ | $??$ |
| BNN? | $??$ | $??$ | $??$ | $??$ |
<!--
* Where:
* $M$ -> num. inducing points or subset size
* $S$ -> samples from the posterior (we used 100 in our experiments),
* $B$ -> batch of new samples
At prediction time we have already computed and cached $\mathbf{K}_{\mathbf{z}\mathbf{z}}^{-1} + (\mathbf{K}_{\mathbf{z}\mathbf{z}} + \mathbf{B}_{\mathbf{u}})^{-1}$ and $\mathbf{K}_{\mathbf{z}\mathbf{z}}^{-1}\mathbf{\alpha}_{\mathbf{u}}$, then we can directly apply the predictive equations that require: i) the computation of the kernel for the new samples $\mathcal{O}( B C [F\!P] + B^2 C^2 P)$, ii) the matrix multiplication required by the predictive equations brings an additional $\mathcal{O}(B M S)$ costs, where we take in consideration the number of samples taken from the posterior.
**TODO:**
* is the inference cost really dominated by the $\mathcal{O}(N M^2)$ cost? We are summing over all the N training samples to compute both $\mathbf{B}_\mathbf{u}$ and $\mathbf{\alpha}_\mathbf{u}$n but then also computing the kernel for each batch of training samples, i.e., $\mathcal{O}(B C [F\!P] + B^2 C^2 P)$.
* how to compute the complexity for fast-updates?
-->
- [ ] Fast-update table UCI Datasets
- [ ] Add algorithm to show what is calculated and when
1. Train NN (like normal)
2. Inference, i.e. fit the sparse GP posterior
- Calculate the dual variables ($\mathbf{\alpha}_{\mathbf{u}}$, $\mathbf{B}_{\mathbf{u}}$) using Eq. 7
- Calculate matrices which can be cached as they're not dependent on the test input.
$$\mathbf{K}_{\mathbf{z}\mathbf{z}}^{-1} + (\mathbf{K}_{\mathbf{z}\mathbf{z}} + \mathbf{B}_{\mathbf{u}})^{-1} \quad \text{and} \quad\mathbf{K}_{\mathbf{z}\mathbf{z}}^{-1}\mathbf{\alpha}_{\mathbf{u}}$$
3. Predict
___
___
## Aidan Notes
### Future TODOs
- [ ] More complicated model architectures (more than MLP/CNN)
- [ ] CIFAR10/CIFAR100 results