riccardomereu
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    ### General Answer We thank the Reviewers wxaY, mZpC, qtJo, and aT2u for their valuable feedback. As noted by the Reviewers, our method represents a practical approach to building a sparse representation of a trained neural network relying on the dual parameterization. Furthermore, this parameterization can be further used to update this representation, avoiding costly retraining from scratch, which is often not practical in many sequential settings. We empirically showed with a diverse and broad set of experiments how our method achieves competitive performance in several tasks, ranging from uncertainty quantification to continual and reinforcement learning. We have replied to the Reviewers' comments to make clarifications where necessary and to detail the improvements we have made to our manuscript. Our largest change is improving the clarity of our results. In particular, we have aligned the points taken from our results to SFR's main motivations. To this end, we have made the following changes: 1. We have updated Figure 3 to show the number of inducing points as a % of the training data. We believe this is a clearer way to demonstrate SFR's representative power with fewer inducing points (shown in the supplementary pdf). 2. We include a new table in the Appendix (shown in the rebuttal's supplement), which clearly demonstrates our main points: 1) SFR matches the performance of BNN/GLM and 2) SFR outperforms the GP subset when using fewer inducing points. Importantly, this table contains less information so is easier to read. Based on the Reviewers' feedback, we have also made some minor changes to the manuscript: 1. In App. A.2, we provide additional details on the computational complexity of SFR's training, inference and fast updates, i.e. updating the posterior with new data using the dual conditioning (Lines 219-299). This shows the power (low complexity) of updating the posterior using the dual parameterization. <!-- , which is not possible for the GP subset method from Immer et al. (2021)[28]. --> 2. In Sec. 3, we have improved the description of our method, making the derivation of the posterior and how we sparsify it explicit. <!-- Further, we have clarified that the GP subset method from Immer et al. (2021)[28] does not use the dual parameterization in Eq. 4 and, as such, cannot use the fast updates, i.e. the dual conditioning. --> 3. In Sec. 5, we clarify the motivations behind our experiments. <!-- 4. **TODO** I think we should remove this --> ___ ___ ### Reviewer wxaY (6: Weak Accept; conf: 3) We thank Reviewer wxaY for their detailed feedback and for highlighting the promise of our dual variables. In light of the comments, we have updated our manuscript and now detail the changes here. **Q1:** *"The paper did not discuss in detail about their computation cost such as numbers of GPU, training time, sampling cost, etc."* **A1:** We agree that it is important to discuss our method's computation cost. We refer the Reviewer to App. A.2, where we have already detailed 1. The computational complexity of training and performing inference with our method; 2. The GPUs that we used for our experiments. However, in light of these reviews, we have added additional details on the computational complexity of training/inference/prediction/updating for our method and the baselines. We have also updated App. A.2 and stated that we used a single GPU. This provides extra clarity and addresses this issue. Note that we do not include training/inference/predict/update times because computation time is highly dependent on the hardware. As mentioned in Lines 594/595, we calculate the dual parameters for batches of the training data, keeping the GPU's memory occupation in check. Depending on the GPUs memory, we are able to leverage different batch sizes. Thus, computation time varies across the different clusters we used for our experiments. **Q2:** *"The experiments in this paper are primitive tasks in corresponding field, with relatively straightforward neural network implementation. It could be better if there is some ablation study on some more challenging downstream tasks with more complicated model structures."* **A2:** We designed our experiments section with two main goals in mind *(i)* show that our method constitutes a valid alternative to the methods based on Laplace approximation in the literature on uncertainty quantification, and *(ii)* show the broad range of applications that our method can be suitable in. We agree with the Reviewer that it would be interesting to see the performance of our method on more complicated and challenging tasks, and we see this direction as a valuable path for future works. **Q3:** *"The paper should also briefly introduce the feasibility to train the sparse function space in comparison to weight-space neural network training process."* **A3:** SFR is a post hoc method like Laplace, meaning that we compute the dual variables that define the Sparse GP used at prediction time, or in the downstream tasks, starting from the MAP weights obtained with standard gradient descent techniques for training NNs. **Q4:** *"In order to be meaningful enough representation of neural network(keep most information), is there a mathematically equivalent operation in sparse function space to gradient descent(or autodiff) in weight-space neural network. If so, the training cost might reduce significantly."* **A4:** It should be possible to perform updates to the Jacobian (*i.e.*, update in function space). This is an interesting direction that we hope to explore for future work. We thank the Reviewer for the interesting question and will include it in the paper as future work. **Q5:** *"Can this approach be applied from scratch(on randomly initialized neural network right before any training)?"* **A5:** Yes, it can, but the resulting kernel function will not have learnt the features of the data so it will revert to the prior kernel (induced by the random initial weights of the NN). We refer the Reviewer to Lines 335/341 in the paper. **Q6:** *"The paper pointed out that despite an approach similar to sparse Gaussian Process, SFR does not provide a straightforward method for specifying prior covariance function. The author suggested this could be addressed by choice of model architecture and activation functions, very abstract but thanks to the black-box characteristic of neural network, seemingly empirically correct."* **A6:** This comment is spot on. There is ongoing work looking into what different model architectures and activation functions encode prior assumptions. We refer the Reviewer to [i]–[iv] for more details. **Q7:** *"Laplace GGN linearization results in locally linear approximation, the ability of global approximation of these GP-like approach should be explored more in the future."* **A7:** We agree that the difference between the weight and function space approximation offers advantages and disadvantages. In the continual learning example, regularizing the posterior to be similar in function space is beneficial. It helps to explain why function space regularization outperforms weight-space equivalents such as Online-EWC. It is an exciting research direction to view the equivalent posteriors based on the different optimization problems they define but outside the scope of the paper. [i] *"Kernel methods for deep learning"*, NIPS 2009. [ii] *"Stationary activations for uncertainty calibration in deep learning"*, NeurIPS 2020. [iii] *"Periodic activation functions induce stationarity"*, NeurIPS 2021. [iv] *"Priors in Bayesian deep learning: A review"*, International Statistical Review 2022. ___ ### Reviewer mZpC (3: Reject; conf: 4) We thank Reviewer mZpC for their valuable comments and for highlighting the importance of the research direction treated in our submission and how our method represents an advancement over prior works on sparse GPs with NTK. **Q1:** *"(major) From a technical point of view, it is difficult for me to see what is new. [...] it was difficult to see any new techniques or ideas that may meet the bar for the conference*" **A1:** We respectfully disagree with Reviewer mZpC's notion that our paper lacks novelty. We argue that our method is a valuable combination of different techniques that form a new approach with the benefits from GPs, dual parameterization, and neural networks. We think this is especially the case as our method offers the benefits of sparse GPs alongside the representation learning of neural networks. In contrast to other Bayesian deep learning approaches (e.g., Laplace approximation, variational inference, and deep ensembles), our method can update given new data without retraining the neural network or access to the training set. As such, other researchers will likely use our ideas and build on them in areas such as continual learning, online learning, and transfer learning. Moreover, we demonstrated that SFR can be used in both continual and reinforcement learning. To the best of our knowledge, this is the first time a GP representation of a NN has been used for RL. We also refer Reviewer mZpC to the other reviewers' contribution scores (3,3,2) as they have seen the value of our paper's contribution. <!-- First to use dual parameteisation for NN, first to show sparse GP of NN, first to show adding data to NN via sparse GP conditioning --> **Q2:** *"(major) The experiment design needs better motivation. [...] as the focus of the paper is on sparsification, the experiments should have focused more on reduction in complexity, and answer the question of how much can be scale GPs for deep learning,[...] can this approach compete against other approaches in Bayesian Deep Learning like deep ensembles, etc."* **A2:** We thank Reviewer mZpC for the valuable feedback. As our work builds on the GP subset method from Immer et al. (2021) [28] we evaluated our method on the same supervised learning experiments. To demonstrate the flexibility and robustness of our method, we tested it on a broad and diverse set of experiments on continual learning and reinforcement learning tasks. We also note that we do already compare to deep ensembles in the RL experiments as this is a common baseline from the literature. We refer the Reviewer to Sec. 5.4 and Fig. 4. However, comparing to deep ensembles is out of the scope of this work for several reasons: *(i)* this paper is focused on learning GP representations as they can be used to update the model on new data without retraining, *(ii)* we primarily put interest in post hoc methods, and *(iii)* the applicability of ensemble methods is questionable in many of the application domains we consider due to storage/training cost. Finally, we also update the first paragraph of Sec. 5 to better introduce and clarify the motivations backing our experiments. **Q3:** *"Showing that introduced sparsity may not harm the performance seems rather repetitive of literature on sparse GPs."* **A3:** We respectfully disagree as we think this is an important experiment to include given that we are the first to report such experiments for neural networks. Moreover, the fact that the experiments are popular indicates that it is common practice to include such experiments. <!-- Nevertheless, we have improved the motivation of our experiments by linking them back to our paper's main contributions: providing NNs with uncertainty estimates, the ability to incorporate new data without retraining, and preventing catastrophic forgetting. To this end, we have included an extra experiment table comparing how the methods incorporate new data. We report results for SFR and the GP Subset method, which can both use SFR's dual conditioning (Lines 219–229), to incorporate new data without retraining. We also compare to Laplace GLM/BNN which requires the NN to be retrained on the new data. --> **Q3:** *"(minor) The title "sparse function space representation of neural networks" might be misleading. In Bayesian Deep Learning literature, function space representations and priors are usually referred to as the follow-up works of "functional variational neural networks" [...]. Including keywords like GPs, dual parameterization, etc may highlight the content of the paper."* **A3:** We thank the Reviewer for their feedback on the title of our manuscript. A title such as '**Neural Networks as Sparse Gaussian Processes via Dual Parameters**' could indeed fit the message, and if this gets support from the other reviewers/AC we are not against updating the title. Moreover, the function space we refer to in our paper is not related to the functional variational NN literature. **Q4:** *"(minor) In lines 67-70, it is stated that the sparse formulation of GPs for the linearization formulation of the neural network is under-explored. Some clear references are however missing. For example, one paper from 2021 [i] shows the sparse formulation using mixtures of GP experts. Well-known SN-GP [ii] uses last-layer approximation, which can also be thought of as sparse GPs."* **A4:** We thank the Reviewer for pointing out these two additional references that we have now included. **Q5:** *"Why should "uncertainty quantification in RL" should be included in the related work section? To me, this part was rather occupying space. I would have rather expanded the section on the methods."* **Q6:** *"On lines 154-157: certain derivation is referred to a book. If possible, more insight should be conveyed here, because this part seems rather one of the key ideas."* **A5/A6:** We agree with the Reviewer on these minor points. We moved the "Uncertainty quantification in RL" into the Appendix and we further detail the derivation in the Method section. [i] *"Trust Your Robots! Predictive Uncertainty Estimation of Neural Networks with Sparse Gaussian Processes"*, in CoRL 2022. [ii] *"A Simple Approach to Improve Single-Model Deep Uncertainty via Distance-Awareness"*, JMLR 2023. ___ ### Reviewer qtJo (4: Borderline reject; conf: 3) We thank Reviewer **qtJo** for their valuable feedback and for pointing out some points of unclarity in our manuscript. We updated our submission addressing the concerns raised in their review, and we now discuss each point to explain the changes and clarify the unclear parts in turn. **Q1:** *"[...] incorporating new data samples into dual variables in the full GP expression (eq. 7) is also doable but just very time-consuming, so the advantage of the proposed solution is that it reduces the time complexity of incorporating new samples, not the argument that it augments the full DP expression with the capability of including new samples."* **A1:** The Reviewer is correct the dual parameters also allow for incorporating new data points in the full GP setting. However, the computational cost grows with the number of data points, which is the downside of using the full GP. Contrasting Eq. (8) to Eq. (4) in the paper, the complexity has switched from a dependence on the number of data points to the number of inducing points $m$, which we now control. As the Reviewer points out, incorporating new data points has a lower complexity, e.g., assuming we have $n$ original data points and $n_{\textit{new}}$ to incorporate, the complexity in full model is $O((n + n_{\textit{new}})^{3})$ while it is reduced to $O((n_{\textit{new}} \; m)^{2})$ with the sparse model. This lower complexity makes our sparse model suitable for most practical applications. **Q2:** *"It might be interesting to conduct experiments to test the significance of whether or not to use inducing points. iirc, reference 42 mentioned in the submission handles the GP without inducing points, not sure if the code is available. I do think that using inducing points would reduce the computational cost, but it also introduces errors since they are randomly sampled from the existing datasets."* **A2:** GP inference approximations are always employed to scale to large data sets. The approximation in [42] differs from our sparse GP approximation using conjugate gradients (CG) approximation, with an additional '*finite difference Fisher vector products*,' which the authors admit introduces a further error of the standard CG approximation. Both approaches are approximations to the full GP. Our paper focuses on building a flexible sparse GP model approximation of the neural network allowing for supervised and continual learning, whereas [42] focuses on transfer learning tasks, thus making a direct comparison difficult as the authors did not publish results on SL tasks. For SL and CL, we focus on baselines that have directly published results and code in that domain, making comparisons fair. **Q3:** *"the paper argues that we would only need to find dual variables for new data samples to incorporate them, but that is only true under the linearization assumption, and standard finetuning or training offers higher-order changes in a neural network. Would the authors recommend only updating the dual variables to incorporate new samples rather than finetuning or retraining neural networks in terms of performance?"* **A3:** We thank the Reviewer for the interesting question. As correctly noted by the Reviewer and discussed in Sec. 6, our method relies on the Laplace-GGN approach that linearizes the network around the MAP solution. However, this local approximation becomes useful in many practical situations. As discussed in our paper, for sequential settings is often not feasible to retrain the neural network from scratch to incorporate new data. SFR offers an effective approach to consider "some of" the information brought by the new data without retraining the model in weight space (which requires access to **all** of the training data). In light of this comment, we have added extra details to the manuscript to indicate when the dual updates might be preferred to retraining the neural network from scratch (and vice-versa). ___ ### Reviewer aT2u (6: Weak Accept; conf: 3) <!-- We thank Reviewer **aT2u** for finding our approach technically sound and our paper well-written and understandable. We followed their valuable feedback to clarify and improve our manuscript. We now address the concerns raised in both the Weaknesses and Questions sections. --> We thank Reviewer **aT2u** for finding our approach technically sound and our paper well-written. We improved our manuscript using their valuable feedback. **Q1:** *"From Table 1/2, it seems that the method from [Immer et al] performs just as well or better in many of the testing scenarios, so what is the motivation for SFR?"* **A1:** SFR follows the GP subset method whose aim is to equip NNs with the benefits of GPs (e.g., uncertainty quantification, function space representation, ability to add new data with multivariate normal conditioning). SFR is further motivated as 1) it scales better than the GP subset method from Immer et al. (2021) [28] (because it needs fewer inducing points), and 2) its dual parameterization offers a more computationally efficient mechanism for incorporating new data without retraining. <!-- Note that these "fast" updates are made possible by the dual parameterization and are not available for the GP subset method from Immer et al. (2021) [28]. --> <!-- This comment highlights some issues with the clarity of our results tables. Let us first clarify the aim of our results: 1. Show that SFR matches the performance (NLPD) of the Laplace approximation with BNN/GLM predictions, 2. Show that SFR outperforms the GP subset method with fewer inducing points. --> In light of this comment, we have improved the clarity of our UCI results tables/figures by making the following changes (shown in the rebuttal's supplement): 1. We have updated Figure 3 to show the number of inducing points as a percentage of the training data. We believe this is a clearer way to demonstrate SFR's representative power with fewer inducing points. 2. We include a new table of results for the UCI data set in the Appendix, which clearly demonstrates that SFR 1) matches the performance of BNN/GLM and 2) outperforms the GP subset when using fewer inducing points. Importantly, this table contains less information, so it is easier to read. <!-- **TODO** --> <!-- 1. need less M than GP subset --> <!-- 2. can leverage the dual parameterization for fast updates --> <!-- **Q2:** *"Figure 3 demonstrates some advantages at lower M (inducing points) but the difference often vanishes by M=512."* **A2:** This is correct and shows that SFR obtains the same predictive performance as the GP Subset method with fewer inducing points. As such, SFR scales better as it requires less inducing points. --> <!-- In light of this comment, we have changed the x-axis to show the number of inducing points as a percentage of the number of training data. --> <!-- **A2:** We agree with the Reviewer that Fig. 3 and its description need to be clarified to explain the behavior of our approach with the increasing number of inducing points. The importance of converging faster with a low number of inducing points for the UCI data sets comes from the low dimensionality of these data sets. In light of this comment, we have changed the x-axis to show the number of inducing points as a percentage of the number of training data. We have included this figure in the rebuttal's supplement. --> **Q2:** *"does SFR outperform [Immer et al] on continual learning (Table 3)?"* **A2:** Immer et al. (2021) [28] does not present any application of their methods for continual learning. Adapting their method to work in the CL setting would require designing a new method which is beyond the scope of this paper. <!-- In light of this, our paper first compares SFR to their proposals (BNN and GLM), matching or improving the results in the supervised learning (SL) setting (see the updated Table 1 results). We then go a step further than SL and show two possible use cases of our method: i) building a function-space regularizer to mitigate the effect of forgetting on various standard continual learning data sets, and ii) quantifying the uncertainty of a dynamics model for exploration in model-based RL. --> **Q3:** *"it seems that computational complexity is a claimed advantage of the sparse GP formulation. If so, it would be nice to have a comparison of the computation time/resources required by different methods in the main paper."* **A3:** We make no claim that SFR has computational complexity advantages over any of the other approaches. On the contrary, our method has the same computational complexity as the GP subset method. Instead, we claim that we can match the predictive performance of BNN/GLM with far fewer inducing points than the GP subset method. We then empirically show this in Table 4 and Figure 3. Let us also bring the reviewers attention to App. A.2, where we detail the computational costs involved in computing the dual parameters from the trained neural network. Note that we do not include computation time because it is highly dependent on the hardware. As mentioned in Lines 594/595, we calculate the dual parameters for batches of the training data, keeping the GPU's memory occupation in check. Depending on the GPUs memory, we are able to leverage different batch sizes. Thus, computation time varies across hardware. In light of this comment, we have added additional details on the computational complexity of our method and the baselines. This provides extra clarity and addresses this issue. <!-- We do not claim that computational complexity is an advantage of our formulation but we do claim that we can match the predictive performance of BNN/GLM with far fewer inducing inputs than the GP subset method. However, this is empirically shown in Table 4 and Figure 3. --> <!-- **A4:** We do not claim computational complexity advantage for our sparse formulation. However, we detail how relying on a set of inducing points, together with the sparse GP formulation, is the preferred choice to better model all the training data to relying on a subset of the training set. Moreover, in App. A.2, we discuss the computational costs involved in computing the dual parameters starting from the trained neural network (i.e., to compute $\mathbf{\alpha}_\mathbf{u}$ and $\mathbf{B}_\mathbf{u}$). Following the Reviewer's feedback, we improve this section by discussing the complexity of inference, i.e., what is required to compute the predictive mean and variance, and how we compute the batched version of the method. This is because computing the Jacobian on the entire training data set is prohibitively expensive in both memory and computational time. The computational speed to compute the dual parameters is hardware dependent because the Jacobian computation is a heavy-memory operation, thus having access to large-memory GPUs may be a way to reduce the time needed to compute $\mathbf{B}_\mathbf{u}$. --> **Q4:** *"Many of the CL comparisons (Table 3) seem to be somewhat older. E.g., many of the permuted MNIST results in table 1 of Dhawan et al seem stronger. Is there a reason not to include those newer comparisons?"* **A4:** As we specify in L283, our CL experiments consider only the single-head (SH) setting, which is more challenging and realistic than multi-head (MH). The performances reported in Dhawan et al. (2023) [i] consider the multi-head setting (MH) (see App. E.2 [i]). The MH scenario is considered "easier" than SH (see discussion in [70]); moreover, the SH is also well known to be problematic for parametric-based approaches, such as the one proposed in [i]. In fact, when considering the MH setting, the CL method also has access at test time to a task identifier to discriminate among the different tasks experienced so far in the CL training regime. Our results align with those reported by Rudner et al. (2022) [62] and Pan et al. (2020) [55]. **Q5:** *"L273 sentence ends abruptly"* **A5:** Thanks for pointing this out, we have added the missing link. **Q6:** *"How $q(\mathbf{u})$ is determined (L168)."* **A6:** The distribution $q(\mathbf{u})$ is usually parameterized as $\mathcal{N}(\mathbf{u};\mathbf{m},\mathbf{V})$ but we choose to use the dual parameterization allowing us to form the posterior from a converged neural network. The form of $q(\mathbf{u})$ is shown in detail in App. B.1. We thank the Reviewer for the question and have updated our main paper to reference to App. B.1. <!-- We follow [64] in assuming conditional independence $q(\mathbf{f},\mathbf{u}) = p(\mathbf{f} \mid \mathbf{u}) q(\mathbf{u})$. Given $q(\mathbf{u})$, the marginal distribution at all other points is easily determined given the conditional independence assumption. --> <!-- **A7:** We have improved the description and split out the equation to be standalone. We follow the derivation detailed in [64] that assumes conditional independence $q(\mathbf{f},\mathbf{u}) = p(\mathbf{f} \mid \mathbf{u}) q(\mathbf{u})$. Given the distribution $q(\mathbf{u})$, the marginal distribution at all other points is easily determined given the conditional independence assumption. The distribution $q(\mathbf{u})$ is usually parameterized as $\mathcal{N}(\mathbf{u};\mathbf{m},\mathbf{V})$, but we choose to use the dual parameterization allowing us to form the posterior from a converged neural network. The form of $q(\mathbf{u})$ is shown in detail in App. B.1, and we will link the section in the main paper. --> **Q7:** *"In general, what is the motivation for training such a GP on data (versus standard NN finetuning), if uncertainty is not a concern? As mentioned, the validity of the kernel (determined by the jacobian) will decrease if new data is too different from what the original NN weights were trained on, and the GP is incapable of new "deep feature learning.""* **A7:** Our method turns a trained NN into a linearized approximation at the MAP of the corresponding GP kernel. This formulation lets us compute uncertainty over the model predictions and is also the starting point to build a function-space regularizer helpful in the CL setting and quantifying the uncertainty in model-based RL. Complete retraining of an already trained NN to include the information in the new samples may be inefficient or impossible for such sequential settings. This perspective represents an important motivation to choose our method over other competing approaches in the literature. The limitation in the validity of the kernel to the distribution of the training data is a well-known limitation of this typology of methods. As we discuss in Sec. 6 of our manuscript, this represents an important direction for future work with such methods. <!-- **A7:** Our method turns a trained neural network into the corresponding linearized approximation at the MAP of the corresponding GP kernel. This formulation lets us compute uncertainty over the model predictions and is also the starting point to build a function-space regularizer helpful in the CL setting and quantifying the uncertainty in model-based RL. Doing a complete retraining of an already trained neural network to include the information in the new samples may be inefficient or impossible for such sequential settings. This perspective represents an important motivation to choose our method over other competing approaches in the literature. --> <!-- The limitation in the validity of the kernel to the distribution of the training data is a well-known limitation of this typology of methods, as we discuss in Sec. 6 of our manuscript, and this represents an important direction for future research directions meant to improve the generalizability of such methods. --> <!-- We thank Reviewers aT2u and qtJo for their interesting questions regarding when to prefer the fast updates (via dual conditioning) vs finetuning or retraining from scratch. . As correctly noted by the Reviewer and discussed in Sec. 6, our method relies on the Laplace-GGN approach that linearizes the network around the MAP solution. However, this local approximation becomes useful in many practical situations. As discussed in our paper, for sequential settings is often not feasible to retrain the neural network from scratch to incorporate new data. SFR offers an effective approach to consider “some of” the information brought by the new data without retraining the model in weight space (which requires access to all of the training data). In light of this comment, we have added extra details to the manuscript to indicate when the dual updates might be preferred to retraining the neural network from scratch (and vice-versa). --> <!-- **A8:** Our method turns a trained neural network into the corresponding linearized approximation at the MAP of the corresponding GP kernel. This formulation lets us compute uncertainty over the model predictions and is also the starting point to build a function-space regularizer helpful in the CL setting and quantifying the uncertainty in model-based RL. Doing a complete retraining of an already trained neural network to include the information in the new samples may be inefficient or impossible for such sequential settings. This perspective represents an important motivation to choose our method over other competing approaches in the literature. The limitation in the validity of the kernel to the distribution of the training data is a well-known limitation of this typology of methods, as we discuss in Sec. 6 of our manuscript, and this represents an important direction for future research directions meant to improve the generalizability of such methods. --> [i] *"Efficient Parametric Approximations of Neural Network Function Space Distance"*, ICML 2023. ___ ___ #### Complexity - [ ] Add table on complexity of training/inference/prediction/updates? + algorithm box > "where defining the cost of a forward pass (and backward pass) as $[F\!P]$, this can be roughly sketched as $\mathcal{O}(2 E N[F\!P])$, where $E$ is the number of training epochs and $N$ the number of training samples. Thus, there is no additional cost involved in the training phase of the neural network." > "Computing the kernel Gram matrix for a batch of $B$ samples with this implementation has a computational complexity of $\mathcal{O}(B C [F\!P] + B^2 C^2 P)$, where $C$ is the number of outputs and $P$ is the number of parameters of the neural network." > (from Appendix A.2) ###### Notation $N$ → the number of elements in the training set $E$ → number of training epochs $[FP]$ → computational complexity of a forward/backward pass of the NN $B$ → number of elements we are considering for inference/prediction/fast-update $C$ → number of outputs of the NN $M$ → number of inducing points (for GP subset the size of the training set samples) $S$ → number of posterior samples #### Training time The first stage is the training of the NN. This requires for each element in the training set of cardinality $N$ to go through a forward pass and a backward pass over the NN for $E$ epochs. ***Training complexity:*** $\mathcal{O}(2 E [FP])$ #### Inference time (I believe the claims in the main paper are not completely fine for now) We have the trained NN obtained after the training phase and we need the Jacobian contraction NTK to compute the kernel Gram Matrix iterating with batchsize $B$ over the $N$ training samples. **NTK complexity:** $\mathcal{O}(N C [FP] + N^2 C^2 P)$ **On batch $B$ samples:** $\mathcal{O}(B C [FP] + B^2 C^2 P)$ Then we need to compute the outer product for computing $\mathbf{B}_\mathbf{u}$ : **iterate through the inner product:** $\mathcal{O}(N M^2)$ ***Inference complexity:*** $\mathcal{O}(NC[FP] + N^2 C^2 P + N M^2)$ #### Prediction time At prediction time we don't need anymore to worry about the computation of $\mathbf{B}_\mathbf{u}$ and the other $\mathbf{K}_\mathbf{zz}$ terms, that are cached in the form of its Cholesky decomposition. At prediction time, we iterate through the new datapoints in batch of $B$ elements to compute $\mathbf{K}_\mathbf{**}$ and $\mathbf{K}_\mathbf{z*}$, with complexity $\mathcal{O}(C [FP] (M + B) + B^2 C^2 P + B M C^2 P)$ This is a bit tricky, but we basically we need to compute the two tensors/matrices, meaning that we pass through the NN both inducing points and the new batch and then perform the predictive equations that introduce an additional $\mathcal{O}(MBS)$ ***Prediction complexity:*** $\mathcal{O}(C [FP] (M + B) + B^2 C^2 P + B M C^2 P + MBS)$ #### Fast Updates ??? #### Other methods ???? #### Table with complexities | Method | Training | Inference | Prediction | Updates | | -------- | -------- | -------- | -------- | -------- | | SFR | $\mathcal{O}(2 E N[F\!P])$ | $\mathcal{O}(NC[FP] + N^2 C^2 P + N M^2)$ | $\mathcal{O}(C [FP] (M + B) + B^2 C^2 P + B M C^2 P + MBS)$ | $??$ | | SFR | $\mathcal{O}(2 E N[F\!P])$ | $\mathcal{O}(N M^2)$ | $\mathcal{O}( B C [F\!P] + B^2 C^2 P + B M S)$ | $??$ | | GP subset | $\mathcal{O}(2 E N[F\!P])$ | $\mathcal{O}(N M^2)$ | $\mathcal{O}( B C [F\!P] + B^2 C^2 P + B M S)$ | $??$ | | GLM? | $??$ | $??$ | $??$ | $??$ | | BNN? | $??$ | $??$ | $??$ | $??$ | <!-- * Where: * $M$ -> num. inducing points or subset size * $S$ -> samples from the posterior (we used 100 in our experiments), * $B$ -> batch of new samples At prediction time we have already computed and cached $\mathbf{K}_{\mathbf{z}\mathbf{z}}^{-1} + (\mathbf{K}_{\mathbf{z}\mathbf{z}} + \mathbf{B}_{\mathbf{u}})^{-1}$ and $\mathbf{K}_{\mathbf{z}\mathbf{z}}^{-1}\mathbf{\alpha}_{\mathbf{u}}$, then we can directly apply the predictive equations that require: i) the computation of the kernel for the new samples $\mathcal{O}( B C [F\!P] + B^2 C^2 P)$, ii) the matrix multiplication required by the predictive equations brings an additional $\mathcal{O}(B M S)$ costs, where we take in consideration the number of samples taken from the posterior. **TODO:** * is the inference cost really dominated by the $\mathcal{O}(N M^2)$ cost? We are summing over all the N training samples to compute both $\mathbf{B}_\mathbf{u}$ and $\mathbf{\alpha}_\mathbf{u}$n but then also computing the kernel for each batch of training samples, i.e., $\mathcal{O}(B C [F\!P] + B^2 C^2 P)$. * how to compute the complexity for fast-updates? --> - [ ] Fast-update table UCI Datasets - [ ] Add algorithm to show what is calculated and when 1. Train NN (like normal) 2. Inference, i.e. fit the sparse GP posterior - Calculate the dual variables ($\mathbf{\alpha}_{\mathbf{u}}$, $\mathbf{B}_{\mathbf{u}}$) using Eq. 7 - Calculate matrices which can be cached as they're not dependent on the test input. $$\mathbf{K}_{\mathbf{z}\mathbf{z}}^{-1} + (\mathbf{K}_{\mathbf{z}\mathbf{z}} + \mathbf{B}_{\mathbf{u}})^{-1} \quad \text{and} \quad\mathbf{K}_{\mathbf{z}\mathbf{z}}^{-1}\mathbf{\alpha}_{\mathbf{u}}$$ 3. Predict ___ ___ ## Aidan Notes ### Future TODOs - [ ] More complicated model architectures (more than MLP/CNN) - [ ] CIFAR10/CIFAR100 results

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully