Reproducing "Laplace Redux – Effortless Bayesian Deep Learning"

<p align="center"> <img src="https://hackmd.io/_uploads/rJlqhgsaXlx.png" alt="logo_new"/> </p> <h1 align="center">Reproducing "Laplace Redux – Effortless Bayesian Deep Learning"</h1> <div style="display: flex; justify-content: space-between; align-items: center; margin-bottom: 30px;"> <div style="flex: 0 0 auto"> <img src="https://hackmd.io/_uploads/SJWCyitXge.png) " alt="University Logo" width="400" height="auto"> </div> <div style="flex: ;"> **Authors:** Yongcheng Huang (5560950) Yiming Chen (5541786) Zeyu Fan (6179487) **Group Number:** 20 </div> </div> ## 1. Introduction Modern deep neural networks often show overconfidence when dealing with unknown data[^1]. Bayesian methods can provide uncertainty estimates to solve this problem, but their mainstream methods are often complex and costly to implement. The paper "Laplace Redux" reintroduces Laplace approximation (LA) - a classic, simple and efficient Bayesian inference method. The authors point out that LA can transform a standard trained neural network into a Bayesian model with extremely low computational overhead, thereby achieving reliable uncertainty quantification without sacrificing performance. The core is: based on the optimal weight obtained by standard training, a Gaussian distribution is fitted to approximate the posterior and used for probabilistic prediction. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/HJmSd29Xxl.png" alt="LA Visualization" style="width:100%;"> <p><strong>Figure 1:</strong> Visualization of Laplace Approximation</p> </div> Our project aimed to verify the core claims of the paper, and we reproduced two of its key results: <div style="display: flex; align-items: flex-start; gap: 30px; margin: 20px 0;"> <div style="flex: 1; padding-left:0px;"> ### 1.1 Paper Table 1 The paper points out that compared with the standard MAP (Maximum A Posteriori) model, LA, especially LA* (LA is Laplace approximation by default setting while LA* is the most robust setting in terms of distribution shift), significantly reduces the model's overconfidence when facing out-of-distribution **(OOD)** data. At the same time, they also outperform various baseline methods (introduced and elaborates in the Methodology section). </div> <div style="flex: 0 0 50%;padding-top: 70px;text-align: center"> <img src="https://hackmd.io/_uploads/SkY9WiK7ex.png " alt="Table 1: Out-of-Distribution Detection Performance Comparison" width="100%" height="auto"> <p><strong>Table 1:</strong> OOD detection performance averaged over all test sets<sup style="color:blue">[1]</sup> </p> </div> </div> <div align="center"> ### 1.2 Paper Figure 6 Model Calibration Performance Across Real-World Datasets <img src="https://hackmd.io/_uploads/B1TTeoYmee.png " alt="Figure 6: Calibration performance comparison showing LA methods versus MAP, DE, and temperature scaling across five datasets" width="700" height="auto"> <div style="text-align: center"> <p><strong>Figure 2:</strong> Figure 6 in the paper<sup style="color:blue">[1]</sup>. Assessing real-world distribution shift robustness on five datasets from the WILDS benchmark</p> </div> In all five real-world datasets, the LA method is much better than the MAP model in calibration. More importantly, the performance of LA is comparable to the much more computationally expensive deep ensemble <strong>(DE)</strong> and commonly used temperature scaling methods, especially on <strong>OOD</strong> datasets. This proves that LA is an efficient method to improve the robustness of models in complex scenarios. </div> ## 2. Methodology Our reproduction methodology adheres strictly to the experimental framework established in the original research. We systematically replicate the specified network architectures, benchmark datasets, and evaluation protocols to ensure methodological consistency and scientific rigor. ### 2.1 Datasets **For Table 1:** MNIST[^2], CIFAR-10[^3], and the corresponding OOD datasets (e.g., FashionMNIST[^4], SVHN[^5]). **For Figure 6:** CivilComments[^8] from the WILDS[^11] benchmark. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/Skd81a9mgl.png" alt="Datasets Overview" style="width:90%;"> <p><strong>Figure 3:</strong> Datasets Overview</p> </div> ### 2.2 Models We use the models specified in the paper, such as LeNet for MNIST and WideResNet for CIFAR-10 for the Table 1 reproduction, and the pre-trained model DistilBERT for the WILDS benchmarks. ### 2.3 Compared Methods For **Table 1**, we **reproduced** the following methods: For the following methods or baselines we strictly follow the authors' logic: 1. **Maximum A Posteriori Estimation(MAP)**: Standard point-estimate network trained with cross-entropy loss plus weight decay. We use Adam and Nesterov-SGD to train **LeNet** and **WRN(WideResNet)-16-4,** respectively. The initial learning rate is $0.1$ and $1e^{-3}$, respectively, and annealed via the cosine decay method over $100$ and $150$ epochs, respectively. The weight decay is set to $5e{-4}$.$$\theta_{MAP} = \arg \max_{\theta} p_{model}(\mathbf{x}; \theta) * p(\theta)$$ 2. **Laplace Approximation (LA and LA\*):** We apply the last-layer Laplace approximations exactly as in the paper. Fitted the **Hessian** via BackPack, then computed analytic posterior predictive via Gaussian integrals.$$p(\theta | D) \approx \mathcal{N}(\theta; \theta_{MAP}, \Sigma) \quad \text{with} \quad \Sigma := \left(\nabla_{\theta}^2 \mathcal{L}(D; \theta)\big|_{\theta_{MAP}}\right)^{-1}$$ ![image](https://hackmd.io/_uploads/S10d609Qeg.png) <div style="text-align: center"> <p><strong>Figure 4: </strong><sup style="color:blue">[1]</sup> Hessian approximation structures in Laplace Redux.</p> </div> 3. **Deep Ensembles (DE):** We train five MAP network (see models) independently to form the ensemble. For the following baselines, beyond following the author's logic, we either explore and evaluate different variants of a method, or check hyperparameters and find better selections or combinations of hyperparams compared to the author's report.$$\{\theta_1^*, \theta_2^*, \ldots, \theta_M^*\} = \{\arg\max_{\theta_i} p(D|\theta_i)p(\theta_i)\}_{i=1}^M$$ For the following baselines, upon following the authors' logic, we go further to meet the **Hyperparams check** and **New algorithm variant** criteria: 4. **Variational Bayes (VB):** We follow the paper's logic and use the Bayesian-Torch library, the diagonal Gaussian variational posterior, and the flipout estimator to train the network. We then set the prior precision to about $5e{-4}$ to match the **MAP** weight decay, as described in the original paper. We conduct systematic **hyperparameter optimization** on *epochs* and $τ$ (representing training epochs and the KL-term downscaling factor, respectively), successfully identifying superior parameter combinations that achieve reduced out-of-distribution confidence while preserving in-distribution accuracy, thereby demonstrating measurable improvements over the original paper's reported configuration settings. $$q(\theta) = \prod_{i} \mathcal{N}(\theta_i; \mu_i, \sigma_i^2)$$ 5. **Stochastic Weight Averaging‐Gaussian (SWAG / SWG):** We follow the paper's logic: starting from a MAP-pretrained model, run SGD with constant learning rate ($1×10^{-2}$ on MNIST, $2×10^{-3}$ on CIFAR-10, these are the author's grid searched optimal learning rates we validate) for $40$ epochs, collect one snapshot per epoch. At test time, draw $30$ MC samples from the fitted Gaussian posterior and average softmax outputs. We explore two **new algorithm variants** through systematic comparison across distinct methodological dimensions. **SWAG Covariance Estimation Variants:** Our analysis contrasts the low-rank plus diagonal approach, which implements the full SWAG methodology: $$\Sigma = \Sigma_{\mathrm{diag}} + \frac{1}{K-1}\sum_{i=1}^K(\theta_{t_i}-\mu)(\theta_{t_i}-\mu)^\top$$ This formulation incorporates rank-$r$ truncation by retaining the top-$r$ snapshot deviation vectors within the low-rank term, against the diagonal-only alternative: $$\Sigma_{\mathrm{diag}} = \mathrm{diag}\bigl(M_2 - \mu^2\bigr)$$ The diagonal-only variant utilizes exclusively the diagonal variances $(M_2-\mu^2)$, systematically omitting the low-rank component to assess its contribution to uncertainty quantification performance. **Batch Normalization Update Strategies:** We evaluate contrasting approaches to batch normalization parameter updates during inference. The standard methodology employs "once per sweep" batch normalization updates across the entire training dataset, following the original authors' established protocol, versus the alternative "per-sample" batch normalization update strategy that incorporates test and out-of-distribution data. This comparison enables quantification of the data leakage scenario's impact on out-of-distribution confidence calibration, providing insights into the robustness of different normalization strategies under varying data exposure conditions. ![swag](https://hackmd.io/_uploads/SJvlG65Qel.png) <div style="text-align: center;"> <p><strong>Figure 5:</strong> Visualization of SWAG process</p> </div> For **Paper Figure 6**: We reproduce Figure 6 for the CivilComments benchmark, using DistilBERT as the backbone model, following the exact setup from the paper. However, using the full-Hessian Laplace mode resulted in persistent Cholesky decomposition errors ("non-positive definite"), Both in our own implementation or with the original code and after applying common stabilizations like jittering. We observed this error only when using >50% of the dataset, making debugging resource-intensive. To preserve reproducibility, we switched to the Kron approximation. So, we compare the last-layer LA (~~Full~~ **Kron**) with the MAP baseline, Temperature Scaling, and DE. **Temperature Scaling** is a **post-hoc calibration** method that adjusts the confidence of predictions without changing model accuracy: $$ p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)} $$ - **T > 1** compresses logits, making predictions more uncertain. - **T < 1** amplifies logits, making predictions sharper and more confident. The optimal temperature T is chosen by minimizing the **Negative Log-Likelihood (NLL)** on a held-out validation set: $$ T^* = \arg\min_{T > 0} \left[ - \frac{1}{N} \sum_i \log p(y_i^* \mid x_i; T) \right] $$ Since this technique does not alter predicted class labels, it preserves classification accuracy but **improves probability calibration**, leading to lower NLL and ECE—especially effective on in-distribution (ID) data. ### 2.4 Evaluation Metrics **Paper Table 1:** Average Confidence and AUROC for OOD detection. **Paper Figure 6:** Negative Log-Likelihood (NLL) and Expected Calibration Error (ECE). * **Negative Log-Likelihood (NLL):** Measures the quality of probabilistic predictions. Lower is better. For a predicted probability distribution $p(y \mid x)$ and ground-truth label $y^*$, it is computed as: $$ \text{NLL} = - \frac{1}{N} \sum_{i=1}^N \log p(y_i^* \mid x_i) $$ * **Expected Calibration Error (ECE):** Measures how well predicted probabilities reflect true correctness likelihood. It partitions predictions into bins by confidence and computes the weighted average difference between confidence and accuracy: $$ \text{ECE} = \sum_{m=1}^M \frac{|B_m|}{n} \left| \text{acc}(B_m) - \text{conf}(B_m) \right| $$ Where *(B_m)* is the set of samples in bin *(m)*, and *acc*, *conf* are bin accuracy and confidence. ## 3. Implementation ### Implementation in Python with PyTorch The primary implementation is developed using Python and the PyTorch framework, leveraging the `laplace-torch` library introduced in the paper. * **Repository:** [laplace_rep](https://github.com/D4vidHuang/laplace_rep) * **Key Libraries:** * `torch` * `laplace-torch` * `torchvision` * `wilds` We first independently built our reproduction framework only upon the paper's logic without referencing the authors' codes. Following comparison with the original codebase show improved clarity and modularity in our reproduction. The authors' codes rely on the unified but tightly coupled `uq.py` script, while we implemented separate files for each method and dataset. This enhances readability, maintainablilty, and ease of extension, meeting the **New code variant** criterion in the course guidelines. Specifically, we isolate the logic for training, evaluation, and model definition, and provide dedicated scripts for each uncertainty quantification (UQ) method. This separation reduces unnecessary dependencies between components and facilitates efficient debugging and experimentation. In the following of the blog post, you'll encounter function names such as `train_XXX` and `evaluate_XXX`. The `XXX` in these function names serves as a placeholder to indicate the specific dataset being used. For instance, `train_MNIST` would refer to the training process performed on the `MNIST` dataset, while `evaluate_CIFAR10` would denote the evaluation conducted on the `CIFAR-10` dataset. Below is a brief overview of our implementations for each method: **MAP:** We train a standard LeNet or WideResNet model using cross-entropy loss and weight decay. The training logic is implemented in `train_XXX_MAP.py`. The evaluation including OOD confidence and AUROC is done in `evaluate_XXX_MAP.py`, it supports both MAP and LA evaluation. **DE:** We train multiple (default: 5) independent MAP models with different random seeds. These models are then ensembled by averaging their softmax probabilities. Implemented in `train_XXX_de.py` and `evaluate_XXX_de.py`. **VB:** We use the Bayesian-Torch framework and implement a variational posterior as a diagonal Gaussian. We use Flipout to reduce variance, and scale the KL-divergence by $τ/N$, as described in the paper. Implementation is in `train_XXX_VB.py` and `evaluate_XXX_vb.py`. **SWAG:** We collect snapshots during continued training with constant learning rate from a pretrained MAP model. Then we estimate the posterior by computing low-rank + diagonal covariance of weights. We support both full SWAG and diag-only SWAG variants via a flag. The training and posterior sampling is in `train_XXX_SWAG.py`. For MNIST we separate evaluation from training in `evaluate_MNIST_SWAG.py`. However for CIFAR, due to the presence of BatchNorm layers in WRN, evaluating with a newly constructed SWAG model may cause KeyError from missing BN parameters, so we integrate training and evaluation into a single script `train_cifar10_SWAG.py` to preserve the BN state. **LA:** We apply the laplace-torch library to fit a Gaussian posterior around a pretrained MAP solution. Both vanilla LA and optimized LA* are supported, with configurable Hessian approximations and backends. Implemented in `apply_LA_XXX_MAP.py`. Initially, we observed that applying LA* was significantly slower than LA and did not yield noticeably better results. The latter was puzzling, since LA* uses full Hessian, while it is computationally heavier, it should ideally offer improvements. Upon closer inspection, we realized that the difference stemmed from a hidden default: the backend parameter was set to BackpackGGN by default. While this works well for standard LA, it is suboptimal for LA*, which internally benefits more from EF (Exact Fisher) estimation via BackpackEF. After switching the backend to BackpackEF for LA*, we found that: 1. LA* became much faster to apply, almost on par with LA. 2. Its performance improved consistently, especially for OOD detection. This time the former result is interesting, as EF should be more expensive to compute. We attribute this result to PyTorch's efficient support for batch-wise gradient computation vs. GGN's heavier hook-based estimation.  ## 4. Results This section presents the results of our reproduction, directly comparing them to the findings in the original paper. ### 4.1. Reproduction of Table 1: OOD Detection The following tables show our reproduced results for OOD detection on MNIST and CIFAR-10, using Python implementation. #### 4.1.1. MNIST -> OOD This section details the results of our out-of-distribution (OOD) detection experiments on the MNIST dataset. The author believes that on the MNIST dataset, the LA and LA* methods successfully reduce the overconfidence problem of MAP while maintaining competitiveness with other Bayesian baseline methods in the AUROC indicator. The author particularly emphasizes that LA* performs best in the confidence indicator (56.1±0.5), significantly outperforming baseline methods such as VB (73.2±0.8), CSGHMC (69.2±1.7) and SWAG (75.8±0.3). In terms of AUROC, LA* reaches 96.4±0.2, which is comparable to other methods, proving that it effectively improves calibration while maintaining detection performance. <div style="display: flex; align-items: flex-start; gap: 15px;"> <div style="flex: 1;"> | Method | Confidence ↓ | AUROC ↑ | |--------|:------------:|:-------:| | MAP | 74.7 ± 0.9 | 96.4 ± 0.1 | | DE | 65.7 ± 0.6 | 97.5 ± 0.1 | | VB | 73.4 ± 0.9 | 96.3 ± 0.2 | | SWG | 75.7 ± 0.7 | 96.4 ± 0.1 | | LA | 67.7 ± 0.8 | 96.1 ± 0.2 | | LA* | 56.4 ± 0.8 | 96.1 ± 0.2 | **Table 2:** OOD detection performance (MNIST) averaged over all test sets </div> <div style="flex: 1; text-align: center;"> ![MNIST_comparison](https://hackmd.io/_uploads/HkvHviq7ll.png) **Figure 6:** Performance comparison between our results and original paper results for OOD detection on MNIST dataset. </div> </div> <div style="text-align: center;"> <table style="margin: auto;"> <thead> <tr> <th style="text-align: left;">Method</th> <th style="text-align: center;">EMNIST Conf ↓</th> <th style="text-align: center;">FMNIST Conf ↓</th> <th style="text-align: center;">KMNIST Conf ↓</th> <th style="text-align: center;">EMNIST AUROC ↑</th> <th style="text-align: center;">FMNIST AUROC ↑</th> <th style="text-align: center;">KMNIST AUROC ↑</th> </tr> </thead> <tbody> <tr> <td style="text-align: left;">MAP</td> <td style="text-align: center;">83.9 ± 0.6</td> <td style="text-align: center;">63.0 ± 2.6</td> <td style="text-align: center;">77.1 ± 0.7</td> <td style="text-align: center;">93.2 ± 0.2</td> <td style="text-align: center;">98.9 ± 0.2</td> <td style="text-align: center;">97.1 ± 0.1</td> </tr> <tr> <td style="text-align: left;">DE</td> <td style="text-align: center;">76.5 ± 0.4</td> <td style="text-align: center;">55.2 ± 1.1</td> <td style="text-align: center;">65.6 ± 0.4</td> <td style="text-align: center;">94.7 ± 0.1</td> <td style="text-align: center;">99.3 ± 0.1</td> <td style="text-align: center;">98.5 ± 0.0</td> </tr> <tr> <td style="text-align: left;">VB</td> <td style="text-align: center;">80.7 ± 0.7</td> <td style="text-align: center;">66.0 ± 2.5</td> <td style="text-align: center;">73.5 ± 0.7</td> <td style="text-align: center;">93.2 ± 0.4</td> <td style="text-align: center;">98.4 ± 0.4</td> <td style="text-align: center;">97.3 ± 0.3</td> </tr> <tr> <td style="text-align: left;">SWG</td> <td style="text-align: center;">84.7 ± 0.5</td> <td style="text-align: center;">64.0 ± 2.4</td> <td style="text-align: center;">78.5 ± 0.5</td> <td style="text-align: center;">93.2 ± 0.2</td> <td style="text-align: center;">98.9 ± 0.2</td> <td style="text-align: center;">97.1 ± 0.1</td> </tr> <tr> <td style="text-align: left;">LA</td> <td style="text-align: center;">75.5 ± 0.5</td> <td style="text-align: center;">58.3 ± 2.5</td> <td style="text-align: center;">69.2 ± 0.6</td> <td style="text-align: center;">93.1 ± 0.3</td> <td style="text-align: center;">98.5 ± 0.4</td> <td style="text-align: center;">96.7 ± 0.2</td> </tr> <tr> <td style="text-align: left;">LA*</td> <td style="text-align: center;">62.5 ± 0.7</td> <td style="text-align: center;">49.7 ± 2.2</td> <td style="text-align: center;">56.8 ± 0.8</td> <td style="text-align: center;">93.5 ± 0.3</td> <td style="text-align: center;">98.1 ± 0.5</td> <td style="text-align: center;">96.7 ± 0.2</td> </tr> </tbody> </table> <p><strong>Table 3:</strong> MNIST OOD detection results (detailed)</p> </div> <div style="text-align: center;"> <figure> <img src="https://hackmd.io/_uploads/By6SRBamlx.png" alt="Detailed MNIST OOD detection results"> <figcaption><strong>Figure 7:</strong> Detailed MNIST OOD detection results</figcaption> </figure> </div> **Detailed Discussion** **MAP, DE, and LA/LA\***: For these methods, our implementation adhered strictly to the methodology and hyperparameter configurations outlined in the original study. Our results demonstrate strong quantitative agreement with the performance metrics reported by the authors. This successful replication confirms the reproducibility of their findings for these baseline and ensemble techniques. **Variational Bayes (VB)**: While the foundational logic of the Variational Bayes (VB) method was maintained, we found that a direct application of the hyperparameters specified in the source material (specifically, `epochs` = 100 and a precision parameter $τ$ = 0.1) did not yield optimal performance in our experimental environment. Consequently, we performed a targeted hyperparameter search to identify a more suitable configuration. Our experiments, which will be detailed further, indicated that tuning these values was essential to properly balance the model's fit to the data against the regularizing influence of the prior, ultimately allowing us to achieve results consistent with the authors' conclusions. <div style="text-align: center;"> <table style="margin: auto;"> <thead> <tr> <th style="text-align: left;">epochs</th> <th style="text-align: left;">50</th> <th style="text-align: left;">75</th> <th style="text-align: left;">100</th> </tr> </thead> <tbody> <tr> <td style="text-align: left;">ID Acc ↑</td> <td style="text-align: center;">99.3 ± 0.0</td> <td style="text-align: center;">99.3 ± 0.0</td> <td style="text-align: center;">99.3 ± 0.0</td> </tr> <tr> <td style="text-align: left;">OOD Conf ↓</td> <td style="text-align: center;">72.9 ± 0.7</td> <td style="text-align: center;">74.5 ± 0.9</td> <td style="text-align: center;">75.1 ± 1.3</td> </tr> </tbody> </table> <p><strong>Table 4:</strong> VB performance (MNIST) over different epochs (τ = 10)</p> </div> <div style="text-align: center;"> <table style="margin: auto;"> <thead> <tr> <th style="text-align: left;">τ</th> <th style="text-align: left;">0.1</th> <th style="text-align: left;">1</th> <th style="text-align: left;">5</th> <th style="text-align: left;">10</th> </tr> </thead> <tbody> <tr> <td style="text-align: left;">ID Acc ↑</td> <td style="text-align: center;">99.2 ± 0.0</td> <td style="text-align: center;">99.3 ± 0.0</td> <td style="text-align: center;">99.3 ± 0.0</td> <td style="text-align: center;">99.3 ± 0.0</td> </tr> <tr> <td style="text-align: left;">OOD Conf ↓</td> <td style="text-align: center;">76.9 ± 1.2</td> <td style="text-align: center;">76.5 ± 0.9</td> <td style="text-align: center;">75.2 ± 1.0</td> <td style="text-align: center;">72.9 ± 0.7</td> </tr> </tbody> </table> <p><strong>Table 5:</strong> VB performance (MNIST) over different τ values (epochs = 50)</p> </div> <div style="display: flex; align-items: flex-start; gap: 15px; margin: 20px 0;"> <div style="flex: 1; text-align: center;"> ![VBHP_MNIST](https://hackmd.io/_uploads/SJ0gjjpQle.png) **Figure 8:** Hyperparameter sensitivity analysis for Variational Bayesian (VB) method on MNIST. (a) Effect of training epochs on ID accuracy and OOD confidence with fixed τ = 10. (b) Effect of temperature τ on performance metrics with fixed epochs = 50. \(c) Comparison between optimal configuration (50 epochs, τ = 10) and original settings (100 epochs, τ = 0.1). (d) OOD confidence landscape showing the interaction between epochs and temperature parameters. (e) Effect of (epochs, τ) combination on OOD confidence. </div> </div> <div style="text-align: center; margin: 20px 0;"> ![VBPerformance](https://hackmd.io/_uploads/rkqyZIp7ex.png) **Figure 9:** Detailed performance comparison of VB under different hyperparameter settings. The bar charts illustrate the trade-off between ID accuracy (which remains stable) and OOD confidence across varying epochs and temperature values. </div> Our hyperparameter analysis reveals a fundamental trade-off in the behavior of Variational Bayes. The interaction between training epochs and temperature parameter τ critically influences the balance between posterior certainty and prior influence. Specifically, configurations with extended training periods and lower temperature values cause the variational posterior to converge toward the MAP estimate, resulting in overconfident predictions similar to deterministic models. This manifests as higher OOD confidence scores (75.1% at 100 epochs with τ = 10). Conversely, shorter training durations combined with higher temperature values preserve the regularizing effect of the prior distribution, maintaining the posterior's uncertainty and yielding more conservative confidence estimates on OOD data (72.9% at 50 epochs with τ = 10). The temperature parameter τ exhibits a particularly pronounced effect: increasing τ from 0.1 to 10 reduces OOD confidence by approximately 4 percentage points while maintaining consistent ID accuracy. This demonstrates that proper temperature scaling enables VB to achieve better-calibrated uncertainty estimates without sacrificing classification performance. Our experiments identify the configuration of 50 epochs with τ = 10 as optimal for MNIST, successfully reproducing the authors' quantitative results while providing superior uncertainty calibration compared to the originally reported settings. **SWAG:** We investigate two variants: covariance estimation and BN update strategy. <div style="display: flex; align-items: flex-start; gap: 15px;"> <div style="flex: 1;"> <table style="margin: auto;"> <thead> <tr> <th style="text-align: left;">Cov Estimation</th> <th style="text-align: left;">Low-rank + Diag</th> <th style="text-align: left;">Diag-only</th> </tr> </thead> <tbody> <tr> <td style="text-align: left;">ID Acc ↑</td> <td style="text-align: center;">99.2 ± 0.1</td> <td style="text-align: center;">99.2 ± 0.1</td> </tr> <tr> <td style="text-align: left;">OOD Conf ↓</td> <td style="text-align: center;">76.0 ± 0.4</td> <td style="text-align: center;">76.7 ± 0.5</td> </tr> <tr> <td style="text-align: left;">Train Time ↓</td> <td style="text-align: center;">~110s</td> <td style="text-align: center;">~100s</td> </tr> </tbody> </table> <p style="text-align: center;"><strong>Table 6:</strong> SWAG performance (MNIST) with different covariance estimation strategies</p> </div> <div style="flex: 1; text-align: center;"> ![SWAGMultiMetricComparison](https://hackmd.io/_uploads/rJBzrLpmel.png) **Figure 10:** Multi-metric comparison of SWAG covariance estimation methods on MNIST. </div> </div> ![SWAGCovarianceMNIST](https://hackmd.io/_uploads/SkYZr8pmee.png) **Figure 11:** SWAG covariance estimation strategies comparison on MNIST dataset. (a) Performance metrics showing comparable ID accuracy and OOD confidence between methods. (b) Training time comparison revealing only ~10% overhead for low-rank approximation. \(c) Performance-efficiency trade-off analysis. (d) Relative performance using low-rank + diagonal as baseline. The results demonstrate that both covariance estimation approaches yield comparable performance. While diagonal-only estimation could be considered sufficient for simple architectures like LeNet on basic datasets such as MNIST, the low-rank plus diagonal approximation provides marginally better uncertainty estimates with only a ~10% increase in training time. Given this favorable performance-cost trade-off, we employed the low-rank plus diagonal approach for both MNIST and CIFAR-10 experiments. As shown in Figure 10 and Figure 11, the multi-metric analysis reveals that the low-rank plus diagonal method achieves slightly lower OOD confidence (76.0% vs 76.7%), indicating better-calibrated uncertainty estimates. The radar chart visualization (Figure 10) further illustrates that both methods maintain identical ID accuracy while the low-rank approximation offers superior OOD performance. The minimal computational overhead (~10 seconds difference in training time, as depicted in Figure 11b) is negligible compared to the improved uncertainty quantification. Therefore, while diagonal-only SWAG demonstrates sufficient performance for simple tasks, the comprehensive benefits of the low-rank plus diagonal approach justify its adoption as our default configuration. <div style="text-align: center;"> <table style="margin: auto;"> <thead> <tr> <th style="text-align: left;">BN update</th> <th style="text-align: left;">once per sweep</th> <th style="text-align: left;">per sample</th> </tr> </thead> <tbody> <tr> <td style="text-align: left;">ID Acc ↑</td> <td style="text-align: center;">99.2 ± 0.1</td> <td style="text-align: center;">99.2 ± 0.1</td> </tr> <tr> <td style="text-align: left;">OOD Conf ↓</td> <td style="text-align: center;">76.0 ± 0.4</td> <td style="text-align: center;">82.4 ± 0.7</td> </tr> <tr> <td style="text-align: left;">Test Time ↓</td> <td style="text-align: center;">~10s</td> <td style="text-align: center;">2500-3000s</td> </tr> </tbody> </table> <p><strong>Table 7:</strong> SWAG performance (MNIST) with different BN update strategies</p> </div> ![SWAGBatchNormMNIST](https://hackmd.io/_uploads/S1gehoT7xx.png) **Figure 12:** SWAG batch normalization update strategies comparison on MNIST. (a) Performance comparison showing identical ID accuracy but degraded OOD confidence for per-sample updates. (b) Test time comparison on logarithmic scale highlighting the 275× computational overhead. \(c) Summary table with key findings and recommendations. The per-sample batch normalization update strategy proves computationally prohibitive, increasing inference time by approximately 275× (from 10 seconds to 2500-3000 seconds). As illustrated in Figure 12, this approach not only incurs massive computational overhead but also degrades OOD detection performance, with confidence scores increasing from 76.0% to 82.4%. This counterintuitive result stems from data leakage: updating batch normalization statistics on a per-sample basis inadvertently incorporates test and OOD data characteristics into the model's normalization parameters, leading to overconfident predictions on out-of-distribution samples. Table 7 and Figure 12 clearly demonstrate that while both strategies achieve identical ID accuracy (99.2%), the per-sample approach fails on both efficiency and effectiveness metrics. The logarithmic scale visualization in Figure 12b emphasizes the impractical nature of per-sample updates for real-world applications. Therefore, we adopted the low-rank plus diagonal covariance estimation with once-per-sweep batch normalization updates for our SWAG implementation, successfully reproducing the authors' quantitative results while maintaining computational feasibility. #### 4.1.2. CIFAR-10 → OOD On the CIFAR-10 dataset, the authors observed a similar trend pattern with MNIST. LA* again achieved the best results on the confidence metric (55.7±1.2), greatly improving the overconfidence problem, and LA (69.0±1.3) also outperformed most baseline methods. The authors pointed out that although VB performed better on confidence (58.8±0.7), its AUROC performance was poor (88.7±0.3), and the LA method achieved a better balance between the two indicators. <div style="display: flex; align-items: flex-start; gap: 15px;"> <div style="flex: 1;"> | Method | Confidence ↓ | AUROC ↑ | |--------|:------------:|:-------:| | MAP | 72.2 ± 2.2 | 92.6 ± 0.6 | | DE | 63.6 ± 1.2 | 93.9 ± 0.2 | | VB | 62.5 ± 0.6 | 84.3 ± 0.6 | | SWG | 64.7 ± 3.0 | 90.5 ± 2.2 | | LA | 66.6 ± 0.3 | 87.5 ± 0.3 | | LA* | 38.6 ± 0.3 | 89.5 ± 0.3 | **Table 8:** OOD detection performance (CIFAR-10) averaged over all test sets </div> <div style="flex: 1; text-align: center;"> ![CIFAR-10_comparison](https://hackmd.io/_uploads/rkCxUjq7el.png) **Figure 13:** Performance comparison between our results and original paper results for OOD detection on CIFAR-10 dataset. Despite dataset modifications and training adjustments, our implementation qualitatively reproduces the relative performance patterns across methods. </div> </div> <div style="text-align: center;"> <table style="margin: auto;"> <thead> <tr> <th style="text-align: left;">Method</th> <th style="text-align: center;">SVHN Conf ↓</th> <th style="text-align: center;">CIFAR-100 Conf ↓</th> <th style="text-align: center;">SVHN AUROC ↑</th> <th style="text-align: center;">CIFAR-100 AUROC ↑</th> </tr> </thead> <tbody> <tr> <td style="text-align: left;">MAP</td> <td style="text-align: center;">66.5 ± 4.2</td> <td style="text-align: center;">77.9 ± 0.3</td> <td style="text-align: center;">95.4 ± 1.0</td> <td style="text-align: center;">89.8 ± 0.2</td> </tr> <tr> <td style="text-align: left;">DE</td> <td style="text-align: center;">57.7 ± 2.4</td> <td style="text-align: center;">69.5 ± 0.2</td> <td style="text-align: center;">96.4 ± 0.5</td> <td style="text-align: center;">91.4 ± 0.1</td> </tr> <tr> <td style="text-align: left;">VB</td> <td style="text-align: center;">63.3 ± 1.4</td> <td style="text-align: center;">61.6 ± 0.4</td> <td style="text-align: center;">84.6 ± 1.1</td> <td style="text-align: center;">84.1 ± 0.1</td> </tr> <tr> <td style="text-align: left;">SWG</td> <td style="text-align: center;">58.6 ± 5.0</td> <td style="text-align: center;">70.8 ± 1.8</td> <td style="text-align: center;">94.1 ± 2.7</td> <td style="text-align: center;">86.9 ± 2.1</td> </tr> <tr> <td style="text-align: left;">LA</td> <td style="text-align: center;">62.9 ± 0.4</td> <td style="text-align: center;">70.3 ± 0.2</td> <td style="text-align: center;">90.0 ± 0.3</td> <td style="text-align: center;">85.1 ± 0.4</td> </tr> <tr> <td style="text-align: left;">LA*</td> <td style="text-align: center;">35.4 ± 0.4</td> <td style="text-align: center;">41.9 ± 0.2</td> <td style="text-align: center;">92.1 ± 0.3</td> <td style="text-align: center;">86.9 ± 0.3</td> </tr> </tbody> </table> <p><strong>Table 9:</strong> CIFAR-10 OOD detection results (detailed)</p> </div> <div style="text-align: center;"> ![CIFAR10OODDetection](https://hackmd.io/_uploads/HJtOR8aXeg.png) **Figure 14:** CIFAR-10 OOD detection performance visualization. (a) OOD confidence scores comparison across methods and datasets. (b) AUROC performance showing discriminative ability. \(c) Average performance metrics highlighting LA* superiority. (d-e) Heatmaps revealing performance patterns. (f) Confidence-AUROC trade-off scatter plot. </div> **General Discussion:** Due to the inaccessibility of the LSUN dataset (large size and website unavailability), we utilized SVHN and CIFAR-100 as OOD test sets for CIFAR-10. These datasets provide complementary evaluation scenarios: CIFAR-100 shares structural similarities with CIFAR-10, resulting in moderate detection difficulty, while SVHN presents significant distributional differences, creating more challenging OOD detection tasks. This combination effectively captures the spectrum of OOD detection challenges, making our evaluation comprehensive despite the LSUN omission. A critical finding in our experiments is that the WRN-16-4 architecture requires at least 150 epochs for proper convergence, compared to the 100 epochs reported in the original paper. This training adjustment fundamentally alters the MAP baseline and consequently affects all comparative results. Therefore, our reproduction achieves qualitative validation of the methods' relative performance rather than exact quantitative replication. <figure style="text-align: center; margin: 2em;"> <img src="https://hackmd.io/_uploads/SyxzeDTQex.png" alt="OOD Confidence Score" style="max-width: 80%;"> <figcaption style="margin-top: 0.5em;"><strong>Figure 15:</strong> OOD Confidence Scores (CIFAR-10)</figcaption> </figure> <figure style="text-align: center; margin: 2em;"> <img src="https://hackmd.io/_uploads/Hk5MxwT7xe.png" alt="OOD Detection Performance" style="max-width: 80%;"> <figcaption style="margin-top: 0.5em;"><strong>Figure 16:</strong> OOD Detection Performance Comparison (SVHN and CIFAR-10)</figcaption> </figure> **Detailed Discussion:** **MAP:** Following the authors' implementation strictly, we observed that the increased complexity of both the WRN-16-4 model and CIFAR-10 dataset leads to higher variance in MAP solutions, as evidenced by the larger standard errors in Table 8. The convergence issue necessitated extended training, which may contribute to the baseline differences from the original paper. **LA and LA\*:** Both Laplace approximation variants successfully reduce predictive confidence on OOD data while significantly stabilizing the MAP solutions. Standard LA performs as expected, reducing OOD confidence moderately while maintaining good AUROC scores. LA* demonstrates dramatic confidence reduction (38.6% average), potentially over-compensating for uncertainty. We attribute this behavior to the 'marglik' hyperparameter optimization method used in our implementation. While we attempted to employ cross-validation methods, they are not available in the current Laplace library. The authors likely used grid search for hyperparameter tuning, which proved computationally prohibitive for our resources. **DE:** Deep Ensembles perform consistently with theoretical expectations, achieving the best balance between low OOD confidence (63.6%) and high AUROC (93.9%). However, the method inherits the instability issues from its constituent MAP models, resulting in higher variance compared to MNIST experiments. **VB:** Variational Bayes required extensive hyperparameter tuning for the WRN-16-4 architecture on CIFAR-10. Our systematic exploration revealed optimal performance at 100 epochs with τ = 3680: <div style="text-align: center;"> <table style="margin: auto;"> <thead> <tr> <th style="text-align: left;">(epochs, τ)</th> <th style="text-align: left;">(75, 3680)</th> <th style="text-align: left;">(100, 3680)</th> <th style="text-align: left;">(100, 100)</th> <th style="text-align: left;">(150, 3680)</th> </tr> </thead> <tbody> <tr> <td style="text-align: left;">ID Acc ↑</td> <td style="text-align: center;">88.6</td> <td style="text-align: center;">90.2</td> <td style="text-align: center;">91.3</td> <td style="text-align: center;">90.8</td> </tr> <tr> <td style="text-align: left;">OOD Conf ↓</td> <td style="text-align: center;">61.7</td> <td style="text-align: center;">62.5</td> <td style="text-align: center;">73.4</td> <td style="text-align: center;">66.9</td> </tr> </tbody> </table> <p><strong>Table 10:</strong> VB performance (CIFAR-10) over different (epochs, τ) combinations</p> </div> The results in Table 10 confirm the same behavioral pattern observed in MNIST experiments: shorter training with larger temperature values maintain stronger prior influence, increasing predictive uncertainty on OOD data. However, this deviation from the maximum posterior also slightly reduces ID accuracy. The optimal trade-off occurs at 100 epochs with τ = 3680, achieving performance comparable to Laplace approximation methods. **SWAG:** Consistent with our MNIST experiments, we employed low-rank + diagonal covariance estimation with once-per-sweep batch normalization updates. While SWAG demonstrates competitive OOD confidence scores, it exhibits higher variance due to its dependence on potentially unstable MAP initializations. The method's performance validates our implementation choices while highlighting the propagation of baseline instabilities through MAP-dependent approaches. **Summary:** Despite dataset substitutions and training modifications, our CIFAR-10 experiments successfully demonstrate the qualitative performance relationships among uncertainty quantification methods. LA* achieves the lowest OOD confidence, DE provides the best AUROC performance, and VB offers balanced uncertainty estimation after careful tuning. The increased model and dataset complexity compared to MNIST manifests as higher performance variance across all methods, emphasizing the importance of robust uncertainty quantification in challenging scenarios.  ### 4.2. Reproduction of Paper Figure 6: Real-World Distribution Shift The following analysis presents our reproduction results for last-layer Laplace approximation, MAP, Deep Ensemble, and Temperature Scaling methods under real-world distribution shifts from the WILDS benchmark. #### 4.2.1. Results from Python Implementation <div style="display: flex; gap: 0px;"> <div style="flex: 1;"> Our reproduction demonstrates that Temperature Scaling achieved the lowest ECE as expected, since it directly optimizes calibration objectives. As shown in the Figure 17 in the right, the Laplace (Kron) method outperformed the MAP baseline, maintaining the overall trend reported in the original paper, though with less pronounced improvements. While our results appear less "glowing" than the original findings, the fundamental behavioral patterns hold consistently. The Kronecker approximation proved to be a stable alternative to the full-Hessian method, and Temperature Scaling's superior performance confirms theoretical expectations about calibration-focused optimization approaches. </div> <div style="flex: 1; text-align: center;"> ![CivilComments NLL](https://hackmd.io/_uploads/rklPoI97xg.png) <p><strong>Figure 17:</strong> CivilComments Results Reproduction</p> </div> </div> #### **4.2.2. Implementation Challenges** **Cholesky Decomposition Error:** During reproduction with identical settings, we encountered a persistent "Cholesky non-positive definite" error, even when using the authors' original code. This indicates a fundamental numerical stability issue rather than implementation problems. **Attempted Solutions:** - Added jitter to diagonal elements - Modified underlying matrix decomposition methods - Manual correction of negative eigenvalues in Hessian inverse All approaches either failed or felt ad-hoc, undermining reproducibility. The error only occurs when using >50% of the dataset, complicating debugging as full training runs are required to test each fix. **Solution:** We switched from full-Hessian Laplace to **Kronecker (Kron) approximation**, which provides numerical stability while maintaining core methodology. #### **4.2.3. Reproducibility Implications** This reproduction highlights important considerations: - Even original author code may contain numerical instabilities - Practical reproduction sometimes requires methodological adaptations - Kronecker approximations offer better stability-accuracy trade-offs for real-world applications Despite numerical challenges, our reproduction validates the core behavioral patterns reported in the original work. ## 5. Julia Implementation Here is the link to our Julia implementation repo: https://github.com/D4vidHuang/laplace_julia.git. We have to specially mention that our implmentation is based on the results from TAIJA's project, from the following github repo: https://github.com/JuliaTrustworthyAI/LaplaceRedux.jl.git. ### 5.1 Overview and Motivation <div style="display: flex; align-items: flex-start; gap: 30px;"> <div style="flex: 1; text-align: center;"> ![image](https://hackmd.io/_uploads/B1jZznM4xx.png) <p><strong>Figure 18:</strong> Julia Implementation</p> </div> <div style="flex: 1;"> #### **Julia Implementation: Comparative Analysis and Enhancement** This section presents our development of a new Julia implementation for Bayesian neural networks based on the Laplace Redux methodology. Building upon existing work [^12], we created a significantly enhanced and streamlined framework that addresses critical limitations in the current Julia ecosystem for Bayesian deep learning, particularly focusing on fixing issues present in the original Laplace Redux Julia library. </div> </div> Our Julia implementation serves multiple purposes in this reproduction study. First, it provides an independent computational validation of our Python results through a completely reimplemented framework. Second, it demonstrates the feasibility of creating more accessible tools for Bayesian neural networks in Julia by significantly simplifying the implementation process. Most importantly, our work addresses and resolves several critical bugs and numerical instabilities that existed in the original Laplace Redux Julia codebase, creating a more robust foundation for future research. ![image](https://hackmd.io/_uploads/SyryfnMVgl.png) <div style="text-align: center"><p><strong>Figure 19:</strong> GUI of Julia implementation</p></div> Our primary contribution is the development of a fixed Julia implementation that faithfully reconstructs the Laplace Redux methodology as specified in the original paper. Unlike previous versions, our framework emphasizes clarity, numerical stability, and ease of use. We rebuilt the implementation from an existing library [^12], adhering closely to the original mathematical formulations to create a cleaner, more intuitive API that simplifies the application of Laplace approximations in Julia. Through systematic analysis, we identified and corrected several fundamental bugs in the existing Julia library, including persistent matrix decomposition failures, memory management issues, and numerical precision errors. Furthermore, our streamlined GUI reduces the typical implementation overhead for Laplace approximations , making Bayesian neural networks more accessible. By leveraging Laplace Redux Library's linear algebra and implementing robust numerical procedures, our framework achieves superior stability, effectively handling the ill-conditioned matrices and edge cases that caused frequent failures in past implementations. ### 5.2 Validation and Performance Analysis <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/BJdfu097xl.png" alt="image" width="300"/> <strong>Figure 20: Architecture of the Julia Framework (adapted from [original source])</strong> </div> While computational resources and time constraints prevented us from conducting a complete experimental reproduction of all paper results in Julia, our targeted validation experiments yielded highly encouraging outcomes. Our implementation successfully reproduced key methodological components of the Laplace Redux approach, demonstrating consistency with both the theoretical framework and our Python implementation results. The validation experiments we conducted showed that our Julia implementation maintains numerical accuracy comparable to the reference Python implementation while offering improved stability in challenging scenarios. Specifically, our framework successfully handled several test cases that caused failures in the original Julia Laplace Redux library, validating the effectiveness of our bug fixes and architectural improvements. Performance analysis revealed that while our Julia implementation excels in CPU-bound matrix operations and provides superior numerical precision, the mature GPU ecosystem surrounding PyTorch still offers advantages for large-scale neural network training phases. This finding reinforces our design decision to focus on creating a robust, mathematically sound framework rather than optimizing purely for computational speed. ### 5.3 Impact and Future Directions By addressing fundamental issues in existing libraries and providing a streamlined, paper-faithful implementation, our Julia implementation represents a significant advancement in the accessibility and reliability of Bayesian neural network tools within the Julia ecosystem. The successful resolution of critical bugs and our simplified implementation process demonstrate the potential for creating more robust scientific computing tools, showing that with careful attention to numerical stability and API design, Julia can be a compelling platform for sophisticated Bayesian inference tasks. We acknowledge that our validation is limited by computational constraints; a complete experimental reproduction would require substantial resources and is an important direction for future work. While major stability issues have been addressed, continued development will be valuable for refinement. Nevertheless, our enhanced framework opens promising research directions, such as integration with Julia's GPU capabilities, development of advanced prior specification tools, and exploration of hybrid Julia-Python workflows. The solid foundation we have established positions the Julia community to more effectively contribute to advancing Bayesian deep learning methodologies. ## 6. Conclusion ### 6.1 Summary of Reproduction Achievements This reproduction project validates the key claims of Laplace Redux – Effortless Bayesian Deep Learning. We confirm that Laplace Approximation (LA) offers an efficient and practical path to Bayesian deep learning, requiring minimal changes to standard training. **Table 1 (MNIST):** We closely matched the paper's results, with LA* reducing OOD confidence from 74.7% to 56.4%, confirming improved calibration. **Table 1 (CIFAR-10):** While numerical values differ due to training differences, we reproduced the relative performance. And LA* significantly reduced OOD confidence from 72.2% to 38.6%. **Figure 6 (WILDS):** Despite full-Hessian issues, our Kronecker approximation reproduced the main result—LA outperforms MAP under distribution shift. We extended the paper via VB hyperparameter tuning and SWAG variant analysis, moving beyond replication to deeper methodological insight. ### 6.2 Technical Contributions and Validation Our work improved on the original implementation’s structure by separating training, evaluation, and model definitions across modular scripts, enhancing clarity and reproducibility. We demonstrated that diagonal-only SWAG performs well on simple models, while full SWAG remains superior with only modest overhead. We discovered that per-sample BN updates can leak test information and are much slower than per-sweep updates, harming performance. In VB, we showed that tuning the hyperparameters is critical, identifying optimal values (*epochs* = 50, $τ$ = 10 for MNIST, and *epochs* = 100, $τ$ = 3680 for CIFAR-10). Our cross-platform Julia implementation further validated algorithmic correctness and revealed its numerical benefits. ### 6.3 Limitations and Methodological Considerations CIFAR-10 models required longer training (150 vs. 100 epochs) for convergence, explaining numerical discrepancies with the paper. Full-Hessian methods failed in practice due to Cholesky errors—even using the original code. We addressed this by switching to Kronecker approximations. Limited computational resources restricted exhaustive grid searches and ensembles, and LSUN unavailability led us to using only SVHN and CIFAR-100. Despite these, we successfully validated the paper’s central claims. ### 6.4 Implications for Bayesian Deep Learning Our findings confirm that LA offers a compelling balance of simplicity, efficiency, and reliability. It improves OOD calibration post-hoc without retraining, often matching or outperforming more complex methods like Deep Ensembles at lower cost. LA’s consistent performance across datasets and platforms, and our success integrating it into standard pipelines, supports its practicality for real-world uncertainty estimation. Future work may explore LA on large-scale architectures like transformers, improve numerical stability in full-Hessian variants, and assess its impact in safety-critical applications. Standardizing hyperparameter tuning across Bayesian methods would also enhance fair comparison and real-world usability. ## 7. Individual Contributions | Name | Contributions | |------------------------------|--------------------------------------------------------------------------------------------------------| | Yongcheng Huang | Table1，Julia | | Yiming Chen | Figure 6 | | Zeyu Fan | Table 1, Julia | ## 8. Comparison with Group 3 We reproduced the **Table 1 benchmarks** with close fidelity: the robust last-layer **LA\*** markedly lowered OOD confidence on both MNIST and CIFAR-10, and our **Figure 6** replication kept the original calibration trend by swapping a fragile full-Hessian for a **Kronecker-factored** one. Beyond replication we tuned **Variational Bayes** to curb OOD over-confidence without harming ID accuracy and, through **SWAG diagnostics**, found that a low-rank + diagonal covariance with a single batch-norm sweep offers the best speed–calibration compromise. The entire pipeline is available at our [repo](https://github.com/D4vidHuang/laplace_rep). From Group 3’s poster we noted two thoughtful extensions—**Subspace Laplace** and a **SWAG-Laplace** hybrid—aimed at exploiting low-dimensional curvature and combining robust means with precise curvature. Their ideas complement our own: we could fold subspace insights into our Kronecker framework, while they might gain from our VB tuning and SWAG batch-norm lessons. Together these perspectives broaden the practical toolbox for effortless Bayesian deep learning. ### Reference [^1]: Daxberger E, Kristiadi A, Immer A, et al. Laplace redux-effortless bayesian deep learning[J]. Advances in neural information processing systems, 2021, 34: 20089-20103. [^2]: Deng L. The mnist database of handwritten digit images for machine learning research [best of the web][J]. IEEE signal processing magazine, 2012, 29(6): 141-142. [^3]: A. Krizhevsky, "Learning Multiple Layers of Features from Tiny Images," Technical Report, University of Toronto, 2009. [Online]. Available: https://paperswithcode.com/dataset/cifar-10 [^4]: Xiao H, Rasul K, Vollgraf R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms[J]. arXiv preprint arXiv:1708.07747, 2017. [^5]: Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng Reading Digits in Natural Images with Unsupervised Feature Learning NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011. [^6]: Bandi, P. et al. From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge. IEEE transactions on medical imaging 38, 550–560 (2018). [^7]: Christie G, Fendley N, Wilson J, et al. Functional map of the world[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6172-6180. [^8]: Borkan D, Dixon L, Sorensen J, et al. Nuanced metrics for measuring unintended bias with real data for text classification[C]//Companion proceedings of the 2019 world wide web conference. 2019: 491-500. [^9]: Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a Distilled Version of Bert: Smaller, Faster, Cheaper and Lighter. In 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS, 2019. [^10]: Ayush K, Uzkent B, Burke M, et al. Generating interpretable poverty maps using object detection in satellite images[J]. arXiv preprint arXiv:2002.01612, 2020. [^11]: Koh P W, Sagawa S, Marklund H, et al. Wilds: A benchmark of in-the-wild distribution shifts[C]//International conference on machine learning. PMLR, 2021: 5637-5664. [^12]: https://github.com/JuliaTrustworthyAI/LaplaceRedux.jl.git , The original implementation of Julia Laplace Redux from Taija.