Hyperspherical Energy Rebuttal

# General response We would like to thank all the reviewers for their valuable effort and helpful comments. We find it particularly encouraging to see that reviewers find that the hyperspherical energy score is a **promising** and **better-suited** OOD detection score due to **its connection to the log-likelihood** (oN3L, td2X, Vf4g) and that method is **empirically supported** through **extensive** and **well designed and conducted experiments** (JWN7, oN3L, td2X, Vf4g) and strengthens with **extensive ablations** (td2X). We are also glad that the reviewers find our paper **clear** and **easy to follow** (JWN7, oN3L, td2X, Vf4g). The theoretical and practical significance of our work forms a key part of the positive feedback we have received. Our method, the first to connect hyperspherical representations and Helmholtz free energy for OOD detection, offers unique contributions and rigorous theoretical interpretations. In practice, our method outperforms existing OOD detection scores in speed, performance on benchmarks like ImageNet-1k, and reduction in FPR95. Compared to the original energy score [1] --- one of the most commonly used scoring functions today --- our hyperspherical energy will steer the OOD community in a new direction with theoretical soundness and significant performance benefits (40% reduction in FPR95$\downarrow$ on CIFAR100). We respond to each reviewer's comments in detail below. We will also revise the manuscript according to the reviewers' suggestions, which makes our paper more comprehensive for readers. [1] Liu et al., Energy-based Out-of-distribution Detection. NeurIPS 2020. # Reviewer JWN7 We sincerely thank the reviewer for the detailed review and constructive criticism. We also appreciate the reviewer's recognition of our paper's clarity and strong results. We address the question below in detail. > **Q1: Difference between hyperspherical energy and CIDER [1]** CIDER and our approach differ significantly in terms of out-of-distribution (OOD) detection method in **testing time**. The design of the OOD scoring function serves as the central intellectual component in many OOD literature, which is what we contribute and focus on in this paper. The difference between CIDER and our approach was discussed in the introduction **L59-L69**. Specifically, CIDER relies on a _non-parametric_ KNN score [2], which requires a nearest neighbor search in the learned embedding space. In contrast, our OOD detection relies on a novel hyperspherical energy formulation, which can be viewed as a _parametric_ OOD scoring function. To further highlight the significance and novelty of our approach: - **Theoretical significance** Our method is the first to establish the connection between the hyperspherical representations and Helmholtz free energy for OOD detection. Our OOD scoring function enjoys rigorous theoretical interpretation from a log-likelihood perspective, while CIDER does not. Our derivation and interpretation of hyperspherical energy provided in Section 3.1 is an entirely new contribution relative to CIDER. From a training-time perspective, we also derive new insight into how the learning objective induces lower hyperspherical energy (Section 3.2), which directly connects to our proposed OOD scoring function. - **Practical significance** We also show empirically that our proposed OOD score achieves competitive performance on different OOD detection benchmarks and is computationally efficient compared to CIDER (Section 4). For example, on a large-scale ImageNet benchmark, our method is more than 10x faster than CIDER while simultaneously reducing the average FPR95 by 11.85% and establishing state-of-the-art performance. Moreover, hyperspherical energy is a privacy-friendly OOD detection score. Our OOD detection algorithm offers privacy benefits over CIDER by not requiring access to a pool of ID data. In situations where data privacy is a concern, the use of the KNN-based CIDER method may not be feasible as it requires access to a certain amount of labeled ID data, which can pose privacy risks. Lastly, we do not claim novelty on learning hyperspherical embeddings since von Mises-Fisher (vMF) distribution is long established in directional statistics [3] and pre-date CIDER. We also properly credited and cited prior works [2,4], which enable efficient optimization using the prototype update scheme (please see L161, L198). These are not our core contributions -- but they enable our work to propose a novel hyperspherical energy method for OOD detection. [1] Ming et al., How to Exploit Hyperspherical Embeddings for Out-of-Distribution Detection? ICLR 2023. [2] Sun et al., Out-of-Distribution Detection with Deep Nearest Neighbors. ICML 2022. [3] P.E. Jupp and K.V. Mardia. Directional Statistics. Wiley Series in Probability and Statistics. Wiley, 2009. [4] Li et al., Mopro: Webly supervised learning with momentum prototypes. ICLR 2020. # Reviewer oN3L We sincerely appreciate your positive feedback and insightful comments. Your acknowledgment of our work's clarity and unique contribution means a lot to us. We address the questions below in detail. > **Q1. OOD data's hyperspherical energy** Thank you for the insightful question. Our method operates under the assumption that OOD data lies in the low-likelihood (high hyperspherical energy) region in the hyperspherical space relative to the ID data with a higher likelihood (low hyperspherical energy). This is natural and commonly adopted by many likelihood-based approaches. Indeed, our empirical results on CIFAR-10, CIFAR-100, and the more challenging ImageNet-1k benchmark support the high hyperspherical energy of OOD data. Effectively, the low FPR is a result of separable hyperspherical energy distributions between ID and OOD data. On the theoretical side, there is a tradeoff in terms of the data used for training vs. the kind of guarantees we can provide. For example, in the most practical and unrestrictive setting with ID data only, we can indeed only guarantee inducing lower hyperspherical energy for ID data (as you said). This is inherently limited by the data exposed to the learner. Alternatively, one could possibly extend our framework by considering a more restrictive data setting that imposes auxiliary outlier data and explicitly optimize for the low hyperspherical energy on the outlier training data. This, in turn, can perhaps render some degree of guarantees on the OOD data. For this study, we intentionally chose to focus our scope on the former case (which is more general), and we believe the latter can be an interesting extension to look into in the near future. > **Q2: Why do generative models like VAE and Flow suffer from the overestimation of the likelihood for OOD data [1,2], while this work does not?** That's another insightful question. We would like to briefly mention several studies aiming to understand the phenomenon. For example, prior work by Ren et al. [3] showed that deep generative models' reliance on the background can undesirably lead to high likelihood estimation for OOD samples. [4] showed that generative models overfit the training data, especially the background pixels that are non-essential for determining the image semantics. In contrast, our method does not suffer from this issue since the training objective is _supervised_ learning, in essence, rather than _unsupervised_ generative modeling like VAE or Flow. As seen in Equation (12), the learning objective has access to the ground truth semantic label for each ID data. Hence, the latent embeddings are shaped and guided by the image semantics without being prone to overfit to the background like deep generative models. In general, methods relying on supervised learning offer stronger performance than the generative counterpart. What makes our method compelling is that it offers the good world of both supervised learning and generative modeling --- hyperspherical energy displays competitive empirical performance while enjoying the theoretical interpretation from a log-likelihood perspective. [1] Nalisnick, Eric, et al. "Do deep generative models know what they don't know?." arXiv preprint arXiv:1810.09136 (2018). [2] Li, Yewen, et al. "Out-of-distribution detection with an adaptive likelihood ratio on informative hierarchical vae." Advances in Neural Information Processing Systems 35 (2022): 7383-7396. [3] Ren, Jie, et al. Likelihood ratios for out-of-distribution detection. NeurIPS 2019. [4] Kirichenko, Polina, et al. Why normalizing flows fail to detect out-of-distribution data. NeurIPS 2020. > **Q3: On the relationship between $P_X(x)$ and $P_Z(z)$** The discrepancy arises since our method is not a deep generative model in nature (that optimizes $p(x)$), but rather a supervised model (that optimizes the posterior probability $p(y|x)$). Precisely because supervised models do not explicitly optimize $p(x)$, the issue you brought up can apply to all OOD detection methods relying on supervised learning. This is also why classification-based OOD detection is fundamentally challenging :) Fortunately, hyperspherical energy offers theoretical guarantees that OOD data can be detected if it lies in the low-likelihood region of the hyperspherical space. Hence, compared to prior OOD detection methods based on supervised models, our method already makes a significant contribution and steps forward by connecting to the log-likelihood interpretation (that usually requires generative modeling). Mitigating the issue you mentioned may require a hybrid model that combines supervised learning and generative modeling --- the latter of which can be often difficult to optimize in practice. In contrast, our method based purely on supervised learning can be tractably optimized, which is easy to use in practice and offers strong empirical performance (outperforming SOTA baselines). > **Typos** All fixed - thank you for the careful read! # Reviewer td2X We sincerely appreciate the reviewer's acknowledgment of the conciseness of our work and the extensiveness of our experiments. We are grateful for the thorough comments and suggestions provided. We address the questions in detail below. > **Rapid explanation of key developments** It's helpful to hear your feedback on this. Though we were constrained by the page limit, we plan to expand the details in our revised supplementary. To elaborate, we assess the gradient of the loss function relative to the model parameters $\theta$ for an embedding $\mathbf{z}$ and its corresponding class label $y$, as illustrated in Equation (14). Applying the chain rule, we compute the partial derivative with respect to $\theta$, as presented in Equations (14.1, 14.2). For Equation (14.3), we substitute the probability function with the Gibbs-Boltzmann distribution, as specified in Equation (2), which we then rewrite as Equation (14.4). $$ \begin{align} \frac{\partial \mathcal{L}(\mathbf{z}, y ; \theta)}{\partial \theta} &= \frac{1}{\tau} \frac{\partial E(\mathbf{z}, y)}{\partial \theta} + \sum_{j=1}^C \frac{\partial \exp(-E(\mathbf{z}, j) / \tau)}{\partial \theta} \cdot \frac{1}{\sum_{c=1}^C \exp(-E(\mathbf{z}, c) / \tau)} \tag{14.1} \\ &= \frac{1}{\tau} \frac{\partial E(\mathbf{z}, y)}{\partial \theta} - \frac{1}{\tau} \sum_{j=1}^C \frac{\partial E(\mathbf{z}, j)}{\partial \theta} \cdot \frac{\exp(-E(\mathbf{z}, j) / \tau)}{\sum_{c=1}^C \exp(-E(\mathbf{z}, c) / \tau)} \tag{14.2} \\ &= \frac{1}{\tau} \frac{\partial E(\mathbf{z}, y)}{\partial \theta} - \frac{1}{\tau} \sum_{j=1}^C \frac{\partial E(\mathbf{z}, j)}{\partial \theta} \cdot p(Y=j \mid \mathbf{z}) \tag{14.3} \\ &= \frac{1}{\tau} (\frac{\partial E(\mathbf{z}, y)}{\partial \theta} (1 - p(Y=y \mid \mathbf{z})) - \sum_{j\neq y} \frac{\partial E(\mathbf{z}, j)}{\partial \theta} p(Y=j \mid \mathbf{z})) \tag{14.4} \end{align} $$ The equation presents a contrastive term, implying the energy of the correct answer $y$ should be lowered while the energies of all the other labels should be raised during the loss optimization process. Furthermore, the sample-wise hyperspherical energy $E(\mathbf{z})$ of ID data is $E(\mathbf{z}) = -\tau \cdot \log \sum_{c=1}^C \exp(-E(\mathbf{z}, c) / \tau)$, which is dominated by the $E(\mathbf{z},y)$ with ground truth label. Hence, the training overall induces lower hyperspherical energy for ID data. By further connecting to our Lemma 3.1, this low hyperspherical energy can directly translate into high ($\uparrow$) log-likelihood $\log p(\mathbf{z})$, for training data distribution. > **Discussion on other hyperspherical approaches** Great suggestion. Our method differs from SSD+ [1] both in terms of training-time loss and test-time scoring function. - At training time, SSD+ directly uses off-the-shelf SupCon (Supervised Contrastive) loss [2], which does not explicitly model the latent representations as vMF distributions. Instead of promoting instance-to-prototype similarity, SupCon promotes instance-to-instance similarity among positive pairs. SupCon loss's formulation thus does not directly correspond to the vMF distribution. Geometrically speaking, our framework directly operates on the vMF distribution, a key property to enable our hyperspherical energy with log-likelihood interpretation. - At testing time, SSD+ uses the Mahalanobis distance, whereas we propose a novel hyperspherical energy score that is compatible with the learned embedding geometry. As suggested, we will include the following table in our revised paper, summarizing and discussing the key distinction among different hyperspherical approaches. ||SSD+|KNN+|CIDER|Ours| |-|-|-|-|-| |Training-time loss function|SupCon|SupCon|vMF|vMF| |Test-time scoring function|Mahalanobis (parametric)|KNN (non-parametric)|KNN (non-parametric)|Hyperspherical energy (parametric)| > **Clarification on CIFAR-10 performance** We would like to note that CIFAR-10 is a very saturated benchmark. We leave it in the Appendix due to the consideration that it's considered an almost-solved task, and the performance is on par with CIDER (the mostly related baseline to ours). Under the same embeddings, our method achieves an FPR95 of 14.16%, which is similar to CIDER (13.85%). While FPR95 can display some sensitivity due to being threshold-dependent, the overall AUROC (averaged across all thresholds) is almost identical to CIDER (97.62 vs. 97.51). We believe more challenging benchmarks such as ImageNet-1k and CIFAR-100 can better signify the efficacy of the OOD detection approach, which has been discussed extensively in the main paper. > **Comparison with ReAct + DICE** We appreciate your suggestion and have conducted an experiment on CIFAR-100 to compare our method with a combination of DICE [3] and ReAct [4]. We set $p$ for ReAct and $p$ for DICE, following their respective validation strategies. Our findings indicate that, even with this combination, the OOD detection performance is still lower compared to the hyperspherical energy score. For a more detailed overview, please refer to the results outlined below. Furthermore, in our revised paper, we plan to integrate the ReAct + DICE results into Tables 2, 3, and 5 for a more comprehensive comparison. **CIFAR-10** | | SVHN | | Places365 | | LSUN | | iSUN | | Texture | | Average | | |:--------------------- |:--------- |:--------- |:--------- | --------- | -------- | --------- | --------- | --------- |:--------- |:--------- |:--------- | --------- | | **Method** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | | ReAct | 48.21 | 92.20 | 48.11 | 90.97 | 23.03 | 95.96 | 22.02 | 96.38 | 48.90 | 91.19 | 38.05 | 93.34 | | DICE | 65.34 | 89.66 | 50.44 | 89.81 | 3.95 | 99.21 | 34.98 | 94.87 | 59.22 | 88.50 | 42.79 | 92.41 | | ReAct + DICE | 48.44 | 91.32 | 61.97 | 87.65 | 9.95 | 98.11 | 21.76 | 96.42 | 42.59 | 91.62 | 36.94 | 93.02 | | Hyperspherical energy | **3.89** | **99.28** | **32.59** | **94.14** | **3.05** | **99.29** | **16.02** | **97.20** | **15.27** | **97.64** | **14.16** | **97.51** | **CIFAR-100** | | SVHN | | Places365 | | LSUN | | iSUN | | Texture | | Average | | |:--------------------- |:--------- |:--------- |:--------- | --------- | -------- | --------- | --------- | --------- |:--------- |:--------- |:--------- | --------- | | **Method** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | | ReAct | 67.07 | 86.83 | 80.98 | 77.39 | 62.89 | 86.90 | 61.62 | 86.61 | 75.69 | 82.85 | 69.65 | 84.12 | | DICE | 53.23 | 89.70 | 83.03 | 75.49 | 44.63 | 91.38 | 74.87 | 79.05 | 84.68 | 73.29 | 68.09 | 81.78 | | ReAct + DICE | 33.05 | 93.49 | 90.38 | 67.91 | 40.92 | 90.46 | 78.90 | 79.36 | 63.24 | 83.30 | 61.30 | 82.90 | | Hyperspherical energy | **17.81** | **96.39** | **76.68** | **76.01** | **8.48** | **98.25** | **57.39** | **86.21** | **32.07** | **93.25** | **38.49** | **90.02** | **ImageNet-1k** | | iNaturalist | | SUN | | Places | | Textures | | Average | | |:--------------------- |:----------- |:--------- |:---------- |:--------- |:--------- |:--------- |:--------- |:--------- |:--------- |:--------- | | **Method** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | | ReAct | 20.38 | 96.22 | **24.20** | **94.20** | **33.85** | **91.58** | 47.30 | 89.80 | 31.43 | 92.95 | | DICE | 25.63 | 94.49 | 35.15 | 90.83 | 46.49 | 87.48 | 31.72 | 90.30 | 34.75 | 90.77 | | ReAct + DICE | 18.64 | 96.24 | 25.45 | 93.94 | 36.86 | 90.67 | 28.07 | 92.74 | 27.25 | 93.40 | | Hyperspherical energy | **8.76** | **98.00** | 36.95 | 91.52 | 49.33 | 87.67 | **11.45** | **96.56** | **26.62** | **93.44** | > **Hyperspherical energy with ReAct** Thank you for the suggestion. We have conducted experiments on ImageNet to combine ReAct [4] with the hyperspherical energy score. The results are provided below for your reference. We observe slight improvement.  | | iNaturalist | | SUN | | Places | | Textures | | Average | | |:--------------------------------------| ----------- |:--------- |:------- | ----------| ------- | --------- | -------- | --------- |:------- | --------- | | **Method** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | | Hyperspherical energy + ReAct ($p=90\%$) | 13.31 | 97.55 | 30.65 | 93.53 | 44.08 | 89.53 | 14.45 | 97.02 | 25.62 | 94.41 | | Hyperspherical energy + ReAct ($p=95\%$) | 10.34 | 97.94 | 29.37 | 93.61 | 43.74 | 89.72 | 12.13 | 97.21 | 23.89 | 94.62 | | Hyperspherical energy + ReAct ($p=99\%$) | 9.18 | 98.05 | 33.46 | 92.54 | 47.38 | 88.55 | 10.57 | 97.23 | 25.15 | 94.09 | | Hyperspherical energy | 8.76 | 98.00 | 36.95 | 91.52 | 49.33 | 87.67 | 11.45 | 96.56 | 26.62 | 93.44 |  [1] Sehwag et al., SSD: A unified framework for self-supervised outlier detection. ICLR 2021 [2] Khosla et al., Supervised contrastive learning. NeurIPS 2020. [3] Sun et al., DICE: Leveraging Sparsification for Out-of-Distribution Detection, ECCV 2022. [4] Sun et al., ReAct: Out-of-distribution Detection With Rectified Activations, NeurIPS 2021. # Reviewer Vf4g We truly value your acknowledgment of the clarity and organization of our paper, as well as the significance of our goal to address issues regarding unconstrained energy scores from a log-likelihood perspective. Your assessment is encouraging and insightful. We address the questions below in detail. > **Difference w.r.t. SIREN [1]** We highlight the major differences below: - SIREN uses maximum class-conditional likelihood, which is mathematically different from the hyperspherical energy score. Unlike SIREN, hyperspherical energy is theoretically sound for OOD detection, due to its log-likelihood interpretation (see Section 3.1 for details). Our OOD scoring function enjoys rigorous theoretical interpretation from a log-likelihood perspective, while SIREN's does not. Our derivation and interpretation of hyperspherical energy provided in Section 3.1 is an entirely new contribution, relative to SIREN. - SIREN primarily focuses on representation shaping loss during training time. In contrast, we focus on a novel test-time OOD detection scoring function. Our method is the first to establish the connection between the hyperspherical representations and Helmholtz free energy for OOD detection. From a training-time perspective, we also derive new insight into how the learning objective induces lower hyperspherical energy (Section 3.2), which directly connects to our proposed OOD scoring function. > **Visualization analysis for large-scale dataset** That's a great suggestion. As part of the author's response, we have included a PDF that illustrates the learned embeddings via UMAP visualization on ImageNet. We indeed observe compact representations, where each sample is effectively pulled close w.r.t. the corresponding class prototype. A notable separation between in-distribution (ID) and OOD classes is also evident, suggesting that the OOD samples exhibit a high hyperspherical energy score. ![](https://hackmd.io/_uploads/SkL-ZBKs2.png) > **Comparison with FeatureNORM [2]** Thank you for your suggestion! In accordance with your advice, we are set to update the results for FeatureNORM [2] in our revised manuscript. This will encompass modifications to Tables 2, 3, 4, and 5, thereby facilitating a more extensive comparison. We have provided the updated results within the following tables for your reference. We utilize the ImageNet-1k results as represented in [2], and for CIFAR-10 and CIFAR-100, we run the experiments using block 4.1 as the selected block. We observe that our method consistently outperforms FeatureNORM across all the benchmark tests. **CIFAR-10** | | SVHN | | Places365 | | LSUN | | iSUN | | Texture | | Average | | |:---------------------- | ---------- |:--------- |:---------- | --------- | --------- | --------- | ---------- | --------- |:--------- |:--------- |:--------- | --------- | | **Method** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | | FeatureNorm | 8.79 | 98.27 | 76.75 | 79.84 | 0.16 | 99.92 | 37.67 | 94.17 | 29.96 | 94.08 | 30.67 | 93.26 | | Hyperspherical energy | 3.89 | 99.28 | 32.59 | 94.14 | 3.05 | 99.29 | 16.02 | 97.20 | 15.27 | 97.64 | 14.16 | 97.51 | **CIFAR-100** | | SVHN | | Places365 | | LSUN | | iSUN | | Texture | | Average | | |:---------------------- | ---------- |:--------- |:---------- | --------- | --------- | --------- | ---------- | --------- |:--------- |:--------- |:--------- | --------- | | **Method** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | | FeatureNorm | 52.69 | 87.95 | 95.26 | 55.62 | 5.96 | 98.74 | 99.33 | 38.51 | 62.11 | 76.16 | 63.07 | 71.40 | | Hyperspherical energy | 17.81 | 96.39 | 76.68 | 76.01 | 8.48 | 98.25 | 57.39 | 86.21 | 32.07 | 93.25 | 38.49 | 90.02 | **ImageNet-1k** | | iNaturalist | | SUN | | Places | | Textures | | Average | | |:--------------------------------------| ----------- |:--------- |:------- | ----------| ------- | --------- | -------- | --------- |:------- | --------- | | **Method** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | **FPR** | **AUROC** | | FeatureNorm | 22.01 | 95.76 | 42.93 | 90.21 | 56.80 | 84.99 | 20.07 | 95.39 | 35.45 | 91.59 | | Hyperspherical energy | 8.76 | 98.00 | 36.95 | 91.52 | 49.33 | 87.67 | 11.45 | 96.56 | 26.62 | 93.44 | [1] Du et al., Siren: Shaping representations for detecting out-of-distribution objects. NeurIPS 2022. [2] Yu et al., Block Selection Method for Using Feature Norm in Out-of-Distribution Detection. CVPR 2023.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.