judy nguyen
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    ## Reviewer Yqvj We appreciate Reviewer #Yqvj in recognizing our work's strengths and contributions and pointing out the potentially improved points. We would like to address these points as follows. W1. **Compare our method with existing BN-related defenses**. We thank the reviewer for the opportunity to clarify how our method addresses fundamental limitations of existing BN-related defenses. Below we provide detailed comparisons with BNP and UNIT. **1. BNP [1]** compares Batch Normalization statistics recorded during training (on poisoned data) against statistics computed from a small clean dataset using KL divergence. Neurons where KL divergence exceeds $mean + u \times std$ are pruned. Limitations of BNP: (i) The implementation (from their released code-base) computes statistics from only one batch (`iter(dataloader).next()`), resulting in high-variance, unreliable KL divergence estimates. This can completely miss backdoor patterns if the sampled batch lacks poisoned samples or contains atypical clean samples. <!-- is it statistically significant? --> (ii) BNP assumes each backdoor neuron exhibits significantly different statistics between clean and poisoned data (Assumption 2: $|\mu - \hat{\mu}| >> \varepsilon$). This fails when the backdoor is distributed across many neurons, making per-neuron shifts too subtle and highly variable across neurons to be reliably detected, one single threshold is not strong enough to capture this activation. <!-- (iii) BNP assumes BN normalizes the current layer's convolution output (Conv → BN → ReLU). However, PreAct ResNet places BN *before* convolution (BN → ReLU → Conv). When BNP zeros out BN weights based on KL divergence, it disrupts the *previous* layer's output distribution rather than pruning the actual backdoor neurons in the current layer—analogous to cutting the wrong wire in a circuit. --> Therefore, BNP only observes *what neurons output* (statistical differences in activation distributions) but cannot detect backdoors when their activation patterns intentionally mimic clean and backdoored samples, as in clean-label attacks (COMBAT, Narcissus). **2. Unit [2]** approximates tight benign activation distribution boundaries for each neuron using 5% clean data, then clips activation values exceeding these boundaries during inference. **It assumes backdoor triggers cause abnormally large activations compared to benign inputs**. Limitations of Unit: (i) COMBAT and Narcissus synthesize triggers containing features as persistent as the original semantic features of the target class [Zhu et al., AAAI 2023; Wang et al., arXiv]. The backdoor activations are *not* substantially larger than benign activations -- they deliberately occupy the same distribution space as legitimate class features, directly violating UNIT's core assumption. As a result, UNIT's boundary-based clipping cannot differentiate between "high activation from backdoor trigger" vs. "high activation from genuine class feature" when both occupy overlapping distributions. (ii) UNIT operates purely at the activation level without understanding the origin or training dynamics of neuron behavior. When backdoor features and authentic semantic features produce similar activation patterns, UNIT faces an impossible dilemma: (a) loose boundaries fail to remove the backdoor, or (b) tight boundaries damage clean accuracy by clipping legitimate high activations. --> UNIT only observes *activation magnitudes* but cannot detect backdoors deliberately designed to produce benign-like activation patterns within normal distribution of activation. **Our FIM-based defense.** Instead of analyzing *what neurons output* (BNP: statistics, UNIT: magnitudes), our method analyzes *how neurons were trained*. The Fisher Information Matrix (FIM) captures the optimization dynamics and parameter sensitivity to training data when we try to align the BN statistic of a reinitialized model with those of the backdoored model, revealing the backdoor's training-time fingerprint. UniBP was designed based on following observations: (i) Affine parameters in BN layers exhibit abnormally high Fisher information because they are explicitly optimized to memorize trigger patterns from a small poisoned subset while learning the main task. This behavior remains largely invariant across different backdoor attacks. We simulate the backdoor learning process by aligning the BN layer statistics of a reinitialized model to those of the backdoored model, and from the resulting learning trajectory we can identify which neurons are responsible for jointly maintaining both the clean task and the backdoor task. This methodology is fundamentally different from detecting abnormal activations at the neuron level as in UNIT, or from computing KL divergence between the statistics of a model on clean data and those of a backdoored model. <!-- I don't understand this next clause; is it supposed to inline a definition of "training trajectory" ? rapid convergence on a few poisoned samples versus gradual learning from diverse clean data, remains distinct and is revealed by FIM when we **simulate this learning process**. --> (ii) By comparing backdoored model FIM vs. reinitialized model FIM, we establish a reference for "backdoor training dynamics." This eliminates: (a) BNP's limitation of having no baseline (only comparing two views of the same backdoored model), and (b) UNIT's blind spot of using only absolute activation thresholds without scruntinizing backdoor learning process. For clean-label attacks where backdoor and benign features are semantically similar, only the *backdoor training dynamics* (captured by FIM difference from clean baseline) reveals the backdoor. <!-- **(A3) SGD-aware convergence analysis:** Being "SGD-aware" means our method captures how neurons converge during training. Backdoor neurons: (a) converge rapidly on few poisoned samples, exhibiting high gradient magnitudes and thus high Fisher information on trigger patterns, while (b) benign neurons gradually learn from many diverse samples with lower per-sample sensitivity. This training-time pathology is invisible to BNP (post-training statistics only) and UNIT (inference-time activations only). Critically, this signal remains strong even at low poisoning rates (<5%) or with distributed backdoors, where BNP's per-neuron statistical shifts fall below detection thresholds. --> <!-- **(A4) Architecture-agnostic parameter analysis:** FIM measures parameter importance directly through gradient information, independent of architectural assumptions about where BN layers are placed. This solves BNP's PreAct ResNet failure—our method identifies backdoor *parameters themselves* (via their Fisher information) regardless of whether BN normalizes outputs (standard ResNet) or inputs (PreAct ResNet), avoiding the "wrong layer pruning" problem. --> (iii) Unlike BNP's single-batch vulnerability, FIM computation aggregates gradient information across multiple samples or the full optimization trajectory, providing stable, representative estimates even when individual batches lack poisoned samples or exhibit high variance. --- We theoretically compare our method and BNP/UNIT as in the following table: | **Aspect** | **BNP** | **UNIT** | **Ours (FIM-based)** | |------------|---------|----------|-----------------------| | Detection Signal | KL divergence (statistics) | Activation bounds (magnitude) | Fisher Information (optimization) | | Targeted Layers | BN Layers | All Layers | BN Layers | | What It Observes | Output statistical differences | Output activation sizes | Training dynamics & parameter sensitivity | | Baseline Model | None | None | Reinitialized model (clean reference) | | Clean-Label Attacks | Partial (if stats differ) | Fails | Works (optimization fingerprint) | | Estimation Stability | Single batch (high variance) | Stable | Multi-sample aggregation | Below is quantitative comparison of our method and Unit/BNP: | Method | Metric | BadNet | LC | COMBAT | SBL | Wanet | Narcissus | Adaptive-Batch | Input-aware | Refool | AVG-DER | |--------|--------|--------|-----|---------|------|--------|-----------|----------------|-------------|---------|----------| | **Ours** | C-Acc | 89.82 | 89.09 | 87.90 | 86.94 | 91.57 | 88.37 | 88.31 | 90.61 | 89.70 | | | | ASR | 1.47 | 2.36 | 7.18 | 5.68 | 4.74 | 14.32 | 3.76 | 5.44 | 1.90 | | | | DER | 95.66 | 98.82 | 90.63 | 89.79 | 96.85 | 87.80 | 95.735 | 95.27 | 94.53 | **93.90** | | **Unit** | C-Acc | 84.66 | 81.36 | 79.70 | 65.64 | 88.05 | 87.79 | 87.57 | 80.05 | 86.75 | | | | ASR | 0.89 | 8.07 | 22.67 | 2.58 | 3.12 | 68.44 | 1.76 | 5.94 | 23.52 | | | | DER | 93.37 | 94.55 | 78.78 | 80.69 | 95.90 | 60.45 | 96.365 | 89.85 | 82.245 | **85.80** | | **BNP** | C-Acc | 91.27 | 83.31 | 91.40 | 90.33 | 65.69 | 93.11 | 92.34 | 89.72 | 91.74 | | | | ASR | 13.12 | 0 | 24.23 | 90.08 | 47.38 | 85.68 | 9.52 | 0.88 | 3.55 | | | | DER | 90.56 | 99.56 | 83.85 | 49.91 | 62.59 | 54.48 | 94.87 | 97.22 | 94.68 | **80.86** | In conclusion, our FIM-based method fundamentally differs by detecting *training-time optimization pathology* rather than *inference-time output patterns*. This enables robust defense against clean-label attacks (COMBAT/Narcissus) where existing methods fail. W2. **Adaptation with non-BN layers models** We create another specialized version of UniBP for non-BN models such as Vision Transformer by recognizing that ViTs use LayerNorm instead of BatchNorm and follow a Pre-LN architecture where normalization precedes computation (LN --> MLP) rather than following it (Conv --> BN). Since LayerNorm does not maintain running statistics like BatchNorm, we manually collect reference statistics by hooking LayerNorm inputs during forward passes on clean data, computing mean and variance across feature dimensions over multiple batches. Our key architectural insight is that in Pre-LN Transformers, the MLP feed-forward blocks between consecutive LayerNorm layers directly determine the input distribution to the subsequent normalization -- therefore, we strategically target only MLP parameters (fc1, fc2 weights/biases) for FIM computation while explicitly excluding LayerNorm parameters themselves through `_is_mlp_param()` filtering. We attach hooks to LayerNorm layers to capture their input statistics and compute an alignment loss `L = ||μ_input - μ_ref||² + λ||σ_input - σ_ref||²` against a reinitialized clean baseline model, which serves as our reference for "normal training dynamics." When this loss backpropagates, gradients flow through the LayerNorm back to the upstream MLP blocks, and the accumulated squared gradients (FIM scores) reveal which MLP parameters are most critical to producing backdoor-specific activation distributions -- parameters with high FIM resist alignment with clean statistics because they were optimized on poisoned data. Finally, we prune only the top-ranked MLP parameters via noise injection, preserving the normalization layers while disrupting the backdoor pathway hidden in the feed-forward sublayers where Transformer backdoors typically reside. | Metric | BadNet FT=0.1 | BadNet FT=0.05 | LC FT=0.1 | LC FT=0.05 | |--------|----------------|----------------|-----------|------------| | C-Acc (Pretrained) | 0.9286 | 0.9062 | 0.8736 | 0.8731 | | ASR (Pretrained) | 0.9542 | 0.9302 | 1.0000 | 1.0000 | | C-Acc (Ours) | 0.9325 | 0.9310 | 0.9529 | 0.9467 | | ASR (Ours) | 0.0034 | 0.0101 | 0.0230 | 0.0006 | | DER (Ours) | 0.9754 | 0.96005 | 0.9885 | 0.9997 | From the table, we can see that our method effectively eliminates backdoors across all attack scenarios (ASR reduced to near-zero) while preserving or even improving clean accuracy, demonstrating that targeting MLP parameters via LayerNorm statistics successfully disrupts backdoor pathways without degrading model performance. We will add a subsection to discuss our LayerNorm version. W3. **Computational overhead analysis** We record the running time of several defense methods on 5000 CIFAR-10 images with PreactResNet with all baselines under the same condition of hardware, and show the results in Table below. | Metric | FT | ANP | NAD | FST | TSBD | BNP | I-BAU | RNP | UNIT | Ours | |---------------------|------|-----|-----|-----|------|-----|-------|-----|------|------| | Running Time (s) | 95 | 414 | 129 | 157 | 1529 | 153 | 132 | 181 | 421 | 228 | | Avg. DER | 70.38| 89.42 | 72.39 | 89.86 | 84.76 | 80.86 | 89.68 | 76.75 | 85.80 | 93.90 | Despite achieving the highest DER among all evaluated defenses (93.90%), our method maintains a moderate and practical running time of 228 seconds—substantially faster than TSBD (1529s) and ANP (414s), and in the same ballpark as lighter baselines such as NAD (129s) and BNP (153s). Notably, several methods with comparable or lower DER (e.g., FT, ANP, UNIT) require considerably more computation, indicating that our approach offers a more favorable robustness–efficiency trade-off. These results suggest that our defense is not only effective but also computationally reasonable for deployment in realistic FL settings. **W4. Accuracy trade-off and discussion:** We agree that UniBP may incur a slightly larger clean-accuracy drop than some baselines, but we view this as an inherent and well-known trade-off in pruning-based defenses under zero-adversary-knowledge: any method that aggressively suppresses backdoor-related capacity without access to the true trigger or strong side information will inevitably sacrifice some clean performance (e.g., ANP and NAD). However, a critical point is that **UniBP is the only defense effective against all tested backdoor attacks**, including challenging sample-specific and adaptive variants where other methods fail to provide adequate protection, whereas methods that preserve marginally higher clean accuracy often leave non-trivial residual backdoor risk. We further found that this trade-off is **mitigable rather than fundamental**. We conducted an additional study using an extra r% of the training data added to the fine-tuning set, performing a third fine-tuning step after UniBP completes. As shown in the table below, adding a small amount of additional clean data for fine-tuning is enough to recover or even improve clean accuracy over the pretrained model, while keeping ASR near zero and DER very high at ~100%. This demonstrates that in practice, practitioners can achieve a favorable balance between robustness and utility with minimal extra cost. We will clarify this trade-off and explicitly discuss such mitigation strategies in the revised version. | **Additional Data Ratio** | **BadNet ACC** | **BadNet ASR** | **BadNet DER** | **LC ACC** | **LC ASR** | **LC DER** | |---------------------------|----------------|----------------|----------------|------------|------------|------------| | Pretrained | 91.44 | 94.41 | -- | 84.19 | 100.00 | -- | | r = 0.00 | 89.82 | 1.47 | 95.66 | 89.09 | 2.36 | 98.82 | | r = 0.01 | 92.03 | 0.92 | 96.745 | 92.62 | 0.07 | 99.965 | | r = 0.02 | 91.84 | 1.12 | 96.645 | 92.61 | 0.07 | 99.965 | | r = 0.05 | 91.60 | 1.00 | 96.705 | 92.75 | 0.04 | 99.98 | From the table, we can see that minimal additional clean data fully recovers clean accuracy while maintaining highly effective backdoor suppression, demonstrating that the initial accuracy trade-off is easily mitigated in practical deployment scenarios. **W5. Additional baselines:** Below is quantitative comparison of our method and suggested baselines including Unit[2], BNP[1] and RNP[5]: | Method | Metric | BadNet | LC | COMBAT | SBL | Wanet | Narcissus | Adaptive-Batch | Input-aware | Refool | AVG-DER | |--------|--------|--------|-----|---------|------|--------|-----------|----------------|-------------|---------|----------| | **Ours** | C-Acc | 89.82 | 89.09 | 87.90 | 86.94 | 91.57 | 88.37 | 88.31 | 90.61 | 89.70 | | | | ASR | 1.47 | 2.36 | 7.18 | 5.68 | 4.74 | 14.32 | 3.76 | 5.44 | 1.90 | | | | DER | 95.66 | 98.82 | 90.63 | 89.79 | 96.85 | 87.80 | 95.735 | 95.27 | 94.53 | **93.90** | | **Unit** | C-Acc | 84.66 | 81.36 | 79.70 | 65.64 | 88.05 | 87.79 | 87.57 | 80.05 | 86.75 | | | | ASR | 0.89 | 8.07 | 22.67 | 2.58 | 3.12 | 68.44 | 1.76 | 5.94 | 23.52 | | | | DER | 93.37 | 94.55 | 78.78 | 80.69 | 95.90 | 60.45 | 96.365 | 89.85 | 82.245 | **85.80** | | **BNP** | C-Acc | 91.27 | 83.31 | 91.40 | 90.33 | 65.69 | 93.11 | 92.34 | 89.72 | 91.74 | | | | ASR | 13.12 | 0 | 24.23 | 90.08 | 47.38 | 85.68 | 9.52 | 0.88 | 3.55 | | | | DER | 90.56 | 99.56 | 83.85 | 49.91 | 62.59 | 54.48 | 94.87 | 97.22 | 94.68 | **80.86** | | **RNP** | C-Acc | 87.63 | 80.78 | 92.89 | 87.57 | 90.34 | 92.97 | 90.28 | 86.48 | 54.11 | | | | ASR | 3.76 | 99.93 | 93.09 | 20.57 | 0.17 | 91.52 | 11.67 | 0.73 | 0 | | | | DER | 93.42 | 48.33 | 50.165 | 82.66 | 98.52 | 51.50 | 92.765 | 95.67 | 77.685 | **76.75** | From the table, UniBP achieves the highest average DER (93.90%), clearly outperforming UNIT, BNP, and RNP while maintaining competitive clean accuracy. Notably, it stays robust under clean-label attacks like COMBAT and Narcissus. **W6. Adaptive Attacks:** We thank the Reviewer for raising this important concern. However, we respectfully disagree with the assessment that "the proposed method may fail easily" under adaptive attacks targeting BN statistics. On the contrary, our experiments demonstrate that UniBP remains robust precisely because it exploits a fundamental property of backdoor mechanisms that cannot be easily circumvented. To directly address this concern, we implemented the **strongest possible adaptive attack** one could design against our defense: an adversary with full knowledge of our method who explicitly regularizes the attack to preserve benign BN statistics. Specifically, assuming the attacker has access to clean BN statistics ($\mu^c_\ell$ and $v^c_\ell$), we augment the BadNets objective with a BN-alignment regularization term: $\mathcal{L}_{\text{adaptive}} = \mathbb{E}_{(\mathbf{x}, y)\sim\mathcal{D}} \left[ \mathrm{D}_{\mathrm{CE}}(y, f(\delta(\mathbf{x}))) \right] + \gamma \sum_{\ell=1}^{L} \mathbb{E}_{\mathbf{x}\sim \mathcal{X}} \Big[ \|\hat{\mu}_\ell - \mu_{\ell}^c\|_2 + \lambda\|\hat{v}_\ell - v_{\ell}^c\|_2 \Big].$ We systematically evaluate this adaptive attack across a wide range of regularization strengths $\gamma$: | Method | Metric | $\gamma=0$ | $\gamma=0.01$ | $\gamma=0.1$ | $\gamma=1.0$ | $\gamma=10.0$ | $\gamma=100.0$ | |------------|--------|------------|---------------|--------------|--------------|----------------|-----------------| | Pretrained | ACC | 91.44 | 91.18 | 89.34 | 89.06 | 88.78 | NaN | | Pretrained | ASR | 94.41 | 94.20 | 96.24 | 95.62 | 95.41 | NaN | | Ours | ACC | 89.82 | 90.55 | 87.06 | 87.22 | 86.67 | NaN | | Ours | ASR | 1.47 | 2.39 | 3.66 | 8.48 | 3.01 | NaN | From the results, we can see that across all viable regularization strengths, UniBP maintains ASR below 9% while the backdoored model suffers ASR above 94%, demonstrating that the defense remains highly effective even when the attacker explicitly targets our detection mechanism. Explanation. The attacker faces a fundamental trade-off: backdoor attacks inherently require trigger-dependent feature patterns that distinguish poisoned from benign samples to enable misclassification, and these patterns necessarily manifest as distributional shifts in BN statistics. Attempting to suppress these shifts to evade detection directly undermines the core mechanism that makes the backdoor work. Since jointly achieving high ASR while keeping BN statistics indistinguishable from clean data is a tightly constrained objective with no demonstrated solution in prior work, we believe UniBP is not easily broken by realistic BN-aware adaptive attacks, or it may require a more dedicated and non-trivial effort to design such an effective attack. [1] Zheng, Runkai, et al. "Pre-activation distributions expose backdoor neurons." Advances in Neural Information Processing Systems 35 (2022): 18667-18680. [2] Cheng, Siyuan, et al. "Unit: Backdoor mitigation via automated neural distribution tightening." ECCV 2024. [3] Huynh, Tran, et al. "Combat: Alternated training for effective clean-label backdoor attacks." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 3. 2024. [4] Zeng, Yi, et al. "Narcissus: A practical clean-label backdoor attack with limited information." Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 2023. [5] Li, Yige, et al. "Reconstructive neuron pruning for backdoor defense." International Conference on Machine Learning. 2023. ## Reviewer #uAhn <!-- The key observation of this paper is that BN layers capture backdoor-related distributional shifts, and can be used to identify and mitigate backdoors. However, this observation has been made in prior works such as [8, 11]. The idea of pruning a set of neurons and then doing fine-tuning to remove backdoors is also similar to [9]. The authors should clarify the novelty of their findings and approach compared to these prior works. What are the key differences and contributions of UniBP that set it apart from existing methods? --> **W1. Comparison with Unit, BNP and RNP** - We thank the reviewer for this important question. While prior works have explored BN layers for backdoor detection, our approach differs fundamentally in what we observe and how we use this information. Below we clarify the key limitations of existing methods and then our novel contributions. - Limitations of Prior Methods: - **BNP [1]** compares Batch Normalization statistics recorded during training (on poisoned data) against statistics computed from a small clean dataset using KL divergence. Neurons where divergence exceeds $mean + u \times std$ are pruned. Limitations of BNP: (i) The implementation (from their released code) computes statistics from only one batch (`iter(dataloader).next()`), resulting in high-variance, unreliable KL divergence estimates. This can completely miss backdoor patterns if the sampled batch lacks poisoned samples or contains atypical clean samples. (ii) BNP assumes each backdoor neuron exhibits significantly different statistics between clean and poisoned data (Assumption 2: $|\mu - \hat{\mu}| >> \varepsilon$). This fails when the backdoor is distributed across many neurons, making per-neuron shifts too subtle and highly variable across neurons to be reliably detected, one single threshold is not strong enough to capture this activation. (iii) BNP assumes BN normalizes the current layer's convolution output (Conv → BN → ReLU). However, PreAct ResNet places BN *before* convolution (BN → ReLU → Conv). When BNP zeros out BN weights based on KL divergence, it disrupts the *previous* layer's output distribution rather than pruning the actual backdoor neurons in the current layer—analogous to cutting the wrong wire in a circuit. Therefore, BNP only observes *what neurons output* (statistical differences in activation distributions) but cannot detect backdoors when their activation patterns intentionally mimic clean and backdoored samples, as in clean-label attacks (COMBAT, Narcissus). - **Unit [2]** approximates tight benign activation distribution boundaries for each neuron using 5% clean data, then clips activation values exceeding these boundaries during inference. **It assumes backdoor triggers cause abnormally large activations compared to benign inputs**. Limitations of Unit: (i) COMBAT and Narcissus synthesize triggers containing features as persistent as the original semantic features of the target class [Zhu et al., AAAI 2023; Wang et al., arXiv]. The backdoor activations are *not* substantially larger than benign activations—they deliberately occupy the same distribution space as legitimate class features, directly violating UNIT's core assumption. As a result, UNIT's boundary-based clipping cannot differentiate between "high activation from backdoor trigger" vs. "high activation from genuine class feature" when both occupy overlapping distributions. (ii) UNIT operates purely at the activation level without understanding the origin or training dynamics of neuron behavior. When backdoor features and authentic semantic features produce similar activation patterns, UNIT faces an impossible dilemma: (a) loose boundaries fail to remove the backdoor, or (b) tight boundaries damage clean accuracy by clipping legitimate high activations. --> UNIT only observes *activation magnitudes* but cannot detect backdoors deliberately designed to produce benign-like activation patterns within normal distribution of activation. - **RNP [3]** employs two-phase pruning: (1) neuron unlearning maximizes cross-entropy loss on clean samples to disrupt clean neurons while preserving backdoor neurons, and (2) filter recovering uses learnable masks to identify backdoor filters. The methodology's limitations are as follows: (i) Clean-label backdoor attacks such as COMBAT/Narcissus embed backdoors within natural feature representations, breaking RNP's assumption that backdoor neurons behave differently during reconstruction, and basic gradient ascent in the first step are not enough to modelize the relationship of of clean task and backdoor task. This leads to the failure as in our quantitative table below. (ii) Narcissus's naturally-distributed triggers cannot be distinguished from legitimate features using clean sample reconstruction alone. (iii) This method is hyperparameter sensitivity that requires careful tuning of pruning ratios and learning rates that becomes intractable when backdoor features mimic normal neuron behavior. - We provide theretical and quantitative comparison with BNP, Unit and RNP as follows. | **Aspect** | **BNP** | **UNIT** | **RNP** | **Ours (FIM-based)** | |------------|---------|----------|---------|----------------------| | Detection Signal | KL divergence (statistics) | Activation bounds (magnitude) | Parameters changes w.r.t. cross-entropy loss | Fisher Information (optimization) | | Targeted Layers | All Layers | BN Layers | All Layers | BN Layers | | What It Observes | Output statistical differences | Output activation sizes | Unlearning behavior | Training dynamics & parameter sensitivity | | Baseline Model | None | None | None | Reinitialized model (clean reference) | | Clean-Label Attacks | Partial (if stats differ) | Fails | Fails | Works (optimization fingerprint) | | Estimation Stability | Single batch (high variance) | Stable | Stable | Multi-sample aggregation | Below is quantitative comparison of our method and these baselines: | Method | Metric | BadNet | LC | COMBAT | SBL | Wanet | Narcissus | Adaptive-Batch | Input-aware | Refool | AVG-DER | |--------|--------|--------|-----|---------|------|--------|-----------|----------------|-------------|---------|----------| | **Ours** | C-Acc | 89.82 | 89.09 | 87.90 | 86.94 | 91.57 | 88.37 | 88.31 | 90.61 | 89.70 | | | | ASR | 1.47 | 2.36 | 7.18 | 5.68 | 4.74 | 14.32 | 3.76 | 5.44 | 1.90 | | | | DER | 95.66 | 98.82 | 90.63 | 89.79 | 96.85 | 87.80 | 95.735 | 95.27 | 94.53 | **93.90** | | **UNIT** | C-Acc | 84.66 | 81.36 | 79.70 | 65.64 | 88.05 | 87.79 | 87.57 | 80.05 | 86.75 | | | | ASR | 0.89 | 8.07 | 22.67 | 2.58 | 3.12 | 68.44 | 1.76 | 5.94 | 23.52 | | | | DER | 93.37 | 94.55 | 78.78 | 80.69 | 95.90 | 60.45 | 96.365 | 89.85 | 82.245 | **85.80** | | **BNP** | C-Acc | 91.27 | 83.31 | 91.40 | 90.33 | 65.69 | 93.11 | 92.34 | 89.72 | 91.74 | | | | ASR | 13.12 | 0 | 24.23 | 90.08 | 47.38 | 85.68 | 9.52 | 0.88 | 3.55 | | | | DER | 90.56 | 99.56 | 83.85 | 49.91 | 62.59 | 54.48 | 94.87 | 97.22 | 94.68 | **80.86** | | **RNP** | C-Acc | 87.63 | 80.78 | 92.89 | 87.57 | 90.34 | 92.97 | 90.28 | 86.48 | 54.11 | | | | ASR | 3.76 | 99.93 | 93.09 | 20.57 | 0.17 | 91.52 | 11.67 | 0.73 | 0 | | | | DER | 93.42 | 48.33 | 50.165 | 82.66 | 98.52 | 51.50 | 92.765 | 95.67 | 77.685 | **76.75** | <!-- space permitting, I would add something more here that says something like "In contrast, our approach ..." (and similar for contrasting with UNIT). As written, it feels like you're not advocating for your own work enough. You address this later, but I would put in small hints here so the reader understands where you're going with this list --> - **Our Novel Contributions:** Instead of analyzing *what neurons output* (BNP: statistics, UNIT: magnitudes), our method analyzes *how neurons were trained*. The Fisher Information Matrix (FIM) captures the optimization dynamics and parameter sensitivity to training data when we try to align the BN statistic of a reinitialized model toward those of the backdoored model, revealing the backdoor's training-time fingerprint. Specifically, UniBP was designed based on following observations: (i) Affine parameters in BN layers exhibit abnormally high Fisher information because they are explicitly optimized to memorize trigger patterns from a small poisoned subset while learning the main task. This behavior remains largely invariant across different backdoor attacks. We simulate the backdoor learning process by aligning the BN layer statistics of a reinitialized model to those of the backdoored model, and from the resulting learning trajectory we can identify which neurons are responsible for jointly maintaining both the clean task and the backdoor task. This methodology is fundamentally different from detecting abnormal activations at the neuron level as in UNIT, or from computing KL divergence between the statistics of a model on clean data and those of a backdoored model of BNP, and important neurons that maximize cross-entropy loss as in RNP. (ii) By comparing backdoored model FIM vs. reinitialized model FIM, we establish a reference for "backdoor training dynamics." This eliminates: (a) BNP's limitation of having no baseline (only comparing two views of the same backdoored model), and (b) UNIT's blind spot of using only absolute activation thresholds without understanding training context. For clean-label attacks where backdoor and benign features are semantically similar, only the *backdoor training dynamics* (captured by FIM difference from clean baseline) reveals the backdoor. (iii) Unlike BNP's single-batch vulnerability, FIM computation aggregates gradient information across multiple samples or the full optimization trajectory, providing stable, representative estimates even when individual batches lack poisoned samples or exhibit high variance. To this end, our method ourperforms discussed baselines and we will update this discussion to our paper correspondingly. <!-- (2) Limited Evaluation: The evaluation is limited to a small set of attacks (5) and baselines (5). It would be helpful to see results on a wider range of attacks[1,2,3,4,5], including more recent adaptive attacks that may specifically target embedding distribution[6,7]. Additionally, evaluating on recent defense baselines are important to showcase the superiority of UniBP[8,9,10]. --> **W2. Limited Evaluation** We sincerely appreciate this constructive feedback. Following the reviewers' suggestions, we have substantially expanded our evaluation to include a more comprehensive set of attacks and defense baselines. (i) **Additional Attacks:** We added four more sophisticated attacks to our evaluation suite, selected based on codebase availability and attack diversity: Narcissus [5]: A clean-label attack that crafts naturally-distributed triggers Adaptive-Batch [8]: An adaptive attack designed to evade batch-based defenses Input-aware [6]: A dynamic backdoor attack that adapts to input characteristics Refool [7]: A reflection-based backdoor attack using natural transformations (ii) **Additional Defense Baselines:** We incorporated four recent defense methods including I-BAU [9], RNP [3], BNP [1], Unit [2] as suggested: This brings our total evaluation to 9 attack types and 10 defense methods, providing a comprehensive assessment of backdoor defense effectiveness. Table below presents results on the four newly added attacks. Our method achieves the highest average DER (93.33) across these challenging scenarios: Table: Results with more attacks. | Method | Metric | Narcissus | Adaptive-Batch | Input-aware | Refool | AVG-DER | |--------|--------|-----------|----------------|-------------|--------|---------| | **Ours** | C-Acc | 88.37 | 88.31 | 90.61 | 89.70 | -- | | | ASR | 14.32 | 3.76 | 5.44 | 1.90 | -- | | | DER | 87.80 | 95.735 | 95.27 | 94.53 | **93.33** | | **I-BAU** | C-Acc | 89.27 | 89.84 | 89.67 | 87.87 | -- | | | ASR | 33.01 | 1.26 | 50.90 | 2.02 | -- | | | DER | 78.905 | 97.75 | 72.18 | 93.555 | **85.60** | | **UNIT** | C-Acc | 87.79 | 87.57 | 80.05 | 86.75 | -- | | | ASR | 68.44 | 1.76 | 5.94 | 23.52 | -- | | | DER | 60.45 | 96.365 | 89.85 | 82.245 | **82.23** | | **BNP** | C-Acc | 93.11 | 92.34 | 89.72 | 91.74 | -- | | | ASR | 85.68 | 9.52 | 0.88 | 3.55 | -- | | | DER | 54.48 | 94.87 | 97.22 | 94.68 | **85.31** | | **RNP** | C-Acc | 92.97 | 90.28 | 86.48 | 54.11 | -- | | | ASR | 91.52 | 11.67 | 0.73 | 0.00 | -- | | | DER | 51.50 | 92.765 | 95.67 | 77.685 | **79.41** | | **FT** | C-Acc | 92.35 | 92.19 | 91.39 | 91.64 | -- | | | ASR | 89.81 | 99.94 | 96.50 | 15.54 | -- | | | DER | 52.045 | 49.585 | 50.00 | 88.68 | **60.08** | | **ANP** | C-Acc | 89.55 | 89.55 | 86.17 | 86.92 | -- | | | ASR | 86.77 | 86.77 | 0.16 | 0.12 | -- | | | DER | 52.165 | 54.85 | 95.80 | 94.03 | **74.21** | | **NAD** | C-Acc | 91.27 | 91.18 | 92.31 | 90.67 | -- | | | ASR | 88.06 | 90.03 | 98.80 | 9.27 | -- | | | DER | 52.38 | 54.035 | 50.00 | 91.33 | **61.94** | | **FST** | C-Acc | 92.18 | 92.04 | 92.67 | 91.70 | -- | | | ASR | 93.91 | 0.40 | 0.00 | 3.93 | -- | | | DER | 49.91 | 99.28 | 97.99 | 94.49 | **85.42** | | **TSBD** | C-Acc | 92.85 | 92.40 | 93.18 | 92.24 | -- | | | ASR | 82.16 | 4.07 | 5.43 | 1.77 | -- | | | DER | 56.12 | 97.625 | 95.28 | 95.57 | **86.15** | Table: Results with more defenses. | Method | Metric | Narcissus | Adaptive-Batch | Input-aware | Refool | AVG-DER | |--------|--------|-----------|----------------|-------------|--------|---------| | **Ours** | C-Acc | 88.37 | 88.31 | 90.61 | 89.70 | -- | | | ASR | 14.32 | 3.76 | 5.44 | 1.90 | -- | | | DER | 87.80 | 95.735 | 95.27 | 94.53 | **93.33** | | **I-BAU** | C-Acc | 89.27 | 89.84 | 89.67 | 87.87 | -- | | | ASR | 33.01 | 1.26 | 50.90 | 2.02 | -- | | | DER | 78.905 | 97.75 | 72.18 | 93.555 | **85.60** | | **UNIT** | C-Acc | 87.79 | 87.57 | 80.05 | 86.75 | -- | | | ASR | 68.44 | 1.76 | 5.94 | 23.52 | -- | | | DER | 60.45 | 96.365 | 89.85 | 82.245 | **82.23** | | **BNP** | C-Acc | 93.11 | 92.34 | 89.72 | 91.74 | -- | | | ASR | 85.68 | 9.52 | 0.88 | 3.55 | -- | | | DER | 54.48 | 94.87 | 97.22 | 94.68 | **85.31** | | **RNP** | C-Acc | 92.97 | 90.28 | 86.48 | 54.11 | -- | | | ASR | 91.52 | 11.67 | 0.73 | 0.00 | -- | | | DER | 51.50 | 92.765 | 95.67 | 77.685 | **79.41** | | **FT** | C-Acc | 92.35 | 92.19 | 91.39 | 91.64 | -- | | | ASR | 89.81 | 99.94 | 96.50 | 15.54 | -- | | | DER | 52.045 | 49.585 | 50.00 | 88.68 | **60.08** | | **ANP** | C-Acc | 89.55 | 89.55 | 86.17 | 86.92 | -- | | | ASR | 86.77 | 86.77 | 0.16 | 0.12 | -- | | | DER | 52.165 | 54.85 | 95.80 | 94.03 | **74.21** | | **NAD** | C-Acc | 91.27 | 91.18 | 92.31 | 90.67 | -- | | | ASR | 88.06 | 90.03 | 98.80 | 9.27 | -- | | | DER | 52.38 | 54.035 | 50.00 | 91.33 | **61.94** | | **FST** | C-Acc | 92.18 | 92.04 | 92.67 | 91.70 | -- | | | ASR | 93.91 | 0.40 | 0.00 | 3.93 | -- | | | DER | 49.91 | 99.28 | 97.99 | 94.49 | **85.42** | | **TSBD** | C-Acc | 92.85 | 92.40 | 93.18 | 92.24 | -- | | | ASR | 82.16 | 4.07 | 5.43 | 1.77 | -- | | | DER | 56.12 | 97.625 | 95.28 | 95.57 | **86.15** | We can see from two tables that: 1. Our method achieves the highest average DER (93.33), outperforming all baselines. 2. Unlike methods that effective on specific attacks but fail on others (e.g., FST: 99.28 on Adaptive-Batch but 49.91 on Narcissus), our approach maintains consistently high performance across all scenarios We believe this comprehensive evaluation addresses the reviewer's concern and demonstrates the broad applicability and effectiveness of UniBP, and we will update our manuscript accordingly. <!-- (3) Dependence on Batch Normalization: UniBP relies heavily on the presence of Batch Normalization layers to identify and mitigate backdoors. However, many modern architectures, such as Vision Transformers (ViT) and ConvNeXt, do not use BN layers. This limits the applicability of UniBP to a narrower set of models. The authors should discuss how UniBP could be adapted or extended to work with models that do not use BN, or provide empirical results on such architectures. --> <!-- (4) Adaptive Attacks: The paper does not sufficiently address the potential for adaptive attacks that could specifically target the UniBP defense. If an attacker is aware of the UniBP method, they may design triggers or training strategies that evade detection by BN parameter pruning. The authors should discuss potential adaptive attack scenarios and evaluate the robustness of UniBP against such attacks. --> **W4. Adaptive Attacks:** We thank the Reviewer for raising this important concern regarding the robustness of UniBP against adaptive adversaries. To directly address this, we implemented and evaluated against **the strongest conceivable adaptive attack**: an adversary with full knowledge of our defense mechanism who explicitly regularizes the backdoor training to preserve benign BN statistics. **Adaptive Attack Design.** Assuming the attacker has access to clean BN statistics ($\mu^c_\ell$ and $v^c_\ell$), we augment the standard BadNets objective with a BN-alignment regularization term: $$\mathcal{L}_{\text{adaptive}} = \mathbb{E}_{(\mathbf{x}, y)\sim\mathcal{D}} \left[ \mathrm{D}_{\mathrm{CE}}(y, f(\delta(\mathbf{x}))) \right] + \gamma \sum_{\ell=1}^{L} \mathbb{E}_{\mathbf{x}\sim \mathcal{X}} \Big[ \|\hat{\mu}_\ell - \mu_{\ell}^c\|_2 + \lambda\|\hat{v}_\ell - v_{\ell}^c\|_2 \Big].$$ This forces the attacker to explicitly minimize the divergence between backdoored and clean BN statistics during training. **Evaluation Results.** We systematically evaluate this adaptive attack across regularization strengths $\gamma \in [0.01, 100]$: | Method | Metric | $\gamma=0$ | $\gamma=0.01$ | $\gamma=0.1$ | $\gamma=1.0$ | $\gamma=10.0$ | |------------|--------|------------|---------------|--------------|--------------|----------------| | Pretrained | ACC | 91.44 | 91.18 | 89.34 | 89.06 | 88.78 | | Pretrained | ASR | 94.41 | 94.20 | 96.24 | 95.62 | 95.41 | | Ours | ACC | 89.82 | 90.55 | 87.06 | 87.22 | 86.67 | | Ours | ASR | 1.47 | 2.39 | 3.66 | 8.48 | 3.01 | Our observation is that (1) Across all viable regularization strengths, UniBP maintains ASR below 9% while backdoored models suffer 94%+ ASR, demonstrating robustness even when attackers explicitly target our detection mechanism. (2) The attacker faces a **fundamental trade-off**: backdoor attacks inherently require trigger-dependent feature patterns that distinguish poisoned from benign samples to enable misclassification. These discriminative patterns necessarily manifest as distributional shifts in BN statistics. Suppressing these shifts to evade detection directly undermines the backdoor's efficacy -- the attacker cannot simultaneously maintain high ASR and preserve benign BN statistics. (3) At high regularization ($\gamma=10$), we observe ASR fluctuation (3.01% vs 8.48% at $\gamma=1.0$), suggesting the optimization landscape becomes highly unstable when forcing both objectives, further validating the inherent difficulty of this adaptive strategy. To the best of our knowledge, *no prior work has demonstrated an effective solution to this trade-off between maintaining backdoor effectiveness and evading BN-based detection*. While we acknowledge that more sophisticated trigger design strategies (e.g., learnable perturbations optimized jointly with the BN regularization) remain an open research direction, our results provide strong evidence that UniBP establishes a robust defense against realistic, BN-aware adaptive attackers. We will add these experiments and discussion to the revised manuscript. [1] Zheng, Runkai, et al. "Pre-activation distributions expose backdoor neurons." Advances in Neural Information Processing Systems 35 (2022): 18667-18680. [2] Cheng, Siyuan, et al. "Unit: Backdoor mitigation via automated neural distribution tightening." ECCV 2024. [3] Li, Yige, et al. "Reconstructive neuron pruning for backdoor defense." ICML 2023. [4] Huynh, Tran, et al. "Combat: Alternated training for effective clean-label backdoor attacks." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 3. 2024. [5] Zeng, Yi, et al. "Narcissus: A practical clean-label backdoor attack with limited information." Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 2023. [6] Nguyen, Tuan Anh, and Anh Tran. "Input-aware dynamic backdoor attack." NeurIPS 2020. [7] Liu, Yunfei, et al. "Reflection backdoor: A natural backdoor attack on deep neural networks." ECCV 2020. [8] Qi, Xiangyu, et al. "Revisiting the assumption of latent separability for backdoor defenses." ICLR 2023. [9] Zeng, Yi, et al. "Adversarial unlearning of backdoors via implicit hypergradient, ICLR 2022. ## Reviewer #88MC <!-- Weaknesses: --> <!-- W1. Some typos exist, e.g., "??" in line 211 and "Batch-norm affine reset" in line 250 is not consistent with the other title (The first letter of each word is not capitalized). --> **W1. Some typos exist** We conducted a proof-reading and fix the typos and errors raised by the reviewer and will update our paper correspondingly. **W2. Comparable coloring in blue** We will color blue for other baselines with comparable performance for better comparison and readability. <!-- For the results presentation, it is unfair to color only UniBP in blue for the comparable performance. The baseline performance should also be highlighted and fairly show the comparison. --> **W3. Scalability Experiments** <!-- Some scalable experiments may help better illustrate the effectiveness, e.g., performance in the ViT model or the ImageNet dataset. The CIFAR-10 and GTSRB are too small, making the generalizability of UniBP unclear. --> We thank the reviewer for this important point. To address the concern about generalizability, we have conducted additional experiments on Vision Transformer architectures and the larger Tiny-ImageNet dataset. The results demonstrate that UniBP's effectiveness extends well beyond CIFAR-10 and GTSRB. --- **Extension to Vision Transformers** We adapted UniBP for Vision Transformers by accounting for the architectural differences from CNNs. ViTs use LayerNorm instead of BatchNorm and follow a Pre-LN architecture where normalization precedes computation (LN → MLP) rather than following it (Conv → BN). Since LayerNorm doesn't maintain running statistics, we manually collect reference statistics by hooking LayerNorm inputs during forward passes on clean data. **The key insight** is that in Pre-LN Transformers, the MLP feed-forward blocks between consecutive LayerNorm layers directly determine the input distribution to subsequent normalization. Since Transformer backdoors typically hide in these feed-forward layers where trigger-specific computations occur, we strategically target only MLP parameters (fc1, fc2 weights/biases) for FIM computation while excluding LayerNorm parameters through `_is_mlp_param()` filtering. We attach hooks to LayerNorm layers and compute an alignment loss `L = ||μ_input - μ_ref||² + λ||σ_input - σ_ref||²` against a clean baseline model. When backpropagated, gradients flow through LayerNorm to the upstream MLP blocks, and the accumulated squared gradients (FIM scores) identify which MLP parameters are most critical to backdoor-specific activation patterns. We then prune only the top-ranked MLP parameters via noise injection, disrupting the backdoor pathway while preserving normalization layers. **Table: Results with Vision Transformer Architecture** | Metric | BadNet FT=0.1 | BadNet FT=0.05 | LC FT=0.1 | LC FT=0.05 | |--------|----------------|----------------|-----------|------------| | C-Acc (Pretrained) | 0.9286 | 0.9062 | 0.8736 | 0.8731 | | ASR (Pretrained) | 0.9542 | 0.9302 | 1.0000 | 1.0000 | | C-Acc (Ours) | 0.9325 | 0.9310 | 0.9529 | 0.9467 | | ASR (Ours) | 0.0034 | 0.0101 | 0.0230 | 0.0006 | | DER (Ours) | 0.9754 | 0.96005 | 0.9885 | 0.9997 | The method effectively eliminates backdoors (ASR near-zero) while maintaining or improving clean accuracy, confirming that our approach transfers successfully to Transformer architectures. --- **Scalability to Larger Datasets** We further evaluated UniBP on Tiny-ImageNet, which is substantially larger and more complex than CIFAR-10 and GTSRB. **Table: Results with Tiny-ImageNet** | Method | Metric | BadNet | LC | Wanet | Adaptive-Batch | Input-aware | AVG-DER | |-------------|--------|--------|----|--------|-----------------|--------------|---------| | Pretrained | C-Acc | 0.4712 | 0.5678 | 0.5497 | 0.5349 | 0.4682 | | | | ASR | 0.9416 | 0.6770 | 0.9970 | 0.9993 | 0.9524 | | | FT | C-Acc | 0.5516 | 0.5655 | 0.5271 | 0.4984 | 0.5160 | | | | ASR | 0.9039 | 0.6733 | 0.4800 | 0.9632 | 0.9881 | | | | DER | 0.51885 | 0.50070 | 0.74720 | 0.49980 | 0.50000 | 0.55331 | | ANP | C-Acc | 0.4712 | 0.5468 | 0.5422 | 0.4911 | 0.4221 | | | | ASR | 0.9416 | 0.5918 | 0.7696 | 0.6636 | 0.0976 | | | | DER | 0.50000 | 0.53210 | 0.60995 | 0.64595 | 0.90435 | 0.63847 | | NAD | C-Acc | 0.4977 | 0.5653 | 0.5244 | 0.5003 | 0.5197 | | | | ASR | 0.2953 | 0.7011 | 0.5159 | 0.9634 | 0.9936 | | | | DER | 0.82315 | 0.49875 | 0.72790 | 0.50065 | 0.50000 | 0.61009 | | FST | C-Acc | 0.2696 | 0.2844 | 0.2875 | 0.2918 | 0.2425 | | | | ASR | 0.0031 | 0.0057 | 0.0006 | 0.0000 | 0.0000 | | | | DER | 0.86845 | 0.69395 | 0.86710 | 0.87810 | 0.86335 | 0.83419 | | TSBD | C-Acc | 0.5264 | 0.5497 | 0.5273 | 0.4867 | 0.5465 | | | | ASR | 0.5495 | 0.1634 | 0.0072 | 0.0038 | 0.0012 | | | | DER | 0.69605 | 0.74775 | 0.98370 | 0.97365 | 0.97560 | 0.87535 | | Ours | C-Acc | 0.4873 | 0.5022 | 0.4827 | 0.4558 | 0.4602 | | | | ASR | 0.1749 | 0.0279 | 0.0051 | 0.0402 | 0.0387 | | | | DER | 0.88335 | 0.79175 | 0.96245 | 0.94000 | 0.95285 | 0.90608 | UniBP achieves the highest average DER (90.61%) across all five attack types, outperforming TSBD (87.54%) and FST (83.42%). More importantly, UniBP consistently suppresses ASR to low levels across diverse attacks (from simple BadNet to sophisticated adaptive and input-aware attacks) while maintaining reasonable clean accuracy. FST, though achieving near-zero ASR on some attacks, suffers from catastrophic accuracy drops to ~27%, rendering it impractical. UniBP maintains clean accuracy around 46-50%, offering a much better robustness-utility balance at scale. Together with the ViT results, these experiments demonstrate that UniBP works effectively across different model architectures (CNNs and Transformers), dataset scales (CIFAR-10 to Tiny-ImageNet), and diverse attack types, addressing the generalizability concern raised by the reviewer. **Q1. How to UniBP on transformer-based or normalization-free architectures?** Thank you for this thoughtful question. While UniBP leverages the BatchNorm statistics of CNNs, the core insight—monitoring activation distribution shifts induced by backdoors—is architecture-agnostic. For transformer-based models, we believe similar principles can be applied using alternatives to BatchNorm, such as LayerNorm statistics or intermediate token distributions (shown above). For normalization-free architectures, surrogate statistical signals (e.g., layerwise activation histograms or activation entropy) could be explored. Extending UniBP to these settings is a promising direction, and we have added this to our discussion section. **Q2. Can the method be combined with data-free defense approaches to further relax the clean data requirement?** We appreciate the suggestion. As noted by the Reviewer, UniBP currently assumes access to a small clean subset, which is a common setup in recent defenses (e.g., I-BAU, RNP, ANP). To relax this assumption, combining UniBP with data-free techniques is a viable direction. For instance, one could employ generative models (e.g., GANs or diffusion models) trained on benign data to approximate clean samples and recover BN statistics. The main challenge lies in ensuring that generated samples faithfully preserve the statistical structure of the original training data. We acknowledge this as an exciting area for future work and will add it to the discussion. ## Reviewer #nswQ <!-- The following uncited paper adjusts batch-normalization parameters for backdoor defense. So, it should have been cited and compared against. [1] X. Li et al. Backdoor Mitigation by Correcting the Distribution of Neural Activations. Elsevier Neurocomputing 614, 21 January 2025. http://arxiv.org/abs/2308.09850 --> <!-- imho reviewer comments like this can be met with diplomatic admonition. What the hell is Elsevier Neurocomputing? --> **W1. Uncited paper BNA** Thanks for pointing this out. We will add the BNA paper [1] to the related work section and discuss a theoretical comparison. The key differences between BNA and our method are: - BNA explicitly relies on the distributional shift between clean and triggered data at each neuron and then minimizes the KL divergence between these two distributions. - However, this setting violates our threat model and makes a direct comparison unfair: BNA requires an estimated trigger function (Algorithm 1 in the BNA paper), which is not available in our scenario and is not straightforward to obtain in practice. - In contrast, our method does **not** require any triggered dataset or explicit trigger estimation. UniBP only observes how BN-layer parameters behave when a freshly reinitialized model is aligned toward the BN statistics of a backdoored model, making it more practical and less dependent on strong assumptions about the attacker’s trigger. <!-- Re. line 145: Though some papers do assume a strong adversary (insider) who controls the training process, typically backdoor poisoning can be effectively accomplished by just inserting poisoned examples into the training dataset. --> **W2. Strong adversary (insider) assumptions** We appreciate this comment and would like to clarify our threat model as follows. While many backdoor attacks can be accomplished through data poisoning alone, we explicitly consider stronger adversarial scenarios (such as the insider threat in Pham et al. SBL [2]) to demonstrate the robustness of our defense. We emphasize that our paper shows UniBP is effective against both standard data poisoning attacks and these more extreme insider threat scenarios. We believe considering stronger threat models actually strengthens our contribution for two reasons: <!-- is this [1] the same as the other one? SBL looks not the same as BNA... --> (1) If a defense can effectively handle extreme scenarios with insider adversaries who control the training process, it provides stronger guarantees for the more common case of data poisoning attacks. (2) As noted by reviewers R#uAhn and R#Yqvj, effective backdoor defenses must consider adaptive adversaries who are aware of the defense mechanism and can attempt to bypass it. Evaluating against stronger threat models helps demonstrate resilience against such adaptive strategies. Therefore, rather than limiting our contribution, *this threat model assumption strengthens the broader applicability and robustness of UniBP* across a spectrum of attack scenarios, from standard data poisoning to sophisticated insider threats. <!-- Re. line 148,149: The statement is odd because a very large number of prior papers on backdoor defense, particularly inversion/reverse-engineering approaches, make exactly this "post training" assumption. --> **W3. “Post-training” assumption** We agree that the post-training setting is not new and has been widely adopted in prior backdoor defenses (e.g., I-BAU, FST, ANP, RNP). Our intention was not to claim that this assumption is unusual, but to clarify the threat model we focus on. We will revise the wording accordingly. We argue that the post-training (model-only) regime is both practical and important for the following reasons: (i) in many real-world scenarios, users obtain a pre-trained model from an external provider or model zoo and have no control over, or access to, the original training pipeline; (ii) retraining from scratch on large-scale data is often prohibitively expensive or infeasible; and (iii) several strong baselines already operate under this assumption, which makes it a natural setting for fair comparison. Our contribution is therefore not the assumption itself, but a defense that remains effective in this realistic, model-only post-training scenario, without requiring access to the original training data or trigger information. **W4. Comparison with I-BAU** We thank the reviewer for this suggestion. We have added I-BAU as an additional baseline and present the comprehensive comparison below: | Method | Metric | BadNet | LC | COMBAT | SBL | Wanet | Narcissus | Adaptive-Batch | Input-aware | Refool | AVG-DER | |--------|--------|--------|-----|---------|-----|--------|-----------|----------------|-------------|--------|---------| | Ours | C-Acc | 89.82 | 89.09 | 87.90 | 86.94 | 91.57 | 88.37 | 88.31 | 90.61 | 89.70 | | | | ASR | 1.47 | 2.36 | 7.18 | 5.68 | 4.74 | 14.32 | 3.76 | 5.44 | 1.90 | | | | DER | 95.66 | 98.82 | 90.63 | 89.79 | 96.85 | 87.80 | 95.735 | 95.27 | 94.53 | **93.90** | | I-BAU | C-Acc | 88.13 | 86.33 | 91.01 | 88.20 | 86.52 | 89.27 | 89.84 | 89.67 | 87.87 | | | | ASR | 7.91 | 2.45 | 1.98 | 0.76 | 20.04 | 33.01 | 1.26 | 50.90 | 2.02 | | | | DER | 91.595 | 98.775 | 94.78 | 92.88 | 86.675 | 78.905 | 97.75 | 72.18 | 93.555 | **89.68** | UniBP achieves higher average DER (93.90%) compared to I-BAU (89.68%). Critically, I-BAU fails against sample-specific attacks like Narcissus (ASR 33.01%) and Input-aware (ASR 50.90%) due to its reliance on a universal trigger assumption, i.e., it searches for common adversarial perturbations across samples, which does not exist when each poisoned sample has a unique trigger. In contrast, UniBP maintains consistently low ASR across all attack types, making it the only defense effective against both universal and sample-specific backdoors. <!-- **Intuitive explanation.** This performance gap occurs because I-BAU relies on a trigger synthesis and unlearning approach with a small set of clean data. The method assumes that synthesized triggers can effectively represent the actual backdoor triggers, and that unlearning these synthesized triggers on clean samples will remove the backdoor behavior. However, sample-specific triggers (Narcissus, Wanet, Input-aware) create instance-dependent backdoor mappings where each poisoned sample has a unique or dynamically generated trigger pattern. This variability makes it extremely difficult for I-BAU's trigger synthesis procedure to accurately capture the diverse trigger space, leading to incomplete backdoor removal. Moreover, I-BAU's gradient-based unlearning on a limited clean dataset may not sufficiently distinguish between backdoor-related and task-relevant features when triggers are designed to be imperceptible or adaptive. Sample-specific attacks deliberately entangle backdoor features with benign representations to evade detection and removal techniques that assume a fixed, universal trigger pattern. --> In contrast, UniBP leverages BatchNorm statistics to detect distributional anomalies induced by any backdoor trigger, which remains effective regardless of trigger specificity. By operating on feature distribution shifts rather than synthesizing individual triggers, UniBP maintains consistent effectiveness across all attack types, including sample-specific and adaptive variants. **W5. Fonts size** <!-- The fonts in the figures and tables are too small. --> Thank you for pointing this out. We will maximize font sizes in all figures and tables while working within the conference format constraints. We will prioritize readability in the camera-ready version. [1] X. Li et al. Backdoor Mitigation by Correcting the Distribution of Neural Activations. Elsevier Neurocomputing 614, 21 January 2025 [2] Hoang Pham, The-Anh Ta, Anh Tran, and Khoa D Doan. Flatness-aware sequential learning generates resilient backdoors. In European Conference on Computer Vision, pp. 89–107. Springer, 2024b.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully