Why Focal Loss Replaced Hard Negative Mining (HNM)

# Why Focal Loss Replaced Hard Negative Mining (HNM) ## 🔹 1. HNM is Discrete & Heuristic * **SSD approach**: after computing losses, sort negatives and pick the top *k* (usually 3× positives). * **Hard cutoff**: some negatives are kept, others dropped — purely by rank. * Issues: * ❌ Non-differentiable (sorting step). * ❌ Requires a hyperparameter (3:1 ratio). * ❌ Potentially unstable if dataset distribution shifts. ## 🔹 2. Focal Loss is Continuous & Adaptive * Formula: $$ FL(p,y) = -(1-p)^\gamma \log(p) $$ where *p* = predicted probability. * Behavior: * ✅ Easy negatives (p ≈ 0) → weight ≈ 0. * ✅ Hard negatives (p ≈ 0.5) → retain strong weight. * Works per-example, **no sorting required**. ## 🔹 3. Scalability to Many Classes * **HNM**: fixed ratio per image → tricky with 80+ classes (e.g., COCO). * **Focal Loss**: adjusts contribution **automatically per sample**. No hand-tuned ratios needed. ## 🔹 4. Efficiency * **HNM**: * Compute all losses. * Sort negatives. * Select top *k*. * ⚠️ Sorting = expensive & training bottleneck. * **Focal Loss**: * Just apply $(1-p)^\gamma$. * ✅ Lightweight, GPU-friendly. ## 🔹 5. Stability & Performance * Fully differentiable → smoother optimization. * Empirical results: * RetinaNet + focal loss **beats SSD + HNM** on AP, recall, and imbalance robustness. ## ✅ Bottom Line * **HNM (SSD)** = *manual*: "throw away most negatives, keep a few hard ones." * **Focal Loss (RetinaNet)** = *automatic*: "keep all, but weight smartly & smoothly." 👉 **Focal Loss is now preferred** because it’s elegant, efficient, differentiable, and scales better to complex datasets.