# Why Focal Loss Replaced Hard Negative Mining (HNM) ## πŸ”Ή 1. HNM is Discrete & Heuristic * **SSD approach**: after computing losses, sort negatives and pick the top *k* (usually 3Γ— positives). * **Hard cutoff**: some negatives are kept, others dropped β€” purely by rank. * Issues: * ❌ Non-differentiable (sorting step). * ❌ Requires a hyperparameter (3:1 ratio). * ❌ Potentially unstable if dataset distribution shifts. ## πŸ”Ή 2. Focal Loss is Continuous & Adaptive * Formula: $$ FL(p,y) = -(1-p)^\gamma \log(p) $$ where *p* = predicted probability. * Behavior: * βœ… Easy negatives (p β‰ˆ 0) β†’ weight β‰ˆ 0. * βœ… Hard negatives (p β‰ˆ 0.5) β†’ retain strong weight. * Works per-example, **no sorting required**. ## πŸ”Ή 3. Scalability to Many Classes * **HNM**: fixed ratio per image β†’ tricky with 80+ classes (e.g., COCO). * **Focal Loss**: adjusts contribution **automatically per sample**. No hand-tuned ratios needed. ## πŸ”Ή 4. Efficiency * **HNM**: * Compute all losses. * Sort negatives. * Select top *k*. * ⚠️ Sorting = expensive & training bottleneck. * **Focal Loss**: * Just apply $(1-p)^\gamma$. * βœ… Lightweight, GPU-friendly. ## πŸ”Ή 5. Stability & Performance * Fully differentiable β†’ smoother optimization. * Empirical results: * RetinaNet + focal loss **beats SSD + HNM** on AP, recall, and imbalance robustness. ## βœ… Bottom Line * **HNM (SSD)** = *manual*: "throw away most negatives, keep a few hard ones." * **Focal Loss (RetinaNet)** = *automatic*: "keep all, but weight smartly & smoothly." πŸ‘‰ **Focal Loss is now preferred** because it’s elegant, efficient, differentiable, and scales better to complex datasets.