Experiments Peter 22-09-2022 --- | +0 "damaged" models | 98.70 | 0.06565 | 0.01115 | 0.9991 | 4 min | | +0 "damaged" models | 98.70 | 0.06565 | 0.01115 | 0.9991 | 4 min | ResNet18 (11 million params) on binary CIFAR-10 (car vs horse) --- |Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time| | initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4 min | ResNet18 (11 million params) on binary CIFAR-10 (cat vs dog) Full-batch Gradient Descnet --- **Setup** * optimizer base model w_star: * prior_prec = 2 * momentum-sgd * learning_rate=0.03, cosine annealing to 0 over 150 epochs, momentum=0.8 * batchsize = 200 * optimizer perturbation: * **full-batch gradient descent with momentum** * **batchsize = 10000** * **learning_rate=0.06**, **no lr scheduler**, momentum=0.8 * averaging method: * average predicted probabilities * dataset: * cat vs. dog * N=10000 training examples, 2000 test examples * data augmentation: horizontal flips, random crops * perturbation-settings: * poolsize varying * perturbation-batchsize P $\sim$ Uniform(M/2, M) * variances for ranking from exact functional-Laplace (matrix-inversion-lemma) * initial model is trained to 99.87% training accuracy **Baseline 1: No-Damage** * retrain from $w_*$ on all data * n_epochs=20 |Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time| |-----|--------|---|---|---|-------------| | initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min | **Baseline 2: Uniform-N (naive memory-perturbation)** * data *not ranked* (random perturbation) * init: retrain from $w_*$ * n_epochs=20 |Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time| |-----|--------|---|---|---|-------------| | initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min | **Experiment 1: Uniform-M** * data *ranked with exact criterion* * init: retrain from $w_*$ * poolsize M=10% * n_epochs=20 |Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time| |-----|--------|---|---|---|-------------| | initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min | **Experiment 1: Ranked-M** * data *ranked with exact criterion* * batches sampled with probability propto ranking score * init: retrain from $w_*$ * poolsize M=0% * n_epochs=20 Experiments Peter 16-09-2022 --- ResNet18 (11 million params) on binary CIFAR-10 (cat vs dog) --- **Setup** * optimizer w_star: * prior_prec = 2 * momentum-sgd * learning_rate=0.03, cosine annealing to 0 over 150 epochs, momentum=0.8 * batchsize = 200 * optimizer perturbation: * as in optimizer w_star, just different "n_epochs" * averaging method: * average predicted probabilities * dataset: * cat vs. dog * N=10000 training examples, 2000 test examples * data augmentation: horizontal flips, random crops * perturbation-settings: * poolsize varying * perturbation-batchsize P $\sim$ Uniform(M/2, M) * variances for ranking from exact functional-Laplace (matrix-inversion-lemma) * initial model is trained to 99.87% training accuracy **Baseline 1: No-Damage** * retrain from $w_*$ on all data * n_epochs=20 |Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time| |-----|--------|---|---|---|-------------| | initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min | | +4 models | 80.70 | 0.51184 | 0.09302 | 0.8913 | 6 min | | +8 models | 81.15 | 0.46487 | 0.07085 | 0.8946 | 9 min | | +32 models | 82.05 | 0.41901 | 0.05452 | 0.9007 | 23 min **Baseline 2: Uniform-N (naive memory-perturbation)** * data *not ranked* (random perturbation) * init: retrain from $w_*$ * n_epochs=20 |Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time| |-----|--------|---|---|---|-------------| | initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min | | +4 perturbed models | 81.30 | 0.49920 | 0.07626 | 0.8917 | 6 min | | +8 perturbed models | 81.00 | 0.46294 | 0.06707 | 0.8921 | 9 min | | +32 perturbed models | 81.60 | 0.42447 | 0.05717 | 0.8986 | 23 min **Experiment 1: Uniform-M** * data *ranked with exact criterion* * init: retrain from $w_*$ * poolsize M=30% * n_epochs=20 |Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time| |-----|--------|---|---|---|-------------| | initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min | | +4 perturbed models | 80.45 | 0.54873 | 0.08838 | 0.8894 | 6 min | | +8 perturbed models | 81.40 | 0.48248 | 0.07273 | 0.8944 | 9 min | | +32 perturbed models | 82.05 | 0.43932 | 0.06049 | 0.8979 | 23 min Training from scratch --- **Baseline 3: No-Damage - Training from scratch** * retrain from $w_*$ on all data * training each model from scratch * n_epochs=150 |Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time| |-----|--------|---|---|---|-------------| | initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min | | +32 models | 82.50 | 0.46883 | 0.07365 | 0.9073 |137.6 min | **Baseline 4: Uniform-N (naive memory-perturbation) - Training from scratch** * data *not ranked* (random perturbation) * training each perturbation model from scratch * poolsize M=15% * n_epochs=150 |Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time| |-----|--------|---|---|---|-------------| | initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min | | +32 perturbed models | 82.50 | 0.45712 | 0.07470 | 0.9068 | a bit less than 137.6 min **Experiment 2: Uniform-M - Training from scratch** * data *ranked with exact criterion* * training each model from scratch * poolsize M=15% * n_epochs=150 |Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time| |-----|--------|---|---|---|-------------| | initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min | | +32 perturbed models | 82.25 | 0.48381 | 0.08192 | 0.9049 | a bit less than 137.6 min **To-Try's** * epochs for retraining * smaller initial learning rate when retraining * GD experiments to control for the effect of SGD noise * other params: poolsize M, perturbation-batchsize P * influence-weighted sampling from M * class-balanced sampling of perturbation-batch * other classes than cat vs dog (this one is particularly hard) * perturbing swag trajectory Removing large batches of low-ranked data --- Perturbation-batches of low-ranked data of the size of 35%-70% of all data. * data *ranked with exact criterion*, **ranking inverted** (low-ranked data removed) * training each model starting from w_star * n_epochs=20 **Baseline 5: No-Damage** (different seed than baseline 1) * retrain from $w_*$ on all data * n_epochs=20 |Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time| |-----|--------|---|---|---|-------------| | initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min | | +16 models | 81.60 | 0.43449 | 0.05694 | 0.8961 | X **Experiment 3:** * poolsize M=70%, perturbation batches between 35-70% of training data. |Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time| |-----|--------|---|---|---|-------------| | initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min | | +16 perturbed models | 81.15 | 0.42059 | 0.05083 | 0.8944 | Conclusion: ECE and ROC improved in comparison to "no-damage" baseline. **Experiment 4:** * poolsize M=90%, perturbation batches between 35-70% of training data. |Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time| |-----|--------|---|---|---|-------------| | initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min | | +16 perturbed models | 81.00 | 0.42595 | 0.04639 | 0.8936 |