Experiments Peter 22-09-2022
---
| +0 "damaged" models | 98.70 | 0.06565 | 0.01115 | 0.9991 | 4 min |
| +0 "damaged" models | 98.70 | 0.06565 | 0.01115 | 0.9991 | 4 min |
ResNet18 (11 million params) on binary CIFAR-10 (car vs horse)
---
|Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time|
| initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4 min |
ResNet18 (11 million params) on binary CIFAR-10 (cat vs dog)
Full-batch Gradient Descnet
---
**Setup**
* optimizer base model w_star:
* prior_prec = 2
* momentum-sgd
* learning_rate=0.03, cosine annealing to 0 over 150 epochs, momentum=0.8
* batchsize = 200
* optimizer perturbation:
* **full-batch gradient descent with momentum**
* **batchsize = 10000**
* **learning_rate=0.06**, **no lr scheduler**, momentum=0.8
* averaging method:
* average predicted probabilities
* dataset:
* cat vs. dog
* N=10000 training examples, 2000 test examples
* data augmentation: horizontal flips, random crops
* perturbation-settings:
* poolsize varying
* perturbation-batchsize P $\sim$ Uniform(M/2, M)
* variances for ranking from exact functional-Laplace (matrix-inversion-lemma)
* initial model is trained to 99.87% training accuracy
**Baseline 1: No-Damage**
* retrain from $w_*$ on all data
* n_epochs=20
|Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time|
|-----|--------|---|---|---|-------------|
| initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min |
**Baseline 2: Uniform-N (naive memory-perturbation)**
* data *not ranked* (random perturbation)
* init: retrain from $w_*$
* n_epochs=20
|Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time|
|-----|--------|---|---|---|-------------|
| initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min |
**Experiment 1: Uniform-M**
* data *ranked with exact criterion*
* init: retrain from $w_*$
* poolsize M=10%
* n_epochs=20
|Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time|
|-----|--------|---|---|---|-------------|
| initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min |
**Experiment 1: Ranked-M**
* data *ranked with exact criterion*
* batches sampled with probability propto ranking score
* init: retrain from $w_*$
* poolsize M=0%
* n_epochs=20
Experiments Peter 16-09-2022
---
ResNet18 (11 million params) on binary CIFAR-10 (cat vs dog)
---
**Setup**
* optimizer w_star:
* prior_prec = 2
* momentum-sgd
* learning_rate=0.03, cosine annealing to 0 over 150 epochs, momentum=0.8
* batchsize = 200
* optimizer perturbation:
* as in optimizer w_star, just different "n_epochs"
* averaging method:
* average predicted probabilities
* dataset:
* cat vs. dog
* N=10000 training examples, 2000 test examples
* data augmentation: horizontal flips, random crops
* perturbation-settings:
* poolsize varying
* perturbation-batchsize P $\sim$ Uniform(M/2, M)
* variances for ranking from exact functional-Laplace (matrix-inversion-lemma)
* initial model is trained to 99.87% training accuracy
**Baseline 1: No-Damage**
* retrain from $w_*$ on all data
* n_epochs=20
|Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time|
|-----|--------|---|---|---|-------------|
| initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min |
| +4 models | 80.70 | 0.51184 | 0.09302 | 0.8913 | 6 min |
| +8 models | 81.15 | 0.46487 | 0.07085 | 0.8946 | 9 min |
| +32 models | 82.05 | 0.41901 | 0.05452 | 0.9007 | 23 min
**Baseline 2: Uniform-N (naive memory-perturbation)**
* data *not ranked* (random perturbation)
* init: retrain from $w_*$
* n_epochs=20
|Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time|
|-----|--------|---|---|---|-------------|
| initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min |
| +4 perturbed models | 81.30 | 0.49920 | 0.07626 | 0.8917 | 6 min |
| +8 perturbed models | 81.00 | 0.46294 | 0.06707 | 0.8921 | 9 min |
| +32 perturbed models | 81.60 | 0.42447 | 0.05717 | 0.8986 | 23 min
**Experiment 1: Uniform-M**
* data *ranked with exact criterion*
* init: retrain from $w_*$
* poolsize M=30%
* n_epochs=20
|Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time|
|-----|--------|---|---|---|-------------|
| initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min |
| +4 perturbed models | 80.45 | 0.54873 | 0.08838 | 0.8894 | 6 min |
| +8 perturbed models | 81.40 | 0.48248 | 0.07273 | 0.8944 | 9 min |
| +32 perturbed models | 82.05 | 0.43932 | 0.06049 | 0.8979 | 23 min
Training from scratch
---
**Baseline 3: No-Damage - Training from scratch**
* retrain from $w_*$ on all data
* training each model from scratch
* n_epochs=150
|Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time|
|-----|--------|---|---|---|-------------|
| initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min |
| +32 models | 82.50 | 0.46883 | 0.07365 | 0.9073 |137.6 min |
**Baseline 4: Uniform-N (naive memory-perturbation) - Training from scratch**
* data *not ranked* (random perturbation)
* training each perturbation model from scratch
* poolsize M=15%
* n_epochs=150
|Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time|
|-----|--------|---|---|---|-------------|
| initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min |
| +32 perturbed models | 82.50 | 0.45712 | 0.07470 | 0.9068 | a bit less than 137.6 min
**Experiment 2: Uniform-M - Training from scratch**
* data *ranked with exact criterion*
* training each model from scratch
* poolsize M=15%
* n_epochs=150
|Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time|
|-----|--------|---|---|---|-------------|
| initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min |
| +32 perturbed models | 82.25 | 0.48381 | 0.08192 | 0.9049 | a bit less than 137.6 min
**To-Try's**
* epochs for retraining
* smaller initial learning rate when retraining
* GD experiments to control for the effect of SGD noise
* other params: poolsize M, perturbation-batchsize P
* influence-weighted sampling from M
* class-balanced sampling of perturbation-batch
* other classes than cat vs dog (this one is particularly hard)
* perturbing swag trajectory
Removing large batches of low-ranked data
---
Perturbation-batches of low-ranked data of the size of 35%-70% of all data.
* data *ranked with exact criterion*, **ranking inverted** (low-ranked data removed)
* training each model starting from w_star
* n_epochs=20
**Baseline 5: No-Damage** (different seed than baseline 1)
* retrain from $w_*$ on all data
* n_epochs=20
|Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time|
|-----|--------|---|---|---|-------------|
| initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min |
| +16 models | 81.60 | 0.43449 | 0.05694 | 0.8961 | X
**Experiment 3:**
* poolsize M=70%, perturbation batches between 35-70% of training data.
|Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time|
|-----|--------|---|---|---|-------------|
| initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min |
| +16 perturbed models | 81.15 | 0.42059 | 0.05083 | 0.8944 |
Conclusion: ECE and ROC improved in comparison to "no-damage" baseline.
**Experiment 4:**
* poolsize M=90%, perturbation batches between 35-70% of training data.
|Name |acc ↑|nll ↓|ECE ↓|ROC ↑|Training Time|
|-----|--------|---|---|---|-------------|
| initial model | 80.15 | 1.04978 | 0.15829 | 0.8837 | 4.3min |
| +16 perturbed models | 81.00 | 0.42595 | 0.04639 | 0.8936 |