LRDE DSD (Pytorch)

Draft 1

  • Training pipeline (MNIST)
    • Better display epoch
  • Establish NN (Lenet)
  • Plot training loss
  • Add plot_wb()

Draft 2

  • Add gradient masking
    • workout small example to debug
  • Don't take first convolution !
  • plot_wb with save option
  • Try to adapt to train_dsd
    • Try learning rate scheduler
    • ModelCheckpoint
    • CSVLogger

Draft 3

  • Clean implementation of DSD
    • Seperate into files

Draft 4

  • Try to reproduce Adam result with NN
  • Try to reproduce SGD result with NN
  • Try to reproduce Adam-dsd result with NN
  • Try to reproduce SGD-dsd result with NN

  • NN + no dataug + batch_size=32
    • Good accuracy + overfit quickly
  • NN + no dataug + batch_size=128
    • Okay accuracy + overfit quickly
  • NN + dataug + batch_size=32
    • Good accuracy + no overfit (can train on more epoch)
  • VGG13 + no dataug + batch_size=32
    • doesn't train (30%)

TODO:

  • Launch training on MNIST without learning scheduler with NN. Does it work ?
    • Yes
  • reproduct same learning rate scheduler as Tensorflow
    • Expected
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
    • Result
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
  • launch NN training on MNIST with learning rate scheduler on NN.
  • Try LR schduler + NN + SGD + FER+ dataset
    • val_acc = 0.762
  • Implement DSD
  • Run NN + adam DSD
  • Run NN + sgd DSD.
    • Commit 4.yaml first.
  • VGG13 + sgd DSD + MNIST
    • Check if weight distibution is good.
  • class weight + train VGG13
  • Try with mobilenet
    • Change head of first convolution to 1 channel instead of 3. link
    • Adapt config file to choose dataset
    • Adapt config file to create model from config file.
    • Make it train on MNIST + adam
    • Make it train on FER+ + adam
    • Make it train on FER+ + DSD + sgd
  • If mobilenet is working, plan all runs and meanwhile code a more classical VGG.
  • Try VGG13 + adam
  • Try LR scheduler with VGG13 + SGD
    • If don't work, use VGG16
  • Mlflow test_accuracy log.
  • Try VGG16 + adam
    • Doesn't work at all
  • Try LR scheduler + VGG16 + sgd
    • Do work but overfit around 16 epoch.
  • Compress VGG16 + MobileNetv2

  • experiments:
    • 1: MobilenetV2 Adam
      • Overfit -> Check Add Learning Rate scheduler.
    • 2: MobilenetV2 Adam-dsd
    • 3: MobilenetV2 Sgd
    • 4: MobilenetV2 Sgd-dsd
    • 5: VGG16: SGD
    • 6: VGG16: SGD-dsd

  • pipreqs /project/path -> Generate requirements.txt based on import.

MobilenetV2 dsd:

  • 8,9 Mo (8 903 688 octets)
    MobilenetV2:
  • 8,9 Mo (8 903 268 octets)

DSD Experiments(blog post)

  • dataset FER+2013

1) Naive

  • 4 runs:
    • sgd
    • sgd-dsd
    • adam
    • adam-dsd

previous ccl: same perf with or w/o DSD

2) Going further

  • Hypothesis: When deploying/packaging, is it better to keep DSD over baseline ? (since it has more weights to 0 -> lighter / less data to transfer on the network)
    • Compare without quantization baseline/DSD (high priority)
    • Compare with quantization baseline/DSD (low priority)

    for each case, report quality (val loss | F1-score) and size (MB) indicators


TALK

  • Recap pipeline
  • previously done
  • current hypothesis
  • Paper recap
    • Goal
    • Pros/Cons
  • Results
    • Enumerate settings of training
    • Explain that we didn't succeed to make VGG16 converges with Adam.
    • Compare val_[loss/acc] of VGG16/MobileNetV2 [sgd/sgd-dsd] + [adam/adam-dsd]
      • CCL: no gain in accuracy
    • Compare mobilenetv2, matrix 2x2 with sgd/sgd-dsd/adam/adam-dsd of val_acc
      • CCL: Better to use Adam.
    • Compare train/val loss/acc of MobilenetV2 [sgd/sgd-dsd]
      • CCL: Form of regularization
    • After quantization, there is a gain in file zip (13%)
      • Mobilenet Normal/DSD -> zip -> compare file size
      • Mobilenet Normal/DSD -> quantization -> zip -> compare file size
      • CCL: With quantization, DSD offers a gain in file size storage.
  • Conclusion
  • Further work
    • Revenir en arriere dans la pipeline
    • Tres gros dataset !

Recover mlruns/ folder

  • Go to 19-03-2021/

  • Depending on which framework you want, run:

    • virtualenv lrde-env-[pytorch|tf2] && source lrde-env-[pytorch|tf2]/bin/activate && pip install -r requirements-[pytorch|tf2].txt
  • If docker container container-lrde-19-03-2021 already exists:

    • sudo docker ps -a and copy CONTAINER_ID
    • sudo docker start CONTAINER_ID
  • Else:

    • Create container with mlruns/ folder
      • sudo docker pull 3outeille/lrde-2021:19-03-2021
      • sudo docker run -d --name container-lrde-19-03-2021 3outeille/lrde-2021:19-03-2021 tail -f /dev/null
  • sudo docker cp container-lrde-19-03-2021:/experiments/ .

  • Run ./recover_mlruns.sh [pytorch|tf2]

  • Stop docker container

    • sudo docker ps -a and copy CONTAINER_ID
    • sudo docker stop CONTAINER_ID
  • You can now use mlflow on your browser.

    • cd src/[pytorch|tf2] && mlflow ui
  • Download pytorch-mlruns to 19_3

  • Just clean all path to make it work in local from 19_03_2021
    and build an image: /home/sphird/Document/19_03_2021/src/[tf2]