# DataOps Using DVC Fu-Chun Hsu Senior Data Scientist Workforce Optimization Analytics National Australian Bank ---- ## Prerequisite 1. GIT 2. Introductory machine learning or data science --- ### If your data project looks like this, you have to pay attention ![](https://miro.medium.com/max/2584/1*kIm20JjkUG1t8839WqVGsA.png) ---- ### What is really happening ![](https://i.imgur.com/rSazziS.png) ---- ### Collaboration Issues(Optional) - Source code and data versioning - Experiment time log - **Navigating through experiments** - Reproducibility - Managing and sharing large data files(A has GPU but b does not) ---- ## Benefits of Data/Model VC 1. **Capture and Save** those data artifacts as same as to code 2. **Track and Switch** between different versions of the data easily, 3. Answering the question of how data artifacts (e.g. ML models) were built in the first place, 4. Able to compare them, 5. Bring best practices to our team and get everyone on the same page. ---- ### Now we are sync ![](https://i.imgur.com/xFnsivt.png) --- ## DVC - Open-Sourced Dava Version Control tool ---- ### Main Concept ![](https://i.imgur.com/85xqgMJ.png) ---- ### A Gift for Data Science Team ![](https://i.imgur.com/2I35Lud.png) ---- ### It works like a git ![](https://i.imgur.com/gsizqfj.png) > DVC uses a similar command structure as to Git. ---- ### Storage places of DVC - Local disk - SSH server - Cloud System(S3, GCP) ![](https://i.imgur.com/GV77kf8.png) ---- ### Bi-Version control architecture of DVC ![](https://i.imgur.com/fX1Pgur.png) ### Reproducing ML process ![](https://i.imgur.com/9Od2w1r.png) 1. -d dfines *dependencies* , an input file and a Python script here 2. -o records output files, here is an output data directory 3. Executed command as a Python script ---- ### A typical Scenerio 1. If prepare.py is changed -> SCM will track the change 2. If data.xml is changed -> DVC will track --- ## DVC Workflows ---- ### Init ```=script $ dvc init Adding '.dvc/state' to '.dvc/.gitignore'. Adding '.dvc/lock' to '.dvc/.gitignore'. Adding '.dvc/config.local' to '.dvc/.gitignore'. Adding '.dvc/updater' to '.dvc/.gitignore'. Adding '.dvc/updater.lock' to '.dvc/.gitignore'. Adding '.dvc/state-journal' to '.dvc/.gitignore'. Adding '.dvc/state-wal' to '.dvc/.gitignore'. Adding '.dvc/cache' to '.dvc/.gitignore'. You can now commit the changes to git. ``` ---- ```=bash $ git status new file: .dvc/.gitignore new file: .dvc/config ``` > Config file is therefore git. ---- ### Add Data ```=bash $ dvc add data $ python train.py $ dvc add model.h5 ``` ---- ### Commit and tag through GIT ```=bash $ git add .gitignore model.h5.dvc data.dvc metrics.json $ git commit -m "model first version, 1000 images" $ git tag -a "v1.0" -m "model v1.0, 1000 images" ``` ---- ### Commit a second model ```=bash $ git add model.h5.dvc data.dvc metrics.json $ git commit -m "model second version, 2000 images" $ git tag -a "v2.0" -m "model v2.0, 2000 images" ``` ---- ### Switching Model/Data ```=bash $ git checkout v1.0 $ dvc checkout ``` > Keep code and go back to the previous dataset only ```=bash $ git checkout v1.0 data.dvc $ dvc checkout data.dvc ``` ---- ### Run ![](https://i.imgur.com/F36Fgg1.png) ---- ### Reproduce a pipeline with respect to its dependencies ![](https://i.imgur.com/IBUkBVT.png) ---- ### Reproducible and shared through any Git ![](https://i.imgur.com/dW8hQQc.png) ---- ### Monitoring ```=bash $ dvc metrics show -T baseline-experiment: auc.metric: 0.588426 bigram-experiment: auc.metric: 0.602818 ``` --- ## Other features ---- ### DVC use MD5 to indexed and stored files ![](https://i.imgur.com/bFsvlzg.png) --- ## More resources [Video Resources](https://dvc.org/doc/understanding-dvc/resources) --- ## Thank You
{"metaMigratedAt":"2023-06-14T23:05:39.222Z","metaMigratedFrom":"Content","title":"DataOps Using DVC","breaks":true,"contributors":"[{\"id\":\"df5bcb6f-88f7-498e-9911-1eda2efc0f5e\",\"add\":4891,\"del\":1020}]"}
    303 views