# Software Skills & Data Science <!-- Put the link to this slide here so people can follow --> slide: https://hackmd.io/@ericmjl/software-ds --- ## Who am I? - 📍 Principal Data Scientist, DSAI, Moderna - 🎓 ScD, MIT Biological Engineering. - 🧬 Inverse protein, mRNA, and small molecule design --- ## Lessons from work 1. 🤝 We collaborate. 2. 📦 Our stuff needs to be portable. > Data science is no longer solo work, but teamwork. The corollary is this: discipline enables better collaborative work. --- ## What do these lessons imply for _how_ we need to work? ---- ### 🤝 Collaborative implies sharing - Any teammate must be able to jump onto any another project easily. - Highly standardized workflows, tools, and idioms. - Zero ambiguity required when talking about data and code. How do we enable these? ---- ### 📄 Project structure templates - All projects are initialized **identically**. - Customizations happen afterwards. - Common customizations are upstreamed. ---- ### 🛠 PyDS-CLI ![](https://i.imgur.com/RfnZ6yJ.jpg) ---- ### 🛠 CCDS ![](https://i.imgur.com/XA7FFqm.jpg) ---- ### ✅ Truthy - Code is versioned with git and packaged up. - Data is versioned separately with git-like hashes. - Analysis code always references specific data hash. ---- ### 0️⃣-ambiguity data loading ```python from custom_source import load_data df = load_data(commit="5j39fdm") ``` 💯% reproducible. ---- ### 👍 Single Source of Truth Avoid this conversation: > "Which version of that function were you talking about? The one in `Untitled12.ipynb`?" > "No, the one in `Untitled13.ipynb`!" ---- Inside `custom_source/functions.py`: ```python def that_function(): return stuff ``` ---- Inside `Untitled12.ipynb`: ```python from custom_source.functions import that_function ``` Inside `Untitled13.ipynb`: ```python from custom_source.functions import that_function ``` ---- ### ✅ Verify Correctness If you modify a function that others depend on, it should still work for the existing use cases. ```python import pytest from custom_source.functions import that_function def test_that_function(): result = that_function() assert result == something_correct ``` ---- ### ✅ Verify Correctness If you create, modify, and return dataframes, make sure that they follow expectations. ```python import pandera as pa from custom_source.schemas import this_schema @pa.check_output(this_schema) def load_data(commit): ... return df ``` ---- ### 💻 Portable > But the code works on my system?! But if it doesn't work on someone else's system...? ---- ### 🧠 Complexity 😫 **Problem:** most projects have a complex set of dependencies that aren't covered by one tool (e.g. `pip`). ---- ### 🧠 Complexity 😇 **Solution:** Explicitly specify all dependencies via configuation files: - `environment.yml` or `requirements.txt` for project dependencies - `Dockerfile` for system-level dependencies. ---- ### 📦 Containers Containers let you ship dependency stack _explicitly_. ```Dockerfile # Explicit version number! # Standardize on some base image. FROM condaforge/mambaforge:4.12.0-0 # Signal to next person that the project needs a `conda` environment. COPY environment.yml /tmp/environment.yml # Never deal with custom environment names in a Docker container. # Always install to base. RUN mamba env update -f /tmp/environment.yml -n base # add additional steps below. ``` ---- ### ☁️ Cloud - ✅ Treat your dev system like livestock 🐮. - ❎ Don't treat your dev system like pets 🐶. - ✅ Your laptop can be your pet 🐶 and a glorified chromebook. - ❎ Your laptop shouldn't be your dev system. Work on the cloud. Create/destroy instances at will. Learn how to recreate environments. That will force portability. ---- ### 💯% Remote ☁️ Development ![](https://i.imgur.com/7zTayuL.jpg) ---- ### 📒 Documentation Good project documentation enables others to quickly gain context. Your future self will thank you. ---- ### 📒 Documentation ![](https://diataxis.fr/_images/diataxis.png) <small>Image credit: diataxis.fr</small> --- ## 🧠 Mental Models 1. Standardize practices and tooling. 2. Test your code. 3. Document your work. 4. Reference single & explicit sources of truth. These will enable you to collaborate effectively and ship your work productively. --- ## Thank you --- ## Resources - [Data Science Bootstrap Notes](https://ericmjl.github.io/data-science-bootstrap-notes/get-bootstrapped-on-your-data-science-projects/) - [Everything gets a package](https://ericmjl.github.io/blog/2022/3/31/everything-gets-a-package-yes-everything-gets-a-package/) - [Python Packages](https://py-pkgs.org) - [Big Book of Python](https://www.bigbookofpython.com) - [Diataxis Framework](https://diataxis.fr)
{"metaMigratedAt":"2023-06-17T00:03:29.289Z","metaMigratedFrom":"YAML","title":"How software skills enable data scientists to collaborate faster","breaks":true,"description":"View the slide with \"Slide Mode\".","contributors":"[{\"id\":\"2b128f1a-0014-4c68-a594-909dfdde9008\",\"add\":6967,\"del\":4392}]"}
    658 views