Software Skills & Data Science

slide: https://hackmd.io/@ericmjl/software-ds


Who am I?

  • 📍 Principal Data Scientist, DSAI, Moderna
  • 🎓 ScD, MIT Biological Engineering.
  • 🧬 Inverse protein, mRNA, and small molecule design

Lessons from work

  1. 🤝 We collaborate.
  2. 📦 Our stuff needs to be portable.

Data science is no longer solo work, but teamwork. The corollary is this: discipline enables better collaborative work.


What do these lessons imply for how we need to work?


🤝 Collaborative implies sharing

  • Any teammate must be able to jump onto any another project easily.
  • Highly standardized workflows, tools, and idioms.
  • Zero ambiguity required when talking about data and code.

How do we enable these?


📄 Project structure templates

  • All projects are initialized identically.
  • Customizations happen afterwards.
  • Common customizations are upstreamed.

🛠 PyDS-CLI


🛠 CCDS


✅ Truthy

  • Code is versioned with git and packaged up.
  • Data is versioned separately with git-like hashes.
  • Analysis code always references specific data hash.

0️⃣-ambiguity data loading

from custom_source import load_data 

df = load_data(commit="5j39fdm")

💯% reproducible.


👍 Single Source of Truth

Avoid this conversation:

"Which version of that function were you talking about? The one in Untitled12.ipynb?"

"No, the one in Untitled13.ipynb!"


Inside custom_source/functions.py:

def that_function():
    return stuff

Inside Untitled12.ipynb:

from custom_source.functions import that_function

Inside Untitled13.ipynb:

from custom_source.functions import that_function

✅ Verify Correctness

If you modify a function that others depend on, it should still work for the existing use cases.

import pytest 
from custom_source.functions import that_function 

def test_that_function():
    result = that_function()
    assert result == something_correct 

✅ Verify Correctness

If you create, modify, and return dataframes, make sure that they follow expectations.

import pandera as pa
from custom_source.schemas import this_schema 

@pa.check_output(this_schema)
def load_data(commit):
    ...
    return df

💻 Portable

But the code works on my system?!

But if it doesn't work on someone else's system?


🧠 Complexity

😫 Problem: most projects have a complex set of dependencies that aren't covered by one tool (e.g. pip).


🧠 Complexity

😇 Solution: Explicitly specify all dependencies via configuation files:

  • environment.yml or requirements.txt for project dependencies
  • Dockerfile for system-level dependencies.

📦 Containers

Containers let you ship dependency stack explicitly.

# Explicit version number!
# Standardize on some base image.
FROM condaforge/mambaforge:4.12.0-0

# Signal to next person that the project needs a `conda` environment.
COPY environment.yml /tmp/environment.yml
# Never deal with custom environment names in a Docker container.
# Always install to base.
RUN mamba env update -f /tmp/environment.yml -n base

# add additional steps below.

☁️ Cloud

  • ✅ Treat your dev system like livestock 🐮.
  • ❎ Don't treat your dev system like pets 🐶.
  • ✅ Your laptop can be your pet 🐶 and a glorified chromebook.
  • ❎ Your laptop shouldn't be your dev system.

Work on the cloud. Create/destroy instances at will. Learn how to recreate environments. That will force portability.


💯% Remote ☁️ Development


📒 Documentation

Good project documentation enables others to quickly gain context.

Your future self will thank you.


📒 Documentation

Image credit: diataxis.fr


🧠 Mental Models

  1. Standardize practices and tooling.
  2. Test your code.
  3. Document your work.
  4. Reference single & explicit sources of truth.

These will enable you to collaborate effectively and ship your work productively.


Thank you


Resources

Select a repo