# Software Skills & Data Science
<!-- Put the link to this slide here so people can follow -->
slide: https://hackmd.io/@ericmjl/software-ds
---
## Who am I?
- 📍 Principal Data Scientist, DSAI, Moderna
- 🎓 ScD, MIT Biological Engineering.
- 🧬 Inverse protein, mRNA, and small molecule design
---
## Lessons from work
1. 🤝 We collaborate.
2. 📦 Our stuff needs to be portable.
> Data science is no longer solo work, but teamwork. The corollary is this: discipline enables better collaborative work.
---
## What do these lessons imply for _how_ we need to work?
----
### 🤝 Collaborative implies sharing
- Any teammate must be able to jump onto any another project easily.
- Highly standardized workflows, tools, and idioms.
- Zero ambiguity required when talking about data and code.
How do we enable these?
----
### 📄 Project structure templates
- All projects are initialized **identically**.
- Customizations happen afterwards.
- Common customizations are upstreamed.
----
### 🛠 PyDS-CLI
![](https://i.imgur.com/RfnZ6yJ.jpg)
----
### 🛠 CCDS
![](https://i.imgur.com/XA7FFqm.jpg)
----
### ✅ Truthy
- Code is versioned with git and packaged up.
- Data is versioned separately with git-like hashes.
- Analysis code always references specific data hash.
----
### 0️⃣-ambiguity data loading
```python
from custom_source import load_data
df = load_data(commit="5j39fdm")
```
💯% reproducible.
----
### 👍 Single Source of Truth
Avoid this conversation:
> "Which version of that function were you talking about? The one in `Untitled12.ipynb`?"
> "No, the one in `Untitled13.ipynb`!"
----
Inside `custom_source/functions.py`:
```python
def that_function():
return stuff
```
----
Inside `Untitled12.ipynb`:
```python
from custom_source.functions import that_function
```
Inside `Untitled13.ipynb`:
```python
from custom_source.functions import that_function
```
----
### ✅ Verify Correctness
If you modify a function that others depend on, it should still work for the existing use cases.
```python
import pytest
from custom_source.functions import that_function
def test_that_function():
result = that_function()
assert result == something_correct
```
----
### ✅ Verify Correctness
If you create, modify, and return dataframes, make sure that they follow expectations.
```python
import pandera as pa
from custom_source.schemas import this_schema
@pa.check_output(this_schema)
def load_data(commit):
...
return df
```
----
### 💻 Portable
> But the code works on my system?!
But if it doesn't work on someone else's system...?
----
### 🧠 Complexity
😫 **Problem:** most projects have a complex set of dependencies that aren't covered by one tool (e.g. `pip`).
----
### 🧠 Complexity
😇 **Solution:** Explicitly specify all dependencies via configuation files:
- `environment.yml` or `requirements.txt` for project dependencies
- `Dockerfile` for system-level dependencies.
----
### 📦 Containers
Containers let you ship dependency stack _explicitly_.
```Dockerfile
# Explicit version number!
# Standardize on some base image.
FROM condaforge/mambaforge:4.12.0-0
# Signal to next person that the project needs a `conda` environment.
COPY environment.yml /tmp/environment.yml
# Never deal with custom environment names in a Docker container.
# Always install to base.
RUN mamba env update -f /tmp/environment.yml -n base
# add additional steps below.
```
----
### ☁️ Cloud
- ✅ Treat your dev system like livestock 🐮.
- ❎ Don't treat your dev system like pets 🐶.
- ✅ Your laptop can be your pet 🐶 and a glorified chromebook.
- ❎ Your laptop shouldn't be your dev system.
Work on the cloud. Create/destroy instances at will. Learn how to recreate environments. That will force portability.
----
### 💯% Remote ☁️ Development
![](https://i.imgur.com/7zTayuL.jpg)
----
### 📒 Documentation
Good project documentation enables others to quickly gain context.
Your future self will thank you.
----
### 📒 Documentation
![](https://diataxis.fr/_images/diataxis.png)
<small>Image credit: diataxis.fr</small>
---
## 🧠 Mental Models
1. Standardize practices and tooling.
2. Test your code.
3. Document your work.
4. Reference single & explicit sources of truth.
These will enable you to collaborate effectively and ship your work productively.
---
## Thank you
---
## Resources
- [Data Science Bootstrap Notes](https://ericmjl.github.io/data-science-bootstrap-notes/get-bootstrapped-on-your-data-science-projects/)
- [Everything gets a package](https://ericmjl.github.io/blog/2022/3/31/everything-gets-a-package-yes-everything-gets-a-package/)
- [Python Packages](https://py-pkgs.org)
- [Big Book of Python](https://www.bigbookofpython.com)
- [Diataxis Framework](https://diataxis.fr)
{"metaMigratedAt":"2023-06-17T00:03:29.289Z","metaMigratedFrom":"YAML","title":"How software skills enable data scientists to collaborate faster","breaks":true,"description":"View the slide with \"Slide Mode\".","contributors":"[{\"id\":\"2b128f1a-0014-4c68-a594-909dfdde9008\",\"add\":6967,\"del\":4392}]"}