IPFS & mybinder.org

What is mybinder.org?

MyBinder is a bit like heroku, but for Jupyter Notebooks & other narrative executable documents. You have a git repository with your notebooks / code / environment specifications, and binder builds & launches these in a way that lets any user check these out.

For example, going to this binder link will open an interactive executable version of the notebooks in this repository.

Problem

Git is great for storing code and notebooks, but terrible for storing large datasets. A lot of narrative documents are explorations of data, and often this data is large. For the documents to be fully reproducible, there should be an easy way to distribute this data.

How can IPFS help?

For our purposes, IPFS is a distributed file system that lets us distribute data easily & refer to it easily.

The easiest way to explore this is with code examples.

Imagine you have a 200Mb CSV file called data.csv that is needed in your notebook to do your analysis. This is a public dataset that you got from figshare or a similar provider. Along with the DOI, the provider gives you an IPFS Hash of this dataset. This is an immutable reference to this dataset.

You can read this file simply with:

import csv

filepath = "/ipfs/Qmb8wsGZNXt5VXZh1pEmYynjB6Euqpq3HYyeAdw2vScTkQ/data.csv"

with open(filepath) as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

That's it! The first time this is run on an IPFS enabled system, it'll reach out and find the set of files referred to by that hash (using P2P systems similar to bittorrent). So the same code will work on your laptop and on various binders!

This requires that the IPFS daemon is running on your computer, and this isn't too hard to set up and run. If you don't want to do that, you can still aquire the data by making a HTTP request to https://gateway.ipfs.io/ipfs/Qmb8wsGZNXt5VXZh1pEmYynjB6Euqpq3HYyeAdw2vScTkQ/data.csv (or any other IPFS gateway), and get the data.

This works not just for individual files, but for whole directories as well. As examples - XKCD archive and arxiv. There is also a python client library and a REST API you can use directly.

How do we make this happen?

IPFS feels like a super perfect fit for our use case! However, a lot of work needs to be done before it can be generically useful. The biggest challenge is that IPFS requires the data to be hosted in at least one place somewhere (pinned in IPFS terminology - similar to seeding in bittorrent). Here are some steps we need to take to make this happen:

Bootstrap a data hosting solution by hosting some of this data ourselves (with the help of the IPFS team ideally) in an accessible way. This (+ binder integration) will be existence proof that'll help convince providers to adopt this.
- In the long run, data publishers (such as figshare) need to start providing IPFS links. Until they do, it won't be as useful. This is a long-term goal - binder can't be the only one hosting these data sets in the long run.
Create / find better GUI methods of getting IPFS set up on local computers (Linux, MacOS and Windows).
Integrate IPFS (especially via the /ipfs filesystem mount) onto mybinder.org, so folks can just start using it there.
Create example analysis / repositories showing users how to use this.

Who can help make this happen?

Eventually we need to convince the data libraries of the world. Until then, we can blaze a trail with help from the Jupyter/Binder folks and the IPFS folks :)