owned this note
owned this note
Published
Linked with GitHub
For further information see https://github.com/jupyterhub/team-compass/issues/277
# Exploratory Survey Paper on Projects using MyBinder
## Introduction
Motivation / relevance of this work.
Survey of related work.
### Research Questions
* How reproducible are repositories with binder specifications?
* What are the patterns of usage?
* What makes repository successful?
* What are best practices?
## Methodology
## Analysis
### Description of the MyBinder.org Events Archive dataset:
+ Statistics about the number of content providers (proportion repos & proportion of launches)
+ Statistics about programming languages
+ Statistics about recognized repo2docker environment files
+ Statistics about URL-Parameter in badges: What environments are used (Lab, RStudio, ... ) and are specific files addressed?
+ Cleaning of the dataset i.e. how to deal with forks?
### How reproducible are repositories?
Estimate p( still launches or builds | has launched previously)
Q: Is this section redundant or problematic since the whole implicate premise is that this probability is (close to) 1?
Tim thinks that this will be interesting. The goal is p = 1, but we will be far off that. I think defining the denominator will be a interesting thing to think about. Iff the denominator includes all repos with >= 1 successful launch p will be very low. If you require n_launches > 10, p will be higher and at n_launches > 200 it will be higher again.
> [name=min] I think p(builds | has built before) is more interesting as time gets longer, since we do things like bump default Python version in repo2docker, which regularly have consequences for repos that pin e.g. numpy and pandas but not Python. This is guaranteed to break on default Python updates eventually. I think it's also important to test with fresh builds with latest repo2docker, since builds in the mybinder.org image cache can't be assumed to live forever. This is the kind of thing Vilde is exploring. Selecting for repos last built before the default Python bump, or specifically building commits of repos from before then is interesting.
An even harder question to answer is requiring not only that the repo builds but that the built image is "useful". How to automatically determine that the image is "useful" is unsolvable??
> [name=min] unsolvable in general, but there might be some heuristics that make a decent "guess", e.g.
>
> - are there notebooks?
> - do they run to completion without errors? Strong indicator of success
> - do they complete *to some degree*? Weaker indicator of success,
> since some notebooks might require user input. Classification
> of errors might give some indication of:
> - environment is wrong (repo2docker issue or bad spec)
> - notebook just has bugs
> - notebook expects user modification/input
> - notebook includes deliberate demonstration of errors
> - hard crash probably means resource issues
> - do imports fail? Strong indicator of failure.
>
> this gives us some decent buckets of "probably works", "might not work because X", and "unable to test" for e.g. repos without notebooks.
>
There is also [Exploring Code Health in a Corpus of Jupyter Notebooks](https://nbgallery.github.io/health_paper.html) and
https://discourse.jupyter.org/t/creating-a-future-infrastructure-for-notebooks-to-be-submitted-and-peer-reviewed/3534/12
### Patterns of temporal usage:
Can we discriminate 1-time workshop repositories from more permanently used ones?
Known use styles:
- reproducible/interactive publications (one-time publication)
- short-term workshops (sometimes once, some repeated)
- interactive documentation (live, kept up-to-date)
- kernel-only (e.g. spacy, thebe)
Can we classify them based on use and/or contents?
### Successful (with many launches) repositories:
+ have many commits?
+ many likes, clones on GH?
+ include markdown cells and plots?
+ their authors have many other repositories?
+ their authors have many followers?
Q: How to deal with outliers such as on try.jupyter.org?
## Discussion
### Shortcomings
## References
Apply Natural Language Processing to analyse responses to the mybinder.org user survey https://github.com/sgibson91/mybinder.org-user-survey-nlp (via @sgibson91)
* Pimentel, Joao Felipe, Leonardo Murta, Vanessa Braganholo, and Juliana Freire. "A large-scale study about quality and reproducibility of jupyter notebooks." In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pp. 507-517. IEEE, 2019.
* Forde, J., Head, T., Holdgraf, C., Panda, Y., Nalvarte, G., Ragan-Kelley, B. and Sundell, E., 2018. Reproducible research environments with repo2docker.
* Adam Rule, Aurélien Tabard, and James D. Hollan. 2018. Exploration and Explanation in Computational Notebooks. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18). Association for Computing Machinery, New York, NY, USA, Paper 32, 1–12. DOI:https://doi.org/10.1145/3173574.3173606
* Moraila, G., Shankaran, A., Shi, Z. and Warren, A.M., 2014. Measuring reproducibility in computer systems research. Technical report, University of Arizona.
* Neglectos. 2018. A Preliminary Analysis on the Use of Python Notebooks. https://blog.bitergia.com/2018/04/02/a-preliminary-analysis-on-the-use-of-python-notebooks/
* Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D.M. and Damian, D., 2014, May. The promises and perils of mining GitHub. In Proceedings of the 11th working conference on mining software repositories (pp. 92-101).