Try   HackMD

https://hackmd.io/@CAM-Gerlach/SJPDZXJ5h/edit

Description

Let's face it: the overwhelming majority of current scientific code is siloed away into one-off scripts and notebooks, where the only real mechanism for re-using and building upon it is good old copy and paste. In order to keep "building upon the shoulders of giants", we need to achieve not only reproducibility of individual results, but also true reusuability of research methods, that can be shared, built upon, and deployed by users across the world.

At this BoF, we invite the community to share their tools and workflows for reusable science, and hope to explore how we can encourage users to expand beyond the current notebook-centric monoculture and toward holistic, open, modular and interoperable approaches to conducting research and developing scientific code.

The ideas and discussion at the BoF and in this document will inform future guides and resources on this topic, to be hosted on central community platforms like the Scientific Python organization

Schedule

5:45 - What is reusable research and why it is important, and what are the goals & outcomes of the BoF?
5:50 - What tools and techniques do people have to share for effective reusable research?
6:05 - How can we integrate reusable research into existing workflows?
6:20 - How do we teach students and researchers about reusable research, and encourage using them?
6:35 - Additional discussion, plugs, "talk to me afters" & closing

What is reusable research and why it is important, and what are the goals & outcomes of the BoF?

  • Reusable research
    • Can be not only replicated, but also built upon and extended easily by both the author and others
    • "Building upon the shoulders of giants" is the foundation of both open science and open source
  • One-off scripts and notebooks not typically very re-usable; generally cannot easily:
    • import them
    • specify dependencies
    • extend them
    • use them for another project (without copy/paste and managing multiple versions)
  • And additionally, for notebooks specifically, cannot easily
    • track them in VCS (with clean diffs)
    • lint, type check, test or format them with standard Python tools
    • interoperate with most other non-notebook-specific ecosystems
    • Etc.
  • Primary BoF goals (and topics)
    • Share tools, techniques and resources for reusable research
    • Discuss how we can better integrate them into common workflows
    • Determine how we can effectively teach and encourage their use among both newer students and established scientists
  • BoF outcomes
    • This document, documenting our discussions, ideas and resources from the BoF, which will be made public afterwards
    • Will inform potential future guides and resources hosted on a central community location (e.g. Scientific Python)
    • Serve as a potential jumping off point for potential further events at future conferences

Even if you don't get a chance to speak up yourself, feel free to add any notes you like to the relevant categories!

What tools, techniques and resources do people have to share for effective reusable research?

  • There's a tool called nbflake8 to lint notebooks

    • Would be cool to have Ruff based tool too
  • Can be difficult to easily compare outputs between notebooks created by different researchers

  • This is an idea I've been working on for 10 years

    • We put the stuff we want to be modular in a regular Python module, and then have a Jupyter notebook that shows an example using the code
    • Have a collection of modular calculations that start with a Python function, decorate it with a class and then it connects to the framework that
    • https://github.com/usnistgov/iprPy
  • I'd probably add devcontainers, being able to work with a lab group or collaborating with a team, it allows you to work together and see everything on their screen. In VSCode live share is also a really cool similar feature.

  • One of the things we do on our project is everything has to be documented, and one of the things we struggled with was reducing a notebook to the type of report NASA is typically looking for, which is a step we're struggling with

  • jbednar: I'd argue that a notebook is not a unit of reproducible research, a project is (notebooks or scripts + envt + record of commands to run there). See 8-levels of Reproducibility and Conda Project.

  • Adding a plug for papermill - super useful tool for parameterizing and executing notebooks programmatically

How can we integrate reusable research into existing workflows?

  • I really like the cookie template that Henry (III) has for packaging
    • A lot of my workflows are just messing around with my data
    • Having something like a package structure from the get go will help make it easier to not miss things
  • Following up on that, I'm in nuclear engineering and we often have two week project leveraging Jupyter at the center
    • We have a cookiecutter template that has Sphinx, and a directory structure, and metadata that looks familiar and has everything set up by default
    • This particularly helps ensure that different colleagues and team members are on the same page with doing things
  • Been using data-driven cookiecutter template to have a structured way for where to put things
    • This helps ensure consistency in terms of what things are named, and the order to run things
  • There's a really cool tool called "Show your work" that comes out of the astrophysics community, that's more in line with wanting to produce a paper at the end but include all the steps that show your work along the way
    • Show your work gives you a template so you can show your work at the end
    • Is build on a tool called snakemake
    • Show your work sets up the template and the paper
    • Really helpful guide for getting started and ensuring all your projects have the same structure
    • Axel who gave talk on Wenesday published their gammapy paper using this tool
    • Related tool for citing open source authors specifically check out duecredit which looks at your code and finds the authors (via git commit) that wrote the code
  • Followup question: How is this different from Quarto?
    • Quarto is much more general, whereas show your work was specifically built to allow users to produce a PDF in LaTeX at the end
  • I do something very old fashioned, I write a aaa_readme.txt file where I record a diary of what I was doing on that project so if I have break working on it, I can go back to those notes and remind myself of what I was doing
  • On that note, notebooks are supposed to be this thing to make programming literate
    • However, while beginners use them interactively because they don't know how to use debuggers, but they don't always remember the literate part
    • nbdev is my favorite tool for that, but also getting people accustomed to best practices can also be helpful for reproducibility
  • I love notebooks, and also love modules, and love the flow of code from notebooks into modules once it approaches that point
    • Thinking of modules as a unit of documented, tested code, but which doesn't mean a lot on its own, whereas combined with a notebook, it gives them context and meaning
    • If your community is afraid of modules, then try to make making them easier, rather than avoiding them, so that you have fully reimportable python modules.
  • For students, the notebooks often turn into a fancy scratch pad or script file, and once they get stuff that works, they can move that stuff out into modules
    • Then the notebooks start to morph into examples and the history of what the work was that can be interpreted by other researchers
    • Tools like autodoc in VSCode can be a great way to reduce the friction for students, as they just add the triple quotes and VSCode expands the rest
  • Wrt nbdev, you can develop your code and let it grow, and then eventually you can export the parts of your code as modules at the end
    • Downside of the documentation is it talks about everything as packages, but you can use for individual notebooks and modules
    • We're talking about students here, and I was hesitant to show it to my students since they're early hPython programmers, but it was actually quite easy to have that one line at the end
    • Do you have a page where you document this? I'm still learning Python and would like to learn more about this and teach it to my students
    • nb-convert is a similar cli too that can convert notebooks to many different formats, including a Python script. This is also similar to the built in VSCode feature
    • I did a tutorial here and can share that; the documentation is pretty intimidating but it would be great to have that in a smaller scale setting
  • Juanita: A cool output of the BoF is this document listing a bunch of tools, which can be the input to a series of guides and tool lists online
  • Question: How do I get started
  • Issue: Documenting the parameters of your modules; without it it's very difficult to use them. jbednar: That's what param.holoviz.org is for. :-)

How do we teach students and researchers about reusable research, and encourage using them?

  • It's one thing when its students, but how do you do that when its your whole organizational culture that needs to change
    • Juanita: I am a student myself, and no one every really talked to me about IDEs and explained what they were and why you'd want to use one.
      • It's important for teachers to actually teach them about using the proper tools
      • But I have no idea when it comes to coworkers using these things
  • With respect to the team situation, the most effective way I found is nerd sniping
    • You figure out what is the biggest pain point for the team, and its usually something that should be automated
    • So I've tricked people into using better practices by showing them how these tools can fix that problem
    • (Juanita) Yeah, I think it's really just awareness, if you show someone a cool tool most folks will make the decision to adopt them on their own, but there will always be folks who might not want that
  • I think students mostly get introduced to notebooks through classes in contexts that are very different from how they would use them for their research
    • More a question really, as I don't have a good resource for that to hand to a student if they have a question or are confused about that
    • (Juanita) I think that should be part of the curriculum, why are people learning machine learning using Jupyter notebooks without learning how to use Jupyter notebooks
    • Many folks don't come from a traditional computer science background and may not know about all these tools, so we get a lot of benefit from students bringing in new ideas
  • CAM: I feel the fact that students are only exposed to notebooks really makes them not necessarily want to reach for other tools even when they would be more appropriate down the line
    • (Responce) I feel we should be encouraging students to use an IDE like JupyterLab that offers many of those IDE like features but also allowing them to take advantage of the notebook's interactive features
    • Juanita: I was a Spyder developer (and CAM is too), and I feel that we should show students how to use those tools like debugging and make it easier for them to do that, but give them the choice whether they want to use those tools. I think the right approach is not necessarily telling them what tool to use, but having documentation and exposure to those tools so students can pick the best option for them.
      • Its true, we want to give students options, but many might not need a debugger
  • I work in the library here at UT, and we often only have an hour to introduce users to Python
    • We use Google Collab (notebooks) because it makes it a lot easier for students to get started with Python over having to download and install an IDE
    • And then students tend to be familiar with that tool and continue to use it
    • (Juanita) I'm a big fan of using videos to help reach student over reading the documentation, as I feel they are much more likely to watch them
  • I am Particle Physicist and I ask all my students to use jupytext.
    • This helps the student to make from Notebooks to python file to be committed to the git.
    • The code can be committed as python.
    • In Jupyter we can right-click and open a python file as a Notebook and continue working on it.

Additional discussion, plugs, "talk to me afters" & closing

  • One other thing I want to add is students might have familiarity with Python or R, but Git is a completely different animal and is quite challenging to factor that into education
    • My wife is a writer, and she would really benefit from Git but its really hard to get her to use it
      • Yeah, we may not be aware of how inefficient the workflows we use are, because that's all we know
  • Feel free to add more questions, comments, and feedback in the Slack channel or this document