Introduction to Workflows with CWL: Lesson Development Sprint

26-27 March 2020

Sign-In

Name / Location (Timezone) / GitHub username

  • Toby Hodges / Heidelberg, Germany (UTC+1) / tobyhodges
  • Fotis Psomopoulos / Thessaloniki, Greece (UTC+2) / fpsom
  • Beatriz Serrano Solano / Heidelberg, Germany (UTC+1) / beatrizserrano
  • Kersten Breuer / Heidelberg, Germany (UTC+1) / KerstenBreuer
  • Tom Tubbesing / Bielefeld, Germany (UTC+1) / ttubb
  • Christian Henke / Bielefeld, Germany (UTC+1) / maitai
  • Yi Sun / Heidelberg, Germany (UTC+1) / sunyi000
  • Hervé Ménager / Paris, France (UTC+1) / hmenager
  • Manabu Ishii / Tokyo, Japan (UTC+9) / manabuishii
  • Renato Alves / Heidelberg, Germany (UTC+1) / unode
  • Michael R. Crusoe / Berlin, Germany (UTC+1) / mr-c

Logistics

Schedule (most relevant for participants in Europe)

  • 08:00 UTC 2020-03-26: Start
  • 08:15: Introduction to Reverse Instructional Design (recorded)
  • 08:45: Discussion and planning
  • 09:00: Lesson development
  • 09:45: Break part 1 (individual)
  • 10:00: Break part 2 (group)
  • 10:15: Lesson development
  • 11:00: [Lunch] Break (individual)
  • 12:00: Report out & discussion (recorded)
  • 12:15: Lesson development
  • 13:30: Break part 1 (individual)
  • 13:45: Break part 2 (group)
  • 14:00: Lesson development
  • 14:30: Report out & discussion (recorded)
  • 15:00: Wrap-up time for participants in Europe
  • 16:00: End of day 1 for participants in Europe (+handover?)
  • 08:00 UTC 2020-03-27: Start of (optional!) day 2
  • 12:00: End of (optional!) day 2

Introduction to Curriculum Development

  • Slides at https://common-workflow-lab.github.io/cwl-novice-tutorial/files/lesson-dev-sprint-intro.html

  • Manabu's experience with learners in Japan: try to motivate learners based on their background

    • work with them on a project/tool relevant to them
    • try to have them writing their own CWL description by the end
    • at least one tool and one workflow by the end
  • Kersten: what computationalk level does our novice have?

    • could have three different entry points:
      • wet lab biologists, command line/Bash?
  • Hervé: experience I've had trying to get people into "workflow thinking," either dealing with people without much research computing experienc recommend thos people to use Galaxy/graphical interface. Other group is very comfortable with working in terminal, writing shell scripts, etc. There are common patterns in the misconceptions and mistakes that people make when transitioning from this interactive/scripting style of writing protocols to working with workflows.

    • ways tasks are coordinated

Audience Definition Questions

(From The Carpentries Curriculum Development Handbook.)

  1. What is the expected educational level of your audience?

Research Software Engineers
Undergraduates
PhD Students
Postdocs
Senior Researchers (possibly not)

  • PhD students and post-docs probably have the most to gain
  • RSEs are the optimal target audience because robustness of pipelines is a primary concern for them
  • PIs/senior researchers - don't target them directly but don't actively exclude them
    • would it make sense to have additional material for these people - "Top Ten Things a PI needs to know about workflows"?
  1. What type of exposure do your audience members have to the technologies you plan to teach?
  • biologist without (much) command line experience wants to chain particular steps together - know a few commonn tools and options, but hasn't written a script
  • bioinformatician who wants to scale their pipeline
  • is it easier for different people to get into workflows/CWL depending on their operating system?
    • not directly, but Linux users are typically more familiar with command line etc
  • audience without experience with command line often doesn't understand that tools/actions can be run without the user directly triggering it (e.g. by double-click)
    • when I've had discussions about workflows with groups of mixed background/technical expertise, having a graphical interface has helped those without technical experience understand the concepts. But that's the end of their involvement - the actually workflow development is done by those with technical background.
    • are we aiming at "Introduction to Workflows" or "Introduction to Workflow Design"/"Introduction to Building Workflows?"

Practical - how to build a workflow
Theoretical layer - why use workflows?

Researchers submit a paper and reviewers ask for a reproducible workflow description

  • why workflows are a solution to what's expected in science these days - (short) sales pitch
  1. What types of tools do they already use?
  2. What are the pain points they are currently experiencing?
  3. What types of data does your target audience work with?
    • What are the commonalities in the datasets your target audience will encounter?

What are the different models that people can have of (research) computing?

  • remote execution. local computation not necessary
  • containerisation? software installation not necessary
  • developer with no biological knowledge
    • what prior knowledge of biology/[domain] are we expecting/assuming with the example workflows we choose?
  • issues regarding reproducibility - perception is that research computing is deterministic. The first time you try to (re-)implement a computational protocol from the description in a research paper you start to grasp that this is far from trivial
  • what are the circumstances under which someone with no computational research experience needs to learn about workflows but won't benefit from learning shell scripting etc first
    • e.g. Nanopore users wanting to make minor adjustments to existing workflows
      • a similar audience of Galaxy users probably exists
  • at which point are people forced to make the switch from shell scripts to defined workflows? when is the optimal time for them to learn?
    • one group: experience with other workflow engines and want to make their workflow interoperable - this is why they learn CWL specifically
    • other group has no idea about workflows - how do we convince them CWL is the best one to start with?

Assuming competent practitioner of command line/shell scripting: What types of tools do they already use?

  • BASH scripts to combine commands into an executable file
    • variables
    • pipes -> streaming
  • Maybe even Makefiles
  • Python/R scripts -> conditionals and logic
  • remote computing (ssh)
  • UI-based text editor

What are the pain points they are currently experiencing?

  • knowing when to write a structured pipeline/workflow
  • have been given legacy protocol from someone else
  • asked by a reviewer to provide workflow description/container/whatever

Learning Objectives

  • aiming to teach good workflow design

Lesson Structure

Other topics

The Carpentries is not so well know in Japan.
Manabu writes:

In Japan, IT field people using
https://dotinstall.com/
each lesson is recorded Short movie (3min-5min)
Bioinfomatics field ,
http://togotv.dbcls.jp/en/

Tom's amazing spreadsheet

I went through all the CWL-related Questions (on biostars/discourse) from Jan. 2019 to now to get a feel for what users are struggling with. Also tried to categorize them while scanning through them (don't know how well that turned out though). Here is the list in case somebody wants to take a gander:
https://docs.google.com/spreadsheets/d/1LxIYysQLi4_fcDy-F5Q3XMQUKI10EYLQz0hitRIKY28/edit#gid=0

Breakout group 1

Profiles

Non Computational

  • no command line at all

CopyMaster

  • copies & pastes to command line /(maybe Jupyter notebook)
  • everything is manual
  • experience with application specifc macros (imaging, Excel)

Novice "hack in my homedir"

  • CopyMaster and ..
  • basic unix commands (cp, mv, mkdir)
  • write and execute simple Bash script (mostly by c&p commands from external resources )

Intermediate homedir hacker

  • Novice and ..
  • pipelining with awk,sed, etc..
  • reusable scripts
  • homedir application installs
  • Make

HPC users, but not using workflows

  • can submit jobs

Workflow user, but not yet CWL

  • Used Nextflow, or snakemake

Pain points for each perspective:

(before CWL/workflows)

  • Non Computational

  • Novice homedir hacker

    • Installing applications
  • Intermediate homedir hacker

    • Keeping track of data, experiments, metadata.
    • Hard to scale things - need to do everything manually
    • Don't even know the structure of distributed computing
    • Don't know about environment variables and their impact (PATH etc)
    • Shared vs. local filesytem
    • How many many things can affect reproducibility (random seeds, number of threads, compiler..)

knowing what things the workflow platform does take care of vs. what you have to take care of

  • format conversions (your responsibility)
  • data movement / tracking (the platform does this)

(after using CWL )

In general: must be more explicit. It is less intuitive.

Can use local applications, conda package, Docker format software container; this is nice but maybe confusing

  • Non Computational

    • still don't know the value of workflows
  • CopyMaster

    • file paths
  • Novice homedir hacker

    • baseCommand vs inputs vs arguments
    • controlling the contents of current working directory
  • Intermediate

    • how to insert custom scripts into a workflow
    • how and when to split up scripts
    • how to connect different I/O formats into a workflow
    • Making environment variables explicit and that they are different in CWL even if you don't use Docker
  • HPC

    • Input files are read only by default
    • Intermediate file management / not having a unified filesystem
    • (Paths are special, not just a string)
  • Other workflow language/system user

    • Mapping between concepts

types of data for each perspective:

  • Non Computational
    • Images
    • FASTQs
    • Spreadsheets
    • plots
  • CopyMaster
    • FASTQs
    • CSV / TSVs
  • Novice homedir hacker
    • reference files
    • scripts
  • Intermediate
    • configuration files
    • "big" data
  • HPC user
    • "bigger" data, and more files

Do we need to limit the tutorial to people with fundamental research computing knowledge (working in the shell/shell scripting)?
Y xxxxxxx
N x

Learner Profiles

Researcher who expects to publish protocol

  • who are they?
    • life sciences background
    • already learned essential computational research skills - attended a Software Carpentry workshop last year?
    • developing a new experimental protocol
    • writing a pipeline/protocol to analyse the data generated
      • this combines multiple command line/bioinformatics tools in a few shell scripts that run on their research group's local server
  • what problem are they having?
    • when they publish this research, they expect to be asked to include a reproducible description of the data analysis
    • the funding body requires them to publish details of their analyses in full
    • need to adapt existing script(s) to work in local HPC environment, as incoming experimental data requires upscaling of analysis
  • how will the tutorial help them?
    • after applying what they've learned in the tutorial, their analysis pipeline will be
      • more portable between compute environments
      • easy to share alongside publication of research findings (separate methods paper?)
    • after applying what they've learned in the tutorial, they will be able to provide information of provenance of research findings on request e.g. from reviewer/funder/collaborator
      • more robust to changes in tool/dependancy versions and adjustments to the protocol itself

RSE who inherited protocol and now needs to deploy/scale up

  • who are they?
    • bioinformatics background
    • [some details about their training here]
    • just joined a new lab and inherited several scripts from departing postdoc
    • leading development on a new tool
  • what problem are they having?
    • all of these things will need to be deployed/deployable in a cloud environment soon
    • since the postdoc left, several key dependencies have been updated and the pipeline currently doesn't run
      • it feels like every time they manage to fix these problems, another update is released and everything breaks all over again
    • the group leader wants to change the short read alignment tool used in a key step in the existing workflow
  • how will the tutorial help them?
    • after applying what they've learned in the tutorial, their analysis pipeline will be
      • more portable between compute environments
      • more robust to changes in tool/dependancy versions
      • more maintainable
        • robust to adjustments to the protocol itself
        • quicker to make these adjustments

The Tom/Kersten Hybrid who wants to learn workflows

  • wanted to learn computational research skills
  • was assigned task of implementing pipeline for variation of ChIP-seq analysis
  • had never connected to a remote machine from the command line
  • supervisor was aware of some of the benefits of implementing this kind of workflow
    • pipeline would end up being run often and on large scale
    • why CWL? to ensure resulting workflow could be run anywhere
  • a lot of wet lab hours during Masters but wanted to also get experience with bioinformatics/data analysis
  • took on a Masters project to migrate existing sequence analysis pipeline to run on de.NBI Cloud
  • CWL: most general solution, most flexible once workflow was described

Learner Profiles PR (merged): https://github.com/common-workflow-lab/cwl-novice-tutorial/pull/5

Learning objectives

Prerequisites

Before taking this tutorial learners should have basic knowledge of the following concepts:

  • foo

Goals

After following one of these tutorials, learners will be able to:

  • Know that all output files must be explicitly captured and how to do so

    • How to capture the output files
    • need to specify the files which you want to capture from the tools
    • bulk caputre the output for debug purpose.
      • which file is actually needed.
    • output files in the specific directory or working directory
    • output files in the same directory which has input files
    • stdout & stdin
  • Recognize when the same step is being run but the input files vary (or may a parameter varies, or both) and that this is the "scatter" pattern. Know how to implement this using "scatter"

    • What is a CWL scatter
    • difference between scattering and parallel execution
    • running the same program on each file
    • running the same program the same way except for one parameter
    • Advanced: multidimensional scatter
  • Be able to split a bash script into a CWL workflow

    • difference between a "control flow" (bash script) and a "data flow" (CWL - and others)
    • identify the inputs and outputs of the script
    • identify the tasks, i.e. the tools being run
    • identify the links, i.e. the data flowing in and out of these tools
    • (not sure about this one) identify and remove some infrastructure-specific details which need to be removed (e.g. if tools are launched with SLURM commands in a bash script, or loaded with Docker or Conda)
  • Be able to explain the difference between a CWL tool description and a CWL workflow (description)

    • difference between a tool and the cwl-document that acts as a wrapper for that tool
    • tool wrapper document: describes input/output semantics of command line tool
    • workflow document: describes the input/output of a workflow and specifies the flow of data between tools
  • Know what a sub workflow is, how to make one, and when to use them. (KB, HM)

  • Be able to make understandable and valid names for inputs and outputs (not "input3")

    • always give use case-oriented names or names that describe the content
    • avoid naming them by the tool that produced it or the format
    • so instead of:
      • fastq1
      • bam
      • bed
    • go for:
      • read1
      • aligned_reads
      • regions_of_interest
  • Describe all the requirements for running a tool: environment variables, and more

    • https://www.commonwl.org/v1.1/CommandLineTool.html#Runtime_environment
    • Assume the program (baseCommand) is in the system PATH
    • Aren't allowed to change the PATH
    • other environment variable necessary for execution must be set explicitly
    • need a file next to (in the same directory as) another file? use secondaryFiles or InitialWorkDirRequirement
    • Don't hard code the number of threads, use $(runtime.cores)
    • Need network access? Not so great, but use NetworkAccess if you must
    • Referencing a file path? Use type: File not a string
  • Realize that a workflow is a dependency graph

  • Be able to include their own script as a step in a CWL workflow

    • make script executable and add to path
      • advantage: quick
      • downside:
        • the scripts are not shipped with the CWL document itself
        • not portable to other execution environment without putting the script first
    • pack the scripts into a docker container and make them executable there:
      • advantage: portable
      • downside:
        • only works for people that are using containers
    • use InitialWorkDirRequirement to list script content directly in the CWLToolWrapper
    • distribute as seperate package via pip / cran /
  • Be able to graph/visualize their workflow, both by hand, and with an automated visualizer

    • use cwlviewer online
    • generate Graphviz diagram using cwltool
    • exercise with the printout of a simple workflow; draw arrows on code; hand draw a graph on another sheet of paper
  • Work around a bad software container (requires to be run as root user inside the container, expects certain directory paths) (FP, KB)

  • Be able to interpret CWL error messages to recognize and fix simple bugs in their workflow code

    • (this is difficult! will need to collect examples first)
  • Know how to use CWL v1.2 dynamic workflow conditionals (MRC, HM, TT)

    • Not yet released, expected in Q2 2020
  • Be able to document purpose, intent, and other factors within their workflow

    • "doc"
    • "label"
    • Workflow level, step level, sub workflow level, tool level, inputs and outputs everywhere.
    • Not required nor recommended to fill these out everywhere! As needed
    • Show example using the CWL viewer
  • Know that workflow development can be iterative, can involve sketching, prototypes; that it doesn't have to happen all at once

    • Whiteboard sketch
    • bash script
    • makefile
    • CWL
  • How CWL can help you give credit for all the tools use used
    See also https://github.com/common-workflow-language/cwl-utils/blob/master/cwl_utils/cite_extract.py

  • Run their workflow on a {Sun Grid Engine, LSF, Slurm,..} HPC system (FP, MRC, TT, MI)

    • explain the benefit and restrictions of HPC systems (high throughput, network restrictions, etc)
    • provide example options (e.g. Toil)
  • Convert a GNU Makefile to a CWL workflow (FP, MRC, TT, HM, KB, MI)

  • Be able to customize a workflow at any of the many levels

    • Change the input object
    • Change the default values at the workflow level
    • Change hard coded values at the workflow level
    • Change default value at the Workflow step level
    • Change hard coded values at the Workflow step level
    • Change default values in the CLT description
    • Change hard coded values in the CLT description
    • Change the container
    • Change the tool source itself

Examples:

  • Locate controls for navigating data in OpenRefine
  • Find options to work with data through the OpenRefine dropdown menus
  • Split cells which contain multiple bits of data so that each piece of >data is in its own cell

(from https://librarycarpentry.org/lc-open-refine/03-working-with-data/index.html)

  • Employ the grep command to search for information within files.
  • Print the results of a command to a file.
  • Construct command pipelines with two or more stages.

(from https://datacarpentry.org/shell-genomics/04-redirection/index.html)

More feedback from Manabu

Rabix Composer is interesting to people in Japan

  • It is shiny
  • attractive to non-command line users
  • (like a Medical doctor)
  • but has limitations like not supporting "stdout"
  • Not ready good for daily use

VS Code live share + CWL language server
https://github.com/tom-tan/cwl-for-remote-container-template

CWL syntax (YAML based) is quite different from other languages. After writing one workflow successfully they are more confident.

Also discussed the "zatsu" method (quick and dirty, but useful!)
https://github.com/tom-tan/zatsu-cwl-generator

Starting CWL with `Zatsu` method! - Qiita

  • At least writing CommandLineTool, Zatsu method is very useful starting point.
  • We try to create Zatsu method for CWL Workflow.

from Manabu:

  • Some people confuse about output files. In CWL we need to write which file is captured in output section, but people believe every files are captured by default.So tutorial covers such a situation. It is very helpful for people create own CWL workflow by only themselves.
    • Does it already covered by user guide ?

Concept Maps

https://carpentries.github.io/instructor-training/05-memory/index.html

Other questions

What about teaching workflow thinking without teaching how to make CommandLineTool descriptions?

Still show the basics but don't spend time explaining how to make CLTs (link to the existing user guide)

Workflow composition would use only pre-made CLTs from public repos

Other lessons can be on how to adapt/extend a pre-made CLT

organizing the learning objectives

pre-BASH-to-CWL

  • understanding CLT Inputs & Outputs
  • CLT vs Workflow
  • naming (Workflow steps, Workflow Inputs & Outputs)

post-BASH-to-CWL

  • scatter
  • workflow is a dependency graph / parallelism
  • own script in workflow
  • visualize / graph a workflow
  • documenting a workflow
  • workflow developement lifecycle (iterative, paper, ..)
  • customize a workflow (9+ levels)

https://docs.google.com/presentation/d/1aVdK8LHkgtESBunCQ-p7XmEl8NB9XbgDsH67X0_2HWg/edit?usp=sharing

2020-04-21 check in meetings

Attendees: Toby, Fotis, Tom, Sehrish, Kersten, Michael

Review of work so far

https://common-workflow-lab.github.io/cwl-novice-tutorial/index.html

On 2020-03-26 a group did a day long virtual meeting.

Things were fuzzy at the beginning of the day: exact objectives, vision, target audience. We did get clarity on these producing a list of learning objectives and learner personas

Learners should have some command line experience, but not necessarily a lot. The learner profiles: https://common-workflow-lab.github.io/cwl-novice-tutorial/audience/

Feedback on those profiles via issues and pull requests are very welcome!

(After the March meeting Toby re-arranged the learning objectives into their current, easier to read form)

The learning objectives are at https://common-workflow-lab.github.io/cwl-novice-tutorial/#learning-objectives

Some could use more details. Again feedback via GitHub issues and edits are welcome!

At the March meeting, concept maps were made: https://docs.google.com/presentation/d/1aVdK8LHkgtESBunCQ-p7XmEl8NB9XbgDsH67X0_2HWg/edit?usp=sharing

Since the likely result is a relatively modular set of exercises, the concept maps can be helpful in determining the dependency tree of the exercises

Next steps

Write at least one exercise for each learning objective (each bullet point).

Writing the exercises will not be easy, but makes writing the tutorial and is required.

It would be good to revisit the concept maps once some/most of the exercises have been written. Concept maps can also be used as an assesment tool by presenting an incomplete map and asking the learner to fill in the blanks; then to analyze and difference in the mental model between the instructor and the learners.

How to contribute exercises?

Toby+Michael will send out instructions next week. A video chat (2020-05-14 10:00-11:00 CEST) will guide contributors through the process. A recording will be made.

(Preview: each set of learning objectives will be moved to their own pseudo-lesson markdown file, then the exercises can be created/edited there)

Request: an overall checklist / progress tracker (not yet, will make issues with checkboxes: exercises, connections to concept maps, alpha-,beta-testing)

(the Carpentries Curriculum Development Handbook has a section on Designing Challenges, but it seems data analysis centric: https://carpentries.github.io/curriculum-development/designing-challenges.html#designing-challenges-1 ; Toby recommends http://teachtogether.tech/#s:exercises )

Fotis: Would it be nice to have a theme for the exercises, so that they can become aligned? Like the "intro to Excel" Carpentries lesson and its narritive that connects the lessons and exercises.
(Yes for a theme before it is "finished", not needed right now)

Toby: when we start writing the exercises, some themes will probably emerge. Writing an exercise that asses what you want to asses is hard enough, swapping out the theme afterwards will not be so hard.

But I want to start today!

Feel free to comment on the learning objectives and start discussing/creating exercises for them! Markdown / plain text is fine, don't worry about formatting.

Fotis: Don't be afraid about duplicating effort, this is not possible. Multiple exercises per learning objective is useful, as are different perspectives. He also starts on paper, not digital

Sehrish: What would an exercise look like? Just about writing CWL?

Toby: Check out http://teachtogether.tech/#s:exercises & http://teachtogether.tech/#s:models (and any other chapters that look interesting like http://teachtogether.tech/#s:process) and the cognitive load & mental models part of the Carpentries Curriculm Development Handbook

Select a repo