Introduction to Workflows with CWL: Lesson Development Sprint

26-27 March 2020

Name / Location (Timezone) / GitHub username

Toby Hodges / Heidelberg, Germany (UTC+1) / tobyhodges
Fotis Psomopoulos / Thessaloniki, Greece (UTC+2) / fpsom
Beatriz Serrano Solano / Heidelberg, Germany (UTC+1) / beatrizserrano
Kersten Breuer / Heidelberg, Germany (UTC+1) / KerstenBreuer
Tom Tubbesing / Bielefeld, Germany (UTC+1) / ttubb
Christian Henke / Bielefeld, Germany (UTC+1) / maitai
Yi Sun / Heidelberg, Germany (UTC+1) / sunyi000
Hervé Ménager / Paris, France (UTC+1) / hmenager
Manabu Ishii / Tokyo, Japan (UTC+9) / manabuishii
Renato Alves / Heidelberg, Germany (UTC+1) / unode
Michael R. Crusoe / Berlin, Germany (UTC+1) / mr-c

Logistics

Zoom URL: https://zoom.us/j/5731515972
Lesson repository: https://github.com/common-workflow-lab/cwl-novice-tutorial
Lesson webpage: https://common-workflow-lab.github.io/cwl-novice-tutorial/
Gitter chat: https://gitter.im/common-workflow-language/common-workflow-language

Schedule (most relevant for participants in Europe)

08:00 UTC 2020-03-26: Start
08:15: Introduction to Reverse Instructional Design (recorded)
08:45: Discussion and planning
09:00: Lesson development
09:45: Break part 1 (individual)
10:00: Break part 2 (group)
10:15: Lesson development
11:00: [Lunch] Break (individual)
12:00: Report out & discussion (recorded)
12:15: Lesson development
13:30: Break part 1 (individual)
13:45: Break part 2 (group)
14:00: Lesson development
14:30: Report out & discussion (recorded)
15:00: Wrap-up time for participants in Europe
16:00: End of day 1 for participants in Europe (+handover?)
08:00 UTC 2020-03-27: Start of (optional!) day 2
12:00: End of (optional!) day 2

Relevant Links:

Issue proposing tutorial: https://github.com/common-workflow-language/user_guide/issues/160
Notes from kick-off meeting: https://docs.google.com/document/d/12kUlKrrMZGou3VCy46Fqu8hIhq8HlHbwlySTmGNrh5o/edit#heading=h.46ubbzys5y6o
Markdown cheatsheet: https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet
The Carpentries Curriculum Development Handbook: https://carpentries.github.io/curriculum-development/
- how to use the lesson template (we probably won't need this today): https://carpentries.github.io/curriculum-development/technological-introductions.html
The Carpentries Instructor Training curriculum: https://carpentries.github.io/instructor-training/

Introduction to Curriculum Development

Slides at https://common-workflow-lab.github.io/cwl-novice-tutorial/files/lesson-dev-sprint-intro.html
Manabu's experience with learners in Japan: try to motivate learners based on their background
- work with them on a project/tool relevant to them
- try to have them writing their own CWL description by the end
- at least one tool and one workflow by the end
Kersten: what computationalk level does our novice have?
- could have three different entry points:
  - wet lab biologists, command line/Bash?
Hervé: experience I've had trying to get people into "workflow thinking," either dealing with people without much research computing experienc recommend thos people to use Galaxy/graphical interface. Other group is very comfortable with working in terminal, writing shell scripts, etc. There are common patterns in the misconceptions and mistakes that people make when transitioning from this interactive/scripting style of writing protocols to working with workflows.
- ways tasks are coordinated

Audience Definition Questions

(From The Carpentries Curriculum Development Handbook.)

What is the expected educational level of your audience?

Research Software Engineers
Undergraduates
PhD Students
Postdocs
Senior Researchers (possibly not)

PhD students and post-docs probably have the most to gain
RSEs are the optimal target audience because robustness of pipelines is a primary concern for them
PIs/senior researchers - don't target them directly but don't actively exclude them
- would it make sense to have additional material for these people - "Top Ten Things a PI needs to know about workflows"?
  - great idea once we have developed a relatively mature tutorial
  - could be a page under the "Extras" menu
  - marketing material for this situation is being gathered at https://github.com/common-workflow-language/for_decision_makers

What type of exposure do your audience members have to the technologies you plan to teach?

biologist without (much) command line experience wants to chain particular steps together - know a few commonn tools and options, but hasn't written a script
bioinformatician who wants to scale their pipeline
is it easier for different people to get into workflows/CWL depending on their operating system?
- not directly, but Linux users are typically more familiar with command line etc
audience without experience with command line often doesn't understand that tools/actions can be run without the user directly triggering it (e.g. by double-click)
- when I've had discussions about workflows with groups of mixed background/technical expertise, having a graphical interface has helped those without technical experience understand the concepts. But that's the end of their involvement - the actually workflow development is done by those with technical background.
- are we aiming at "Introduction to Workflows" or "Introduction to Workflow Design"/"Introduction to Building Workflows?"

Practical - how to build a workflow
Theoretical layer - why use workflows?

Researchers submit a paper and reviewers ask for a reproducible workflow description

why workflows are a solution to what's expected in science these days - (short) sales pitch

What types of tools do they already use?
What are the pain points they are currently experiencing?
What types of data does your target audience work with?
- What are the commonalities in the datasets your target audience will encounter?

What are the different models that people can have of (research) computing?

remote execution. local computation not necessary
containerisation? software installation not necessary
developer with no biological knowledge
- what prior knowledge of biology/[domain] are we expecting/assuming with the example workflows we choose?
issues regarding reproducibility - perception is that research computing is deterministic. The first time you try to (re-)implement a computational protocol from the description in a research paper you start to grasp that this is far from trivial
what are the circumstances under which someone with no computational research experience needs to learn about workflows but won't benefit from learning shell scripting etc first
- e.g. Nanopore users wanting to make minor adjustments to existing workflows
  - a similar audience of Galaxy users probably exists
at which point are people forced to make the switch from shell scripts to defined workflows? when is the optimal time for them to learn?
- one group: experience with other workflow engines and want to make their workflow interoperable - this is why they learn CWL specifically
- other group has no idea about workflows - how do we convince them CWL is the best one to start with?

Assuming competent practitioner of command line/shell scripting: What types of tools do they already use?

BASH scripts to combine commands into an executable file
- variables
- pipes -> streaming
Maybe even Makefiles
Python/R scripts -> conditionals and logic
remote computing (ssh)
UI-based text editor

What are the pain points they are currently experiencing?

knowing when to write a structured pipeline/workflow
have been given legacy protocol from someone else
asked by a reviewer to provide workflow description/container/whatever

Learning Objectives

aiming to teach good workflow design

Lesson Structure

Breakout group 1

Profiles

Non Computational

no command line at all

CopyMaster

copies & pastes to command line /(maybe Jupyter notebook)
everything is manual
experience with application specifc macros (imaging, Excel)

Novice "hack in my homedir"

CopyMaster and ..
basic unix commands (cp, mv, mkdir)
write and execute simple Bash script (mostly by c&p commands from external resources )

Intermediate homedir hacker

Novice and ..
pipelining with awk,sed, etc..
reusable scripts
homedir application installs
Make

HPC users, but not using workflows

can submit jobs

Workflow user, but not yet CWL

Used Nextflow, or snakemake

Pain points for each perspective:

(before CWL/workflows)

Non Computational
Novice homedir hacker
- Installing applications
Intermediate homedir hacker
- Keeping track of data, experiments, metadata.
- Hard to scale things - need to do everything manually
- Don't even know the structure of distributed computing
- Don't know about environment variables and their impact (PATH etc)
- Shared vs. local filesytem
- How many many things can affect reproducibility (random seeds, number of threads, compiler..)

knowing what things the workflow platform does take care of vs. what you have to take care of

format conversions (your responsibility)
data movement / tracking (the platform does this)

(after using CWL )

In general: must be more explicit. It is less intuitive.

Can use local applications, conda package, Docker format software container; this is nice but maybe confusing

Non Computational
- still don't know the value of workflows
CopyMaster
- file paths
Novice homedir hacker
- baseCommand vs inputs vs arguments
- controlling the contents of current working directory
Intermediate
- how to insert custom scripts into a workflow
- how and when to split up scripts
- how to connect different I/O formats into a workflow
- Making environment variables explicit and that they are different in CWL even if you don't use Docker
HPC
- Input files are read only by default
- Intermediate file management / not having a unified filesystem
- (Paths are special, not just a string)
Other workflow language/system user
- Mapping between concepts

types of data for each perspective:

Non Computational
- Images
- FASTQs
- Spreadsheets
- plots
CopyMaster
- FASTQs
- CSV / TSVs
Novice homedir hacker
- reference files
- scripts
Intermediate
- configuration files
- "big" data
HPC user
- "bigger" data, and more files

Do we need to limit the tutorial to people with fundamental research computing knowledge (working in the shell/shell scripting)?
Y xxxxxxx
N x

Learner Profiles

Researcher who expects to publish protocol

who are they?
- life sciences background
- already learned essential computational research skills - attended a Software Carpentry workshop last year?
- developing a new experimental protocol
- writing a pipeline/protocol to analyse the data generated
  - this combines multiple command line/bioinformatics tools in a few shell scripts that run on their research group's local server
what problem are they having?
- when they publish this research, they expect to be asked to include a reproducible description of the data analysis
- the funding body requires them to publish details of their analyses in full
- need to adapt existing script(s) to work in local HPC environment, as incoming experimental data requires upscaling of analysis
how will the tutorial help them?
- after applying what they've learned in the tutorial, their analysis pipeline will be
  - more portable between compute environments
  - easy to share alongside publication of research findings (separate methods paper?)
- after applying what they've learned in the tutorial, they will be able to provide information of provenance of research findings on request e.g. from reviewer/funder/collaborator
  - more robust to changes in tool/dependancy versions and adjustments to the protocol itself

RSE who inherited protocol and now needs to deploy/scale up

who are they?
- bioinformatics background
- [some details about their training here…]
- just joined a new lab and inherited several scripts from departing postdoc
- leading development on a new tool
what problem are they having?
- all of these things will need to be deployed/deployable in a cloud environment soon
- since the postdoc left, several key dependencies have been updated and the pipeline currently doesn't run
  - it feels like every time they manage to fix these problems, another update is released and everything breaks all over again
- the group leader wants to change the short read alignment tool used in a key step in the existing workflow
how will the tutorial help them?
- after applying what they've learned in the tutorial, their analysis pipeline will be
  - more portable between compute environments
  - more robust to changes in tool/dependancy versions
  - more maintainable
    - robust to adjustments to the protocol itself
    - quicker to make these adjustments

The Tom/Kersten Hybrid who wants to learn workflows

wanted to learn computational research skills
was assigned task of implementing pipeline for variation of ChIP-seq analysis
had never connected to a remote machine from the command line
supervisor was aware of some of the benefits of implementing this kind of workflow
- pipeline would end up being run often and on large scale
- why CWL? to ensure resulting workflow could be run anywhere
a lot of wet lab hours during Masters but wanted to also get experience with bioinformatics/data analysis
took on a Masters project to migrate existing sequence analysis pipeline to run on de.NBI Cloud
CWL: most general solution, most flexible once workflow was described

Learner Profiles PR (merged): https://github.com/common-workflow-lab/cwl-novice-tutorial/pull/5

Learning objectives

Prerequisites

Before taking this tutorial learners should have basic knowledge of the following concepts:

Goals

After following one of these tutorials, learners will be able to:

Know that all output files must be explicitly captured and how to do so
- How to capture the output files
- need to specify the files which you want to capture from the tools
- bulk caputre the output for debug purpose.
  - which file is actually needed.
- output files in the specific directory or working directory
- output files in the same directory which has input files
- stdout & stdin
Recognize when the same step is being run but the input files vary (or may a parameter varies, or both) and that this is the "scatter" pattern. Know how to implement this using "scatter"
- What is a CWL scatter
- difference between scattering and parallel execution
- running the same program on each file
- running the same program the same way except for one parameter
- Advanced: multidimensional scatter
Be able to split a bash script into a CWL workflow
- difference between a "control flow" (bash script) and a "data flow" (CWL - and others)
- identify the inputs and outputs of the script
- identify the tasks, i.e. the tools being run
- identify the links, i.e. the data flowing in and out of these tools
- (not sure about this one) identify and remove some infrastructure-specific details which need to be removed (e.g. if tools are launched with SLURM commands in a bash script, or loaded with Docker or Conda)
Be able to explain the difference between a CWL tool description and a CWL workflow (description)
- difference between a tool and the cwl-document that acts as a wrapper for that tool
- tool wrapper document: describes input/output semantics of command line tool
- workflow document: describes the input/output of a workflow and specifies the flow of data between tools
Know what a sub workflow is, how to make one, and when to use them. (KB, HM)
Be able to make understandable and valid names for inputs and outputs (not "input3")
- always give use case-oriented names or names that describe the content
- avoid naming them by the tool that produced it or the format
- so instead of:
  - fastq1
  - bam
  - bed
- go for:
  - read1
  - aligned_reads
  - regions_of_interest
Describe all the requirements for running a tool: environment variables, and more
- https://www.commonwl.org/v1.1/CommandLineTool.html#Runtime_environment
- Assume the program (baseCommand) is in the system PATH
- Aren't allowed to change the PATH
- other environment variable necessary for execution must be set explicitly
- need a file next to (in the same directory as) another file? use secondaryFiles or InitialWorkDirRequirement
- Don't hard code the number of threads, use $(runtime.cores)
- Need network access? Not so great, but use NetworkAccess if you must
- Referencing a file path? Use type: File not a string
Realize that a workflow is a dependency graph
Be able to include their own script as a step in a CWL workflow
- make script executable and add to path
  - advantage: quick
  - downside:
    - the scripts are not shipped with the CWL document itself
    - not portable to other execution environment without putting the script first
- pack the scripts into a docker container and make them executable there:
  - advantage: portable
  - downside:
    - only works for people that are using containers
- use InitialWorkDirRequirement to list script content directly in the CWLToolWrapper
  - example: https://github.com/CompEpigen/ATACseq_workflows/blob/master/CWL/tools/generate_atac_signal_tags.cwl (bad example as here the scripts are too long, only for short scripts)
  - advantage:
    - portable
    - can be used independent of the dependency management solution
  - disadvantage:
    - explodes the complexity of a CWL tool wrapper
    - having to deal with escapes and so on
- distribute as seperate package via pip / cran / …
Be able to graph/visualize their workflow, both by hand, and with an automated visualizer
- use cwlviewer online
- generate Graphviz diagram using cwltool
- exercise with the printout of a simple workflow; draw arrows on code; hand draw a graph on another sheet of paper
Work around a bad software container (requires to be run as root user inside the container, expects certain directory paths) (FP, KB)
Be able to interpret CWL error messages to recognize and fix simple bugs in their workflow code
- (this is difficult! will need to collect examples first)
Know how to use CWL v1.2 dynamic workflow conditionals (MRC, HM, TT)
- Not yet released, expected in Q2 2020
Be able to document purpose, intent, and other factors within their workflow
- "doc"
- "label"
- Workflow level, step level, sub workflow level, tool level, inputs and outputs everywhere.
- Not required nor recommended to fill these out everywhere! As needed
- Show example using the CWL viewer
Know that workflow development can be iterative, can involve sketching, prototypes; that it doesn't have to happen all at once
- Whiteboard sketch
- bash script
- makefile
- CWL
How CWL can help you give credit for all the tools use used
See also https://github.com/common-workflow-language/cwl-utils/blob/master/cwl_utils/cite_extract.py
Run their workflow on a {Sun Grid Engine, LSF, Slurm,..} HPC system (FP, MRC, TT, MI)
- explain the benefit and restrictions of HPC systems (high throughput, network restrictions, etc)
- provide example options (e.g. Toil)
Convert a GNU Makefile to a CWL workflow (FP, MRC, TT, HM, KB, MI)
Be able to customize a workflow at any of the many levels
- Change the input object
- Change the default values at the workflow level
- Change hard coded values at the workflow level
- Change default value at the Workflow step level
- Change hard coded values at the Workflow step level
- Change default values in the CLT description
- Change hard coded values in the CLT description
- Change the container
- Change the tool source itself

Examples:

Locate controls for navigating data in OpenRefine

Find options to work with data through the OpenRefine dropdown menus

Split cells which contain multiple bits of data so that each piece of >data is in its own cell

(from https://librarycarpentry.org/lc-open-refine/03-working-with-data/index.html)

Employ the grep command to search for information within files.

Print the results of a command to a file.

Construct command pipelines with two or more stages.

(from https://datacarpentry.org/shell-genomics/04-redirection/index.html)

More feedback from Manabu

Rabix Composer is interesting to people in Japan

It is shiny
attractive to non-command line users
(like a Medical doctor)
but has limitations like not supporting "stdout"
Not ready good for daily use

VS Code live share + CWL language server
https://github.com/tom-tan/cwl-for-remote-container-template

CWL syntax (YAML based) is quite different from other languages. After writing one workflow successfully they are more confident.

Also discussed the "zatsu" method (quick and dirty, but useful!)
https://github.com/tom-tan/zatsu-cwl-generator

Starting CWL with `Zatsu` method! - Qiita

At least writing CommandLineTool, Zatsu method is very useful starting point.
We try to create Zatsu method for CWL Workflow.

from Manabu:

Some people confuse about output files. In CWL we need to write which file is captured in output section, but people believe every files are captured by default.So tutorial covers such a situation. It is very helpful for people create own CWL workflow by only themselves.
- Does it already covered by user guide ?

Concept Maps

https://carpentries.github.io/instructor-training/05-memory/index.html

organizing the learning objectives

pre-BASH-to-CWL

understanding CLT Inputs & Outputs
CLT vs Workflow
naming (Workflow steps, Workflow Inputs & Outputs)

post-BASH-to-CWL

scatter
workflow is a dependency graph / parallelism
own script in workflow
visualize / graph a workflow
documenting a workflow
workflow developement lifecycle (iterative, paper, ..)
customize a workflow (9+ levels)

https://docs.google.com/presentation/d/1aVdK8LHkgtESBunCQ-p7XmEl8NB9XbgDsH67X0_2HWg/edit?usp=sharing

2020-04-21 check in meetings

Attendees: Toby, Fotis, Tom, Sehrish, Kersten, Michael

Review of work so far

https://common-workflow-lab.github.io/cwl-novice-tutorial/index.html

On 2020-03-26 a group did a day long virtual meeting.

Things were fuzzy at the beginning of the day: exact objectives, vision, target audience. We did get clarity on these producing a list of learning objectives and learner personas

Learners should have some command line experience, but not necessarily a lot. The learner profiles: https://common-workflow-lab.github.io/cwl-novice-tutorial/audience/

Feedback on those profiles via issues and pull requests are very welcome!

(After the March meeting Toby re-arranged the learning objectives into their current, easier to read form)

The learning objectives are at https://common-workflow-lab.github.io/cwl-novice-tutorial/#learning-objectives

Some could use more details. Again feedback via GitHub issues and edits are welcome!

At the March meeting, concept maps were made: https://docs.google.com/presentation/d/1aVdK8LHkgtESBunCQ-p7XmEl8NB9XbgDsH67X0_2HWg/edit?usp=sharing

Since the likely result is a relatively modular set of exercises, the concept maps can be helpful in determining the dependency tree of the exercises

Next steps

Write at least one exercise for each learning objective (each bullet point).

Writing the exercises will not be easy, but makes writing the tutorial and is required.

It would be good to revisit the concept maps once some/most of the exercises have been written. Concept maps can also be used as an assesment tool by presenting an incomplete map and asking the learner to fill in the blanks; then to analyze and difference in the mental model between the instructor and the learners.

How to contribute exercises?

Toby+Michael will send out instructions next week. A video chat (2020-05-14 10:00-11:00 CEST) will guide contributors through the process. A recording will be made.

(Preview: each set of learning objectives will be moved to their own pseudo-lesson markdown file, then the exercises can be created/edited there)

Request: an overall checklist / progress tracker (not yet, will make issues with checkboxes: exercises, connections to concept maps, alpha-,beta-testing)

(the Carpentries Curriculum Development Handbook has a section on Designing Challenges, but it seems data analysis centric: https://carpentries.github.io/curriculum-development/designing-challenges.html#designing-challenges-1 ; Toby recommends http://teachtogether.tech/#s:exercises )

Fotis: Would it be nice to have a theme for the exercises, so that they can become aligned? Like the "intro to Excel" Carpentries lesson and its narritive that connects the lessons and exercises.
(Yes for a theme before it is "finished", not needed right now)

Toby: when we start writing the exercises, some themes will probably emerge. Writing an exercise that asses what you want to asses is hard enough, swapping out the theme afterwards will not be so hard.

But I want to start today!

Feel free to comment on the learning objectives and start discussing/creating exercises for them! Markdown / plain text is fine, don't worry about formatting.

Fotis: Don't be afraid about duplicating effort, this is not possible. Multiple exercises per learning objective is useful, as are different perspectives. He also starts on paper, not digital

Sehrish: What would an exercise look like? Just about writing CWL?

Toby: Check out http://teachtogether.tech/#s:exercises & http://teachtogether.tech/#s:models (and any other chapters that look interesting like http://teachtogether.tech/#s:process) and the cognitive load & mental models part of the Carpentries Curriculm Development Handbook

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

Introduction to Workflows with CWL: Lesson Development Sprint

26-27 March 2020

Logistics

Schedule (most relevant for participants in Europe)

Relevant Links:

Introduction to Curriculum Development

Audience Definition Questions

Learning Objectives

Lesson Structure

Other topics

Breakout group 1

Learner Profiles

Researcher who expects to publish protocol

RSE who inherited protocol and now needs to deploy/scale up

The Tom/Kersten Hybrid who wants to learn workflows

Learning objectives

Prerequisites

Goals

More feedback from Manabu

Concept Maps

Other questions

organizing the learning objectives

pre-BASH-to-CWL

post-BASH-to-CWL

2020-04-21 check in meetings

Review of work so far

Next steps

How to contribute exercises?

But I want to start today!

Introduction to Workflows with CWL: Lesson Development Sprint

26-27 March 2020

Sign-In

Logistics

Schedule (most relevant for participants in Europe)

Relevant Links:

Introduction to Curriculum Development

Audience Definition Questions

Learning Objectives

Lesson Structure

Other topics

Breakout group 1

Learner Profiles

Researcher who expects to publish protocol

RSE who inherited protocol and now needs to deploy/scale up

The Tom/Kersten Hybrid who wants to learn workflows

Learning objectives

Prerequisites

Goals

More feedback from Manabu

Concept Maps

Other questions

organizing the learning objectives

pre-BASH-to-CWL

post-BASH-to-CWL

2020-04-21 check in meetings

Review of work so far

Next steps

How to contribute exercises?

But I want to start today!