Group #2 CWL Hackathon

# Group #2 CWL Hackathon Fotis, Hervé, Robin, Xinhui ## CWL workflow descriptions ### Questions - Key question (FIXME) ### Objectives 1. explain the difference between a CWL tool description and a CWL workflow 2. describe the relationship between a tool and its corresponding CWL document 3. exercise good practices when naming inputs and outputs 4. Be able to make understandable and valid names for inputs and outputs (not ‘input3’) CWL Workflows are about chaining different steps. A workflow is the orchestration of the individual steps. ## Definitions > A CWL Tool description is a document that enables to run a command line tool, using a CWL runner (i.e. any type of workflow engine that is CWL compatible) > A CWL Workflow description is a particular chaining of steps (running CWL Tools or CWL Workflows) that enables a their execution with a CWL runner in an orchestrated way. This is what you see in a header of a CWL Tool Description (link: https://github.com/common-workflow-library/bio-cwl-tools/blob/release/bwa/BWA-Index.cwl) ``` #!/usr/bin/env cwl-runner cwlVersion: v1.0 class: CommandLineTool requirements: DockerRequirement: dockerPull: "quay.io/biocontainers/bwa:0.7.17--ha92aebf_3" InlineJavascriptRequirement: {} inputs: InputFile: type: File format: edam:format_1929 # FASTA inputBinding: position: 200 IndexName: type: string inputBinding: prefix: "-p" valueFrom: $(self + ".bwt") #Optional arguments algoType: type: - "null" - type: enum symbols: - is - bwtsw inputBinding: prefix: "-a" baseCommand: [bwa, index] outputs: index: type: File outputBinding: glob: $(inputs.IndexName) $namespaces: edam: http://edamontology.org/ $schemas: - http://edamontology.org/EDAM_1.18.owl ``` The key element is the first line (`class: CommandLineTool`) - this means that this is something that will be executed as a command line tool. (link to an example workflow: https://github.com/bioexcel/biobb-wf-md-setup-protein-cwl/blob/master/protein_md.cwl) ``` cwlVersion: v1.0 class: Workflow label: Example of setting up a simulation system doc: | Common Workflow Language example that illustrate the process of setting up a simulation system containing a protein, step by step, using the BioExcel Building Blocks library (biobb). The particular example used is the Lysozyme protein (PDB code 1AKI). inputs: step1_pdb_name: string step1_pdb_config: string ``` The key element here is (again) class (`class: Workflow`) - it implies that this is an orchestration, that can be executed as a workflow by a CWL-supported engine. The `Workflow` and the `CommandLineTool` are the most commonly used classes in CWL. There are some additional classes, but they are very specific and their use is beyond the scope of this lesson. **Exercise #0** Shawn is looking at existing CWL Tools and Workflows to see how to write them. He knows that the "class" variable specifies whether a CWL description is a workflow or a tool. Looking at the headers below, can you tell whether they are a description of a tool or a workflow? **A**: (from https://github.com/common-workflow-library/bio-cwl-tools/blob/release/GATK/GATK-ApplyBQSR.cwl) ``` #!/usr/bin/env cwl-runner cwlVersion: v1.0 class: CommandLineTool requirements: DockerRequirement: dockerPull: "broadinstitute/gatk:4.1.1.0" InlineJavascriptRequirement: {} ``` **B**: (from https://github.com/h3abionet/h3agatk/blob/master/workflows/GATK/GATK-complete-WGS-Workflow-h3abionet.cwl) ``` #!/usr/bin/env cwl-runner class: Workflow cwlVersion: v1.0 requirements: - class: StepInputExpressionRequirement - class: InlineJavascriptRequirement ``` **_TODO add more_** _Solution_: - A: Tool - B: Workflow - C: ??? **Exercise #1** Jorge is designing a workflow A that will be utilizing the command line tools B, C and D. What would be the "class" listed in each of the corresponding files A, B, C and D. _Solution_: - B, C, and D: `class: CommandLineTool` - A: `class: Workflow` **Exercise #2** Maria finds a workflow description that runs two successive workflows as steps. She wonders if this can actually run. Can you tell? _Solution_: A given workflow description can rely only on other workflows, i.e. have no direct calls to specific tools itself. For this reason, the workflow that Maria found can run without any issues. **Exercise #3** Maria is reading a CWL file, but is unsure whether it contains a CWL Tool description. Can you see a Tool description in there? _Solution_: Yes, it does contain a Tool description. CWL Workflows can call CWL tool descriptions, but they can also have Tool descriptions embedded inside them. Embedding Tool descriptions inside of workflows is valid CWL, however this limits the reusability of the Tool descriptions, decreases the readability, and makes debugging CWL harder. (_Note, to be removed: the following address the questions about the relationship between a tool and its corresponding CWL document_) **Exercise #4** Shawn wants to use a <a_tool> in his workflow. He has found a CWL description for the tool but it does not have all the options he expected. What may be the reason for this: a) The CWL description is for an older version of <a_tool>. b) The CWL description was never finished. c) The author of the CWL description did not need those options when they wrote the CWL Description Why might this be? _Solution_: All three options may be valid. It could be that this is for an older version of <a_tool>, or that the author did not finish writing the full CWL description. What is important to remember when using and creating CWL Tools is that they provides a clear interface to call an executable from a CWL runner. A CWL Tool may not provide access to the full features of an executable, often they only contain a subset of options.  As a result, when looking for an existing CWL Tool for the software you want to run, you might find more than one. This is perfectly normal, and the choice between these will be up to you and the requirements of your project. There are best practice-compliant repositories established for such descriptions (e.g. https://github.com/common-workflow-library/bio-cwl-tools), but it should be clear that you can use a description that best fits your own needs.  As an example, both CWL Tool Descriptions below correspond to the `bwa mem` tool. However, there are specific difference, based on the types of parameters used and expected in each case, as well as the output files. **Version 1** ``` #!/usr/bin/env cwl-runner class: CommandLineTool id: bwa-mem-0.7.8 label: bwa-mem-0.7.8 cwlVersion: v1.0 baseCommand: [ /tools/bwa-0.7.8/bwa, mem ] inputs: reference: type: File label: FastA file for reference genome fastq1: type: File label: FastQ file from next-generation sequencers fastq2: type: File label: FastQ file from next-generation sequencers outputs: sam: type: stdout stdout: $(inputs.name).sam stderr: $(inputs.name).sam.log ``` **Version 2** ``` #!/usr/bin/env cwl-runner cwlVersion: v1.0 class: CommandLineTool requirements: DockerRequirement: dockerPull: "quay.io/biocontainers/bwa:0.7.17--ha92aebf_3" inputs: InputFile: type: File[] Index: type: File Threads: type: int? inputBinding: prefix: "-t" MinSeedLen: type: int? inputBinding: prefix: "-k" BandWidth: type: int? inputBinding: prefix: "-w" VerboseLevel: type: int? inputBinding: prefix: "-v" baseCommand: [bwa, mem] stdout: unsorted_reads.sam outputs: reads_stdout: type: stdout ``` **TODO: describe the overall structure of a workflow** - steps - dataflow between workflow inputs/outputs and step inputs/outputs **TODO: exercice: ask the student to identify where the input of a given task comes from, where the output of a given task is fed into, and same for workflow-level inputs/outputs) (_Note, to be removed: the following address valid names and good practices for input and output names_) Names for inputs and outputs, just like the names of variables in a program, should be unique, and as self-documenting as possible (have sensible/meaningful names) - valid names, are any character forbidden. **Exercise 5**: Which of the following are valid choices for input and output names. And which one are actually well-designed as well. - 2ndInput - !WrongOne - This is the correct input - Output2 - BAMFile - MetadataCSVFileInputForToolHiSat2AndRNASeqSundayEveningVersionDoNotModifyOrContactMe _Solution_: - starts with a number - starts with a symbol - has spaces/gaps - it is valid, but does not convey any useful information - Valid, and well designed - valid, but too much information Regardless of naming conventions, a key aspect to always keep in mind is the fact that any input and/or output that is expected to appear (or be used respectively) needs to be explicitly defined in the corresponding CWL files (the tools and the workflow description files). #### Notes Examples of exercises for more assessment of "conceptual" understanding, rather than specific skills/syntax: 1. [An exercise to assess understanding of the suitability of a static site generator for development of different types of website](https://carpentries-incubator.github.io/jekyll-pages-novice/introduction/index.html#exercise-the-perfect-tool-for-the-job) 2. [An exercise to encourage learners to think about how to design tests for their software, and to get them talking with their colleagues](https://swcarpentry.github.io/python-novice-inflammation/10-defensive/index.html#pre--and-post-conditions) 3.