Elevating Scientific Computing with Singularity Containers - Nextflow + Snakemake + CWL - HackMD

<style> .reveal { font-size: 18px; } .reveal pre { 2 font-size: 24px; } .reveal section p { text-align: left; font-size: 18px; line-height: 1.2em; vertical-align: top; } .reveal section figcaption { text-align: center; font-size: 20px; line-height: 1.2em; vertical-align: top; } .reveal section h1 { font-size: 26pxem; vertical-align: top; } .reveal section h2 { font-size: 24px; line-height: 1.2em; vertical-align: top; } .reveal section h3 { font-size: 22px; line-height: 1.2em; vertical-align: top; } .reveal ul { display: block; } .reveal ol { display: block; } </style> ![](https://hackmd.io/_uploads/Hy0aurcl6.png) # Part 2: Streamlining Reproducible Data Analysis using Workflow Management Systems and Singularity Containers Ivan E. Cao-Berg Research Software Specialist Pittsburgh Supercomputing Center Carnegie Mellon University --- ## Before we begin - :warning: Have an issue or question? - Feel free to ask during the presentation, on chat or Slack - Send an email to the Help Desk `support@psc.edu` - :computer: What is the project charge ID? - `cis230059p` - :computer: What is the reservation name? - `workshop` - :computer: Where can I find the code and data? - The code and data is located in `/ocean/projects/cis230059p/shared` - The code can be found in this [repo](https://github.com/pscedu/workflow-examples) - :computer:Where do I save my output? - You can save your output in `/ocean/projects/cis230059p/$(whoami)`. - :computer: Where can I find the docs? - You can find the documentation [here](https://hackmd.io/@icaoberg/Ske8b00oh). --- ## Resources available during this experience * 30 regular-memory compute nodes that can be accessed using SLURM from the partition named `RM-shared` and reservation `workshop`. * If you do not wish to install software, then you can use OpenOnDemand to connect to Bridges 2 using the link `http://ondemand.bridges2.psc.edu` * To connect to Bridges 2 use the official [documentation](https://www.psc.edu/resources/bridges-2/user-guide/#:~:text=Using%20your%20ssh%20client%2C%20connect,username%20and%20password%20when%20prompted). --- ## Introduction to Workflow Management Systems (WMS) - **Workflow Management Systems (WMS)** - Automate and manage complex computational workflows. - **Key Features** - Workflow Definition Language. - Task Execution, Dependency Management. - Parallelism, Distribution, Resource Management. - Logging, Monitoring, Error Handling. - Reproducibility, Integration with External Tools. --- ## Some Popular WMS 1. **[Apache Airflow](https://airflow.apache.org/).** Apache Airflow is an open-source platform for programmatically authoring, scheduling, and monitoring workflows. It is commonly used for data pipeline automation, ETL (Extract, Transform, Load) processes, and workflow orchestration. 2. **[Apache NiFi](https://nifi.apache.org/).** Apache NiFi is an open-source data integration tool that provides an intuitive user interface for designing data flows. It can be used for data routing, transformation, and system mediation tasks. 3. **Apache Oozie:** Apache Oozie is a workflow scheduler system designed for managing Hadoop jobs. It is commonly used in big data environments for coordinating and managing data processing workflows. 4. **Microsoft Azure Logic Apps:** Azure Logic Apps is a cloud-based workflow automation service provided by Microsoft Azure. It allows users to create and run workflows that integrate with various Azure services and external systems. --- 5. **AWS Step Functions:** AWS Step Functions is a serverless orchestration service offered by Amazon Web Services (AWS). It enables users to coordinate and manage workflows using AWS services and Lambda functions. 6. **IBM Workflow Automation:** IBM Workflow Automation is a platform for designing, executing, and monitoring workflows. It offers tools for business process automation and workflow management. 7. **Taverna:** Taverna is an open-source scientific workflow management system that is domain-agnostic. It is used in various scientific disciplines for designing and executing workflows. These workflow management systems are versatile and can be applied to a wide range of use cases, including data integration, business process automation, cloud resource management, and more. The choice of a workflow management system often depends on the specific needs and requirements of the organization or project. --- ## Some domain specific WMS A practical understanding of WMS benefits, automation, and implementation. 1. **[Galaxy](https://usegalaxy.org/).** Galaxy is a web-based platform that provides a user-friendly interface for designing, running, and sharing bioinformatics workflows. It offers a wide range of tools and supports the analysis of various types of data. 2. **[Snakemake](https://snakemake.github.io/).** Snakemake is a workflow management system that uses a Python-based domain-specific language. It is known for its simplicity and flexibility, making it popular among bioinformaticians for defining and executing data analysis pipelines. 3. **[Nextflow](https://www.nextflow.io/).** Nextflow is a data-driven workflow management system that enables the creation of reproducible and scalable bioinformatics workflows. It uses a domain-specific language called DSL2, which is based on Groovy. 4. **[Cromwell](https://cromwell.readthedocs.io/en/stable/).** Cromwell is an open-source workflow management system developed by the Broad Institute. It is designed to work seamlessly with the Workflow Description Language (WDL) and is commonly used for large-scale genomics data analysis. --- 5. **[Toil](https://toil.readthedocs.io/en/latest/).** Toil is an open-source workflow engine developed by the University of California, Santa Cruz. It is designed for the execution of complex, scalable, and reproducible bioinformatics workflows. 6. **[Bioconductor](https://www.bioconductor.org/).** Bioconductor is a collection of R packages and tools specifically designed for the analysis and comprehension of high-throughput genomic data. While not a traditional workflow management system, it provides a framework for developing and executing bioinformatics workflows in R. 7. **[Common Workflow Language (CWL)](https://www.commonwl.org/).** CWL is not a specific workflow management system but a standardized way to describe and execute bioinformatics workflows. Several workflow engines, including Cromwell and Rabix/Benten, support CWL, making it a popular choice for interoperability. 8. **[WDL (Workflow Description Language)](https://github.com/openwdl/wdl).** WDL is a domain-specific language for describing bioinformatics workflows. It is used with Cromwell and other workflow engines that support WDL. These workflow management systems are widely used in bioinformatics to automate and streamline the analysis of biological data, from genomics to proteomics and beyond. Researchers often choose a system based on their specific requirements, familiarity with the tools, and the nature of their data analysis tasks. --- ## CellOrganizer for Galaxy ![](https://hackmd.io/_uploads/By5qSScg6.png) **Figure** A Galaxy instance hosted on Bridges. Screenshot taken from the official CellOrganizer site. --- ## Apache Airflow ![](https://hackmd.io/_uploads/HJ1cUSqxa.png) **Figure** HuBMAP uses Airflow for data ingestion. Screenshot taken from the official documentation. --- ## Other WMS * Snakemake - Used in the Brain Image Library to build manifests from public data * CWL - Used by HuBMAP for data ingestion and processing along with Apache Airflow --- # Dive into Nexflow * Nextflow is an open-source workflow management system designed for scientific and computational workflows. * It simplifies the creation, execution, and sharing of workflows, addressing challenges related to reproducibility, portability, and scalability. --- ## Key Features 1. **Domain-Specific Language (DSL):** Nextflow provides a human-readable scripting language for defining complex workflows, allowing researchers to focus on scientific tasks. 2. **Portability:** Abstracts infrastructure details, making workflows portable across local machines, clusters, and cloud platforms. 3. **Reproducibility:** Encapsulates dependencies, ensuring consistent results across different execution environments. 4. **Parallel and Distributed Computing:** Supports scalable execution on clusters and parallel computing for handling large datasets. --- 5. **Containerization Support:** Integrates with Docker and Singularity, enabling the packaging of workflows with dependencies for consistency. 6. **Versioning:** Supports versioning of workflows and dependencies for tracking changes over time. 7. **Community and Collaboration:** Active community support and documentation facilitate collaboration, sharing best practices, and troubleshooting. 8. **Error Handling and Logging:** Provides robust error handling mechanisms and detailed logging for workflow debugging. --- ## Applications Nextflow is widely used in various scientific domains, with a focus on **bioinformatics** and **genomics**. Its versatility and community support contribute to its popularity among researchers. For more information, visit the [Nextflow GitHub repository](https://github.com/nextflow-io/nextflow). --- ## Popularity ![](https://hackmd.io/_uploads/r1IWqrceT.png) --- ## Pros and Cons | **Pros** | **Cons** | | ----------------------------- | ----------------------------------| | 1. Portability | 1. **Learning Curve** | | 2. Reproducibility | 2. Limited GUI | | 3. Scalability | 3. Resource Overhead | | 4. Containerization Support | 4. Dynamic Typing | | 5. Versioning | 5. Dependency Management | | 6. **Active Community** | 6. Lack of Native GUI | | 7. Error Handling | 7. Limited Workflow Visualization | | 8. DSL for Workflow Definition | | --- ![](https://hackmd.io/_uploads/Skw1ODclT.png) **Figure** Nextflow has an active community of contributors but a steep learning curve. --- # Install Nextflow on Bridges 2 ``` change_primary_group cis230059p cd /ocean/projects/cis230059p/$(whoami) mkdir sdkman && ln -s $(pwd)/sdkman $HOME/.sdkman curl -s "https://get.sdkman.io" | bash source "$HOME/.sdkman/bin/sdkman-init.sh" sdk install java 17.0.6-amzn ``` --- # Install Nextflow on Bridges 2 (cont.) ``` if [ ! -d ~/bin ]; then mkdir ~/bin; fi cd ~/bin && curl -s https://get.nextflow.io | bash chmod +x ~/bin/nextflow export PATH=~/bin:$PATH ``` --- # Install Nextflow on Bridges 2 (cont.) ``` nextflow -h Usage: nextflow [options] COMMAND [arg...] Options: -C Use the specified configuration file(s) overriding any defaults -D Set JVM properties -bg Execute nextflow in background -c, -config Add the specified file to configuration set -config-ignore-includes Disable the parsing of config includes -d, -dockerize Launch nextflow via Docker (experimental) -h Print this help ``` --- [![](https://hackmd.io/_uploads/HJecor5l6.png)](https://nf-co.re/) **Figure** nf-core has a list of curated pipelines that work out of the box. --- ## [`nfcore/atacseq`](https://nf-co.re/atacseq/2.1.) is a bioinformatics analysis pipeline used for ATAC-seq data. * It uses Docker/Singularity containers making installation trivial and results highly reproducible. * The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. * Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community! --- [![](https://hackmd.io/_uploads/rk0baSqep.png)](https://nf-co.re/atacseq/2.1.2) --- ## [`nfcore/atacseq`](https://nf-co.re/atacseq/2.1.) * Source code can be found [here](https://github.com/nf-core/atacseq). * The workflow can be triggered locally [here](https://nf-co.re/atacseq/2.1.2) --- # Before we submit the workflow ``` cat psc.config singularity { enabled = true } process { executor = 'slurm' queue = 'RM' } ``` **Figure**. A config file is needed to tell NextFlow what allocation and partition needs to be used. --- # Before we submit the workflow (cont.) ``` cat script.sh #!/bin/bash #SBATCH -p RM-shared export SDKMAN_DIR="$HOME/.sdkman" [[ -s "$HOME/.sdkman/bin/sdkman-init.sh" ]] && source "$HOME/.sdkman/bin/sdkman-init.sh" export PATH=~/bin:$PATH export NXF_SINGULARITY_CACHEDIR=./containers if [ ! -d ./containers ]; then mkdir ./containers; fi nextflow run nf-core/atacseq -r 2.1.2 -profile test --outdir ./results -c psc.config ``` **Figure**. The last line is the basic run. Everything else is setting up the job for Bridges 2. --- # and submit... ``` sbatch ./script.sh Submitted batch job 19573905 ``` --- ## Waiting... ``` squeue -u icaoberg JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 19572572 RM nf-NFCOR icaoberg PD 0:00 1 (Resources) 19572573 RM nf-NFCOR icaoberg PD 0:00 1 (Priority) 19572574 RM nf-NFCOR icaoberg PD 0:00 1 (Priority) 19572575 RM nf-NFCOR icaoberg PD 0:00 1 (Priority) 19572605 RM nf-NFCOR icaoberg PD 0:00 1 (Priority) 19572604 RM nf-NFCOR icaoberg PD 0:00 1 (Priority) 19572602 RM nf-NFCOR icaoberg PD 0:00 1 (Priority) 19572600 RM nf-NFCOR icaoberg PD 0:00 1 (Priority) 19572599 RM nf-NFCOR icaoberg PD 0:00 1 (Priority) 19572598 RM nf-NFCOR icaoberg PD 0:00 1 (Priority) 19572594 RM nf-NFCOR icaoberg PD 0:00 1 (Priority) 19572593 RM nf-NFCOR icaoberg PD 0:00 1 (Priority) 19572592 RM nf-NFCOR icaoberg PD 0:00 1 (Priority) 19572480 RM-shared script.s icaoberg R 9:11 1 r021 ``` --- ## and what about the containers? ``` tree containers containers ├── depot.galaxyproject.org-singularity-ataqv-1.3.1--py310ha155cf9_1.img ├── depot.galaxyproject.org-singularity-bedtools-2.30.0--hc088bd4_0.img ├── depot.galaxyproject.org-singularity-bwa-0.7.17--hed695b0_7.img ├── depot.galaxyproject.org-singularity-deeptools-3.5.1--py_0.img ├── depot.galaxyproject.org-singularity-fastqc-0.11.9--0.img ├── depot.galaxyproject.org-singularity-homer-4.11--pl526hc9558a2_3.img ├── depot.galaxyproject.org-singularity-khmer-3.0.0a3--py37haa7609a_2.img ├── depot.galaxyproject.org-singularity-macs2-2.2.7.1--py38h4a8c8d9_3.img ├── depot.galaxyproject.org-singularity-mulled-v2-0560a8046fc82aa4338588eca29ff18edab2c5aa-5687a7da26983502d0a8a9a6b05ed727c740ddc4-0.img ├── depot.galaxyproject.org-singularity-mulled-v2-57736af1eb98c01010848572c9fec9fff6ffaafd-402e865b8f6af2f3e58c6fc8d57127ff0144b2c7-0.img ├── depot.galaxyproject.org-singularity-mulled-v2-8186960447c5cb2faa697666dc1e6d919ad23f3e-3127fcae6b6bdaf8181e21a26ae61231030a9fcb-0.img ├── depot.galaxyproject.org-singularity-mulled-v2-fe8faa35dbf6dc65a0f7f5d4ea12e31a79f73e40-219b6c272b25e7e642ae3ff0bf0c5c81a5135ab4-0.img ├── depot.galaxyproject.org-singularity-perl-5.26.2.img ├── depot.galaxyproject.org-singularity-picard-3.0.0--hdfd78af_1.img ├── depot.galaxyproject.org-singularity-python-3.8.3.img ├── depot.galaxyproject.org-singularity-samtools-1.16.1--h6899075_1.img ├── depot.galaxyproject.org-singularity-samtools-1.17--h00cdaf9_0.img ├── depot.galaxyproject.org-singularity-trim-galore-0.6.7--hdfd78af_0.img ├── depot.galaxyproject.org-singularity-ubuntu-20.04.img └── depot.galaxyproject.org-singularity-ucsc-bedgraphtobigwig-445--h954228d_0.img ``` --- ## Exercise. [nf-co.re/bamtofastq]() Go to https://nf-co.re/bamtofastq/2.0.0 and explore the pipeline. Go to exercise document and fix the problems described in the exercise for this workflow. --- ## Dive into Snakemake * Snakemake is an open-source workflow management system written in Python. * It is designed to create and execute reproducible and scalable data analysis workflows. * Snakemake utilizes a human-readable and expressive domain-specific language (DSL) to define workflows, making it accessible to both bioinformaticians and researchers in various domains. --- ## Key Features 1. **Declarative Workflow Definition:** Workflows are defined in a declarative manner, specifying input, output, and the steps to transform data. 2. **Rule-Based Workflow Execution:** Workflow steps are defined as rules, and Snakemake automatically determines the execution order based on dependencies. 3. **Parallel and Cluster Computing:** Snakemake supports parallel and cluster computing, enabling efficient execution of tasks across multiple cores or nodes. 4. **Conda Integration:** Seamless integration with Conda allows easy management of software dependencies, enhancing reproducibility. 5. **Logging and Error Handling:** Detailed logging and error handling mechanisms aid in debugging and troubleshooting workflows. 6. **Community Support:** With an active user community, Snakemake benefits from ongoing development, documentation, and community-driven support. --- ## Applications Snakemake is widely used in bioinformatics, genomics, and data analysis workflows. Its flexibility and scalability make it suitable for various scientific domains. For more information, visit the [Snakemake GitHub repository](https://github.com/snakemake/snakemake). --- ## Popularity ![](https://hackmd.io/_uploads/H16tKOcl6.png) --- # Pros and Cons | **Pros** | **Cons** | | ------------------------------------- | ---------------------------------- | | 1. Clear and readable syntax. | 1. Initial learning curve. | | 2. Workflow steps defined declaratively.| 2. Requires Python. | | 3. Efficient handling of dependencies.| 3. Lacks a graphical interface. | | 4. Enhances reproducibility. | 4. Workflows can be verbose. | | 5. Robust error handling mechanisms. | 5. Limited support for dynamic | | | workflow modification. | | 6. Allows incorporation of custom | | | scripts and modules. | | | 7. Benefits from ongoing development. | | | 8. Suitable for various domains. | | | 9. Simplifies dependency management. | | --- ![](https://hackmd.io/_uploads/S1i4sucxa.png) **Figure** Snakemake is widely popular outside of the genomics realm. --- [![](https://hackmd.io/_uploads/Byt0xB9ea.png)](https://snakemake.github.io/snakemake-workflow-catalog/) --- # Install Snakemake on Bridges 2 ``` change_primary_group cis230059p cd /ocean/projects/cis230059p/$(whoami) module load anaconda3 pip install snakemake --user -q ``` --- # Install Snakemake on Bridges 2 (cont.) ``` snakemake --help usage: snakemake [-h] [--dry-run] [--profile PROFILE] [--workflow-profile WORKFLOW_PROFILE] [--cache [RULE ...]] [--snakefile FILE] [--cores [N]] [--jobs [N]] [--local-cores N] [--resources [NAME=INT ...]] [--set-threads RULE=THREADS [RULE=THREADS ...]] [--max-threads MAX_THREADS] [--set-resources RULE:RESOURCE=VALUE [RULE:RESOURCE=VALUE ...]] [--set-scatter NAME=SCATTERITEMS [NAME=SCATTERITEMS ...]] [--set-resource-scopes RESOURCE=[global|local] [RESOURCE=[global|local] ...]] [--default-resources [NAME=INT ...]] [--preemption-default PREEMPTION_DEFAULT] [--preemptible-rules PREEMPTIBLE_RULES [PREEMPTIBLE_RULES ...]] [--config [KEY=VALUE ...]] [--configfile FILE [FILE ...]] [--envvars VARNAME [VARNAME ...]] [--directory DIR] [--touch] ``` --- ## Exercises. Click [here](https://hackmd.io/@icaoberg/SkeHG6Kxa). --- ## Dive into CWL * The **Common Workflow Language (CWL)** is an open standard designed to address the challenges associated with describing and executing data analysis workflows. * It aims to make workflows portable and scalable across various computing environments. * More emphasis on the language than the tool unlike Snakemake workflows built on DSL. --- ## Objectives Some of the primary objectives of CWL include - **Describe Workflows** Provide a standardized way to describe data analysis workflows, ensuring clarity and consistency. - **Portability** Enable the portability of workflows across different computing environments, such as local machines, clusters, and cloud services. - **Scalability** Facilitate the scalability of workflows to handle varying computational demands. --- ## Key Features 1. **Platform Independence:** Workflows can be executed on diverse computing platforms without modification. 2. **Tool and Platform Agnosticism:** CWL does not dictate specific tools or platforms, focusing instead on the relationships between workflow steps. 3. **Reproducibility:** Ensure the reproducibility of analyses across different systems, enhancing the reliability of scientific research. 4. **Accessibility:** Designed to be human-readable and writable, making it accessible to researchers and bioinformaticians. 5. **Community-Driven:** Developed collaboratively with input from researchers, developers, and organizations, ensuring a broad perspective. --- ## Application - **Bioinformatics** Particularly in genomics and other areas where complex analyses involve multiple tools and data processing steps. - **Scientific Research** In various scientific domains, CWL facilitates the coordination of diverse computational tasks such as data ingestion and processing. --- ## Benefits - **Standardization:** Enables the creation of standardized workflows, enhancing collaboration and interoperability. - **Reproducibility:** Ensures that analyses are reproducible across different computing environments. - **Collaboration:** Facilitates collaboration by providing a common language for describing and sharing workflows. For example, in theory, a CWL workflow that works on Galaxy could be used in AirFlow. --- # Pros and Cons | **Pros** | **Cons** | | ----------------------------------- | ------------------------------------------ | | **1. Standardization:** Facilitates the creation of standardized workflows, promoting consistency and interoperability. | **1. Learning Curve:** Users may face a learning curve to understand and implement CWL, particularly if they are new to workflow description languages. | | **2. Reproducibility:** Ensures the reproducibility of analyses across different computing environments, enhancing the reliability of scientific research. | **2. Initial Setup:** Setting up CWL for a specific environment may require additional configuration and setup effort. | | **3. Collaboration:** Facilitates collaboration by providing a common language for describing and sharing workflows, fostering teamwork. | **3. Limited Tool Support:** Not all tools may have native support for CWL, potentially limiting its applicability in certain scenarios. | | **4. Platform Independence:** Allows workflows to be executed on various computing platforms without modification, enhancing flexibility. | **4. Adoption Challenges:** The adoption of CWL might face resistance in environments accustomed to different workflow management systems. | | **5. Community Support:** Being community-driven, CWL benefits from diverse perspectives, continuous improvement, and a supportive community. | **5. Overhead:** In some cases, the detailed description required by CWL may introduce additional overhead compared to simpler workflow languages. | --- [![](https://hackmd.io/_uploads/ByyRTyjg6.png)](https://www.commonwl.org/repos/) --- # Installing cwl-runner on Bridges 2 ``` change_primary_group cis230059p cd /ocean/projects/cis230059p/$(whoami) module load anaconda3 pip install cwl-runner --user -q ``` --- # Installing cwl-runner on Bridges 2 (cont.) ``` cwl-runner -h usage: cwl-runner [-h] [--basedir BASEDIR] [--outdir OUTDIR] [--log-dir LOG_DIR] [--parallel] [--preserve-environment ENVVAR | --preserve-entire-environment] [--rm-container | --leave-container] [--cidfile-dir CIDFILE_DIR] [--cidfile-prefix CIDFILE_PREFIX] [--tmpdir-prefix TMPDIR_PREFIX] [--tmp-outdir-prefix TMP_OUTDIR_PREFIX | --cachedir CACHEDIR] [--rm-tmpdir | --leave-tmpdir] [--move-outputs | --leave-outputs | --copy-outputs] [--enable-pull | --disable-pull] [--rdf-serializer RDF_SERIALIZER] [--eval-timeout EVAL_TIMEOUT] [--provenance PROVENANCE] [--enable-user-provenance] [--disable-user-provenance] [--enable-host-provenance] [--disable-host-provenance] ``` --- ## Exercises. Click [here](https://hackmd.io/@icaoberg/SkeHG6Kxa). --- ## Take home lesson * Spend time building well defined Dockerfile and Singularity recipes * Post your containers to SyLabs, Docker Hub or similar * Indenpendently of the system or language used, make sure to spend some time designing your workflow (design is just as important as implementation) * These are not the only systems that can work on Bridges 2 --- # Conclusion * Nextflow has curated and well maintained workflows for those looking for solutions out of the box. * Snakemake is a great starter system for those familiar with Python. * CWL is a standard, not a tool. And workflows written in CWL should be compatible between engines that can operate on them, for example Galaxy and AirFlow. * While not covered in this workshop know that workflows can scale down to a laptop or desktop and scale up as well. --- # But WHY? Going back to FAIR principles * publishing source code with data or making a public repo doesn't fix the issue of reproducibility. * Having a set of stable containers that can be used by a workflow to reproduce results from a published paper is, in my opinion, going to be become a now.