Data-driven computational pipelines

# Abstract In the era of big data, computational pipelines have become indispensable for efficiently processing and analyzing vast amounts of data. With the advent of high-performance computing systems like Bridges-2, researchers now have access to unprecedented computing power and resources. However, designing and executing data-driven computational pipelines on such systems can be challenging. This presentation aims to explore the advantages and some use cases of three popular workflow management systems: NextFlow, Snakemake, and cwltool, all within the context of Bridges-2. These systems provide a streamlined approach to building scalable and reproducible computational pipelines for processing biological data. Additionally, we will discuss best practices for deploying these systems on Bridges-2, including resource management, job scheduling, and data management strategies. We will also address the challenges and potential solutions encountered when integrating these workflow management systems with Bridges-2’s unique features and constraints. By the end of this presentation, attendees will have a generic understanding of NextFlow, Snakemake, and cwltool, and how these frameworks can empower researchers to build robust and scalable data-driven computational pipelines on Bridges-2. Speaker Biography: Ivan Cao-Berg is a research software specialist in the Biomedical Applications Group tinkering with technology in scientific related projects. At the moment, Ivan is involved in several projects HuBMAP, The Brain Image Library, SenNet and on occasion, with Bridges 2. ## What are workflows? In computational workflows, individual tasks or steps are organized in a logical order, where the output of one task serves as the input for the subsequent task. This allows for the creation of **reproducible** process that can be executed reliably and efficiently. Workflows can be designed to handle a wide range of tasks, including data processing, analysis, simulation, modeling, and decision-making. There are different types of computational workflows, including procedural workflows, data-driven workflows, and model-driven workflows * Procedural workflows: These workflows follow a predefined sequence of steps or procedures. Each step specifies the input requirements, the processing to be performed, and the output produced. Procedural workflows are often used in scientific simulations or data processing tasks. ![](https://hackmd.io/_uploads/HyH2U_8u3.png) * Data-driven workflows: These workflows focus on the flow and manipulation of data. They utilize data dependencies to determine the order in which tasks should be executed. Data-driven workflows are common in data analysis and data mining applications. * Model-driven workflows: These workflows incorporate mathematical or computational models as the core components. The tasks within the workflow involve creating, configuring, or executing these models. Model-driven workflows are often used in simulation studies or predictive modeling. ## Snakemake ### fortune #### Overhead ``` (base) ➜ fortune hyperfine --warmup 10 'fortune fortunes' 'snakemake --cores 1 --printshellcmds' -i --export-json fortune.json Benchmark 1: fortune fortunes Time (mean ± σ): 62.3 ms ± 2.9 ms [User: 35.3 ms, System: 17.5 ms] Range (min … max): 56.3 ms … 70.5 ms 47 runs Benchmark 2: snakemake --cores 1 --printshellcmds Time (mean ± σ): 633.3 ms ± 37.3 ms [User: 368.9 ms, System: 162.4 ms] Range (min … max): 579.7 ms … 726.6 ms 10 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Summary fortune fortunes ran 10.17 ± 0.77 times faster than snakemake --cores 1 --printshellcmds ``` The result