Tuning GPU parameters for Guppy performance

# Tuning GPU parameters for Guppy performance *Author:* [Miles Benton](https://sirselim.github.io/) ([GitHub](https://github.com/sirselim); [Twitter](https://twitter.com/miles_benton); [Gmail](miles.benton84@gmail.com)) *Created:* 2022-02-08 16:56:50 *Last modified:* 2022-02-11 20:24:06 ###### tags: `Nanopore` `GPU` `notes` `documentation` `benchmarks` `Guppy` `optimisation` `parameter sweep` ---- Nvidia GPUs drastically accelerate the basecalling rate of Nanopore data. This is great, but not all GPUs are built equally, meaning that sometimes you'll need to tweak specific parameters to either increase performance, or on the other side of the coin, tune down parameters so that a lower spec'd card can work. Some examples of this: * you have a card with more than 8GB of RAM (such as a RTX3080, RTX3090, A5000) and you want to get more basecalling performance from high accuracy (HAC) and super accuracy (SUP) models. * you have a card with less than 8GB of RAM (such as GTX1660, RTX2060, RTX3050) and you want to be able to use Guppy for live basecalling or running the HAC/SUP models without running into CUDA memory issues. For some time I've been meaning to put together something more structured around how I go about tuning Guppy parameters to get the most out of particular GPUs for the Nanopore basecalling work that we do. This document is an attempt at that. ## Preface This "guide"/document assumes that you already have a GPU-enabled version of Guppy set up and running on your system. This includes current Nvidia GPU drivers and CUDA installation (both required to run Guppy GPU). For the purpose of this exercise I'm am using the current latest release of Guppy (version 6.0.1) that is available via the [software page](https://community.nanoporetech.com/downloads) of the ONT Community forum. The machine I am using to run this "tutorial" on doesn't run MinKNOW, i.e. it's not used for sequencing and/or live basecalling. As such it isn't running the Guppy basecall server, so I'm using the `guppy_basecaller` executable. If there is a basecall server present and running on a machine it would be recommended to use `guppy_basecall_client` instead of `guppy_basecaller`. If you are unsure whether you are running the basecall server you can easily check with the `nvidia-smi` commandline tool: ```shell= nvidia-smi Wed Feb 9 11:09:36 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.86 Driver Version: 470.86 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA TITAN RTX On | 00000000:18:00.0 Off | N/A | | 46% 58C P8 14W / 280W | 464MiB / 24215MiB | 12% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1684 G /usr/lib/xorg/Xorg 289MiB | | 0 N/A N/A 7033 G cinnamon 27MiB | | 0 N/A N/A 7250 G ...AAAAAAAAA= --shared-files 10MiB | | 0 N/A N/A 15803 G /usr/lib/firefox/firefox 122MiB | | 0 N/A N/A 16255 G /usr/lib/firefox/firefox 3MiB | | 0 N/A N/A 16480 G /usr/lib/firefox/firefox 3MiB | | 0 N/A N/A 16491 G /usr/lib/firefox/firefox 3MiB | +-----------------------------------------------------------------------------+ ``` If you have the server running you will see it in the output, along with an amount of RAM that's allocated to it. ### Monitoring GPU usage To monitor GPU usage you have several options. The above mentioned `nvidia-smi` is a fairly good tool to do this. Better yet is `nvtop` ([link](https://github.com/Syllo/nvtop)). This is like `htop` in terms of being a nice graphical overview of GPU information. On Debian/Ubuntu systems it's as easy as `sudo apt install nvtop` to get it installed and running. Example NVtop view: ![](https://i.imgur.com/RzIydJq.png) Another alternative is `gpustat` ([link](https://github.com/wookayin/gpustat)) which is a great little tool that displays very minimal information in a nice way. I've chosen to use `gpustat` as part of my testing as I found it easy to integrate it into a small bash script which allowed me to automate the optimisation process (more on this later). To install `gpustat` you can just use pip as per below: ```shell= pip install git+https://github.com/wookayin/gpustat.git@master ``` Example of running gpustat: ![](https://i.imgur.com/n6iqL0Q.png) ## Example / "benchmarking" data set I have provided a small set of example data (fast5 format) that I have been using in testing, development and benchmarking various projects. It is hosted via MEGA.co.nz, and can be downloaded manually or via the command line. Link to the small subset of fast5 data for manual download: ([link](https://mega.nz/file/nAkFHAZR#hFc2ELBxNlXV8MfGaAuuP8nXfoEHBwvk1obnO-LkZTI)) ### Install megatools `megatools` is a cli program that allows terminal-based access to MEGA.co.nz hosted files/data. It's straightforward to install on Debain/Ubuntu systems: ```shell= sudo apt update sudo apt install megatools ``` ### Download the example data We can now use `megatools` to download the example fast5 data: ```shell= megadl https://mega.nz/file/nAkFHAZR#hFc2ELBxNlXV8MfGaAuuP8nXfoEHBwvk1obnO-LkZTI ``` Once downloaded extract the data and you're ready to go. ## Running Guppy I am again making an assumption here that the user is familiar with Guppy, particularly on the commandline. If you are not there are plenty of tutorials out there, including the documentation available via the ONT Community forums. I will include basic code for running a fairly standard post-sequencing basecalling run on the example set of fast5 files. I won't go into any details of many of the specific parameters unless they are being modified to alter basecalling performance. ## Linux workstation specifications This isn't going to be super important, it's more a point of reference as to the system I'm running that is generating the results below. Please note that in this instance I'm running a Nvidia Titan RTX GPU, a card with 24GB of RAM. This means that I'm going to be concentrating on improving performance. If I get time I would like to add a later section on down tuning for lower performance GPUs, until I get around to that I will add a comment about what we have done with the Jetson Xavier boards to get them basecalling nicely (this will be in a section towards the end of this document). Lenovo P920 ThinkStation * CPU: 2x12 core Intel Xeon CPUs (48 threads) * MEMORY: 512GB RAM * GPU: Nvidia Titan RTX 24GB RAM * Storage: 2x1TB SSD, 6x 4TB HDD ### Default settings The below represents the baseline performance of the system running a Titan RTX GPU for each of the basecalling models (FAST, HAC, SUP). This is running every parameter in Guppy on default, no changes at all. The metric that we are interested in is the number of samples per second (samples/s). The higher this number the faster the rate of basecalling, the sooner your basecalling job will be finished. #### FAST ```shell= $ guppy_basecaller -c dna_r9.4.1_450bps_fast.cfg \ -i ../example_fast5/ \ -s titanRTX_fastq_fast_6.0.1 \ -x 'auto' ONT Guppy basecalling software version 6.0.1+652ffd1 config file: /home/miles/Downloads/software/guppy/6.0.1/ont-guppy/data/dna_r9.4.1_450bps_fast.cfg model file: /home/miles/Downloads/software/guppy/6.0.1/ont-guppy/data/template_r9.4.1_450bps_fast.jsn input path: ../example_fast5 save path: titanRTX_fastq_fast_6.0.1 chunk size: 2000 chunks per runner: 160 minimum qscore: 8 records per file: 4000 num basecallers: 4 gpu device: auto kernel path: runners per device: 8 Found 10 fast5 files to process. Init time: 715 ms 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| *************************************************** Caller time: 34878 ms, Samples called: 1150014823, samples/s: 3.29725e+07 Finishing up any open output files. Basecalling completed successfully. ``` The FAST model completes with a samples/s value of `3.29725e+07`. #### HAC ```shell= $ guppy_basecaller -c dna_r9.4.1_450bps_hac.cfg \ -i ../example_fast5/ \ -s titanRTX_fastq_hac_6.0.1 \ -x 'auto' ONT Guppy basecalling software version 6.0.1+652ffd1 config file: /home/miles/Downloads/software/guppy/6.0.1/ont-guppy/data/dna_r9.4.1_450bps_hac.cfg model file: /home/miles/Downloads/software/guppy/6.0.1/ont-guppy/data/template_r9.4.1_450bps_hac.jsn input path: ../example_fast5/ save path: titanRTX_fastq_hac_6.0.1 chunk size: 2000 chunks per runner: 256 minimum qscore: 9 records per file: 4000 num basecallers: 4 gpu device: auto kernel path: runners per device: 4 Found 10 fast5 files to process. Init time: 1153 ms 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| *************************************************** Caller time: 258948 ms, Samples called: 1150014823, samples/s: 4.4411e+06 Finishing up any open output files. Basecalling completed successfully. ``` The HAC model completes with a samples/s value of `4.4411e+06`. I include a screenshot from NVtop to illustrate monitoring of GPU performance and RAM usage. You'll note that with HAC we're using ~8.5GB out of the cards total 24GB. This will hopefully give you an idea as to why ONT recommend cards with 8+ GB of GPU RAM for analysis. ![](https://i.imgur.com/Yd61OTo.png) #### SUP ```shell= $ guppy_basecaller -c dna_r9.4.1_450bps_sup.cfg \ -i ../example_fast5/ \ -s titanRTX_fastq_sup_6.0.1 \ -x 'auto' ONT Guppy basecalling software version 6.0.1+652ffd1 config file: /home/miles/Downloads/software/guppy/6.0.1/ont-guppy/data/dna_r9.4.1_450bps_sup.cfg model file: /home/miles/Downloads/software/guppy/6.0.1/ont-guppy/data/template_r9.4.1_450bps_sup.jsn input path: ../example_fast5/ save path: titanRTX_fastq_sup_6.0.1 chunk size: 2000 chunks per runner: 256 minimum qscore: 10 records per file: 4000 num basecallers: 4 gpu device: auto kernel path: runners per device: 12 Found 10 fast5 files to process. Init time: 2094 ms 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| *************************************************** Caller time: 273577 ms, Samples called: 1150014823, samples/s: 4.20362e+06 Finishing up any open output files. Basecalling completed successfully. ``` The SUP model completes with a samples/s value of `4.20362e+06`. ![](https://i.imgur.com/jTsJPCL.png) It can be seen that SUP uses significantly more RAM. ### Parameter tuning for performance OK, time to get into it. The basic approach we're going to use here is essentially pick the one parameter (chunks_per_runner) that seems to make the most difference to base calling rate. Before getting to this it's probably worth it to define a few of the parameters. The most useful parameters to note are probably as follows (as documented by ONT in the Guppy manual): * **Chunks per caller** (--chunks_per_caller): A soft limit on the number of chunks in each basecaller's chunk queue. When a read is sent to the basecaller, it is broken up into “chunks” of signal, and each chunk is basecalled in isolation. Once all the chunks for a read have been basecalled, they are combined to produce a full basecall. --chunks_per_caller sets a limit on how many chunks will be collected before they are dispatched for basecalling. On GPU platforms this is an important parameter to obtain good performance, as it directly influences how much computation can be done in parallel by a single basecaller. * **Number of parallel callers** (--num_callers): Number of parallel basecallers to create. A thread will be spawned for each basecaller to use. Increasing this number will allow Guppy to make better use of multi-core CPU systems, but may impact overall system performance. * **GPU device** (-x or --device): Specify a GPU device to use in order to accelerate basecalling. If this option is not selected, Guppy will default to CPU usage. You can specify one or more devices as well as optionally limiting the amount of GPU memory used (to leave space for other tasks to run on GPUs). GPUs are counted from zero, and the memory limit can be specified as percentage of total GPU memory or as size in bytes. * **Chunk size** (--chunk_size): Set the size of the chunks of data which are sent to the basecaller for analysis. Chunk size is specified in signal blocks, so the total chunk size in samples will be chunk_size * event_stride. * **Max chunks per runner** (--chunks_per_runner): The maximum number of chunks which can be submitted to a single neural network runner before it starts computation. Increasing this figure will increase GPU basecalling performance when it is enabled. * **Number of GPU runners per device** (--gpu_runners_per_device): The number of neural network runners to create per CUDA device. Increasing this number may improve performance on GPUs with a large number of compute cores, but will increase GPU memory use. This option only affects GPU calling. I have done a ***LOT*** of tweaking and as mentioned above, from my experience it really comes down to the --chunks_per_runner parameter. Time for my very lay approach at understanding this... I think of it this way, this parameter dictates how much signal data (chunks) can be held in GPU memory to be passed on the the GPU itself (the processor), where it will get basecalled. As data moving from RAM is going to get there faster, it makes sense that the more data you can provide in this manner the faster you're overall basecalling rate is going to be (obviously to a point). So the more RAM you have the more data you can hold, the more data you hold in the quicker environment that is the RAM the faster you can move it back and forward to the processor. It's probably more complicated than this, but hey it works for me! #### OK, let's get going! So lets forget all of the above and just concentrate on --chunks_per_runner. The goal here is to see what our default basecalling rate is, and then see if we can improve this (speed it up). I'm going to initially concentrate on the HAC model, mainly because it allows me to rather quickly get results. For "fun", and to demonstrate that sometimes parameter tuning does nothing, I ran through the approach with the fast model. ##### FAST I haven't been able to find a set of parameter tweaks / optimisations that result in significant increases in the basecalling rate when using the FAST model (see table below). Most people using FAST are doing so in a live basecalling context and as such the majority of the time there is no need to improve the basecalling rate - it becomes a case of the basecalling rate being limited by the speed of the sequencing. Therefore moving forwards this guide will concentrate on the other models, HAC and SUP. | chunks_per_runner | basecalling rate (samples/s) | |------------------:|------------------------------| | 160 | 3.30911e+07 | | 256 | 3.30645e+07 | | 512 | 3.33125e+07 | | 768 | 3.36026e+07 | | 1024 | 3.3017e+07 | The above table demonstrates the influence of changing the number of chunks_per_runner on the overall basecalling rate. As can be seen, increasing this in the context of the FAST model provides little improvement in basecalling rate. ##### HAC OK, let's try something more meaningful. So if we have a card with more than 8GB of RAM it should be completing HAC runs at the default --chunks_per_runner setting of 256. We can monitor our RAM usage during this run and see where it is sitting. On the Titan RTX this hovered just above the 8GB run, leaving a lot of headroom (another 16GB). So the aim here is to increase the value of chunks_per_runner while keeping an eye on GPU resources. If you start to hit CUDA (or similar) errors dial the value back a bit. The goal is to find a value that gives a good boost in basecalling rate but doesn't use all the GPU memory. For the Titan RTX I ran this through the following values: 256 (default), 512, 786, 1024, 1124. I noticed that I was starting to reach a point where I was a) getting towards max RAM capacity and b) I wasn't seeing any more positive impacts to basecalling rate, in fact it was getting worse in some cases. See the below table for all results: | chunks_per_runner | HAC (samples/s) | RAM | |-----------------------:|-----------------|----------| | 256 | 4.4411e+06 | 8.4GB | | 384 | 6.70352e+06 | 12.6GB | | 512 | 9.09916e+06 | 16.5GB | | 786 | 1.22646e+07 | 20.1GB | | 1024 | 1.42249e+07 | 19.3GB | | 1084 | 1.42788e+07 | 22.3GB | | 1124 | 1.4468e+07 | 21.1GB | | 1324 | 1.37997e+07 | 16.2GB | | 1500 | 1.35735e+07 | 19.3GB | I included a set of values around what I decided was the "optimal" (1024) just to see if there was anything further to be gained. One could argue the slight increase observed at 1124, but it doesn't seem worth it to me for the extra RAM usage. Plus the numbers fluctuate a little when you run iterations of the same settings. If I had time I would look at taking the average of 3+ runs for each value, but for now this is enough to demonstrate the process. As it stands I'm more than happy with the increase over the default: * default rate at 256 chunks_per_runner = **4.4411e+06** * optimised rate at 1024 chunks_per_runner = **1.42249e+07** That's more than a **3x increase** in basecalling speed, which is incredibly significant with running post-sequencing basecalling of large data sets. Here is the Guppy run log for the "optimal" setting, as well as a screenshot of nvtop running during this. ```shell= $ guppy_basecaller -c dna_r9.4.1_450bps_hac.cfg \ -i ../example_fast5/ \ -s titanRTX_fastq_hac_6.0.1 \ -x 'auto' \ --chunks_per_runner 1024 ONT Guppy basecalling software version 6.0.1+652ffd1 config file: /home/miles/Downloads/software/guppy/6.0.1/ont-guppy/data/dna_r9.4.1_450bps_hac.cfg model file: /home/miles/Downloads/software/guppy/6.0.1/ont-guppy/data/template_r9.4.1_450bps_hac.jsn input path: ../example_fast5/ save path: titanRTX_fastq_hac_6.0.1 chunk size: 2000 chunks per runner: 1024 minimum qscore: 9 records per file: 4000 num basecallers: 4 gpu device: auto kernel path: runners per device: 4 Found 10 fast5 files to process. Init time: 1274 ms 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| *************************************************** Caller time: 81447 ms, Samples called: 1150014823, samples/s: 1.41198e+07 Finishing up any open output files. Basecalling completed successfully. ``` ![](https://i.imgur.com/pXsXK3D.png) ##### SUP :::warning **{UNDER CONSTRUCTION}** I will update this section when I get a chance to perform all the runs using the SUP model. It takes a bit longer, even on such a small data set. ::: ## Automating the process? As this optimistation process can get time consuming I decided to put together a rough and ready bash script that would allow me to iterate through a given list/string of chunks_per_runner values while also outputting the basecalling metrics as well as GPU usage. I have gotten it into a shape that I'm happy to release a minimal working version on GitHub, it can be found here ([link](https://github.com/sirselim/guppy_parameter_optimiser)). At the moment the basic approach is that a user provides the model to optimise (fast, hac, sup) and then a string of chunks_per_runner values (i.e. "160 256 512 786 1024"), as well as a directory of fast5 files and an output location. The script then sequentially runs Guppy using the selected model and processes through the string of values. For each iteration it logs the Guppy information as well GPU usage information. In the future I would like to add things such as: * option to iterate over each chunk_per_runner *i* times and get an average basecalling rate * code in options to switch between protocols (i.e. R9/R10 etc) * give the user the ability to modify other parameters as part of the "tool", rather than modify the script itself * maybe have another script that takes the log files and generates a json of the useful values * (as above) a json file could then easily be used to generate some nice plots to visualise what is going on An example quick plot showing the increase in performance (basecall rate) with increasing chunks per runner on the Clara AGX RTX6000: ![](https://i.imgur.com/BwCK5xB.png) ## What if I want to "down tune" In the instance that you have a GPU with less than 8GB of RAM you essentially just reverse the process from above. Starting with reducing the number of chunks_per_runner to a low value, i.e 64 or 128, and monitoring whether basecalling will proceed or whether you get CUDA (out of memory) errors. Once you have this baseline value you can start the optimisation process of tweaking the number of chunks_per_runner upwards towards a point that you feel comfortable with the basecalling rate and the amount of RAM being used. I mentioned early in this document that as part of the Jetson "portable sequencing" project ([link](https://github.com/sirselim/jetson_nanopore_sequencing)) we had to work on optimisation for GPUs that are sort of under the recommended specs. I say "sort of" as it is a bit of a tricky situation. The Jetson boards are known as SoM (system on module). This means that the board contains the CPU and GPU on the same module. As such they share the same RAM allocation, i.e. the Jetson Xaiver NX has 8GB of RAM shared between CPU and GPU, while the Xavier AGX has either 16 or 32GB of RAM. This gets interesting with the NX with it's 8GB of RAM. Long story short we ended up using either 48 or 64 chunks_per_runner, which was a balance between basecalling performance (it can easily run live basecalling in FAST mode) and RAM usage. So at the end of the day it's all about balance. :::info NOTE: I'll add this note here to myself to try and find time to do a quick example of how I would go about tuning Guppy parameters for GPUs that don't meet the recommended requirements. :::

Read more

GPU musings (with an eye on genomics)

GPU basecalling ONT data

GPU price / performance comparisons for Nanopore basecalling

Running (live) GPU basecalling on 21.04 [experimental]