GPU price / performance comparisons for Nanopore basecalling

Author: Miles Benton (GitHub; Twitter)
Created: 2021-07-16 20:15:56
Last modified: 2022-01-19 13:54:40

tags: Nanopore GPU notes documentation benchmarks

For some time I've been wanting to put together my thoughts around price/performance ratios of GPUs. I've been thinking that there must be a "sweet spot" for users that are wanting to run a single MinION Mk1b and have access to things such as adaptive sampling and live basecalling with FAST/HAC models. Bonus points if it has decent retrospective basecalling performance.

I've been very fortunate recently in terms of being provided with a range of hardware that has allowed me to start exploring this. So I wanted to create some notes to provide the information back to the community, for any users that are interested and may be in the market for a GPU, or people that just want an idea of the type of performance you can expect from various GPU models.

I'm hoping that this becomes a dynamic document and will evolve with time. For now I want to report on a comparison I was able to perform using an external GPU enclosure (eGPU) paired with two of the Nvidia Ampere cards, a RTX3060 and a RTX3080Ti. These are two cards aimed at gaming, one at the 'lower' end (RTX3060) the other very much at the higher end (RTX3080Ti). Obviously this is reflected in the price with the RTX3060 being ~$1000 NZD and the RTX3080Ti ~$3000.

Note: I'm reporting in NZ dollars as that's where I'm based, but the trend should hold - or you can easily do your own conversion of my calculations.

I'll also disclaim that going into this I believed that the RTX3060 was going to be the best middle ground for "most" peoples needs, and has been my recommendation for the last few months. Spoiler: based on this little experiment this confirms my thinking, and with the GPU market recovering and cards becoming more sensibly priced and available this is a good option.

Jumping into it

The test bed

The test set up was an HP Zbook Fury G7 17 laptop (nearly fully spec'd), which has a very decent internal GPU in the form of a Turing based RTX4000 mobile. I've included this card in the mix as well as I think it's useful to have an understanding of the laptops performance as well. This mobile GPU should provide performance right in the middle of a desktop RTX2070 and a RTX2080 - so it's no slouch. It actually provides another good justification for the RTX3060 and shows the huge performance gained in the generational leap from Turing to Ampere. But let's let the results speak to that.

The system

For completeness sake I'll record the system specs of the laptop that was used for this experiment. It was a new HP ZBook Fury 17 G7 Mobile Workstation, a very 'beefy'/powerful laptop in the scheme of things.

Linux OS

            .-/+oossssoo+/-.               miles@pop-os 
        `:+ssssssssssssssssss+:`           ------------ 
      -+ssssssssssssssssssyyssss+-         OS: Ubuntu 18.04 x86_64 
    .ossssssssssssssssssdMMMNysssso.       Host: HP ZBook Fury 17 G7 Mobile Workstation 
   /ssssssssssshdmmNNmmyNMMMMhssssss/      Kernel: 5.12.0-13.1-liquorix-amd64 
  +ssssssssshmydMMMMMMMNddddyssssssss+     Uptime: 12 mins 
 /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Packages: 2080 (dpkg), 9 (flatpak) 
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Shell: bash 5.1.4 
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Resolution: 3840x2160 
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   DE: GNOME 3.38.4 
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   WM: Mutter 
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   WM Theme: Pop 
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Theme: Pop-dark [GTK2/3] 
 /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/    Icons: Pop [GTK2/3] 
  +sssssssssdmydMMMMMMMMddddyssssssss+     Terminal: tilix 
   /ssssssssssshdmNNNNmyNMMMMhssssss/      CPU: Intel Xeon W-10885M (16) @ 2.400GHz 
    .ossssssssssssssssssdMMMNysssso.       GPU: NVIDIA 09:00.0 NVIDIA Corporation Device 2504 
      -+sssssssssssssssssyyyssss+-         GPU: NVIDIA Quadro RTX 4000 Mobile / Max-Q 
        `:+ssssssssssssssssss+:`           GPU: Intel Device 9bf6 
            .-/+oossssoo+/-.               Memory: 5471MiB / 64097MiB

GPU information

Here is the read out from nvidi-smi for the internal RTX4000 mobile and each external GPU that was tested.

RTX3060 installed in eGPU
$ nvidia-smi -L
GPU 0: NVIDIA Quadro RTX 4000 with Max-Q Design (UUID: GPU-284a50ce-2672-714a-2034-c484f69e9655)
GPU 1: NVIDIA GeForce RTX 3060 (UUID: GPU-1a433ac4-748a-44fd-bee2-e2109232cff2)

$ nvidia-smi 
Fri Jul 16 20:27:44 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.31       Driver Version: 465.31       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Quadro R...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   57C    P8     7W /  N/A |   1409MiB /  7982MiB |     19%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:09:00.0 Off |                  N/A |
|  0%   25C    P8    11W / 170W |   4194MiB / 12053MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
RTX3080Ti installed in eGPU

[still to come (I forgot to grab the info when I had the RTX3080Ti set up!)]

The results

The testing was done with a very small set of fast5 files from an ultra-long run we did a month or so ago. Moving forward I will test on bigger data sets, but this is to establish a baseline.

I also used an eGPU enclosure (the wonderful Akitio Node Titian - there will be a proper write up on this as well), so if you are going to be installing a GPU into a system internally you will see slightly better performance than I'm reporting. There is a degree of overhead and latency with an eGPU set up, but it's minimal now with Thunderbolt3/4 bandwidth - still worth reporting though.

If you're interested in lots of pictures then I posted both of my eGPU setups as Twitter threads:

  • RTX3080Ti Twitter thread: link
  • RTX3060 Twitter thread: link

Here are the numbers:

GPU\CPU FAST model+ HAC model+ SUP model+
#Telsa V100 2.66337e+07 1.58095e+07 3.91847e+06
#A100 3.40604e+07 2.68319e+07 6.58227e+06
#Titan RTX (P920) 3.17412e+07 1.47765e+07 4.29710e+06
#RTX6000 (Clara AGX) 2.01672e+07 1.36405e+07 3.42290e+06
RTX4000 (mobile) 2.88644e+07 4.81920e+06 1.36953e+06
RTX3060 (eGPU) 4.70238e+07 6.40374e+06 2.28163e+06
RTX3080Ti (eGPU) 5.71209e+07 1.18229e+07 4.52692e+06
Jetson Xavier NX 4.36631e+06 - -
Jetson Xavier AGX (16GB) 8.49277e+06 1.57560e+06 4.40821e+05
Xeon W-10885M (CPU) 6.43747e+05 DNF DNF

#this GPU is in a different machine, so results will be influenced by different components to some degree.
+metric is samples/s - higher is faster
DNF - did not finish (I couldn't be bothered waiting hours/days for the CPU)

UPDATE: I have been benchmarking other cards and Nvidia Jetson boards that I have at hand. This information is now included in the above table. As yet I haven't had a chance to update the plots in the rest of this document.

{
    "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
    "width": 580,
    "height": 250,
    "padding": 5,
    "description": "A simple bar chart with embedded data.",
    "title": "Performance of various GPUs/CPUs for Nanopore Guppy basecalling",
    "data": {
        "url": "https://raw.githubusercontent.com/sirselim/random_plotting_scripts/main/data/speed_perf_stats.json"
    },
    "width": {
        "step": 38
    },
    "mark": {
        "type": "bar",
        "tooltip": true
    },
    "encoding": {
        "column": {
            "field": "Model",
            "type": "ordinal",
            "spacing": 10
        },
        "x": {
            "field": "Method",
            "type": "ordinal",
            "axis": {
                "title": "",
                "labelAngle": 45
            }
        },
        "y": {
            "field": "samples per second",
            "type": "quantitative"
        },
        "color": {
            "field": "Method",
            "scale": {
                "range": [
                    "#675193",
                    "#ca8861",
                    "#c7c7c7",
                    "#ffbb00"
                ]
            }
        },
        "tooltip": [
            {
                "field": "Method",
                "title": "Hardware"
            },
            {
                "field": "samples per second",
                "title": "samples/s"
            }
        ]
    }
}

Note: for the CPU run above I used an Intel Xeon W-10885M, which has 8 cores and 16 threads (clock speed: 2.4GHz, turbo speed: 5.3GHz). This is a mobile CPU but it's also no slouch (it's much higher spec'd than what ONT recommend). I believe the CPU in the GridION is an Intel i7 7700K, comparing the two the Xeon tested here beats is comfortably (link).

When I ran the comparison I tried to give the CPU a fighting chance. I gave every thread to Guppy (so all 16) it did not help!

I used the below code to run the test:

guppy_basecaller -c dna_r9.4.1_450bps_fast.cfg \
  -i cputest_fast5 -s cputest_fastq/ --recursive \
  --num_callers 2 --cpu_threads_per_caller 8

In the above results just to be clear, the slowest GPU took 15 seconds, while the CPU took 2 minutes and 56 seconds, that is speed up more than 11X by the mobile GPU. Remember this is the CPU running at it's absolute fastest.

For both external GPUs I played around a little optimising the basecalling parameters for the HAC and SUP model and was able to get a decent chunk of extra performance over the default models.

What's interesting is that the base RTX3060 has 12Gb of GDDR6 RAM, the RTX3080Ti also has 12Gb albeit higher performance GDDR6X. That extra RAM on the RTX3060 actually really helps I believe. The RTX3070 is obviously more powerful across the board, except at the RAM level. It would be really interesting to hear from someone with a RTX3070, which is 8Gb of RAM, to see what sort of numbers they're pulling.

A performance | price ratio?

Now, my very crude metric of generating a performance/price ratio. All I've done is taken the samples per second that Guppy reports and divide it by the price I could find the GPU's in stock, which gives a samples per second per dollar metric. Crude but interesting, see below:

FAST (fast accuracy)
GPU samples/s price ($ NZD) samples/s/$
RTX4000 (mobile) 28864400 8000 3608
RTX3060 (eGPU) 47023800 1060 44362
RTX3080Ti (eGPU) 57120900 3000 19040
High (high accuracy)
GPU samples/s price ($ NZD) samples/s/$
RTX4000 (mobile) 4819200 8000 602
RTX3060 (eGPU) 6403740 1060 6041
RTX3080Ti (eGPU) 11822900 3000 3941
SUP (super high accuracy)
GPU samples/s price ($ NZD) samples/s/$
RTX4000 (mobile) 1369530 8000 171
RTX3060 (eGPU) 2281630 1060 2152
RTX3080Ti (eGPU) 4526920 3000 1509
{
    "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
    "width": 980,
    "height": 250,
    "padding": 5,
    "description": "A simple bar chart with embedded data.",
    "title": "Plotting price / performance of Nvidia GPU for Nanopore basecalling",
    "data": {
        "url": "https://raw.githubusercontent.com/sirselim/random_plotting_scripts/main/data/price_perf_stats.json"
    },
    "width": {
        "step": 55
    },
    "mark": {
        "type": "bar",
        "tooltip": true
    },
    "encoding": {
        "column": {
            "field": "basecalling_model",
            "type": "ordinal",
            "spacing": 10
        },
        "x": {
            "field": "GPU",
            "type": "ordinal",
            "axis": {
                "title": "",
                "labelAngle": 45
            }
        },
        "y": {
            "field": "samples/second/$",
            "type": "quantitative"
        },
        "color": {
            "field": "GPU",
            "scale": {
                "range": [
                    "#675193",
                    "#ca8861",
                    "#c7c7c7"
                ]
            }
        },
        "tooltip": [
            {
                "field": "GPU",
                "title": "Hardware"
            },
            {
                "field": "samples/second/$",
                "title": "samples/second/$"
            },
            {
                "field": "samples/s",
                "title": "samples/second"
            },
            {
                "field": "price ($ NZD)",
                "title": "Price ($NZD)"
            }
        ]
    }
}

Major caveat: this is a TINY sample of GPUs and will benefit from being filled out with more, but it was nice to see that much gut instincts were correct and the RTX3060 is providing nice bang for buck! I will be able to do this for a couple of other cards (Titan RTX, V100, A100), however they won't be very useful as they're such expensive cards. One idea I had was loading a couple of the test data sets and then seeing if any kind community members would like to contribute numbers based on their GPUs

What does this mean?

Obviously if you are wanting to run multiple MinIONs, and do the absolute fastest basecalling then the more expensive cards will provide this. But otherwise something in the range of the RTX3060 is looking to be a great performer for Nanopore basecalling. You could actually buy three of them for the price of a single RTX3080Ti, but I wouldn't recommend that. Two RTX3060's though make an interesting prospect for ~$2000 NZD

There is a LOT more I want to do to make this more robust and fleshed out, but for now I hope that this is at least interesting and maybe helpful to some.

An example of decision making considering some of the above

Note: please remember that this is all just my opinion. There are other factors that contribute to final decision making, i.e. if you are at an institute that are unable to install 'gaming' GPUs into the infrastructure then that option is not on the table.

This is merely an attempt to help provide more information when it comes to making decisions about spending non-trivial amounts of money.

You'll also notice that I jumped into USD below, sorry the bulk of this was copy-pasted from a reply of mine in the forum.

Here is an example based on a recent community discussion around trying to select a GPU. The suggested option was the Nvidia RTX A4000. This is a very decent GPU that is aimed at a more 'workstation' type setting, think professional CAD/3D etc. It's priced at around $1000 USD MSRP. Spec's wise it sits in between a RTX3070 and a RTX3080, except it has 16Gb of RAM. Apart from the RAM the RTX3080 is more powerful on all fronts (and most likely a better basecaller).

So my response went something like this:

If you are wanting to potentially run multiple MinION Mk1b's at once then a more powerful GPU will be useful. The RTX3080 is a fine card and will do a good job I imagine. As you mention, apart from the RAM the RTX3080 is better spec'd across the board meaning it should be faster at basecalling etc than the A4000. The amount of RAM is really only going to come into play with running multiple instances, so I would say it would be the better option between those two cards. Where it gets interesting is when you consider the price difference. The RTX3080 should be around the $700 USD mark, while the A4000 is approx $1000 USD. If you want to save money but have power the RTX3080 is great. If you are looking at spending towards that higher end and want as much GPU power bang-for-your-buck then the RTX3080Ti becomes a very interesting option at about $1200 USD. While it's $200 more this card will stomp the A4000 by a large margin. It has nearly twice the number of CUDA cores, more RT cores, more advanced RAM and a larger BUS. The only thing the A4000 has over the RTX3080Ti is 16Gb vs 12Gb - but that's probably not going to make very much of a difference in 95% of situations.

Some may argue you could go up again to the RTX3090, at $1500 USD - but at that point the difference in performance for the extra $300 is probably only in the 2-8% range. The RTX3090 only has 256 more CUDA cores that the RTX3080Ti, with everything else exactly the same, except it has double the RAM at 24Gb. I've yet to be faced with a situation where I've been wanting that much GPU RAM - you might be able to tweak parameters to use as much RAM as you want but 99 times out of 100 you won't actually see better performance (at least in my experience).

At the end of the day as active community member David Eccles so nicely put it, basecalling on a GPU is already going to be night and day over CPU calling.

This information may be useful, pulling some performance numbers from TechPowerUp you get an idea of the relative performance of the shown cards (all relative to the A4000 GPU). The RTX3070 at ~$500 USD pulls slightly ahead of the A4000 for half the price, BUT the A4000 does have twice the amount of RAM. This also nicely highlights the difference of $200 between the A4000 and the RTX3080Ti (for ~50% increase perf), vs the $300 difference between the RTX3080Ti and RTX3090 (for an increase of ~2% perf).

This plot is based solely on comparable GPU performance between the cards (not based on Guppy basecalling at all). As I said above, it would really great to get basecalling performance metrics for the RTX3070/RTX3080/RTX3090, as well as any other cards. That way we could then factor in the price and do a broader comparison that what I've been able to do thus far.

UPDATE: more GPUs benchmarked

The below table lists results for all the GPUs that we have currently tested. We have used the same example set of ONT fast5 files, Guppy 5.0.16, and where possible have tuned the chunks_per_runner parameter to get the most out of HAC and SUP calling based on the GPU being tested. This hopefully gives a more "real world" example of what you can expect from these types of cards in terms of basecalling rate.

The colours represent how well a given GPU and basecalling model will perform for keeping up with live basecalling during a sequencing run.

  • green - easily keeps up in real-time
  • orange - will likely keep up with 80-90% of the run in real-time
  • red - won't get anywhere close, large lag in basecalling

From ONT community forum link:

“Keep up” is defined as 80% of the theoretical flow cell output.
e.g. MinION = 4000 kHz x 512 channels x 0.8 = 1.6 M samples/s = 160 kbases/s at 400 b/s

MinION = 4000 kHz x 512 channels x 1.0 = 2,048,000 samples/s \(\equiv\) 2.048 M samples/s or 2.048e+06 samples/s

It should be noted that this is based of an ideal situation where a flowcell is sequencing at 100% it's capacity / theoretical output. This is in reality never going to happen, so it's probably safe to assume that a GPU that can perform a minimum of 1.6 M samples/s for a given basecalling model will be able to keep up 'live'.

* the metric reported is samples/second - where higher is faster basecalling
DNF - did not finish (I couldn’t be bothered waiting hours/days for the CPU)

Select a repo