GPU price / performance comparisons for Nanopore basecalling

# GPU price / performance comparisons for Nanopore basecalling *Author:* [Miles Benton](https://sirselim.github.io/) ([GitHub](https://github.com/sirselim); [Twitter](https://twitter.com/miles_benton)) *Created:* 2021-07-16 20:15:56 *Last modified:* 2022-01-19 13:54:40 ###### tags: `Nanopore` `GPU` `notes` `documentation` `benchmarks` ---- For some time I've been wanting to put together my thoughts around price/performance ratios of GPUs. I've been thinking that there must be a "sweet spot" for users that are wanting to run a single MinION Mk1b and have access to things such as adaptive sampling and live basecalling with FAST/HAC models. Bonus points if it has decent retrospective basecalling performance. I've been very fortunate recently in terms of being provided with a range of hardware that has allowed me to start exploring this. So I wanted to create some notes to provide the information back to the community, for any users that are interested and may be in the market for a GPU, or people that just want an idea of the type of performance you can expect from various GPU models. I'm hoping that this becomes a dynamic document and will evolve with time. For now I want to report on a comparison I was able to perform using an external GPU enclosure (eGPU) paired with two of the Nvidia Ampere cards, a [RTX3060](https://www.techpowerup.com/gpu-specs/geforce-rtx-3060.c3682) and a [RTX3080Ti](https://www.techpowerup.com/gpu-specs/geforce-rtx-3080-ti.c3735). These are two cards aimed at gaming, one at the 'lower' end ([RTX3060](https://www.techpowerup.com/gpu-specs/geforce-rtx-3060.c3682)) the other very much at the higher end ([RTX3080Ti](https://www.techpowerup.com/gpu-specs/geforce-rtx-3080-ti.c3735)). Obviously this is reflected in the price with the [RTX3060](https://www.techpowerup.com/gpu-specs/geforce-rtx-3060.c3682) being ~$1000 NZD and the [RTX3080Ti](https://www.techpowerup.com/gpu-specs/geforce-rtx-3080-ti.c3735) ~$3000. Note: I'm reporting in NZ dollars as that's where I'm based, but the trend should hold - or you can easily do your own conversion of my calculations. I'll also disclaim that going into this I believed that the [RTX3060](https://www.techpowerup.com/gpu-specs/geforce-rtx-3060.c3682) was going to be the best middle ground for "most" peoples needs, and has been my recommendation for the last few months. Spoiler: based on this little experiment this confirms my thinking, and with the GPU market recovering and cards becoming more sensibly priced and available this is a good option. Jumping into it... ## The test bed... The test set up was an HP Zbook Fury G7 17 laptop (nearly fully spec'd), which has a very decent internal GPU in the form of a Turing based [RTX4000 mobile](https://www.techpowerup.com/gpu-specs/quadro-rtx-4000-mobile.c3430). I've included this card in the mix as well as I think it's useful to have an understanding of the laptops performance as well. This mobile GPU should provide performance right in the middle of a desktop [RTX2070](https://www.techpowerup.com/gpu-specs/geforce-rtx-2070.c3252) and a [RTX2080](https://www.techpowerup.com/gpu-specs/quadro-rtx-4000-mobile.c3430) - so it's no slouch. It actually provides another good justification for the [RTX3060](https://www.techpowerup.com/gpu-specs/geforce-rtx-3060.c3682) and shows the huge performance gained in the generational leap from Turing to Ampere. But let's let the results speak to that. ### The system For completeness sake I'll record the system specs of the laptop that was used for this experiment. It was a new HP ZBook Fury 17 G7 Mobile Workstation, a very 'beefy'/powerful laptop in the scheme of things. #### Linux OS ```=shell .-/+oossssoo+/-. miles@pop-os `:+ssssssssssssssssss+:` ------------ -+ssssssssssssssssssyyssss+- OS: Ubuntu 18.04 x86_64 .ossssssssssssssssssdMMMNysssso. Host: HP ZBook Fury 17 G7 Mobile Workstation /ssssssssssshdmmNNmmyNMMMMhssssss/ Kernel: 5.12.0-13.1-liquorix-amd64 +ssssssssshmydMMMMMMMNddddyssssssss+ Uptime: 12 mins /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/ Packages: 2080 (dpkg), 9 (flatpak) .ssssssssdMMMNhsssssssssshNMMMdssssssss. Shell: bash 5.1.4 +sssshhhyNMMNyssssssssssssyNMMMysssssss+ Resolution: 3840x2160 ossyNMMMNyMMhsssssssssssssshmmmhssssssso DE: GNOME 3.38.4 ossyNMMMNyMMhsssssssssssssshmmmhssssssso WM: Mutter +sssshhhyNMMNyssssssssssssyNMMMysssssss+ WM Theme: Pop .ssssssssdMMMNhsssssssssshNMMMdssssssss. Theme: Pop-dark [GTK2/3] /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/ Icons: Pop [GTK2/3] +sssssssssdmydMMMMMMMMddddyssssssss+ Terminal: tilix /ssssssssssshdmNNNNmyNMMMMhssssss/ CPU: Intel Xeon W-10885M (16) @ 2.400GHz .ossssssssssssssssssdMMMNysssso. GPU: NVIDIA 09:00.0 NVIDIA Corporation Device 2504 -+sssssssssssssssssyyyssss+- GPU: NVIDIA Quadro RTX 4000 Mobile / Max-Q `:+ssssssssssssssssss+:` GPU: Intel Device 9bf6 .-/+oossssoo+/-. Memory: 5471MiB / 64097MiB ``` #### GPU information Here is the read out from `nvidi-smi` for the internal [RTX4000 mobile](https://www.techpowerup.com/gpu-specs/quadro-rtx-4000-mobile.c3430) and each external GPU that was tested. ##### RTX3060 installed in eGPU ```=shell $ nvidia-smi -L GPU 0: NVIDIA Quadro RTX 4000 with Max-Q Design (UUID: GPU-284a50ce-2672-714a-2034-c484f69e9655) GPU 1: NVIDIA GeForce RTX 3060 (UUID: GPU-1a433ac4-748a-44fd-bee2-e2109232cff2) $ nvidia-smi Fri Jul 16 20:27:44 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 465.31 Driver Version: 465.31 CUDA Version: 11.3 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA Quadro R... Off | 00000000:01:00.0 Off | N/A | | N/A 57C P8 7W / N/A | 1409MiB / 7982MiB | 19% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:09:00.0 Off | N/A | | 0% 25C P8 11W / 170W | 4194MiB / 12053MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ ``` ##### RTX3080Ti installed in eGPU [...still to come... (I forgot to grab the info when I had the RTX3080Ti set up!)] ## The results The testing was done with a very small set of fast5 files from an ultra-long run we did a month or so ago. Moving forward I will test on bigger data sets, but this is to establish a baseline. I also used an eGPU enclosure (the wonderful [Akitio Node Titian](https://www.akitio.com/expansion/node-titan) - there will be a proper write up on this as well), so if you are going to be installing a GPU into a system internally you will see slightly better performance than I'm reporting. There is a degree of overhead and latency with an eGPU set up, but it's minimal now with Thunderbolt3/4 bandwidth - still worth reporting though. If you're interested in lots of pictures then I posted both of my eGPU setups as Twitter threads: * RTX3080Ti Twitter thread: [link](https://twitter.com/miles_benton/status/1412670062331252739?s=20) * RTX3060 Twitter thread: [link](https://twitter.com/miles_benton/status/1415965240341061634?s=20) Here are the numbers: | GPU\CPU | FAST model^+^ | HAC model^+^ | SUP model^+^ | |--------------------------|---------------|--------------|--------------| | ^#^Telsa V100 | 2.66337e+07 | 1.58095e+07 | 3.91847e+06 | | ^#^A100 | 3.40604e+07 | 2.68319e+07 | 6.58227e+06 | | ^#^Titan RTX (P920) | 3.17412e+07 | 1.47765e+07 | 4.29710e+06 | | ^#^RTX6000 (Clara AGX) | 2.01672e+07 | 1.36405e+07 | 3.42290e+06 | | RTX4000 (mobile) | 2.88644e+07 | 4.81920e+06 | 1.36953e+06 | | RTX3060 (eGPU) | 4.70238e+07 | 6.40374e+06 | 2.28163e+06 | | RTX3080Ti (eGPU) | 5.71209e+07 | 1.18229e+07 | 4.52692e+06 | | Jetson Xavier NX | 4.36631e+06 | - | - | | Jetson Xavier AGX (16GB) | 8.49277e+06 | 1.57560e+06 | 4.40821e+05 | | Xeon W-10885M (CPU) | 6.43747e+05 | DNF | DNF | ^#^this GPU is in a different machine, so results will be influenced by different components to some degree. ^+^metric is samples/s - higher is faster DNF - did not finish (I couldn't be bothered waiting hours/days for the CPU) :::warning **UPDATE:** I have been benchmarking other cards and Nvidia Jetson boards that I have at hand. This information is now included in the above table. As yet I haven't had a chance to update the plots in the rest of this document. ::: ```vega { "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "width": 580, "height": 250, "padding": 5, "description": "A simple bar chart with embedded data.", "title": "Performance of various GPUs/CPUs for Nanopore Guppy basecalling", "data": { "url": "https://raw.githubusercontent.com/sirselim/random_plotting_scripts/main/data/speed_perf_stats.json" }, "width": { "step": 38 }, "mark": { "type": "bar", "tooltip": true }, "encoding": { "column": { "field": "Model", "type": "ordinal", "spacing": 10 }, "x": { "field": "Method", "type": "ordinal", "axis": { "title": "", "labelAngle": 45 } }, "y": { "field": "samples per second", "type": "quantitative" }, "color": { "field": "Method", "scale": { "range": [ "#675193", "#ca8861", "#c7c7c7", "#ffbb00" ] } }, "tooltip": [ { "field": "Method", "title": "Hardware" }, { "field": "samples per second", "title": "samples/s" } ] } } ``` :::info **Note:** for the CPU run above I used an Intel Xeon W-10885M, which has 8 cores and 16 threads (clock speed: 2.4GHz, turbo speed: 5.3GHz). This is a mobile CPU but it's also no slouch (it's much higher spec'd than what ONT recommend). I believe the CPU in the GridION is an Intel i7 7700K, comparing the two the Xeon tested here beats is comfortably ([link](https://www.cpubenchmark.net/compare/Intel-Xeon-W-10885M-vs-Intel-i7-7700K/3762vs2874)). When I ran the comparison I tried to give the CPU a fighting chance. I gave every thread to Guppy (so all 16)... it did not help! I used the below code to run the test: ``` guppy_basecaller -c dna_r9.4.1_450bps_fast.cfg \ -i cputest_fast5 -s cputest_fastq/ --recursive \ --num_callers 2 --cpu_threads_per_caller 8 ``` ::: In the above results just to be clear, the slowest GPU took **15 seconds**, while the CPU took **2 minutes and 56 seconds**, that is speed up more than 11X by the mobile GPU. Remember this is the CPU running at it's absolute fastest. For both external GPUs I played around a little optimising the basecalling parameters for the HAC and SUP model and was able to get a decent chunk of extra performance over the default models. What's interesting is that the base [RTX3060](https://www.techpowerup.com/gpu-specs/geforce-rtx-3060.c3682) has 12Gb of GDDR6 RAM, the [RTX3080Ti](https://www.techpowerup.com/gpu-specs/geforce-rtx-3080-ti.c3735) also has 12Gb albeit higher performance GDDR6X. That extra RAM on the [RTX3060](https://www.techpowerup.com/gpu-specs/geforce-rtx-3060.c3682) actually really helps I believe. The [RTX3070](https://www.techpowerup.com/gpu-specs/geforce-rtx-3070.c3674) is obviously more powerful across the board, except at the RAM level. It would be really interesting to hear from someone with a [RTX3070](https://www.techpowerup.com/gpu-specs/geforce-rtx-3070.c3674), which is 8Gb of RAM, to see what sort of numbers they're pulling. ### A performance | price ratio? Now, my very crude metric of generating a performance/price ratio. All I've done is taken the samples per second that Guppy reports and divide it by the price I could find the GPU's in stock, which gives a samples per second per dollar metric. Crude but interesting, see below: ##### FAST (fast accuracy) | GPU | samples/s | price ($ NZD) | samples/s/$ | |------------------|-----------|---------------|-------------| | RTX4000 (mobile) | 28864400 | 8000 | 3608 | | RTX3060 (eGPU) | 47023800 | 1060 | 44362 | | RTX3080Ti (eGPU) | 57120900 | 3000 | 19040 | ##### High (high accuracy) | GPU | samples/s | price ($ NZD) | samples/s/$ | |------------------|-----------|---------------|-------------| | RTX4000 (mobile) | 4819200 | 8000 | 602 | | RTX3060 (eGPU) | 6403740 | 1060 | 6041 | | RTX3080Ti (eGPU) | 11822900 | 3000 | 3941 | ##### SUP (super high accuracy) | GPU | samples/s | price ($ NZD) | samples/s/$ | |------------------|-----------|---------------|-------------| | RTX4000 (mobile) | 1369530 | 8000 | 171 | | RTX3060 (eGPU) | 2281630 | 1060 | 2152 | | RTX3080Ti (eGPU) | 4526920 | 3000 | 1509 | ```vega { "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "width": 980, "height": 250, "padding": 5, "description": "A simple bar chart with embedded data.", "title": "Plotting price / performance of Nvidia GPU for Nanopore basecalling", "data": { "url": "https://raw.githubusercontent.com/sirselim/random_plotting_scripts/main/data/price_perf_stats.json" }, "width": { "step": 55 }, "mark": { "type": "bar", "tooltip": true }, "encoding": { "column": { "field": "basecalling_model", "type": "ordinal", "spacing": 10 }, "x": { "field": "GPU", "type": "ordinal", "axis": { "title": "", "labelAngle": 45 } }, "y": { "field": "samples/second/$", "type": "quantitative" }, "color": { "field": "GPU", "scale": { "range": [ "#675193", "#ca8861", "#c7c7c7" ] } }, "tooltip": [ { "field": "GPU", "title": "Hardware" }, { "field": "samples/second/$", "title": "samples/second/$" }, { "field": "samples/s", "title": "samples/second" }, { "field": "price ($ NZD)", "title": "Price ($NZD)" } ] } } ``` **Major caveat:** this is a TINY sample of GPUs and will benefit from being filled out with more, but it was nice to see that much gut instincts were correct and the [RTX3060](https://www.techpowerup.com/gpu-specs/geforce-rtx-3060.c3682) is providing nice bang for buck! I will be able to do this for a couple of other cards (Titan RTX, V100, A100), however they won't be very useful as they're such expensive cards. One idea I had was loading a couple of the test data sets and then seeing if any kind community members would like to contribute numbers based on their GPUs... ## What does this mean? Obviously if you are wanting to run multiple MinIONs, and do the absolute fastest basecalling then the more expensive cards will provide this. But otherwise something in the range of the [RTX3060](https://www.techpowerup.com/gpu-specs/geforce-rtx-3060.c3682) is looking to be a great performer for Nanopore basecalling. You could actually buy three of them for the price of a single [RTX3080Ti](https://www.techpowerup.com/gpu-specs/geforce-rtx-3080-ti.c3735), but I wouldn't recommend that. Two [RTX3060](https://www.techpowerup.com/gpu-specs/geforce-rtx-3060.c3682)'s though make an interesting prospect for ~$2000 NZD... There is a LOT more I want to do to make this more robust and fleshed out, but for now I hope that this is at least interesting and maybe helpful to some. ## An example of decision making considering some of the above :::info **Note:** please remember that this is all just my opinion. There are other factors that contribute to final decision making, i.e. if you are at an institute that are unable to install 'gaming' GPUs into the infrastructure then that option is not on the table. This is merely an attempt to help provide more information when it comes to making decisions about spending non-trivial amounts of money. You'll also notice that I jumped into USD below, sorry the bulk of this was copy-pasted from a reply of mine in the forum. ::: Here is an example based on a recent community discussion around trying to select a GPU. The suggested option was the Nvidia [RTX A4000](https://www.techpowerup.com/gpu-specs/rtx-a4000.c3756). This is a very decent GPU that is aimed at a more 'workstation' type setting, think professional CAD/3D etc. It's priced at around $1000 USD MSRP. Spec's wise it sits in between a RTX3070 and a [RTX3080](https://www.techpowerup.com/gpu-specs/geforce-rtx-3080.c3621), except it has 16Gb of RAM. Apart from the RAM the [RTX3080](https://www.techpowerup.com/gpu-specs/geforce-rtx-3080.c3621) is more powerful on all fronts (and most likely a better basecaller...). So my response went something like this: If you are wanting to potentially run multiple MinION Mk1b's at once then a more powerful GPU will be useful. The [RTX3080](https://www.techpowerup.com/gpu-specs/geforce-rtx-3080.c3621) is a fine card and will do a good job I imagine. As you mention, apart from the RAM the [RTX3080](https://www.techpowerup.com/gpu-specs/geforce-rtx-3080.c3621) is better spec'd across the board meaning it should be faster at basecalling etc than the [A4000](https://www.techpowerup.com/gpu-specs/rtx-a4000.c3756). The amount of RAM is really only going to come into play with running multiple instances, so I would say it would be the better option between those two cards. Where it gets interesting is when you consider the price difference. The [RTX3080](https://www.techpowerup.com/gpu-specs/geforce-rtx-3080.c3621) should be around the $700 USD mark, while the [A4000](https://www.techpowerup.com/gpu-specs/rtx-a4000.c3756) is approx $1000 USD. If you want to save money but have power the [RTX3080](https://www.techpowerup.com/gpu-specs/geforce-rtx-3080.c3621) is great. If you are looking at spending towards that higher end and want as much GPU power bang-for-your-buck then the [RTX3080Ti](https://www.techpowerup.com/gpu-specs/geforce-rtx-3080-ti.c3735) becomes a very interesting option at about $1200 USD. While it's $200 more this card will stomp the [A4000](https://www.techpowerup.com/gpu-specs/rtx-a4000.c3756) by a large margin. It has nearly twice the number of CUDA cores, more RT cores, more advanced RAM and a larger BUS. The only thing the [A4000](https://www.techpowerup.com/gpu-specs/rtx-a4000.c3756) has over the [RTX3080Ti](https://www.techpowerup.com/gpu-specs/geforce-rtx-3080-ti.c3735) is 16Gb vs 12Gb - but that's probably not going to make very much of a difference in 95% of situations. Some may argue you could go up again to the [RTX3090](https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622), at $1500 USD - but at that point the difference in performance for the extra $300 is probably only in the 2-8% range. The [RTX3090](https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622) only has 256 more CUDA cores that the [RTX3080Ti](https://www.techpowerup.com/gpu-specs/geforce-rtx-3080-ti.c3735), with everything else exactly the same, except it has double the RAM at 24Gb. I've yet to be faced with a situation where I've been wanting that much GPU RAM - you might be able to tweak parameters to use as much RAM as you want but 99 times out of 100 you won't actually see better performance (at least in my experience). At the end of the day as active community member **David Eccles** so nicely put it, basecalling on a GPU is already going to be night and day over CPU calling. This information may be useful, pulling some performance numbers from [TechPowerUp](https://www.techpowerup.com/) you get an idea of the relative performance of the shown cards (all relative to the [A4000](https://www.techpowerup.com/gpu-specs/rtx-a4000.c3756) GPU). The [RTX3070](https://www.techpowerup.com/gpu-specs/geforce-rtx-3070.c3674) at ~$500 USD pulls slightly ahead of the [A4000](https://www.techpowerup.com/gpu-specs/rtx-a4000.c3756) for half the price, BUT the [A4000](https://www.techpowerup.com/gpu-specs/rtx-a4000.c3756) does have twice the amount of RAM. This also nicely highlights the difference of $200 between the [A4000](https://www.techpowerup.com/gpu-specs/rtx-a4000.c3756) and the [RTX3080Ti](https://www.techpowerup.com/gpu-specs/geforce-rtx-3080-ti.c3735) (for ~50% increase perf), vs the $300 difference between the [RTX3080Ti](https://www.techpowerup.com/gpu-specs/geforce-rtx-3080-ti.c3735) and [RTX3090](https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622) (for an increase of ~2% perf). ![](https://ont-experimentcompanion-live.s3.amazonaws.com/2021/07/01/10/33/46/be63da3a-f512-4332-917f-c54914ea1348/1625135625088.png) This plot is based solely on comparable GPU performance between the cards (not based on Guppy basecalling at all). As I said above, it would really great to get basecalling performance metrics for the RTX3070/RTX3080/RTX3090, as well as any other cards. That way we could then factor in the price and do a broader comparison that what I've been able to do thus far. ## UPDATE: more GPUs benchmarked The below table lists results for all the GPUs that we have currently tested. We have used the same example set of ONT fast5 files, Guppy 5.0.16, and where possible have tuned the `chunks_per_runner` parameter to get the most out of HAC and SUP calling based on the GPU being tested. This hopefully gives a more "real world" example of what you can expect from these types of cards in terms of basecalling rate. The colours represent how well a given GPU and basecalling model will perform for keeping up with live basecalling during a sequencing run. * green - easily keeps up in real-time * orange - will likely keep up with 80-90% of the run in real-time * red - won't get anywhere close, large lag in basecalling From ONT community forum [link](https://community.nanoporetech.com/protocols/Guppy-protocol/v/gpb_2003_v1_revaa_14dec2018/guppy-software-overview): > “Keep up” is defined as 80% of the theoretical flow cell output. e.g. MinION = 4000 kHz x 512 channels x 0.8 = 1.6 M samples/s = 160 kbases/s at 400 b/s MinION = 4000 kHz x 512 channels x 1.0 = 2,048,000 samples/s $\equiv$ 2.048 M samples/s or 2.048e+06 samples/s It should be noted that this is based of an ideal situation where a flowcell is sequencing at 100% it's capacity / theoretical output. This is in reality never going to happen, so it's probably safe to assume that a GPU that can perform a minimum of 1.6 M samples/s for a given basecalling model will be able to keep up 'live'. ![](https://i.imgur.com/F0tkwoA.png) \* the metric reported is samples/second - where higher is faster basecalling **DNF** - did not finish (I couldn’t be bothered waiting hours/days for the CPU)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.