Accelerating CUDA C++ Applications with Multiple GPUs

# Accelerating CUDA C++ Applications with Multiple GPUs ![](https://i.imgur.com/r39QtPw.png) * NVIDIA Courses: https://courses.nvidia.com/ * NVIDIA Courses event: courses.nvidia.com/dli-event * Event code: GTC_ACC_MA48 * the courses.nvidia.com material for this course (i.e. the notebooks, cloud access etc) will be available for a year. If you don't manage to finish today you can take the assessment at any point afterwards, and you can have multiple attempts :) Try to get there today though so if you get stuck you can get some advice from the team * Mutiple communication libraries are available for multinode-multigpu data exchange : MPI of course but also NCCL and NVSHMEM * GPUDirect is a key word to look up at GTC - GPUDirect RDMA - direct comms across nodes, GPUDirect Storage - direct comms to storage - see GTC presentation [S32039] * The stride is the total number of threads in our grid. The total number of threads is something we choose, usually taking into account the particular device (e.g. number of streaming multiprocessors on the device, 80 in today's course). At minimum you probably want every physical core to be used once, so e.g. 80 blocks, 64 threads per block for this device (each SM has 64 cores) * GPU's do not ahve cache like the cpu's. The memory load is done by 32 threads togheter in a warp * apt-get update && apt-get install htop * gridDim.x = # of blocks , blockDim.x = # of threads per block, threadIdx.x = index of thread within this block, blockIdx.x = index of block within the grid, blockDim.x*gridDim.x = total number of threads in grid, blockIdx.x*blockDim.x+threadIdx.x = unique thread index in whole grid * You could (proably should) of course write your code to query the device to find out how many SMs it has and use that rather than "80" * https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html * Jestine Paul :Do we need -fopenmp always with nvcc? * Paul Graham: No, this particular example uses openmp for the CPU part, hence we need -fopenmp * !nvidia-smi in a notebook (i.e. exclamation mark meaning run as a command), or just nvidia-smi in a terminal * If others are interested in executing c++ on a jupyter session CERN have developed a kernel https://indico.cern.ch/event/574561/contributions/2324730/attachments/1348578/2034673/Oral-ROOT-Jupyter_CHEP_1016.pdf Another jupyter kernel protocol https://github.com/jupyter-xeus/xeus * A little discussion on __forceinline__ here: https://stackoverflow.com/questions/19897803/forceinline-effect-at-cuda-c-device-functions * The device you are running on (V100) has 80 SMs. We want some multiple of 80, 32 was chosen here, Christian can give insight. * Hi Medhy the number of SM of the GPU in this course is 80 (it's a V100) as indicated by Paul G. further up in this chat; 32 is an oversubscription factor; so each SM will handle 32 blocks of threads * Kernels are the actual codes we want to run, I think you mean cores. We can query the device to see the number of SMs, this particular device has 80 SMs, 64 cores per SM, so if we create a grid of 80 blocks, 64 threads per block, that will use every physical core once * You would use the CUDA API to query the number of SMs as shown in the Fundamentals of CUDA-C DLI * However, GPUs usually run more optimally by oversubscribing (very different to CPUs), this oversubscription allows the scheduler on the GPU more opportunity to hide e.g. latency of memory fetching, * The first time you run the code it actually does the encyption on the CPU (takes a long time) and saves the data - the second time it loads the encrypted data from a file, so don't use this as a measure of the speed of the CPU, it is simply loading rather than actually calculating for subsequent runs * In this tutorial we play with toy examples at all times. In practice, the GPU is useful only if you have, at least, millions of data, and therefore, there will always be much more data than cores. If you apply fine-grained parallelism to compute one data per thread (using the onwer's compute rule), then the best way to go is to use an intermediate block size (say 256) and determine the number of blocks as the division between the problem size and the block size. * The specification of each GPU architecture generation sometimes has more SMs than the actual cards which are released - amongst other reasons, this is to ensure good yield in the manufacturing process * To be clear, each V100 SXM2 has 80 SMs - but the Volta architecture specification proposes 84 SMs - two different things * Only the GV100 GPU has 84 SMs, but this is the generic GPU for this generation. From that, commercial models are released dependinng on how much money you are willing to pay. The Tesla V100 model we use today provides you 80 SMs, but if you are not that rich, V108, V106, V104, V102, ... will give you a range of multiprocessors to choose, all sharing the same SM design that Christian is talking about. * V108, V106, V104, V102, ... are the internal names for the chips that TSMC manufactures for Nvidia. There are a bunch of GeForce models, Titan, ... in the market using each of these chips (which is what ultimately determines how many SMs your GPU model has). * Hint for vampires who like to run everything in dark mode , if you do run jupyterlab in dark mode you need to switch to find the code for the url ![](https://i.imgur.com/wCgTSl8.png) * nvprof+nvvp were mixing tracing and kernel detailed analysis. nsight-systems is the tracing tool and nsight compute is the kernel profiler. nsight systems is much faster and scalable (now using QT for the GUI where nvvp was using java) * To profile a GPU, you need a GPU, to look at the profile you don't, just the software * Just to encourage the audience: The market price for a Tesla V100 is around 10.000€ and each of us is having today 4 of them (40.000€ budget) to enjoy for 8 hours in a row. :-) * "make profile" , ./baseline just runs the code,, if you look at the Makefile it shows you the command it runs to create a profile which you can of course just do manually * We didn't really use makefiles on Tuesday, old school command line! :) * Usually, the DLI gives you the T4 (Turing family) for running the CUDA labs. That is way cheaper for renting on AWS. If you want to purchase a GeForce model, to me, the ones with the "60" suffix often offer a good performance/price ratio. * "more affordable" is in the eye of the beholder - in terms of value per FLOP vs CPU, for many apps, GPUs are very cost-effective for example ;) But of course there are lots of options for accessing different GPUs from both cloud providers and resellers, depends on what you are trying to do, e.g. is double precision important to you, memory requirements, time to solution etc etc * **May I ask how much cost per GPU Hour? Is it the same definition and comparable to CPU Hour?** Hi, you'd need to check with your cloud provider of choice - I believe it is comparable in terms of definition, but of course in one hour of GPU you can usually accomplish a lot more work than you would with just a CPU, assuming you have an accelerate application. * ***Are both the CPU hour and GPU hour defined in terms of core number hours? It was a bit confused to me since GPU is manycore hardware while CPU has only few cores*.** It's usually based on the whole device - although you can now get fractions of GPUs via the new A100 multi-instance GPU for example, but I think most providers charge by GPU device - A100 has 6192 cores, per core pricing would be crazy ... :) * Shared memory is user-controlled L1 cache.Sometimes referred to as "scratch pad" to distinguish it from the general concept of "shared memory architectures" in HPC. * It is called "shared" because all threads within a block can share their contents. There is a shared memory on each GPU multiprocessor, which is evenly used by all blocks assigned to that multiprocessor. But as Christian said, this is out of the scope of this workshop and has nothing to do with streams. * So given cudaMalloc will always execute in the default stream, what does that mean for best practice in terms of efficiently allocating memory when using multiple streams? To allocate everything at the beginning,or use a memory pool like RMM or the new async alloc API. * Pinned memory by definition is a non-swappable memory(fixed location) ; under the cover when a memory copy occurs (managed by linux) if the source operand lies in pageable memory it is copied first to a pinned memory buffer. So to avoid this extra-operation it's better to start with data already located in pinned memory and saving one copy operation. * In the notebook, on section Host-toDevice Memory Transfers in a Non-Default Stram, both cudaMallocs use Dta_cpu, is that right? It uses a stream called "stream" that you defined and thus it is a non-default stream. * Hint: If you abuse of pinned memory, there will be less DRAM memory available for the remaining processes running at the same time, so you may suffer from "thrashing" (more time dedicated to swap memory than actual progressing) in those processes. It is up to you to prioritize your CUDA application up to a certain limit...CUDA apps are always the most important apps of highest priority in my life, I don't understand the warning :) * Is there a typical mem copy size that you dont want to go below? I.e because of memtransfer setup overhead? I would roughly say a few kilobytes, but that will depend on whether your graphics card uses GDDR or HBM memory (the high-end alternative). The GDDR# saga increases latency and also bandwidth as # increases, so in GDDR7 the threshold will be higher than in the previous GDDR6. * For V100 the max number of resident grids (and therefore concurrent kernels in streams) is 128, nothing to do with SMs * "Max number of resident grids per device" - https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications__technical-specifications-per-compute-capability But I'd say optimal choice of streams is application-dependent , you will eventually get diminishing returns from adding more streams for example. * Someone has may be mentioned this, there is a hardware limit to how many streams can be actually running on the device - you can logically have more streams than this, but they would be idling waiting for resource to become available * As mentioned earlier, first runthrough the CPU does the encryption and saves the encrypted data, subsequent runthroughs it just loads from file - to save you getting annoyed waiting for CPU * **The Zen of Python** https://www.python.org/dev/peps/pep-0020/ * The number of streams also influences the chunk size, you should consider latency of copy operations in your theoretical scalability model * that's what I said when analyzing the single GPU app (copy takes more time than compute, i.e. we cannot fully hide communication -> increase num_iters to change that). By starring at the single GPU runtime one could have known that beforehand and thus it is so important to profile your app to spot opportunities for improvement and exclude certain optimizations from the very beginning. * Sombody already asked, but could not find the answer about: "CUDA error: invalid argument : helpers.cuh, line 17" Shall we care about it ? It means your code throws an error (wrong memcpy, direction maybe?) The line given is the one of the helper function and the only way to fix that right now is to manually inline that code into your main method (maybe __forceinline___ helps) * Hi guys for a bit further read on the topic of multi GPU programming: https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9139-multi-gpu-programming-models.pdf