## <center> Research 4 Innovation 2021 - INFN </center>
# <center>Open For Better Computing</center>
#### <center>Development and use case study of modern partitionable GPUs for graphical and computing applications under Linux KVM</center>
The GPU develompement and evolution of the last years has shown a much stronger peak performance growth with respect to CPUs.
This strong growth, forces the sudden adoption of new programming paradigms and toolsets.
Programming frameworks often born, grow and die within two or three GPU architecture generations.
Due to this ever-evolving paradigm, ML and AI libraries always have to re-settle on the new paradigms, with high development efforts.
Simirarly, the development of software capable of fully exploiting such architectures is a big hassle even for the most skilled professionals.
The performance growth, driven mainly by technological improvements in the transistor's sizes, forced hardware producers to introduce hardware and low level software functionalities capable of enabling the GPU partitioning "à la CPU".
Such solutions (e.g. AMD MxGPU, Nvidia vGPU) are currently available in payware environments such as VMware and CITRIX, while these are not available inside Linux KVM. Concerning KVM there are some proofs of concept, but nothing enterprise-ready or of commercial grade.
In addition a PCI Express Consortium' protocol, called Single Root IO Virtualization (SR-IOV), allows similar functionalities with a potentially cross-producer approach, enabling GPU sharing ubiquitously.
Despite all the available solutions present some kind of specificity, the market direction is clear.
In fact, this paradigm injects in existing and new deployments multiple advantages:
* computing kernels for GPGPUs are rarely able to fully exploit huge GPUs with big amounts of memory. Their optimization is not a trivial task and a time consuming one. Executing multiple copies of a poorly optimized computing kernel on a partitionable GPU can potentially lead to an improved efficiency, reducing the performances cost.
* Concerning graphical applications (i.e. CAD, 3D tasks, rendering), the user density achievable with partitionable GPUs capable of grouping up to 32 sessions on a single board (AMD Radeon Pro v340) is unrivalled. Such functionalities are incredibly interesting for VDI, but also as a method to add to and HPC cluster graphical-acceleration-enabled front end machines, in order to allow the cluster's users to rely on the cluster functionalities for data visualization (e.g. protein and drugs 3D visualization, HEP events visual, neural network topology rendering) without having to rely on personal workstations.
* Finally, dynamic partitioning of GPUs allows one to apply the "what you need when you need it" paradigm, granting the maximum efficiency for any task, therefore reducing the power waste caused by idle components and granting, for a VM, to expand the hardware configuration when needed.
Developing a layer capable of presenting a common interface on top of such technologies (and possibly similar techs developed in the future) in the open-source Linux KVM environment is crucial, paving the way for the wide adoption of this approach both in the research and in the industry fields.
Such toolset will ease and democratize the access to GPU partitioning technologies, enabling, for example, the following applications in Linux KVM:
* Batch GPGPU software execution with "develop once run everywhere" approach. Develop your algorithm thinking of running it on the GPU partition which constitutes the maximum common denominator of the whole set of available devices.
* ML and AI applications will exploit two aspects. The first is similar to the "develop once run everywhere" approach. The second is the capability of reconfiguring the GPU partitioning during various steps of the workflow: big beefy GPUs during training and multiple smaller instances during inference, using the same "variable sweep" hardware.
* Virtual Desktop Infrastructures and Hyperconverged virtual appliances. Higher density of users using maximally partitioned GPUs or great workstations' density with bigger partitioning, on the same online-repurposable hardware.
The tool will be developed by the INFN and exploited in many deployments, enabling the support for such technologies in the in house KVM-based infrastructures.
This tool will operate as a plugin of the KVM-libvirt environment.
INFN will mainly tackle the tests concerning computing on GPUs using ForBC.
In order to enable the adoption of this tool even outside of the scientific community, a start-up will provide to commercial users support plans, formation for experts and will lead the development community of the tool.
The start-up will focus its efforts on VDI applications exploiting the same technology.
## External collaborations
The following actors are taking part to the initiative:
* AMD will provide top notch cloud GPUs equipped with MxGPU technology and remote technical support;
* Intel manifested great interest in the initiative and its involvement is under investigation;
* Nvidia has been reached out and their team is evaluating the project and the company's involvement;
* Mellanox will provide the networking equipment necessary for the tests.
We really think that aggregating as much actors as possible is the key for the creation of a real hardware-agnostic software platform.
## About INFN
The Istituto Nazionale di Fisica Nucleare (Italian National Institution for Nuclear Physics), is an around 80 years old institution which has driven some of the most important scientific efforts around the globe since its very beginning.
Nowadays INFN is involved in many experiments regarding high energy physics (mainly LHC at CERN), atrophysics and cosmology (Gravitational waves observatories), biomedic research and many other aspects.
Last but not least, since it is the fabric enabling behind the scenes the operations of all the previous scientific efforts, INFN if highly involved in IT research and administration of production IT services.
INFN always trusted the power of technology transfer, as a mean to exploit self developed leading edge tools which can survive thanks to a much wider audience provided from users external to the scientific community.