Franck Charras
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
Publish Note

Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

Your note will be visible on your profile and discoverable by anyone.
Your note is now live.
This note is visible on your profile and discoverable online.
Everyone on the web can find and read all notes of this public team.
See published notes
Unpublish note
Please check the box to agree to the Community Guidelines.
View profile
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# Exploring a oneAPI-based GPU-powered backend for scikit-learn ## Context We at the [soda-inria](https://team.inria.fr/soda/team-members) team experimented with developing a scikit-learn computational backend, called [sklearn-numba-dpex](https://github.com/soda-inria/sklearn-numba-dpex) using the oneAPI-based numba_dpex JIT compiler, so that the backend can run on CPUs, (integrated) iGPUs, and GPUs alltogether. It comes along the development of a [plugin system](https://github.com/scikit-learn/scikit-learn/issues/22438) in scikit-learn that opens up scikit-learn estimators to alternative backends. The backend implements Python bindings using the Intel-led numba-like [numba_dpex](https://github.com/IntelPython/numba-dpex) JIT compiler that translates Python functions into compiled code, built on top of [dpnp](https://github.com/IntelPython/dpnp) a oneAPI-powered numpy drop-in replacement, and [dpctl](https://github.com/IntelPython/dpctl), a lower-level library that exposes Python bindings for SYCL concepts such as `Queue` and `Kernels`, and in particular an [Array-API](https://data-apis.org/array-api/latest/) compatible array library, dpctl.tensor . ## Goals The goal of our study is to evaluate the potential of the oneAPI-based software stack for ease of installation, interoperability, and performance on GPUs for sklearn popular estimators GPUs are known to be the preferred hardware for deep-learning based applications, but their use for a wide range of other algorithms has also been proved to be relevant: k-means, nearest neighbors search, gradient boosting trees... CPU-based implementations can be outshined and more particularly so where the data is plentiful to the point where the duration for training an estimator becomes a bottleneck. We want to evaluate the point where it really starts to matter, and if it really is a concern for sklearn typical use cases, either that it unlocks use with very large datasets, or interactive productivity on datasets that requires tens of seconds/minuts on CPU. ## Targeted hardware, platform and OS The main targeted hardware for this project are CPUs, Intel iGPUs (integrated graphical chipset that mostly come embedded in laptops), and Intel latest discrete GPU series, namely Flex and Max series for servers, and Arc series for consumer-grade computers. While we obviously don't expect performance on iGPUs to be comparable to discrete GPUs, it's particularly interesting to see how it compares to CPUs they're embedded with. Many scikit-learn users might just want to run compute on their personal laptop, any significant speed-up there is very valuable. While oneAPI latest releases do have cuda and hip support, numba_dpex is not yet able to compile and run kernels for NVIDIA gpus and is not tested for AMD gpus. Hence the choice of using numba_dpex didn't let us extend the scope of our benchmarks to other vendors. [Preliminary investigations](https://github.com/IntelPython/dpctl/discussions/1124) have shown potential for extending compatibility . We run the software on Linux distributions. We ran benchmarks on personal laptops, as well as in the cloud using Intel developer edge cloud and the beta version of the new Intel Developer Cloud.. ## Installation Installing a working numba_dpex runtime would consist in three steps: - installing low-level drivers - installing low-level runtime libraries - installing high level runtime libraries For the first two steps, we used the [official instructions](https://dgpu-docs.intel.com/installation-guides/index.html) that are available for Ubuntu on Intel dev cloud servers. It _just works_. Note that it requires a sequence of somewhat unusual steps (like editing `grub` configuration file), also it includes installing a specific version of the Linux kernel, which, while it can be OK for a server setup that is designed for a single task, might be troublesome for general purpose workstations. There are alternative installers that don't require pinning the kernel version, but we have found those generally hard to use. The high level runtime libraries, including the oneAPI-based runtimes and the Python libraries, can be installed using conda [following our guide](https://github.com/soda-inria/sklearn-numba-dpex#setup-a-conda-environment-for-numba-dpex-22), which again has some complexity. For instance, it requires the use of vendor specific channels instead of the more popular, community maintained `conda-forge`channel. We also provide a [docker image](https://github.com/soda-inria/sklearn-numba-dpex#using-the-docker-image) that has proved to be stable and enable quick starting of ready-to-go environments. We've found that using Intel GPUs from within a docker container (using the `--device=/dev/dri` option) works well for all GPU architectures (tested with iGPU and Max series). ## Interoperability numba_dpex higher level libraries are not yet compatible with AMD and NVIDIA GPUs, but it already showcases its potential by offering seemless compatibiltiy with CPUs, Intel iGPUs and Intel GPUs. Meaning, that the same code that we wrote and optimized for GPUs, can be compiled and executed on CPU, and also works on iGPUs. This is awesome since it means: - I can start writing code on my favorite terminal, be it a GPU-less computer, and start benchmarking on high performance hardware only after having ensured that the code works (returns the expected outputs) on my daily-use machine. - a first-level of continuous integration can be setup easily on GitHub and run unit tests on CPU. It is much cheaper, accessible and simple than requiring cloud access to GPU-powered VMs that are typically necessary to be able to setup Continuous Integration for software projects that are based on CUDA. For sklearn_numba_dpex, the [automated unit-test pipeline](https://github.com/soda-inria/sklearn-numba-dpex/blob/main/.github/workflows/run_tests.yaml#L29) is started with pytest: ``` pytest -v sklearn_numba_dpex/ ``` and since the default GitHub Actions runners do not provide access to GPUs, the tests will run on CPU. Running the same command in a local environment that provides a GPU or an iGPU will instead automatically run the tests on the GPU. `SYCL` has built-in environment variables that enable forcing a specific device (as long as it's available): ``` # force all SYCL instructions to run on CPU SYCL_DEVICE_FILTER=cpu pytest -v sklearn_numba_dpex/ ``` if one wants to, for instance, specifically reproduce a GPU-less environment of a CPU-only CI runner. - being able to leverage iGPUs unlocks performance improvements on simple personal laptops, which will benefit to an even more wider part of the user base. We've found that code that works on CPU do translate to code that works on GPU/iGPU, and inversely so. Except for minor quantitative differences regarding device parameters, the oneAPI concepts that programmatically describe devices are abstracted in a way that will be interpreted into working instructions for both GPU and CPU architectures. We note that while the same code will always work on devices of different nature, in general algorithm optimization specializes for a type of device architecture so that good performance on a given device type will not translate to good performance on other types. For sklearn_numba_dpex we seek best performance for GPUs, while running the algorithms on CPU is really useful for testing, the performance will be inferior, and we don't include CPU runs of the sklearn_numba_dpex engine in the benchmarks. We rather compare to scikit-learn and scikit-learn-intelex implementations that are well optimized for CPU. ## Performance of k-means on Intel iGPUs, Flex and Max series We implemented a production-ready `numba_dpex` engine for [k-means](https://github.com/soda-inria/sklearn-numba-dpex/blob/main/sklearn_numba_dpex/kmeans/engine.py#LL36C7-L36C19) and devised a [scikit-learn branch](https://github.com/scikit-learn/scikit-learn/tree/feature/engine-api) that exposes a plugin interface that the engine can plug into. Then, we can ensure that our plugin implements the exact same specs than scikit-learn vanilla k-means with the same quality control, by [running scikit-learn unit tests for k-means](https://github.com/soda-inria/sklearn-numba-dpex/blob/main/.github/workflows/run_tests.yaml#L43) when the engine is activated. Both from a tflops and memory usage point of view, k-means implementation on GPU benefits from fusing together pairwise distance computation and weight updates in the same set of parallel task dispatch, using special computational tricks that are very specific to GPU architectures. This is typically not possible to reproduce with high-level primitives available in array libraries such as numpy or pytorch. Thus a low-level GPU implementation framework like numba_dpex is well adapted for k-means and we expect this example to showcase both the performance improvement one can get from a GPU implementation of k-means, and the power of the numba_dpex framework. ### Methodology: #### Per-device tuning of performance parameters GPU implementations variabilize performance parameters, such as size of allocated shared memory, width of contiguous memory RW steps, number of iterations per thread, size of groups of threads... we found that the performance on a given device can be very sensitive to variations of the parameters, and that a set of best parameters for a given device can yield bad performance for another device. The performance thereafter for each device are reported after the performance parameters have been tuned by automatically grid-searching the parameter grid on small-ish inputs and keeping the combination that gave the best benchmark. In practice, currently our repository sklearn_numba_dpex uses a set of parameters that we have found optimal for the Max series GPUs, but that can translate to worse performance for other tested devices. To unlock best performance for all devices and all algorithms, further work should be considered: - either maintaining a catalogue of best performance parameters for all supported devices - or decorating the JIT compiler with an _autotuner_ that runs a parameter search on the user hardware right after compile-time. #### JIT compilation time The JIT compilation time is substracted from the total walltime in the benchmark. In practice, it is not negligible but reasonably fast. For the k-means, expect a few seconds of compilation time for the first calls of `fit` and `predict` methods, once compiled the binaries are cached for the remaining of the session. ### Benchmark Trends The following figures display benchmark walltimes using CPU (with scikit-learn-intelex and scikit-learn CPU-optimized implementations), iGPU, and discrete GPUs (using numba-dpex), for k=127 clusters, on a dataset with small dimensions (d=14) in line with k-means typical usecases, and 50 million samples. We measure performance over 100 k-means iterations. The benchmark can be reproduced using [our benchmark script](https://github.com/soda-inria/sklearn-numba-dpex#running-the-benchmarks). ![](https://hackmd.io/_uploads/SkiGuMFHh.png) With this setting, discrete GPU will complete within seconds while CPU implementations will complete within minutes, with up to 10 times faster compute on discrete GPUs. It is in line, and even faster, than what is foresighted by [rapids.ai benchmarks](https://rapids.ai/) with cuda devices. GPU-backed implementations will noticeably improve productive interactivity with k-means for dataset with similar characteristics, as long as the whole dataset can fit into the GPU memory (that could be up to 5 times more data on discrete GPUs than used on our benchmark). Going beyond this size of dataset would require carefully loading slice of datasets back and forth between system memory and device memory, which calls for additional developments. iGPU performance is not as impressive, but for the cost and accessibility of an iGPU it still offers a decent speed-up over the CPU it's embedded with, that have the potential to address a large userbase, with up to twice faster compute than CPU. As expected, discrete GPUs performance is leaps and bounds ahead, with the Max Series GPU, the most performant GPU in Intel cloud offer, yielding about 20% speed-up over the Flex Series. ### Conclusion and future work We were really pleased with the performance we got for our k-means and are now discussing going forward with merging the plugin-api into scikit-learn so the plugin can be released to a large pool of users. We hope that numba_dpex libraries and dependencies will become easier to install using pip with the default pypi.org repository or using conda with the community-based `conda-forge` channel, so that users are not discouraged by issues with managing environments that can occur with the state of current installers. Our next target would be implementing a k-nearest neighbors estimator with optimized performance for GPU and good interoperability. Unlike k-means, this estimator does not benefit from fusing kernels, and we intend to explore using the high level oneAPI-based primitives (`topk`/`partition` and `matmul`/`cdist`) that are exposed in PyTorch using the [Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch). Those kernels are key basic building blocks of many algorithms so we expect it to be optimized by specialized teams and yield a lot more performance than if we re-implement with numba_dpex. Plus, using the pytorch front-end natively will enable interoperability with NVIDIA and AMD devices using the native pytorch binaries. --- --- --- --- <span style="color:grey"> **BEGIN optionally, more context** [SYCL](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_common_functions) is an open specification maintained by the Khronos Group, a consortium that also has to its credit pivotal specifications such as OpenCL, OpenGL, WebGL or Vulkan. SYCL is a programming model that aims at writing and dispatching data and tasks on heterogeneous hardware, including CPUs and GPUs of all manufacturers, in a simple and flexible way. Practically speaking, interoperability with devices of all manufacturers is a work in progress. Intel's bleeding edge `llvm` project embeds an open-source implementation of the SYCL specification, whose first target has been Intel branded hardware, but it also contains [POCs of interoperability](https://github.com/intel/llvm/blob/sycl/sycl/doc/GetStartedGuide.md#cuda-back-end-limitations) for all major manufacturers, including NVIDIA and AMD, with `CUDA` and `HIP`/`ROCM` backends. The proprietary compiler `dpcpp` shipped by the oneAPI Basekit from Intel builds on top of this library, and can also be extended with [plugins developed by Codeplay Software](https://www.intel.com/content/www/us/en/developer/articles/training/oneapi-2023-toolkits-and-codeplay-plug-in-support.html#gs.wcwdut) that extends compatibility to AMD and NVIDIA GPUs. (other alternative implementations are also under active developments such as [OpenSYCL](https://github.com/OpenSYCL/OpenSYCL)). As opposed to the `CUDA` programming model that targets NVIDIA devices specifically, developers can use SYCL-based programming front-ends to write their application without specializing the stack for a specific device type or a specific device manufacturer in mind, and ensure it will run on any device, provided low-level runtime drivers are installed on the user machine. This appeals both to developers, that can easily target a wider range of users, and to users, that can run seemlessly the same software on whatever compute power they have access to. Software accessibility is a strong principle within the design philosophy of the scikit-learn library. The scikit-learn project aims at developing and distributing user-friendly software, that is easy to use and easy to install. Scikit-learn serves users that expect their software to be installable with a single command, and then provide a ready-to-use data science environment with the best performance on any computer they have access to. On paper, SYCL-based backends have the potential to unlock performance improvements without adding user-side complexity because they can detect and use any available device. **END optionally, more context** </span>

Import from clipboard

Paste your markdown or webpage here...

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template has been removed or transferred.
Upgrade
All
  • All
  • Team
No template.

Create a template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

Slide Example

API Docs

Edit in VSCode

Install browser extension

Contacts

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Get Full History Access

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

Note content is identical to the latest version.
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully