High-Performance Data Analytics with Python - Day 3

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

High-Performance Data Analytics with Python - Day 3

High-Performance Data Analytics with Python — Schedule: https://hackmd.io/@yonglei/python-hpda-2025-schedule

Schedule

Time	Contents	Instructor(s)
09:05-10:15	Performance boosting	YW
10:15-10:30	Break
10:30-11:55	Dask for scalable analytics	AM
11:55-12:00	Q/A & Summary	YW

Useful Links

You can ask questions about the workshop content at the bottom of this page. We use the Zoom chat only for reporting Zoom problems and such.

Questions, answers and information

6. Performance Boosting

	Step 0	Step 1	Step 2	Step 3	Step 4
YL	200	139	65.6	58	14.3
MC	132	106	53.2	27.2	0.5
KS	177	133	41.5	38.3	0.5
MK	305	177	71.1	59	0.812
Em	88	79	35	28.7	0.356
IV	532	320	144	129	40.6
TW	175	159	41.6	41.5	0.6
OS	271	229	78.6	63.2	1.2
OR	262	229	74.3	62.6	0.87

Do you know if numpy.typing annotations are compatible with Cython? I.e., they also result in a speedup?
- Yes, and no
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
  Cython 3.0 support standard Python annotations, but you will have to use something like:
```
def sum3d(arr: cython.int[:, :, :]) -> cython.int:
```
- instead of numpy.typing annotations. See here for an example: https://cython.readthedocs.io/en/stable/src/userguide/memoryviews.html#memoryviews
How persistant is Numba cache? Will it be cleared upon, e.g. closing my project? Or is it only runtime persistant or something else?
- The compiled code is stored only in memory, which is why it is called Just-In Time and not Ahead-of-Time. So always with, Numba the first execution will be nearly as slow as pure Python code.
- Found this interesting Numba link. Do you have any experience with this? Looks like a new feature.
  - Interesting. TIL! Thanks for sharing.
The last command %timeit apply_integrate_f_numba_dtype(df['a'].to_numpy(), df['b'].to_numpy(), df['N'].to_numpy()) gives me a TypeError:" No matching definition for argument type(s) array(float64, 1d, C), array(float64, 1d, C), array(int32, 1d, C)".How can I solve this? Do I need to install something to my VS Code?
- It is not installation / VS code related. Can you check df['a'].to_numpy().shape and df['b'].to_numpy().shape and so on. Also df['a'].to_numpy().dtype etc. Once you have that, you either need to either modify the inputs or modify the types in @numba.jit decorator for that function.
- it seems that when your the code, the integer is int32 but another one is int64
Can you please show how to load Cython in VSCode?
- To run Cython code in VSCode you need the Jupyter VSCode extension. Once you have that, and along with a conda environment / virtual environment containing Cython as the chosen interpreter, Cython should work. Does that answer your question?

Break until XX:30

7. Dask for Scalable Analytics

I am getting a "Nanny failed to start" error when I try to run a Dask cluster/client.

a tree like figure from output. both dask.visualize(sum_da) and sum_da.visualize() will get the same figure shown below
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
I can see the 4 chunks in a graph layer
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →

Exercise set 1 (until xx:30)

https://enccs.github.io/hpda-python/dask#exercise-set-1

How to open Jupyter on LUMI with the correct environment? If I open it through the LUMI login webpage, it doesn't connect to pyhpda conda environment, and therefore doesn't have Dask installed.

follow instructions here and the Login to LUMI cluster via web-interface to set the environmetn on LUMI.
below are the details you should fill when you launch jupyter notebook on LUMI

Project: project_465001310
Partition: interactive
Number of CPU cores: 2
Time: 4:00:00
Working directory: /projappl/project_465001310
Python: Custom
Path to python: /project/project_465001310/miniconda3/envs/pyhpda/bin/python
check for Enable system installed packages on venv creation
check for Enable packages under ~/.local/lib on venv start
Click the Launch button, wait for minutes until your requested session was created.
Click the Connect to Jupyter button, and then select the Python kernel Python 3 (venv) for the created Jupyter notebooks.

Exercise set 2 (until: xx:55)

https://enccs.github.io/hpda-python/dask/#exercise-set-2

Always ask questions at the very bottom of this document, right above this.

Schedule

Useful Links

Questions, answers and information

6. Performance Boosting

Break until XX:30

7. Dask for Scalable Analytics

Exercise set 1 (until xx:30)

Exercise set 2 (until: xx:55)

Read more

Julia for High-Performance Data Analysis - Day 3

Julia for High-Performance Data Analysis - Day 2

Practical Deep Learning - Day 3

Practical Deep Learning - Day 2