Try   HackMD

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

High-Performance Data Analytics with Python - Day 3

High-Performance Data Analytics with Python — Schedule: https://hackmd.io/@yonglei/python-hpda-2025-schedule

Schedule

Time Contents Instructor(s)
09:05-10:15 Performance boosting YW
10:15-10:30 Break
10:30-11:55 Dask for scalable analytics AM
11:55-12:00 Q/A & Summary YW


You can ask questions about the workshop content at the bottom of this page. We use the Zoom chat only for reporting Zoom problems and such.

Questions, answers and information

6. Performance Boosting

Step 0 Step 1 Step 2 Step 3 Step 4
YL 200 139 65.6 58 14.3
MC 132 106 53.2 27.2 0.5
KS 177 133 41.5 38.3 0.5
MK 305 177 71.1 59 0.812
Em 88 79 35 28.7 0.356
IV 532 320 144 129 40.6
TW 175 159 41.6 41.5 0.6
OS 271 229 78.6 63.2 1.2
OR 262 229 74.3 62.6 0.87
  • Do you know if numpy.typing annotations are compatible with Cython? I.e., they also result in a speedup?

    • Yes, and no
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
      Cython 3.0 support standard Python annotations, but you will have to use something like:
    ​​​​def sum3d(arr: cython.int[:, :, :]) -> cython.int:
    
  • How persistant is Numba cache? Will it be cleared upon, e.g. closing my project? Or is it only runtime persistant or something else?

    • The compiled code is stored only in memory, which is why it is called Just-In Time and not Ahead-of-Time. So always with, Numba the first execution will be nearly as slow as pure Python code.
    • Found this interesting Numba link. Do you have any experience with this? Looks like a new feature.
      • Interesting. TIL! Thanks for sharing.
  • The last command %timeit apply_integrate_f_numba_dtype(df['a'].to_numpy(), df['b'].to_numpy(), df['N'].to_numpy()) gives me a TypeError:" No matching definition for argument type(s) array(float64, 1d, C), array(float64, 1d, C), array(int32, 1d, C)".How can I solve this? Do I need to install something to my VS Code?

    • It is not installation / VS code related. Can you check df['a'].to_numpy().shape and df['b'].to_numpy().shape and so on. Also df['a'].to_numpy().dtype etc. Once you have that, you either need to either modify the inputs or modify the types in @numba.jit decorator for that function.
    • it seems that when your the code, the integer is int32 but another one is int64
  • Can you please show how to load Cython in VSCode?

    • To run Cython code in VSCode you need the Jupyter VSCode extension. Once you have that, and along with a conda environment / virtual environment containing Cython as the chosen interpreter, Cython should work. Does that answer your question?

Break until XX:30

7. Dask for Scalable Analytics

I am getting a "Nanny failed to start" error when I try to run a Dask cluster/client.

  • a tree like figure from output. both dask.visualize(sum_da) and sum_da.visualize() will get the same figure shown below

    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

  • I can see the 4 chunks in a graph layer

    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

Exercise set 1 (until xx:30)

https://enccs.github.io/hpda-python/dask#exercise-set-1

  • How to open Jupyter on LUMI with the correct environment? If I open it through the LUMI login webpage, it doesn't connect to pyhpda conda environment, and therefore doesn't have Dask installed.
    • follow instructions here and the Login to LUMI cluster via web-interface to set the environmetn on LUMI.
    • below are the details you should fill when you launch jupyter notebook on LUMI
    ​​​​Project: project_465001310
    ​​​​Partition: interactive
    ​​​​Number of CPU cores: 2
    ​​​​Time: 4:00:00
    ​​​​Working directory: /projappl/project_465001310
    ​​​​Python: Custom
    ​​​​Path to python: /project/project_465001310/miniconda3/envs/pyhpda/bin/python
    ​​​​check for Enable system installed packages on venv creation
    ​​​​check for Enable packages under ~/.local/lib on venv start
    ​​​​Click the Launch button, wait for minutes until your requested session was created.
    ​​​​Click the Connect to Jupyter button, and then select the Python kernel Python 3 (venv) for the created Jupyter notebooks.
    

Exercise set 2 (until: xx:55)

https://enccs.github.io/hpda-python/dask/#exercise-set-2


Always ask questions at the very bottom of this document, right above this.