High-Performance Data Analytics with Python - Day 1

![](https://media.enccs.se/2024/11/python-hpda-25.webp) High-Performance Data Analytics with Python - Day 1 :::success **High-Performance Data Analytics with Python — Schedule**: https://hackmd.io/@yonglei/python-hpda-2025-schedule ::: ## Schedule | Time | Contents | Instructor(s) | | :---------: | :------: | :-----------: | | 09:00-09:15 | Welcome | YW | | 09:15-09:30 | Motivation | YW | | 09:30-10:20 | Scientific data | FF | | 10:20-10:40 | Break | | | 10:40-11:55 | Efficient array computing | FF | | 11:55-12:00 | Q/A & Reflections | | --- ## Useful Links :::warning - [Lesson material](https://enccs.github.io/hpda-python/) - [Setup programming environment on personal computer](https://enccs.github.io/hpda-python/setup/#local-installation) - [Access to programming environment on LUMI](https://enccs.github.io/hpda-python/setup/#using-pyhpda-programming-environment) ::: --- :::danger You can ask questions about the workshop content at the bottom of this page. We use the Zoom chat only for reporting Zoom problems and such. ::: ## Questions, answers and information ### 1. Motivation :::info **How large is the data you are working with?** - KB: - MB: o - GB: o o o o o o - TB: o o o o o o o - Larger dataset (PB, EB, ZB, YB): **Are you experiencing performance bottlenecks when you handle large dataset (>GB)?** - Yes: o o o o o o o o o o - No: o - I don't know: ::: ### 2. Scientific Data - What about the image data and video data? Does Python have relevant packages to handle these data? - Image: `pillow`, `opencv-python` and `scikit-image` are well known libraries. An image once loaded is represented as a 3D Numpy array of shape (3, height, width). 3 = colour channels which are red, green and blue. - Video: not sure - I have a file and want to save data with common data formats (*i.e.*, pickle and json), how can I expect the size of the files? Which data format leads to a small file size? - the `space efficiency` in the table specifies the size of the file if you use the right data format to save data. - Until Python 3.4 at least, `Pickle` beats `JSON` at strings, ints and floats regarding I/O speed. File size IIRC was quite similar, but since Pickle is binary, probably you get smaller files if you store many floats. - Also think that JSON is readable by virtually any piece of software whereas Pickle is specific to Python. - I need to store a large number of dictionaries and the solution I found was to use JSON. What should I use instead? - UPDATE: in the dictionaty I have different type of data which are also organized in dictionaries of scalar or vectors - If the values in the dictionaries are 1-dimensional (think something which can be represented in an Excel spreadsheet), then Parquet or Feather is a fine format. Take a look at `pandas.Dataframe.from_dict` function to load it in memory and `pandas.Dataframe.to_parquet` / `to_feather` methods to convert those dictionaries. We will see more of Pandas soon in this course. - If the values are multi-dimensional, then HDF5/NetCDF or Zarr can store it efficiently. Take a look at `xarray.Dataset.from_dict`. - For a mix of scalars and vectors, I would prefer NetCDF. Scalars can be saved as metadata or attributes and Vectors as the main `data_vars`. Example: ```python dataset = xr.Dataset(data_vars={'my_vector': np.arange(10)}, attrs={'my_scalar': 'ABCD'}) dataset.to_netcdf("ds1.nc") ``` - You can also think whether you actually need to store large data as dictionaries or if you can get away with storing large values of the dictionary (if they are tables or arrays) in some of the formats we talked about and recreate the dictionary structure upon reading :::info ### **Break until XX:40** ::: ### 3. Efficient Array Computing - I tried executing the first example (Xarray dataset) in two ways. - A) Locally by opening a new Jupyter notebook via VS Code using the pyhpda kernel (installed according to the tutorial provided before the workshop). Saving the dataset to disc (ds1.to_netcdf("ds1.nc") failed with `PermissionError: [Errno 13] Permission denied: '/ds1.nc‘. ` - B) I created a new notebook in the web interface Jupyter-lab opened from terminal. That works perfectly fine. Clearly Jupyter-lab is the way to go, however, I am curious to find out what went wrong locally on my computer. - For context I should note that I usually use Jupyter notebooks by setting them up through VS Code either locally or on a remote machine. I don’t have much experience with the Jupiter-lab interface. Also, I am fairy new to this area. - Update: I am a Mac user. Thank you for the suggestions - I tried saving to ~/ds1.nc but still getting permission denied which is surprising to me. Also, I have the pyhpda Interpreter selected. Update no.2: Problem solved. Rookie mistake 🙈 - Are you on a Mac? It seems like you're trying to save into `/`, which is a "system directory". - :+1: to above. Specifying the path like `ds1.to_netcdf("~/ds1.nc")` could help. Where `~` is your home directory. - you can use `./ds1.nc`, which means to save the file to the current folder - Run `%pwd` in Jupyter to see what your current directory is and `%cd /some/other/directory` to change it within Jupyter. The `%` symbol is optional, of course. - If you are comfortable using Jupyter notebooks using VSCode, keep doing that. You can probably press, "Command+Shift+P" (maybe) and run `Python: Select Interpreter` to change the environment to `pyhpda`. - ==let us know if you have solved this problem, otherwise we can take time to see your code using breakout room later (after 11:30) when we perform hand-on exercises== - How are lists implemented under the hood in Python? Is certain amount of memory pre-allocated (like in Java) or are they effectivly a pair of python_object-address to the next element? OK, now I see the answer on the picture being presented :smile_cat: - Great that you got the right answer. - Can type hinting help performance in any way? In Python, that is. - Type hinting is *not* read by the CPython interpreter, but some "compilers" can convert code with type hints into machine code. For example `mypyc`, `cython`, `pythran`. ~~But this part is out of scope for the current course.~~ - the lesson will cover `cython` where you have type hinting :+1: - `cython` will be covered on the 3rd day in the [Performance Boosting](https://enccs.github.io/hpda-python/performance-boosting/) episode. - If we don't know the exact size of an array (because some items will be skiped for instance) is it still better to allocate the maximum size and then reduce it (or leave it empty)? - Allocating large arrays have a cost, but nevertheless I am inclined to say yes. However, there are alternatives: - Adding a new row and column if and when needed. - Using a sparse array format, if there are lot of missing values. See `scipy.sparse` - most problably the best solution is case dependent - Is there any difference in using the "@" operator versus "dot()" in Numpy? - Good question. If you run `help(np.dot)` or `np.dot?` in Jupyter you see that `dot` can also be used, but `@` is preferred in case of matrix multiplication. ```python Docstring: dot(a, b, out=None) Dot product of two arrays. Specifically, ``` - If both `a` and `b` are 1-D arrays, it is inner product of vectors (without complex conjugation). - If both `a` and `b` are 2-D arrays, it is matrix multiplication, but using :func:`matmul` or ``a @ b`` is preferred. - How can we handle invalid data in a column that is expected to contain numbers but includes one or more character entries? - Usually, one uses `np.nan` to mark invalid data, but if you use characters to mark such data, it is also supported. However it results in a slower Numpy array / Pandas series with data-type `object`. It is slow, because it is not compactly stored in memory. - One thing you can do is to check the `infer_dtype()` of each column and if it's something very generic (like `object` or similar), you can use `np.astype()` to cast it. Just be careful to check what your data becomes when you do it. If e.g. you have a column of ints and there is one random string `'10'`, the cast will be successful, otherwise you have to do some manipulation or do as above and mark it as `nan` - can a type in the column of panda be an array? - Pandas columns or `Series` as it is called is actually a thin wrapper over 1-D Numpy array. Example: ```python In [25]: df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4]}) In [26]: type(df.a) Out[26]: pandas.core.series.Series In [27]: type(df.a.values) Out[27]: numpy.ndarray ``` - If you need to store an array in each cell, maybe it's easier to tidy it up (in the sense of tidy vs wide that we showed today) - If the question was about storing a N-dimensional array in a column, then using `xarray.Dataset` is the preferred approach. :::warning **Reflections and quick feedback:** One thing that you liked or found useful for your projects? - I liked that you covered different file formats. I'm going to start using NetCDF with Xarray in my Python work. Currently using HDF5. - I liked the introduction on data type and to "be forced" on thinking of consequences of saving floating number to text for example - Arun: Useful- Detailed explanation of numpy and pandas. May be more useful- If we can have more practical examples comparing the performance of python with Fortran/c (in terms of speed) - Good point. We will consider to update the lesson materials to a new version including comparison of performance efficiency to Fortran and C/C++. - I enjoyed your summary on file types. Also, I found the array operations and manipulations with the visuals very useful. Moreover, the data analysis workflow with the examples was great. One thing that was confusing/suboptimal, or something we should do to improve the learning experience? - It would have been nice with more time for hands on - We will try to handle the time for each episode next time. - Thank you! The second episode was quite packed so maybe we can think of how to compress it/remove some things so to leave more time for hands on :) - we will consider to separate the 2nd episode into 2 parts. - As both numpy and pandas are import tools for data analytics, we would like to cover general descriptions of these two packages. - I agree with the previous point. I think the practical examples in the second section were chosen nicely to represent the issue at hand. However, I wasn't able to complete them within the time for the lecture. ::: --- :::info *Always ask questions at the very bottom of this document, right **above** this.* ::: ---