Python for SciComp 2025(archive) 25-27/November 9:00 CET (10:00 EET)

# Python for SciComp 2025 (Archive) 25-27/November 9:00 CET (10:00 EET) :::danger ## Archival of PFSC 2025 notes This is the archival of the notes document. The live document is here: https://notes.coderefinery.org/pythonscicomp2025 Archive for Day 3 is available at https://hackmd.io/@coderefinery/python2025archive2 ::: # Day 1 - 25/11/2024 ## Icebreaker questions *Let's test this **collaborative notes document** with some icebreakers :icecream:* ### Where (country/city) are you following from? - Finland, Helsinki +7 - Finland, Espoo +5 - Finland, Oulu +5 - Finland, Oulu - Finland, Joensuu +1+1 - Finland, Oulu - Finland, Oulu - Finland, Oulu - Iceland, Reykjavík - Denmark, Copenhagen - Finland, Helsinki - Italy - Finland, Oulu +1 - Finland, Oulu - Finland, Espoo - Finland, Espoo - Taiwan, Taoyuan - Finland, Espoo +1 - Finland, Kuopio +1 - Finland, Oulu - Finland, Oulu - Finland, Oulu - Sweden, Uppsala - Finland, Helsinki - Finland, Oulu +3 - Finland, HKI - Finland, Espoo - Finland, - Finland, Oulu - Finland , Helsinki - Finland, Oulu - OULU - Espoo, Finland Turkey,Konya - Finland, Espoo -Finland, Espoo - Germany - Iceland, - Lebanon, - Tunisia, Tunis - Finland, Espoo - Finland,Oulu - Finland, Jyväskylä - Finland, Helsinki - Sweden, Stockholm - Finland, University of Oulu - Finland, Oulu - Gliwice, Poland - Bologna, Italy - Brussels, Belgium - Finland/Tampere - Denmark/Lyngby - Sweden, Stockholm +1 - Iceland, Reykjavik - Spain, Valencia ### How much have you used Python before? - Beginner, but never really done anything "serious" with it +6 - Rather beginner, although did a few courses in advanced python. - Absolute beginner +2 - Basic interactive courses - Beginner, training courses +2 - Beginner +2 - Some, but not for a while. - Beginner, did some plots with it +5 - Basic done some basic programming - Basic+1 - Basic - Basics, many times tried, but not feeling comfortable to use it +1 - beginner - beginner - beginner - beginner - Intermediate - I use python quite often. It's my first language. - Basic - basic - Basic - basic - Never - I dream in python (they are nightmares) :laughing: - Beginner - Basics - Basics - Basics - Some basic trainings only -Basics - I have done a beginner course but we didn't use python itself, just learned coding - I use python quite often. It is my goto language - I usually use Python to process data +1 - basics - Used for many years - Can use it in a framework now - I use python quite often. It's my first language. - Beginner - I use C++, but sometimes Python. It's good for scripting just like Bash. - Studying Bioinformatics, so cute a lot ### How much have you used Jupyter before? - Beginner, I never remember the shortcuts +2 - I have had couple of courses using Jupyter - Couple of times +4 - Basic interaction +1 - Tried out, prefer Spider - Sometimes +1 - Never +9 - A fewBasic times in a Machine Learning course with Python at Aalto University - A little - sometime +1 - Sometimes - Few times - Few times - never +1 - once or twice - Always - never - never - Never - Nope - Never - Some coursework - Few times - Few times - never+1 - Seen it used by some - yes - Very little - I have used Jupyter notebook extensively. - Used it when I had to (too often). +1 - Few times - Never - Few times - Few times ### Where do you store your files? -laptop - laptop - University laptop, which I guess is centrally backed up...? +8 - Triton HPC cluster (I know it is not backed up...) - OneDrive +4 - OneDrive - OneDrive - Laptop, cloud - Laptop, cluster, time machines, and cloud - Laptop - Laptop, Github - Laptop - Hard drive - Work computer in my university account OneDrive - Laptop +2 - Onedrive - Onedrive - Onedrive - onedrive - onedrive - Depends, onedrive, mahti,triton, laptop - CSC Allas, backup drives, laptop +1 - laptop - Laptop, CSC Allas, Google Cloud, time machines - CSC Cloud + Backup on laptop - LAptop - laptop - Laptop+git - Onedrive - Onedrive - University server - OneDrive - Univ comp envs - laptop - git, oneDrive, GoogleDrive, own server - Github, OneDrive, GoogleDrive, Own Laptop - Onedrive, laptop - Laptop, Onedrive - Laptop, Git - Laptop, OneDrive ## More questions here and some questions submitted via registration *We have tried to answer via email to some questions sent during registration, if we missed your question, email us at scip@aalto.fi* - Will recording of the lecture be shared afterwards or not? - I would love to get info if there is AI and related course or training in the future. - At Aalto University we have some intro on responsible use of AI in research work - https://www.aalto.fi/en/services/ai-and-research-work-useful-learning-materials (the November 2025 recording is going online next month) - With CodeRefinery, we are planning a new episode "responsible coding practices with AI assistants" - "I will appreciate if I can be part of Zoom activities as well." - We will not provide a parallel zoom for exercises as it has not been popular in past editions. - How to utilize AI the most efficient way in Python coding? And are there some certain guidelines for prompting? - We will not cover this in this course, but in general it depends on the A I model you use. A good guide is this one https://simonwillison.net/2025/Mar/11/using-llms-for-code/ In general there are three techniques - You know what you what: be as detailed as possible since you know what you want "generate python code for a function that takes these inputs and returns these outputs. etc..." - You have examples (e.g. from tutorials) and want those adapted to your case (paste the examples as a few-shot-learning prompting technique) - You are exploring or brainstorming... just go with the flow and iterate many times - I would like to learn more about reducing code complexity in Python. - I highlighted this just because I also wish for this. Sometimes some tools/libraries make things more complex, but I would really like to hear what the instructors have to comment on this. - In my experience, the best way to reduce code complexity is to learn more about the packages you are using. Things like numpy/scipy/pandas have so many functions... When I am cleaning messy code, the main thing I end up doing is replacing screenfulls of messy code with one or two lines calling the proper functions. - I am using R studio a lot, only just started with python. Practical setting ups would be good to know, if possible to compare with R for better understanding." - This is an interesting comparison. We won't have time to compare them deeply, but personally I have to say that R is my go-to-choice when I want to reuse existing R packages without inventing new functions. Python is more for developing new things, e.g. some advanced pre-processing or more advanced libraries for machine learning or signal processing. For typical data-science use (load data, preprocess, run some statistics, produce tables or graphs) they are basically equivalent. - I wish to learn a more structured way to use Python. At the moment I search the Internet for any problems that I have, and most likely use the first solution that I can get to work. - I so know the feeling 🙂 I think that, beyond the actual programming language and implementation, it is important to write comments and pseudocode before any actual coding. We used to run a course called Software Design for Sci Comp https://scicomp.aalto.fi/training/scip/software-design-2022/ maybe we should run it again.. - I want deep learning examples, pytorch, etc - Unfortunately we won't cover those deeply. However we have been giving those workshops and CSC with the Practical Deep Learning course is a good way to start. Next run is in April 2026. - You may also check the ENCCS workshops on deep learning: - https://enccs.se/events/practical-deep-learning/ - https://enccs.se/tag/deep-learning/ - Many of the events are re-occuring. Also, the material, including recordings, is available for self-study as well. - I want to get introduced to the basic Python libraries for post-processing CSMP++ VTK files. In particular, I would like to learn NumPy, SciPy, pandas, and Matplotlib - Then you are in the right place! - Is anyone else hearing a beeping sound? - I can hear it too - I am not hearing it right now, it might have been our "streaming computer" - need to improve knowledge in deep and machine learning - Oulu. yes I can hear the beep sound - Nothing in the Zoom room. - We will not use zoom in the end, only the TwitchTV stream and this document - I am getting this error when trying to open the notepad in jupiter Error :Permission denied: Untitled.ipynb - Try to save it in a folder you have permission to write. - How can I do that ? now it works thank you - what is the magic command for undo in Jupyter-Lab? - There is no magic command in the sense of using the % symbol, but you can ctrl+z (undo) for undoing the last edit you did. You cannot undo the output of a cell that has already been run, but you can always run it again... unless you have a time machine :D ## Jupyter https://aaltoscicomp.github.io/python-for-scicomp/jupyter/# *Questions continued* ### Share your experience with us! What is your favorite IDE, be it with or without notebooks integration. Why? - VSCode, Sublime - VScode +4 - VScode (Msoft still has not ruined that product surprisingly) - Spider +1 - VScode - Vim - VScode, because i'm most familiar with it - Vscode, Intellij - Spyder - Linux :) +1 - VSCode - If you've been using Jupyter before, what for? - Data analysis like plotting. Was kind of useful for benchmarking ML models too before writing distinct .py scripts and running them for longer periods on a cluster +1 - Assignments, course projects - Assigments, Data analysis, Demos, Documentation - Teaching, explaining an analysis in detail - Assignments, course projects - Assignments, course projects - Assigment If you have been using Jupyter notebook / Lab for a while, which best practices would you share to beginners / others? - Fun fact: did you know you can create custom line magic? Have you ever tried it? Jupyter pros. What do you like about it? - .. - Good way of sharing code clearly, esp. for demos and showing datasets - "Literate programming" mixing text, equations, images and blocks of programming code - useful for interactive work Jupyter cons. What do you dislike about it? - It seems that there is no real possibility to debug - ability to run cells out of order +1 - Notebook format - Explanatory text mixed with Code means my brain either skips the explanantions or the Code boxes - Variables staying in the env even when the cells are removed - Poor version control ### Questions - I have only experience with R, so Python is completely new for me. I was also struggling with the installation and do not understand what Jupyter really is. Is it where I write my commands for Python? Or is it "just" a help tool? - If you are at a university listed in our pages, we can help you with the installation. Get in touch - I am not affiliated with a university since I finished my master studies in May this year. I am taking this course as a self-learning opportunity to understand Python more - Ok, if you are truly lost, get in touch anyway. :) Jupyter in a nutshell: it is an interface where you can write python code, and then when you "execute it", the code is passed to the "python interpreter" that processes the code and generates some output. Jupyter runs on your computer and uses the web-browser to show this interface. "Behind the scenes" python waits that jupyter gives it some tasks. - Thank you so much! I will get in touch if I am getting lost along the way :) I really like the "real-time" answers which help a lot! - An analoguous thing for R is [RMarkdown](https://rmarkdown.rstudio.com/). So if you have used RStudio and written RMarkdown notebooks Jupyter is a similar system for Python. It is a simple IDE and you can run your code that contains both the documentation (Markdown) and the code (Python). What Jupyter then does is that it takes the code and runs it in a Python interpreter (also called kernel in the case of Jupyter). Similarly like Rstudio runs the R code in the R interpreter. - Yes, I have used RStudio and The RMarkdown previously, and I am well-versed in that. What I find confusing is that with Python (my first time using it, really) gn seems a bit "all over the place" - with different windows and things...I liked the "neatness" of R that everything is in one place. But I think I am going to get used to it - hopefully :) I enjoy that you show what you are doing and I can watch which is the easisiest way for me to learn - There are other IDEs for Python like [Spyder](https://www.spyder-ide.org/) that are more reminiscent of Rstudio. But for this course we use Jupyter as that is a commonly used in many cases and good to learn even if you're not going to be using it for all of your work. - I understand, and that is totally fine by me. I also started a general "data science" course on LinkedIn Learning and they also used Jupyter there. Is Jupyter something I have to download as well or is it something I use "only" in my web browser? - I believe you mean Jupyter? I am not sure if Junyper is a python extension as well. But basically whatever libraries you need, you can install them within python with tools like conda or pip, and jupyter is one of them. So it is downloaded in the moment of installation of libraries. Compared with R/Rstudio, with many libraries (and interface) always already downloaded. - Yes, I realised that I downloaded not only Python, but libraries as well. But from R I know that I have to "load" these libraries into R - so I did not know how to accomplish that with Python? Or if this is necessary to do. - Indeed. You can think it like this: When you "conda activate" or "source activate" a python environment, you "load" the access to those libraries including jupyter executable and other executables. Then inside a python script you make only certain libraries visible to the commands on that script by using "import", very similar like you do with R. - And for "loading" the libraries, I need to use the Python programme, right? I tried several commands to unpack and/or load the libraries "conda" into Python (like I have learned to do in R and RMarkdown). But I always got a "syntax error" in Python - "conda" can also be used with R, if you have multiple versions of R studio /R language on your computer, so you first "activate" the one for the project and then inside the R script you import the libraries you need, but they will be limited to the versions of which R studio/R lang you activated. - - So, R studio/R is limited due to the version of R studio/R I have installed. Is Python limited by this as well or is it more 'free' and - It can be confusing at the beginning, but from reproducibility point of view, it's amazing :) - I had no idea that I can use libraries interchangably between programming languages. But my secondary judge during my master thesis (bioinformatician) told me that there are a lot of similarities and if you "know" one language, you are able to see the parallels and use them interchangably - which is amazing :) :+1: - Has marimo notebook ( a jupyter alternative been explored? ) - Yes! They are really nice. If you are at Aalto join our zulip chat, we have a nice discussion thread there. ### Exercise 1 https://aaltoscicomp.github.io/python-for-scicomp/jupyter/#exercises-1 ### Exercise 2 https://aaltoscicomp.github.io/python-for-scicomp/jupyter/#exercises-2 Feel free to share the solutions to this exercise! ### Exercise 3 https://aaltoscicomp.github.io/python-for-scicomp/jupyter/#exercises-3 :::success # Exercises until xx:42 See list of exercises here above. Do what you can. :) I have completed (add "o" to the tally): - Exercise 1: oooooooooooooooooooooooo - Exercise 2: ooooooooooooooooooooooo0 - Exercise 3: ooooooo0 - Not trying: o ::: ::: success # Break until xx:00 Stretch your legs, have some water... ask more questions! :) ::: :::info For everyone in Finland: #### FYI: CSC services for research available for you Did you know that Jupyter can also be run and collaboratively used on CSC’s Noppe platform, in addition to being available on our national supercomputers? More info on [this flyer on CSC services for research](https://kannu.csc.fi/s/omTrE3A7rfSq926) ::: ## NumPy https://aaltoscicomp.github.io/python-for-scicomp/numpy/# ### Share your experience with us! - Have you used numpy before? - yes +4 - no +2 - yes, did an assignent using numpy - yes, but just basics - yes, but it was very basic to create plots - What do you most like about numpy? - A nice combo of the pros of python when it comes to readability and the speed of fast libraries running in the background - Relatively easy and intuitional to use in comparison to lists ### Questions - when you use lots of numbers, isn't something like jax usually a better solution (than numpy)? - [Jax implements the Numpy API](https://docs.jax.dev/en/latest/jax.numpy.html), so if you learn NumPy, you'll learn Jax as well. Same with [Tensorflow](https://www.tensorflow.org/guide/tf_numpy), [PyTorch](https://docs.pytorch.org/tutorials/beginner/examples_tensor/polynomial_tensor.html#pytorch-tensors) etc.. They all have their own implementation of (most of) NumPy's API. But efficiency depends on multiple factors like what you're calculating. For most cases NumPy is fast enough, but in deep learning etc. you might want to utilize GPUs and then you'll want to use other frameworks. But for all of these it is important to learn NumPy as that is the reference that it introduces all of the core concepts. - Any tricks to deal with mixing imports that use Numpy 1.x and Numpy 2.x? - I would recommend migrating to v2. Most of it can be automated with [Ruff](https://docs.astral.sh/ruff/) if you follow the [official migration guide](https://numpy.org/doc/stable/numpy_2_0_migration_guide.html) - If I have to pip install something no longer updated but required by something else? - Then you're in a pickle. If the code is abandoned it is basically locked down in time. If there are no alternatives to use you can usually fork the code and make the migration yourself, but if that is much work, you can force yourself to use the older version by installing that in your environment and having [version dependent imports](https://numpy.org/doc/stable/numpy_2_0_migration_guide.html#writing-numpy-version-dependent-code) in your code to make it compatible with v2 as well. - Where did the list with the numbers come from? Is there a repository in which I have to load lists (or 'external' input like excel sheets, for example) into? (sorry, I am really relying on my knowledge of RStudio here, because it is the only coding experience I ever had) - if you were talking about the numbers in the random array, those are simply generated by the random number generator. Or, in case you are asking about the Arrays in the maths/vectorization part, they were just entered in the terminal manually. - If this is a question about loading external data, we will talk about this at a later stage (e.g. in pandas and xarray), which are tools that provide a lot of data loaders for many different data formats. - Isn´t it limiting in dtype to only be able to use one type of numbers? Do you have to convert, for example, decimal numbers into 'whole' numbers to be able to us dtype? - Usually you'll want to cast the array to the data type you'll want to use and that suits the data. If your data is decimal numbers, you'll want to use `np.float64` and if your data is whole numbers, you'll want to use `np.int64`. You can convert floating point values to whole numbers, but you'll of course lose the decimal expansion, so the conversion is not reversible. So for example, temperature values might be floating point numbers and experiment participant ids might be integer numbers. - It is good to know that I am loosing the decimal expansion when converting numbers using the commands you wrote down - thanks for the commands :). Since I am from a biomedical background, I prefer to work with the data I got from an experiment and not convert, for example, floating point numbers to integer numbers. - Indeed. Double precision floating points have [a lot of precision (around 15-16 decimal places)](https://en.wikipedia.org/wiki/Floating-point_arithmetic#Internal_representation), so for any numeric data, you'll want to use them. In deep learning nowadays lots of stuff uses lower precision numbers for speedup, but in physical sciences double precision is still the king. - I had no idea that in deep learning lower precision is used(why?). But I understand that floating point data needs to be converted into integer data then. - GPU cards have smaller amount of (v)RAM memory, so you need less bits to store numbers if you want to store lots of numbers. - The code also runs faster because most of the big calculations are done as a huge number of small matrix operations like "calculate A times B plus C", where A, B and C are small matrices. The "tensor cores" are optimized for these sorts of calculations and they can do twice as many calculations with half-precision than with single precision, which is again twice as fast as double precision. Also in the context of deep learning the precision (actual value of the floating point value) usually does not matter that much as the range of the values (whether it is big or not). And smaller precision can also make the model less likely to overfit. - That makes sense and I understand what you explained to me. For my scientific brain it is a bit hard to wrap my head around that precision is not so important or is necessary to reduce. - In deep learning the precision (how complex of an concept can the model learn) seems to come more from the amount of parameters available and the amount of connectionsm, less from the precision of individual weigths. - Okay. Understandable. What is meant with "parameters" here? The individual numbers? - The weights in the deep learning layers. The amounts of numbers nowadays are in hundreds of billions for LLMs. In billions for other model structures. - So, do I understand correctly that the individual numbers are cast in "arrays" and/or matrices? And the amount of matrices or differently "filled" matrices 'creates' the multitude of parameters then? - Usually a framework such as PyTorch creates a deep learning network layer from tensors, which are multidimensional arrays of numbers. Those tensors are big blocks (hypercubes?) of numbers. Then the whole network is just lots of these layers organized in a big calculation that starts with one layer (input layer) and ends up in another (output layer). Calculations are usually matrix multiplications, activation functions and various normalizations. There are lots of different layer types, but underneath it all everything is basically similar to numpy arrays, but ran with GPUs. Values of the tensors that are modified during training are called the model parameters. How those tensors are organized and what layers etc. are included define the model structure. - This sounds very complex and fascinating. I am new to deep learning as well. So, this information gives me new input I can build upon and research to educate myself further. So, these numpy arrays need to be manually down? That sounds rather tedious and time consumin, depending on the amount of data I am working with, or am I wrong? - Frameworks do it pretty much automatically once you tell it what sort of model you want to create and how do you want to optimize it. I recommend checking [some PyTorch intro](https://docs.pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html) for learning more. - If frameworks do the work, am I not giving up a lot of control and "allow" the framework to go and "run with the ball"? Can I even understand what these frameworks do? It sounds like a huge black box for me...I will check PyTorch and further information regarding it out - thanks! - Lots of training is nowadays done on ["brain float" half-precision (bfloat16)](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) or even int8 because then the GPUs can do more calculations with same hardware. In inference int4 or int8 are often used to quantize the model weights so that same model can run on less memory. - For more info on data types, check the [data type reference](https://numpy.org/doc/stable/user/basics.types.html#numerical-data-types). - Thank you - that is very helpful! 121 115 iptarget = 121 103 ip = 121 116 ip = 121 117 ip = 121 118 Warning: Error occurred while executing the listener callback for event DrawLine defined for class mok.ui.workspace.slicecontrol.linemenu: Index in position 1 exceeds array bounds. Index must not exceed 18. Error in mok.linesegmentation2/subsegmmask (line 667) - Is it possible to see the current variables anywhere, similar to Matlab workspace? - there is the `vars()` or `dir()` function which lists all variables currently available in the workspace, but it also contains quite a lot of things that you might not be interested in. Or you 121 115 iptarget = 121 103 ip = 121 116 ip = 121 117 ip = 121 118 Warning: Error occurred while executing the listener callback for event DrawLine defined for class mok.ui.workspace.slicecontrol.linemenu: Index in position 1 exceeds array bounds. Index must not exceed 18. Error in mok.linesegmentation2/subsegmmask (line 667)can have a look at a large range of extensions for jupyterlab (e.g. https://stackoverflow.com/questions/37625959/jupyter-notebook-equivalent-to-matlabs-workspace) - is there a reason why my jupyter crashes when I try to run a cell? - If you run out of memory you can crash the cell. Check how many inputs you're creating. - Also, is there any comment in the terminal you ran jupyterlab from? This might give you a hint as to what is happening. - How would one go about if one wanted to get the diagonal values of a NxN matrix using a for loop for example? - `np.diag(matrix)` would be the normal way to get the diagonal elements. :::success # Lunch break until xx:00 You can write more questions below if you have any, but it is important to have a break. :) ::: - ... - ... - ... ## Pandas https://aaltoscicomp.github.io/python-for-scicomp/pandas/ - Does pandas use indexes as well or are only the column names used? - Use can indeed use both: something like `df.iloc[3:5, 6:7]` will select rows and columns by index. - If we group by survived and also fare, it seems that the people who survived had paid on average for a heftier ticket... a sociological comment perhaps? ; ) - The upper classes were situated on the upper decks and these people would have had an easier time getting to the life boats fast. - Where does the data comes from? Was it imported into Python via a certain function? - It was these lines: ``` url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv" titanic = pd.read_csv(url, index_col='Name') ``` The function `pd.read_csv` can read directly from a file on the internet. - Okay. If I want to use data I have as a result of an experiment in an excel sheet (because the machine gave the results like this) can I manually put the data via a "read" function into Python as well? (This might be similar to RStudio, but I am not sure) - Yes! But you will have to use `pd.read_excel`. See here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html - How is pandas different from various databases like maybe some SQL ones? Just that it's a python package or is it lighter or are there other meaningful differences? or do they serve different purposes all together and i'm just misunderstanding this - The Pandas dataframe is similar to databases, but it's not really a database. It's a single table and is stored in memory, instead of a file. - You can look at it as a really lightweight database system. I like to use it for the same things as one would use Excel. - Same kind of things apply to databases as to pandas. SQL stores data in columnar format as well (each table has multiple columns with pre-defined data types). In fact Pandas has good interfaces when you're [working with SQL](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html). If you're familiar with SQL [this document](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html) describes what operations in Pandas are analogous to ones in SQL. - Why to get first 3 rows we need synthax first_row=titanic.iloc[0:3,:], when(what data types) indexing would be [0:2,:]? - When using integer indexing, following Python conventions, `0:3` means row 0 up to *but not including* row 3, so row 0,1,2. - what would be the advantages of marimo to pandas? Havent tried marimo yet and thinking whether its an easy conversion having some experience with pandas - marimo seems to be an alternative to the Jupyter Notebook, not to the pandas python package - Is Pandas dataframe like excel, or at least similar in a sense? But does it store the data (or "raw data") in the same way as excel? - I like to think of Pandas as "excel for python". It stores the - raw data as a collection of NumPy arrays: one array for each column. - Good to know! What I don´t quite get is, why we need the NumPy arrays now...it sounds so very layered. Is it only my impression that it is so complex and layered or is it a fact when working with Python? It sounds/is similar to R (or it reminds me a lot of R), but how data is organised and stored sounds so much more complicated and a bit "messy". - It is important to remember that in Pandas columns and rows are different of each other: columns contain data of a single variable (with shared data type) and each row is one observation. In Excel columns and rows are interchangable (though often people use the same way of organizing data because rows are numbers instead of letters). - Now I see the difference. But what is the advantage of organising the data like this? It sounds like that I have to "know" beforehand, what will be what: a single variable and observations. And do I need to "pre-sort" or "pre-prepare" my data into NumPy arrays previously to be able to "work" with my data and to visualise it into, for example, a pie chart and/or a histogramm? - Often when you load your data you'll load it from e.g. a CSV file where each column has data in a certain format and the CSV reader will automatically cast the column to certain data type. Sometimes you'll want to change the data types manually as well (e.g. make timestamp written as string into datetime). We'll talk about this a bit tomorrow. The advantage is that when all columns are NumPy arrays calculations are really fast. - Is the "main goal" to have fast calculations? I was wondering about this, too, when I saw the us of the "timeit" function... - And organizing to this tidy data format usually means asking something like: what am I calculating/averaging over? The thing you're averaging over is the usually the observation. E.g. if you have measured information on different mountains, each columns could be things like mountain's name, location, height etc. while each row would correspond to each mountain. Then, for example, if you want to calculate tallest mountain, you calculate maximum of the height column. This calculation is now fast because all of the heights are stored in a numpy array with floating point numbers. - This I understand, but again: the point seems to be the speed...And if I want to do this calculation, don´t I have to put my data (and in some cases also convert my data) into arrays previously? Which seems like a lot of work. But maybe this is the magic of Python...not to have a lot of 121 115 iptarget = 121 103 ip = 121 116 ip = 121 117 ip = 121 118 Warning: Error occurred while executing the listener callback for event DrawLine defined for class mok.ui.workspace.slicecontrol.linemenu: Index in position 1 exceeds array bounds. Index must not exceed 18. Error in mok.linesegmentation2/subsegmmask (line 667) work, but to have data neatly organised into arrays to do multiple things with it. - The reason why the tidy data format is recommended (and used by other tools as well like SQL, R tidyverse, Apache Arrow and Apache Parquet) is that it makes it easier to write good and fast code that operates on data. By having data in columns functions such as sum and mean over the whole DataFrame make sense because the framework knows that the data is organized in a tidy fashion and it should calculate them one column at a time. So if you want to do aggregations, plots etc. the framework knows how it should work on the data because the data is organized like this. If there would be no organization one would always have to check how the data is organized in every function call. So basically if you organize your data in a tidy format (which usually is an important and time consuming part) you can then use all of the tools provided by pandas and other frameworks as now your data is in the format they use. - I totally get it now. Because we had to use and import excel files and make functions (pie charts, histogramms, etc.) based on that. And often R plotted them the wrong way and I had to change columns around in excel to get to the "right" visualisation. So, it makes sense to have "tidy data" so that the frameworks (and other programmes, or whatever does the visualisation) can work based on the "rightly" organised data. So, I figured that the most time consuming part is to organise the data based on what needs to be done with it, right? Do I need to know what I want to "achieve" with my data while organising the data? Or do I organise my data always in a "similar" way that I can use it for whatever I have in mind later for it? - Usually you'll want to organize in a way that relevant data for answering your question is in the tidy data format. That can involve merging data from multiple sources and dropping columns that are not relevant. If you're interested about the concept, see this paper: https://www.jstatsoft.org/article/view/v059i10/ Its written by creator of R tidyverse and it is cited widely. - That makes sense, because usually there is a reason why the data needs to be visualised/analysed/etc. . Thanks for the paper! I have a lot of reading and educate myself to do - but I love to not search the vastness of the internet, instead get such good suggestions :) - Do you have experience with Polars, i.e. is it really a faster drop in replacement for Pandas? - Yes, Polars is better for big data. But with Polars you'll really want to use the [expression syntax](https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/) for getting pieces of the data instead of loading everything into a single dataframe. You'll want to select the relevant rows and columns and operate on those or you won't see the performance benefits. - There is also [Ibis](https://ibis-project.org/tutorials/coming-from/pandas), which can do big data stuff with various "backends" like Polars. When working with it you'll need to coding in a way similar to Polars. ::: success ## Exercise until XX:43 https://aaltoscicomp.github.io/python-for-scicomp/pandas/#exercises-1 ::: ::: success ## Exercise until XX:05 https://aaltoscicomp.github.io/python-for-scicomp/pandas/#exercises-2 ::: ::: success ## Break after the exercise until xx:15 ::: ## Xarray https://aaltoscicomp.github.io/python-for-scicomp/xarray/ - From performance point of view, is xarray better than "pure NumPy"? - Underneath the hood xarray uses numpy to represent everything. It is convenience library for keeping track of complex multidimensional datasets. It also makes working with such data easier which avoids performance problems. - Is an xarray one array to contain all observations to one variable (which would be the location here, right)? - xarray stores multiple different arrays to one dataset. So location is just one array. But it allows you to make queries like "what is the temperature in this location". In numpy you would first need to find the index based on coordinates and then search the temperature array with that index. In xarray you can use the selection functions to make the searches easier. - Okay. I get the most. But what I don´t understand is the following: Is an xarray a matrix? Or is it a tool to annotate? If we organise data - like we did in Panda, we had: one variable with multiple observations. Is this structure the same or similar in xarray, or different? - xarray is a library for working with multidimensional arrays. So if you have multidimensional data (e.g. temperature at locations) you can use xarray to house a full dataset of these data arrays in a single dataset. - This I understand now. But how is the data organised? Is is "x" (for example, the temperature) and then all the complex observations and all informations as "y1" (for example, the location), "y2" (for example, the height), etc. which belong to "x"? Or is it just an accumulation of all the observations (temperature, location, height) in one place a.k.a. the xarray? - If in pandas each column was a numpy array with a name that described one variable, in xarray each array in the dataset is basically a named multidimensional array. These are then collected into one dataset which is a collection of different arrays. Then you can do queries like "based on value in array A, give me something from array B". - For sparse data (only few cells with a value), is there something better than xarray/numpy arrays? - SciPy (sister package to NumPy) provides sparse matrix representations: https://docs.scipy.org/doc/scipy/reference/sparse.html ### subthread of pythia-datasets issues (aka "welcome to the joys of reproducibility") - Should pythia_datasets already be in the environment? I don't seem to have it - from phytia_datasets import command seems not working? - Yes, I also have problems loading the pythia_dataset - If you installed from our env file, it is there at the end https://raw.githubusercontent.com/AaltoSciComp/python-for-scicomp/master/software/environment.yml - Advanced: if you are confident with terminal - open a terminal - source activate (or conda activate) name of the environment e.g. `source activate python-for-scicomp` - then run the command `conda list` - check if pythia datasets is in the list of packages. If it is missing: - `conda install -c conda-forge pythia-datasets` - reactivate the env and start a new jupyter - It's not there (in the list). When I try to install, it says "PackagesNotFoundError: The following packages are missing from the target environment: - pythia-datasets" - I added the "-c" option to the conda install, try again :) - thanks! - for me, it is there in the list, but it wont run the env activation - It might be you have a "local" python installation that hides the paths of the environment (e.g. past install with "pip install" without being inside an environment) - It seems like I am having problems with dependencies, but I already meet the requirements for the usage of the xarray. I really don't know how to make it work. I even downloaded the dataset myself with wget. - see solution above using conda (assuming you installed the environment following our instructions at https://aaltoscicomp.github.io/python-for-scicomp/installation/) - Thanks! That worked! sorry for the inconvenience - Great that you fixed it so quickly!! **End of day 1** ## Feedback for Day 1 of Python for Scientific Computing :::success News for day 1 / preparation for day 2 Today, we covered jupyter-lab and those python libraries that make data science happen. Day 2 looks at plotting in Python and something a little bit more advanced: scripting and profiling. Scripting is especially useful when you need to move away from Jupyter for example when you need to run your python code in remote systems such as High Performance Computing (HPC) clusters ::: ### Today was (vote for all that apply): too fast: ooooooooo too slow: right speed: too slow sometimes, too fast other times: oooooooooo too advanced:o too basic: o right level: oooo I will use what I learned today: oooooooo I would recommend today to others: oooooooooo I would not recommend today to others: It was bit fast for me: ### One good thing about today: - new packages were shown +1 +2 - excellent for newbies +2+1 - Xarray was completely new to me and seemed useful +2 - A lot of information -1 +1 - Xaort - I really enjoy the teaching method that include conversation amongst the teachers. This enhanace my understanding, comparing one teacher doing evertything. +1 - A lot of new stuff +1 - Interesting content and topics +1 - even if I did not code myself, I learned so much - especially from people answering my questions in the notes file (this is awesome!) - lovely people!! thanks +4 - Interesting and well structured +4 ### One thing to improve for next time: - have more time for exercises +5 - or drop the exercises completely and leave them for individual work after the lectures +2 -1 - slow down a bit +3 -1 - I would like to have a bit more of the way Python works (what is behin), probably the code to sort a dataset may be found in copilot. +1 - Some prep before we join the class, like going through the code, so that if we have any questions we can ask during the session +1 - Would be good to have some structure or outline in the sessions/lectures, like "1. 1st Topic, 2. Discussion, 3. 2nd topic, ..." - it was difficult for me to follow some of the spekers +1 - The lectures were incoherent at times +2 - It would be nice if you could go through high level idea more, since we don't have much time to do the exercises. - If possible, have more time in general and divide the exercise timings unevenly (first task: 2min, 2nd task: 5min, 3rd task. 15min or similar) ### Any other feedback? General questions? - some of the tasks were very generic, while other had not enough time to go through. Also, quite a lot of time was spent to just chatting between hosts, I tend to lose attention to the lecture in this moments. - considering the background of the participants, and use their own scientific problems as examples +2 - especially the morning tasks were quite simple but they very quickly got quite challenging towards the afternoon +3 - https://www.twitch.tv/coderefinery get stucked for me in the morning. It took me a while to understand that I am missing the lessons xD Refreshing helped +3 - The afternoon session was difficult to follow. I wish I was able to follow it or have executed the code before today - For me it was not enough time to do the exercises. I would like to take my time reading documentation and understand in more detail what is going on. I would prefer uninterrupted lecture going in more detail into usecases and syntax of the code. - Seemed very overwhelming for someone that is not familiar with none of those Numpy, pandas, Xarray things, only very basics of python - A good overview of the packages, good speakers. Could be useful to have some more time on each topic though, perhaps reduce the lunch break if necessary? ## Thank you everyone for the useful feedback! :heart_eyes_cat: ----- # Day 2 ## Icebreakers ### I was here yesterday: yes: oooooooooooooooooooooooooooo no: o partly: o Yes Yes Yes yes ### How big ios the data you use with Python? no data: oo <1M: oo 1-100M: ooooo 100M-10G: oooooooo 10G - 1T: ooooo 1T - 100T: o more: o not sure: ooo Varies ### How do you make figures for your reports or articles? Which tools do you use? Which libraries? - R with ggplot2 +10 - matplotlib in python +5 - matplotlib, seaborn, bokeh, ... - Veusz, matplotlib - LaTeX, TikZ, gnuplot, MATLAB, Inkscape - Prism - R ComplexHeatmap - Matlab, tikz, matplotlib - Varies, depends on the type of analysis, Excel for customization and even PowerPoint for precision edits - Mostly Matlab - ggplot and mony others - [yEd](https://www.yworks.com/products/yed) - [excalidraw](https://excalidraw.com/) - R (ggplot2) + Inkscape for some editing - [geogebra](https://www.geogebra.org/) - seaborn/ggplot2 for figures, Canva for post-processing - Matlab - Matlab, inkscape, drawio - Vega-altair - spss Always label your axes https://xkcd.com/833/ ### What's the most chaotic data you have used? - binary data with mixed record-lengths and text,single & double precision data - Data with randomly varying datatypes, large files with specific lines that are actually useful - Diffusion MRI data (4D) mixed with demographic & clinical data (numeric, text) - fMRI - Poorly digitalized data, data format varies with year/period of time - Qualitative data from interviews - Clinical data, where most of things were written by clinicians without any template or strict format. In Finnish. - Massive matrix, geographic data from across the globe, with each location followed by sparse occurrences across the rest of the (thousands of) columns - untimed measurements and video materials to combine movements and sensor data and analyse all together ## Plotting with Vega Altair https://aaltoscicomp.github.io/python-for-scicomp/plotting-vega-altair/ - I tried executing the xOffset line but the histogram bar is still not stacked +2 - Did you do the date conversion and all of the data preprocessing steps beforehand? ``` # replace mm.yyyy to date format data_monthly["date"] = pd.to_datetime(list(data_monthly["date"]), format="%m.%Y") ``` - Without it the date is not temporal, it is nominative. - I had done so, but I guess it just took more time. Now it's working well for me. Thank you! - For me, I did that too, but it still plots the same. Tried restarting kernel and re-running the process but still unsuccessful +1 - Did you rerun all cells in order? Start a new notebook and run the code I pasted below.. still same issue? I also added a print of the Altair version. I am on 5.3.0 and it looks like the example we have in the materials. - Yes, and tried it again just now -- ah okay, probably version is the issue, mine is 5.0.1. Thank you! - 5.0.1 is from 2023 https://github.com/vega/altair/releases/tag/v5.0.1 you want to update your installation :) - thank you, I tried updating to 5.3.0 but the issue persists - Does the code below print the new version? You need to restart jupyter after the upgrade. - Yes! it worked after restarting jupyter :) - Reproducibility achievement unlocked :D ``` import pandas as pd import altair as alt print(alt.__version__) url_prefix = "https://raw.githubusercontent.com/AaltoSciComp/python-for-scicomp/master/resources/data/plotting/" data_tromso = pd.read_csv(url_prefix + "tromso-monthly.csv") data_oslo = pd.read_csv(url_prefix + "oslo-monthly.csv") data_monthly = pd.concat([data_tromso, data_oslo], axis=0) # let us print the combined result data_monthly # replace mm.yyyy to date format data_monthly["date"] = pd.to_datetime(list(data_monthly["date"]), format="%m.%Y") alt.Chart(data_monthly).mark_bar().encode( x="yearmonth(date):T", y="precipitation", color="name", xOffset="name", ) ``` - ModuleNotFoundError after importing altair. Whats the problem? Ok. I'll go back to check :) So, back in bussines, thanks! - It might be that you do not have Altair in your installation. If you installed with conda, you can open a terminal, actiate the conda environment and check with "conda list" if Altair is installed - If you run jupyter-lab from the same environment (i.e. the one from the instructions) it should be present - I have the same issue, it appeared in the conda list but still not working in jupyter - What OS version do you use ? - Sorry, im new here, dont know what OS is :D - Operating System (Windows/Linux/mac) - Windows - How are you starting jupyter? command line or some other launcher? - Miniforge Prompt - is the environment activated when starting jupyter lab ? - Yes - could you, in the miniforge prompt run python and try to import altair ? Just start python (`python`) and run `import altair` in the python prompt. - I got Type Error: - Could you copy&paste what's in your command window here ? - _TypedDictMeta.__new__() got an unexpected keyword argument 'closed' - What was the command you ran to get this? (copy the Full line please) - >>> import altair - And that error message is the only result you get? no track at all? - Well I got a loads of text but they all contain paths etc including my name which I reacon wasnt allowed to share here - Could it be that you have some pip installed packages in your user settings (installed via pip install --user)? That could cause an issue here. Those packages are normally located in - This was the first time loading all these things for this course so I havent been installed any "additional" things that wasnt instructed - Not sure if we can solve this at the moment if you are from Aalto you can come to our garage and we can have a look, but this would need a lot more info that's probably not sensible to past/hadle here. - unfortunatelly not from Aalto. Okay, thanks for trying anyway. - or in the terminal with miniforge: - `conda activate python-for-scicomp` - `conda list` - Altair should be in the list of installed packages - Or `conda list altair` - When would you prefer Altair over Matplotlib? - Depends on which you prefer. I (Simo) use Altair nowadays for everything, but there is no single best plotting library. I prefer the declarative style because I feel it gives me more control, but others prefer matplotlib and create great plots with it. - I personally look at galleries examples and use the one that looks closer to what I have in mind - https://altair-viz.github.io/gallery/index.html - https://matplotlib.org/stable/gallery/index.html - Overall: It's a question of preference/ease of use/familiarity. If it would take me way longer to produce a certain plot in one or the other, I would use the one where I'm getting the same result faster. If I need features that only one provides, use that. I think vega altair is faster if you have an existing dataframe, so most likely I would start with altair and switch back to matplotlib if I miss features. - Thank you. Do you mean that Matplotlib allows for more features than Altair? - Not necessarily. More that there might be some different features in matplotlib to altair. They are just different librarieries, with slightly different focus. - Should you format your data in Pandas first before using Vega-Altair? - In general yes. Do all pre-processing before you start plotting. - We'll talk about transformations a bit later, but in general you want to have the data in a tidy data format. * If the datasets have the same range in the x-axis but different number of bins. Can pandas and altair still handle it? - Yes, typically you can re-bin the data through the `maxbins` option. - Thanks - Why do we sometimes use pandas to load data before plotting with Altair, and other we can plot directly without pandas? - I am not sure what you mean. We need pandas to load and store the data in memory. It is also possible to plot mathematical graphs, without loading any data (e.g. a trajectory with two sines interacting https://altair-viz.github.io/user_guide/transform/calculate.html) :::success ## Exercise until xx:50 https://aaltoscicomp.github.io/python-for-scicomp/plotting-vega-altair/#exercise-using-visual-channels-to-re-arrange-plots I did (add "o") to the poll: Exercise 1.1: ooooooooooooooo Exercise 1.2: oooooooooooooo Exercise 1.3: ooooooooooooooo Exercise 1.4: ooooooooooooo Not doing exercises: o ::: ## Vega Altair continued - Could you share some resources about the coloring aspect of plots? Are some themes better than others? - I like this one https://clauswilke.com/dataviz/ and this chapter https://clauswilke.com/dataviz/color-pitfalls.html explores for example how color-impaired individuals might not be able to see your plots (about 4% of population) - [matplotlib has a great article](https://matplotlib.org/stable/users/explain/colors/colormaps.html) on its colormap choices. vega (that vega-altair users underneath to do the plotting) has same [color schemes](https://vega.github.io/vega/docs/schemes/) for the same reasons of uniform perception. - Could explain this a bit more? “Vega-Altair also provides many different data transformations that you can use to quickly modify plots.” Does it mean that I can use Altair for both data transformation and visualization? When and why would you prefer Altair over Pandas for data transformation? - Yes, you can do data transformations with vega-altair, but usually those relate to what you want to visualize. For more fundamental transformations you'll want to use pandas. It is hard to say where the breakpoint is, but pandas is recommended when the data is changed in some way while Altair is good for something that affects the data plotting representation (like time ranges). With pandas you can also verify that you're getting the exact results that you want. - If I don't specify the theme, will a default theme be selected? - Yes. As a general rule with Python, if you do not specify all input parameters for a function, the function will use defaults. If some parameters are mandatory, the function will fail without those. - - :::success ## Exercise until xx:35 https://aaltoscicomp.github.io/python-for-scicomp/plotting-vega-altair/#exercise-adapting-a-gallery-example ::: ## Working with data https://aaltoscicomp.github.io/python-for-scicomp/work-with-data/ - Can be Panel data consider nD array data ? (Where n represents the time) - What do you mean with Panel data? - Panel data is a dataset where the units (for example people) are observed for many times (for example I observe different characteristics of many Countries for 4 years: axis 0= Countries, axis 1= characteristics, axis2= the years) - in Pandas, and when plotting with altair, you would always encode such information in 2D as a table: | Country | Year | Height | Favorite food | | -------- | ---- | ------ | ------------- | | Finland | 2010 | 168cm | Pulla | | Finland | 2011 | 168cm | Ruisleipa | | Sweden | 2010 | 143cm | Lingonberry | | Sweden | 2011 | 143cm | Pasta | -Perfect!, thank you very much :-). Then, also if the characteristics are quantitive,it is possible to rewrite always so. For this reason the datase is not nD. - Data can always be written in this format. It is not always very efficient though, because you duplicate values a lot . (in the table we duplicate the Country and Year over and over again). Then again, you probably do not want to plot millions of data points, so perhaps in those cases, the data processing can be done on 4D numpy arrays, and then for plotting you put just the data you need in a Pandas DataFrame. - Different ways of organizing the data are sometimes referred as [long vs. wide formats](https://towardsdatascience.com/long-and-wide-formats-in-data-explained-e48d7c9a06cb/). [pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/07_reshape_table_layout.html#long-to-wide-table-format) and [altair](https://altair-viz.github.io/user_guide/data.html#long-form-vs-wide-form-data) work with both, but altair prefers the long format (similar to the table above). See the pandas tutorial on [pivot tables, melt and reshaping](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html) for more information on how to convert tables around. Sometimes organizing the data in the way you want can require lots of fiddling around. Especially if you have data encoded in the column names. - Perfect! - What's the Vega part of Vega-Altair? You don't do an "import vega" - "Vega" is the visualization grammar: https://vega.github.io/vega/ It is a json format that can then be converted to HTML, SVG or PNG. Altair is the python package to generate the json, which is then consumed by vega to produce the actual plot. This all happens behind the scenes though. - I need to process a very large (> 1B entries) dataset. On import I process textual data and after that processing there are some entries with duplicate values on some field. All entries with duplicates on that field need to be removed. Storing the values as strings in a python set takes too much memory. So I have considered calculating a hash and storing it instead. My question is whether you can suggest a way to do that that efficiently and how to calculate hash collision probability? Or do you have a suggestion for deduplicating in another way (I also tried process data with python followed by bash sort, awk and uniq combo, but it is cumbersome) - Using a hash to replace a long string seems like a good idea to me! It's hard to be specific without knowing more about the data, but I have a feeling your use case may be good for a more advanced Pandas-like package such as Polar: https://pola.rs/ or Dask: https://docs.dask.org/en/stable/index.html - I won't bet money on this, but dask might even be enough without any hashing :) (at least when loading 100 millions of strings I had no issues) - If the strings repeat, you can also use [pandas.Categorical](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html) datatype. It basically takes all of the strings, gives each one a number and then represents the strings as a column of integers. There is then a stored mapping between strings and the numbers. - You can also create this mapping table yourself if you want. Then you'll have two tables: one with the strings and a mapping number and one with the rest of the data where strings are represented as numbers from the other table. - [DuckDB](https://duckdb.org/docs/stable/clients/python/overview) is also a good tool for storing big data for analysis. It can store the data in e.g. parquet files, where strings are compressed. You can then query just the data that you want from the files. It acts like a database, but does not require any installations or database engine. Very useful to data analysis. - Building a database with an index might be an approach, reading by entry, discarding duplicate on storing. Or, depending on the overall size, trying to use an HPC system for the wrangling (which might have sufficient memory). How big is the original data amount - To clarify, I only keep the value of that single field in memory (python set) and the data is streamed (read, process, write if not a duplicate) in order to reduce the memory footprint. DuckDB is great, but it still crashes if data size exceeds RAM (on Linux) unfortunately. - At some point temporarily running some database server (sql, mongo, or whatever you prefer) could be an approach. - Usually at this point you'll want to query the data to get rid of unnecessary data when loading it instead of loading it and then filtering it. With e.g. [duckdb expressions](https://duckdb.org/docs/stable/clients/python/expression) or [polars expressions](https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#expressions) you can filter the data for the relevant parts without loading the whole data into memory (or while [streaming the data](https://docs.pola.rs/user-guide/concepts/streaming/)). Of course this is complimentary to converting the data into e.g. categorical data format. - I currently process such data on Puhti large mem partition. A less memory intensive method would be nice. I will look into your suggestions, Thanks. - Memory Mapping (within dask for example) can work really well, but if you are doing GPU stuff, then you need better tricks for faster I/O - I personally fear, that without some form of database service, there isn't really a way to avoid having to do these checks yourself (and thus needing the memory). A db, will check "on" inserting something into the db, and essentially does lookups, so avoiding to store it by doing more checks. - I have tried UNIQUE constraints (indexing)with SQLite, it does not seem to play nice on Lustre. Would someething like PostgeSQL on Pukki be good (how about bandwidth) - DuckDB can do similar sql queries, but data is stored in parquet instead of sqlite, which works better with HPC file systems. - I had a DuckDB version of my code at one point where i stored the whole dataset (with UNIQUE restraint). However it always crashed when exceeding physical RAM. (setting PRAGMA memory_limit did not help) - hence I moved the data preprocessing into a separate script. Although I have not tried DuckDB in the last year. - You might want to check what is the underlying data format. Some data formats (csv, json, sqlite) require data to be loaded to memory, while others can stream directly from the disk (parquet). - Not entirely sure, but sqlite is essentially an "in mem" database. What I was thinking about was running something like a postgresql or mongo server (via containers), but admittedly, I'm not sure, how nice those play with hpc file systems like lustre. Btw, how big are those strings that you need to match ? - chemical compound SMILES (so reduced characterset, should be able to encode to small space too, as an option) maybe about 40 ASCII chars each on average (this is the field for deduplication, the actual dataset has much more information) - A more general question, when working with lots of data files that all include a lot of lines. What are the tools that are good for that situation? - What is in the lines? - mixture of strings and floats, generally i would like to use the floats, but the strings are used to identify which lines are interesting - So is it a CSV? - I don't think so, in my case it would be for example VASP output data (looksl ike txt file with a LOT of different values and descriptions of what's been calculated and with what inputs). According to internet it's a "text-based output file" if that helps. - For VASP you might be interested in https://www.vasp.at/py4vasp/latest/ or https://pyprocar.readthedocs.io/en/stable/index.html . Usually you'll want to operate on the [output formats](https://www.vasp.at/wiki/index.php/Output) that the programs provide and not work on stdout outputs. - Yes, i do work (mostly) on the outcars. Will have to look into py4vasp, never tried it. Thanks! - Another useful python package is ASE (https://ase-lib.org/index.html). It can read VASP output and load the data as part of an Atoms object, which contains all the info (or most of it, not sure if something is lost). - Yes, ASE is useful, i have used it a bit, but i will have to look more closely to the details. Had semi-forgot it can be used for more than just the "basic" things, thanks! - Happy to help :) just as a reminder, check the read function (https://ase-lib.org/ase/io/io.html#ase.io.read). There you can find exactly which vasp output files you can read with ASE (OUTCAR for sure), and what info gets loaded. - ... ## Scripts https://aaltoscicomp.github.io/python-for-scicomp/scripts/ :::success ## Exercise 1 until xx:15 https://aaltoscicomp.github.io/python-for-scicomp/scripts/#exercises-1 ::: - Is the sys.argv[0] = script file? - Yes. When you run `python something.py` the first thing after `python` is `something.py`, so its the first argument. - so, the index of first argument is 1? Seems weird to start indexing from 1 when i'm used to first being index 0 :) - First index is the script name. But it can feel strange. - As mentioned on stream: The arguments `sys.argv` gives are the arguments to the python interpreter, not to the script, and therefore `sys.argv[0]` is always the name of the script that was executed, since you call `python <script> arg1 arg2`. - After updating sys.argv into the .py script, does it need to be 'updated' or run again from the terminal for the changes to apply? (Mine isn't working currently) - I'm not sure I understand the question, a script doesn't do anything until it is run from the terminal. So the new script, once saved, is ready to be run, but it is not run by default, that needs to be done manually. - So for the people with not a lot of experience with terminal: how is that done? (Sorry if this is a dumb question) - No dumb question :) Once you have updated the script to use sys.argv you want to open the terminal and run `python script_name arg1, arg2, arg3`. So for the example we had in the lecture since we had in the script `sys.argv[1], sys.argv[2], sys.argv[3]` we need to pass three things to the command line. - Right okay I did that, I wasn't sure if that updates the script or just passes the 'things'. So mine keeps updating another .png and doesn't create the new one - did you save the .py file after updating to use sys ? - Hmm no - right, the .py file needs to be saved, after that it's ready to go (I also forget to save a lot of times ;) ) - Yes okay thank you, that's what I was trying to ask if it needs to be permanently changed, my wording could've been clearer :D but thanks for the help! - Regarding the date script, could it be used to sort of timestamp files based on when the file was created? E.g., you have some datafile, you run some script on the data to do some calcs, create a few file with the computed values and then including a timestamp (or perhaps with which inputs the data was computed) - Sure. The code could be modified to do that. You can use [strftime](https://docs.python.org/3/library/datetime.html#datetime.datetime.strftime) or pandas [pandas.Series.dt.strftime](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.strftime.html) to convert timestamps into a specific time format. You could then choose the file name based on the timestamp. - one option could be to do something like `datetime.datetime.now()` and add it to the output file name. - . - . - it givs me the following error when trying to run it in the terminal. how to fix it ? - ``` : PS C:\Python excersice> pythom3 weather_observations.py * pythom3 : The term 'pythom3' is not recognized as the name of * a cmdlet, function, script file, or operable program. Check * the spelling of the name, or if a path was included, verify * that the path is correct and try again. * At line:1 char:1 * + pythom3 weather_observations.py * + ~~~~~~~ * + CategoryInfo : ObjectNotFound: (pythom3:String * ) [], CommandNotFoundException * + FullyQualifiedErrorId : CommandNotFoundException * --- ``` - Try `python3` instead of `pythom3`. The name of the interpreter seems to be wrong. - Did anyone try the february data command line input? I get an empty figure? I think i followed the steps as faithfully as i could : / - works for me - if you get a new picture but it is empty, then I suspect the problem might be with the dates, so I would try to check teh following things: - start_date and end_date are not swapped - start_date is smaller (earlier) than end_date. This could happen if you e.g. use the default start date but pass a new earlier end date [i have checked all that and i also plotted a larger time period and it seems that from 01/2021 to 05/2021 there are little to no data in the csv file. maybe i did something wrong when loading it? is this normal? -> I downloaded the csv and it seems there are no entries for february 2021, i think that is why i was getting an empty graph] - interesting, if you could paste the exact line you ran in the terminal I can take a closer look. Might also be an outdated example in the teaching material - I havnt used python before, so I was wondering why did we use miniforge yesterday to be able to launch Jupyter? is there a jupyter application or website that we can access directly without needing miniforge? and why are we using this interface (jupyter) are there any others that are recommeded ? In R for example is use R studio - There are several places where you can use external versions of Jupyter: - https://mybinder.org, run on voluntary resources, does not save your work - If at Aalto, https://ondemand.triton.aalto.fi/. Jupyter is on the front page. Saves your work on Triton - Don't know if it was mentioned, but storing command line options in a config file also helps reproducibility, you don't have to try to remember what cmdline options you used to produce your data - Good point. I think Thomas briefly mentioned it. :::success ## Break until xx:10 ::: ## Profiling https://aaltoscicomp.github.io/python-for-scicomp/profiling/ - Is profiling a must-have tool in our arsenal if we are not that affiliated with optimising code? Meaning, if our work is methodology development, and code is more of a thing we just have to do, rather than our sole-purpose, would it be bad if we do not bother too much with developing this skillset? What is your recommendation for someone who is more interested in RnD rather than soft dev? - This is a very good question! Profiling in general is useful for two cases: - Your code is good but you want it faster - Your code simply stops working when trying to use it with bigger data - For the former, I would tend to agree that not every researcher needs to work on it, especially if the code in question is not performance critical. Regarding the latter, imagine your methodology / algorithm works for e.g. 10 datapoints but stops working for 1000. Then this might indicate also a fundamental algorithmic issue in the developed code, which I would classify under RnD. For this issue, profiling can be useful - Thank you, this actually is helpful because i feel a bit overwhelmed with all the things i have to learn and i am worried that i am not dedicating enough time to polishing code compared to polishing the physics or chem behind my methods. I will keep working on good coding principles, and i guess that when i see a problem in efficiency, i will pick up profiling then. - That's the right attitude! Also, this is where the professional figure of RSE comes in. Something being necessary (e.g. code doesn't scale and cannot be run for higher number of data) still doesn't (shouldn't) imply one person (the researcher should fix it). It is very understandable to feel overwhelmed and feel like to have enough with the physics/chemsitry part, that's where an RSE (if available at your university) can help, taking out the burdain from you :) - I would also add that advanced profiling etc. is not absolutely necessary for doing research, but doing timings, logging and knowing what your program is doing is. Measuring can make it easier for you to know your program and even advanced coders can easily do mistakes that affect the code runtime negatively. So if you're not using profiling, you probably need to figure out some other way to know what your program is doing. If your program is fast enough for you, you might not need profiling, but if it isn't, figuring out the problem can be hard without it. - .magic commands don't work in my jupyter notebook so I cannot run '%%prun -s tottime -l 5'. If I run it like: '%prun -s tottime -l 5' it does give me an output but it is all 0s, as if it did not run de function - Did you have all of the code in the cell you were running? Also for the cell profiling you'll want to use the cell syntax `%%prun` with two percent signs. - . Yes, I am quite sure it was all in one cell. If i put '%%' it just seems like it does not run it. Like there is no output. - Did you do the imports beforehand? Try: ```python %%prun -s tottime -l 5 import math import random def calculate_pi(n_darts): hits = 0 for n in range(n_darts): i = random.random() j = random.random() r = math.sqrt(i*i + j*j) if (r<1): hits += 1 pi = 4 * hits / n_darts return pi calculate_pi(10_000_000) ``` - . Yes. Still nothing - What operating system are you on? - I use Windows but i have WSL. So actually everything should be running in bash/Linux. - Does `%%scalene` work? - Nope. If I do %%timeit this is the error I get: 'UsageError: Line magic function `%%timeit` not found'. If i do %timeit then it gives me an output. But for scalene, no output either if I put % or %%. - I restarted the kernell and now %%prun -s tottime -l 5 is working, but not scalene. - Did scalene produce a `profile.html`-file? There is an error in the current installation that has been pointed out to upstream developers that might not make it appear for you in the Jupyter view. - The error that I get is. UsageError: Cell magic `%%scalene` not found. - My mistake: run `%load_ext scalene` first in a cell to load scalene jupyter extension. - I think I got it now. Thanks for your help! - . - **INFO**: To try scalene, load scalene extension with `%load_ext scalene` in another cell before running `%scalene` ## The Library Ecosystem https://aaltoscicomp.github.io/python-for-scicomp/libraries/ - Are there any "general guidelines" on when spyder/VSCode/PyCharm is better/worse or are they more based on personal preference? - Editor wars have been going on since time immemorial. So it is really a question of preference (vim gang unite). - Ok, thanks. PS. nano ftw - . :::success ## What Libraries do you use in your work ::: - [ibis](https://ibis-project.org/): Dataframes from multiple different sources - [polars](https://pola.rs/): pandas written in Rust for bigger data analysis - NumPy, matplotlib - [ASE](https://ase-lib.org/about.html) good when doing anything with atomic systems, from creating structures to keeping track of sizes and such. - numpy, pandas, scipy, matplotlib +2 - numpy, pandas, matplotlib - numpy, scipy, matplotlib, pandas (openmm?) - rdkit: anything chemistry - scikit-learn - pytorch - mne (EEG signal analysis), numpy, scipy, pandas - [DuckDB](https://duckdb.org/docs/stable/clients/poython/overview): Data library for bigger data analysis - tensorflow - psychopy, fmriprep, mne - [signac](https://signac.readthedocs.io/en/latest/): data management, helps to setup a workflow involving many parameters/data (e.g. for a parameter search) - flopy, matplotlib, geopandas, pyemu,shapely - [mlflow](https://mlflow.org/docs/latest/): Model tracking - [hydra](https://hydra.cc/): configuration file and command line management - [click](https://click.palletsprojects.com/en/stable/): library for creating easy client interfaces - [pytorch-lightning](https://lightning.ai/docs/pytorch/stable/): Library for creating pytorch models - [PythonCall.jl](https://github.com/JuliaPy/PythonCall.jl): to call julia from python - [SQLAlchemy](https://www.sqlalchemy.org/): SQL interface library - [FastAPI](https://fastapi.tiangolo.com/): Library for creating REST APIs - ollama - vllm - Library for hosting large lange models. - [GeoPandas](https://geopandas.org/en/stable/index.html): Library for dealing with geospatial data - sys, argparse, numpy **End of day 2** ----- --- ## Feedback for Day 2 of Python for Scientific Computing :::success News for day 2 / preparation for day 3 Today, we covered plotting, data management, script, profiling and selected libraries. Day 3 looks at Parallel programming and integrating C code into Python in the morning, and creating Python packages and managing dependencies in the afternoon. ::: ### Today was (vote for all that apply): too fast: ooo too slow: right speed:ooooooooooooo too slow sometimes, too fast other times:o too advanced:ooo too basic: right level: oooo+5 I will use what I learned today:oooooo I would recommend today to others: oooo+2 I would not recommend today to others: ### One good thing about today: - I found the pace today better than yesterday - the breaks allowed more time for the tasks or relaxing if needed +4 - The profiling tools will be useful - I find the plotting session to be very useful (since I mostly use python to plot my datasets). I think this is my favorite day so far! :) - Vega-altair seemed easy enough and I was able to follow and it seemed useful - Really liked the profiling/libraryECO parts! - Great learning on plotting - I felt more confident in today's lessions :) +1 - Today pace was great. keep it up. - I had some other work to do as well, but great to have the materials and video later on - thanks! ### One thing to improve for next time: - ... Some sessions were too fast. Plotting session was nice pace :) - Overall maybe make the days an hour longer so there is more time for all excercises (i.e. the ones now left for self study) - Scripts session was too fast +1+o ### Any other feedback? General questions? - Matplotlib would have been a great addition +2 - Perhaps one could suggest how the scripts can be altered to suite other needs (unsure if this is all that possible) +1 - I have some py-routines which read various output files and returns numpy-arrays. Want to store them in some type of libray and call them from other py-scripts. Will this (including requirement on how to code them) be covered tomorrow or do you have some pointers on this subject? - ... ---