Python for Scientific Computing 2024 - Archive of chats part 2

Chat archive

This document https://hackmd.io/@coderefinery/python2024archive2 is the archival of the live chat
If you are looking for the live chat, go here: https://notes.coderefinery.org/python2024
Program: https://scicomp.aalto.fi/training/scip/python-for-scicomp-2024/
Materials: https://aaltoscicomp.github.io/python-for-scicomp/

Day 2 - 6/11/2024

Continued from https://hackmd.io/@coderefinery/python2024archive

Scripts

https://aaltoscicomp.github.io/python-for-scicomp/scripts/

**Have you used the command line before? **
Yes : oooooooooooooooooo
No : o
Yes, but still learning: oooooooooo
All the time: oooo
Sort of. A lot of use, but not very diverse commands. oo

Questions:

Can you define function call. Even though I have been working with Jupyter notebooks and ML, we did not really learn how to define a "function call"
- Function call is done with parentheses e.g. np.arange is a function from the numpy-module and np.arange(10) is a function call with argument of 10. You can define your own functions with def:
```
def my_function(x):
    return x
```

Exercise Scripts-1, we return at xx:15

https://aaltoscicomp.github.io/python-for-scicomp/scripts/#exercises-1
Try to get this notebook downloaded and run
We'll use this in the next exercise, too.

I am (vote for ones that apply):

done: oooooooo
not trying: o (not really using notebooks, just pycharm and scripting)
had problems: oo
wish for more time:

When I try to run the script, it kept mentioning no module named pandas, matplotlib… Is it recommended to build an environment in advance? Or did I miss something?
- The script should run within the environment you have been using for other parts of the workshop. So from a terminal, you first activate the environment and then the script will "see" all the modules.
- In general: Yes, when you want to use this on a cluster or somewhere else, generating an environment makes sense and is usggested.
Jupyter command jupyter-nbconvert not found.!!!
- did you run jupyter nbconvert or jupyter-nbconvert, if this is not working, you can also go the Export as route. File -> Save and Export Notebook as -> Executable Script. Only idsadvantage: it makes you download the file, so you have to upload it again.
  - i run jupyter nbconvert –to script weather_observations.ipynb
- Have you activated the environment for this course? There might be soNmeothing missing. It worked for me.
  - NO , my mistake
jupyter nbconvert --to script weather_observations.ipynb - invalid syntax
- I suppose you are running this on a terminal, did you activate the conda environment with the modules and dependecies?
- I have the same error (prompt: jupyter nbconvert –to script weather_observations.ipynb): Cell In[3], line 1
  jupyter nbconvert –to script weather_observations.ipynb
  ^SyntaxError: invalid syntax
  - What does the "prompt" of your terminal say?
    - That command should be run from within a terminal, but it seems it is now within a cell of a jupyter notebook, is it so?
    - yes, that is true. Should I use Miniforge prompt then?
      - Yes if that is what you installed and how you've been running jupyter.
      - You can also use the "Terminal" from Jupyterlab ("+" button to go to the launcher and then start Terminal, it opens in a tab.)
      - I opened (in Jupyter) a new launcher "+" > Other > Terminal, it created a py-script.
Is there a reason why nbconvert does not automatically include from IPython import get_ipython but it automatically includes get_ipython().run_cell_magic()
- Good question. I guess it doesn't want to break other things, that it could potentially break, when importing more things.
I'm wondering if there is a way to run the .py script so that the figure would open straight away? I use VS code and miniforge promt, and the figure appeared to my VS code folder, if that makes sense, even when I ran the script in the terminal.
- If you ran it in the integrated terminal for VSCode, this is expected behaviour. I have to run the .py file in the bash terminal (outside VSCode), and ensure there is a plt.show() statement in the file, and it runs fine.
  - Ah you are right, there was no plt.show(), so that's probably why. I ran the .py file in the miniforge prompt (so outside VS code)
- Yes, when it's not being run in the contained environment (Jupyter) you need to do a bit more to define where the outputs should go. It has advantages and disadvantages.
How do I execute the comand on the jupyter kernel? Not working with ctrl+enter or enter.
- The command jupyter nbconvert --to script your_notebook_name.ipynb should be run from a terminal. https://aaltoscicomp.github.io/python-for-scicomp/scripts/#save-as-python-script
  - I opened the terminal on jupyter like explained previously and then tried to execute the comand.
    - I see, and what error do you see in the terminal after pressing enter?
      - No error. nothing happens..+1
      - Hm, weird.
        
        You should see something like:
        
        [NbConvertApp] Converting notebook Untitled.ipynb to script [NbConvertApp] Writing 745 bytes to Untitled.py
        
        Try running "ls" (=lists content of the folder) in the terminal where you are, do you see the "ipynb" file?
        
        I can't run any comand
        
        I would restart jupyter-lab just in case (and the browser too if it still persists)
      - What operating system are you using?
        
        The problem was that I opened a console instead of the terminal. Sorted! Thanks
when running the script in the terminal, I'm getting an error "setHighDpiScaleFactorRoundingPolicy must be called before creating the QGuiApplication instance"
- "QGui" makes me think this is about popping up the graphics (what matplotlib is trying to do with the plotting part). I'm not sure what the solution is, but it might get fixed later on as we adjust the script.
  - Thank you. Unfortunately this was not solved; even in the optional exercise. I've been looking up this error in google and haven't solved it either
What is metavar in argparse?
- Whet it shows the auto-generated help that comes with --help, it would show --start=N instead of --start=START
Wait, is argparse positional args or something else..? I understand the sys.arg (for comparison)
- Argparse parses sys.argv in a automatic way, according to common conventions. It can separate positional and --option arguments and makes an object that holds everything it detects
- What do you mean by conversions? So does it not rely on positional at all?
  - (I mis-typed "conventions". Basically it takes --option things and removes them and the values, everything else is returned as positional arguments)
- What is the benefit of using this v. just sys.args?
What is the difference between arg and argparse?
- Is this answered above?
(Not the instructor) A nice trick with argparse is also to use vars(parser.parse_args()) so that it returns a dict, i.e. one can use args["name"] instead of args.name
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
I don't understand how to do this exercise… should I use jupyter notebook? Or Spyder? Or the command line (I don't know what is it)
- If you don't know the command line, I'd recommend reading the lesson and watching for now - we will try to explain later
- Basically, we had a notebook, and we wanted to be able to run it from the command line. We took the notebook, converted it to a .py file, and it can be edited (in JuptyerLab or any other editor) and we are now adapting it to run. We will be able to give command line options to change what it does, which is convenient.

Exercise, we return at xx:37

https://aaltoscicomp.github.io/python-for-scicomp/scripts/#exercises-2

Continue from the previous exercise, and explore what you would like.

Should we define a variable outside python with the url instead of: python weather_observations.py "https://raw.githubusercontent.com/AaltoSciComp/python-for-scicomp/master/resources/data/scripts/weather_tapiola.csv"
- That is definitely an option! I'll ask an instructor the intentions.
- I'd rather set a default value here in argparse, and allow users to set something here (if really necessary). But yes, you could also create an environment variable for this.
I got an error message saying "NameError: name 'fig' is not defined." How can I fix this?
- This means the Python variable fig isn't defined. In the notebook/script file it comes from fig, ax = plt.subplots(). Has something changed in it?
- NO, it came from "fig.savefig(output_file_name)"
  - Your script should have the "fig, ax…" as mentioned here just above, before reaching the point of "fig.savefig…"
  - I thought we were working on Ex.1
    - We were continuing from ex1. You can check that the file that you converted from the notebook ("weather_observations.py") has the definition of the variable "fig", before calling "fig.savefig".
    - I was trying the script above "discussion." I added the lines colored in yellow in .py file and ran the command just above "Discussion"
      - I see. If you add those lines to the weather_observations.py file, you just need to make sure that fig.savefig(output_file_name) comes after the block
      import matplotlib.pyplot as plt # start the figure. fig, ax = plt.subplots() ax.plot(weather['Local time'], weather['T']) # label the axes ax.set_xlabel("Date of observation") # etc etc...
What are the first and the second inputs in add argument? for example '-d' and '–date' in th example. What do they specify?
- On the command line it means that either -d or --date work. You can give multiple and both work. If you do --date=1/12/2020 or --date 1/12/2020, then args.date has that value. You can pass in options without modifying the script!
- So if i wanna add a third one it would stat with –-?
- No, having one or two dashes doesn't affect how many you can have total. Basically --date can have multiple letters but single-dashed ones can only have one letter in the option name.
- This may seem hard, and it does take some getting used to, but keep an eye out and you'll start seeing the patterns and you'll learn how to use it.
One should also remember that often a .py file works well as a configuration file if everything is done in Python
- Yes, you can load it and see options from it
- (but also consider security: if someone else gives it to you, it can run anything on your computer. Still useful when using yourself)
- You can also use code to define your config this way
I face: AttributeError: partially initialized module 'argparse' has no attribute 'ArgumentParser'
- Hm. My guess is this is something about the order of imports, and/or circular imports. Is this in this example? Are there any cilcular imports (A imports B imports A)
  I just imported pandas, matplotlib and argparse but it indeed says: AttributeError: partially initialized module 'argparse' has no attribute 'ArgumentParser' (most likely due to a circular import)
Isn't it easier to dump the required (long) arguments into a JSON that we can read into Python as parsed arguments? It would also enable very easy type/validity checking.
- It depends. Sometimes I like this json approach so that I have the full list of the argumnets stored. Sometimes the python script is something I want to reuse in various places just like a linux command.
- Absolutely. This is, somewhat, what the optionparser is doing.

Break, we resume at xx:00

Someone asked:

Would be nice to have in the future a small course about best practices when creating pipelines with bash/python/R:
- folder management
- Filename management
- architecture
- snakemake and other tools
  - We have a course like that and it is our main course: CodeRefinery.
  - You can have a look at last September run https://coderefinery.github.io/2024-09-10-workshop/ with videos and materials all openly available
  - The next run will most likely be around mid March 2025.
  - To be more precise the goal of CodeRefinery is not to create a pipeline per se, but to consider all the aspects of best practices related to computational reproducibility. The "pipeline" aspect is just one small part.
- Tuesday Tools and Techniques for HPC might also be relevant, if we do it again(?): https://scicomp.aalto.fi/training/scip/ttt4hpc-2024/
can you explain more about this: 4 (or more) .py scripts/notebooks: For each of the 4 days take one code example from the course materials and make sure you can run it locally as a “.py” script or as a jupyter notebook. Modify it a bit according to what inspires you: adding more comments, testing the code with different inputs, expanding it with something related to your field of research. There is no right or wrong way of doing this, but please submit a python script/notebook that we are eventually able to run and test on our local computers.
- I should fix "4 days" (previous years) to "3 days" (this year), but basically the idea of the homework is that you submit some of the examples that you've been running / we've been demoing, with some changes so that you were able to run the scripts and test them.
  - get in touch with scip@aalto.fi and we can think of a homework that is suitable for your level
    - ok thank you

Profiling

https://aaltoscicomp.github.io/python-for-scicomp/profiling/

Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
+1+1+1+1+1
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
How is the cat SO cute
- Cats are cute by definition!
- CATS is is just the best cat
  def print_cat():
  cat_art = r"""
  /_/\
  ( o.o )
  
  ^ <
  """
  print(cat_art)

print_cat()

If you are using HPC you can just set -t to 60% and accept the complains of your peers.
- NO OPTIMIZATION ONLY MORE THREADS LOL
- THREAD SUPREMACY!
  - HPC admin is watching you :)
    - FULL SEND
What's profiling?
- We'll talk about this later, but in very short: Analysing where your code spends how much time, and how much memory is used at certain places
- That's the spirit LoLoL
- I see the point but as usually if I'm in a hurry I just put more resources at work, no time to optimization (unless it's for a publication method etc)
  - Which is (often) a valid approach in research. But if you want to get some code to be used by others, or be put into regular use, you have to start thinking about resource requirements, and thus optimisation. Also, depending on what you do, it might work for a small example, but even with large resources, might fail with slightly larger set, because of bad scaling, and then you need to optimize in order to allow you to run it at all.
    - very good points. thanks
      - e.g. we had a situation, where a user had code that worked, but took ages because of constant write/read operations. And that's something you can find with profiling, and get the computation time down from in that case) 2 days for each run to a few minutes per run, which made it much more useable…
  - Correct. It's a balance. At some point it becomse worth caring. I hope this lesson can show "giving a quick check isn't that hard" and you can easily make the most important improvements fast.
    - It's actually the most valuable lesson of my entire PhD that is now ending. If only I knew this before :(
.
Does adding profiling to code slow it down?
- Yes, a little bit. But profiling is done during development, not during production notmally.
- I see, I thought one might use profiling on a separate thread to track long-running code by sampling CPU/Memory
  - Which (to some extend) also slows it down (except if your code is de facto single-thread/cpu).
can you use Scalene if you have a python code block inside a bash script? like a code chunk inside *.sh?
- You usually call scalene with scalene python your_python_script.py or python -m scalene your_python_script.py so you can have it inside your bash script where you call Python.
is there something similar to this in R?
- Here's few intros to profiling in R: https://bookdown.org/rdpeng/rprogdatascience/profiling-r-code.html http://adv-r.had.co.nz/Profiling.html

Exercise: Profiling, until xx:30

https://aaltoscicomp.github.io/python-for-scicomp/profiling/#exercises

download the book (https://www.gutenberg.org/cache/epub/2600/pg2600.txt), rename it to book.txt
try to anticipate which function spends most time and which spends most memory
verify this with the scalene profiler (which creates a report that you read in your browser)

I am:

done: oooooooo
not trying:
had problems: oo
wishing for more time:
done but have no idea how to interpret the results shown by Scalene: oooo

which book?
- https://www.gutenberg.org/cache/epub/2600/pg2600.txt
- we chose this because this was one of the longer texts that are public domain
I'm confused, does the second function use zero memory?
- Almost. Since it reads line by line, there are only ever a few kb of memory being used and then discarded again.
when running scalene example.py I get "AttributeError: partially initialized module 'argparse' has no attribute 'ArgumentParser' (most likely due to a circular import)"
- Is this example.py from here or a previous lesson? - Here
- Hm… I dont' see argparse included anywhere here. I wonder if there is something weird about your Python installation that is making some problem. Let me think…
- How have you installed Python and where are you running?
- Yes it is weird. I have Python 3.12.7 and am using the courses conda-envitonment
- Do you think there could be previous Pythons installed, or packages installed with pip install --user? (if you don't know what this means I assume not…)
- it also says: scalene/scalene_profiler.py", line 24, in ``<module>`
  import argparse causes the problem
Argparse: I will try to use another python version. Does not solve the issue. Should I remove all previous python packages and only have one version?
- Maybe. My guess is it's other stuff installed, possibly pip install --user, that is mixing things up. I don't know exactly how but it's in the realm of possibility.
How long is this meant to run for? It's been minutes.
- did it open a browser page? it should finish in seconds.
I got this error when I first ran the profiler: "Scalene: The specified code did not run for long enough to profile.By default, Scalene only profiles code in the file executed and its subdirectories.
To track the time spent in all files, use the --profile-all option." It somehow worked on a following attempt…
- Hm, to me this means it ran too fast and thought "I'm not getting good statistics so I can't show results". I guess we should improve our example some…
The output I get looks very different from the output in the solution. And it only profiles the time, not the memory.
- I had to do scalene example.py --html > profile.html. Did that help? Or is your version different?
  - That didn't help. When I run scalene example.py I get profiling results for the time, but also the error message "NOTE: The GPU is currently running in a mode that can reduce Scalene's accuracy when reporting GPU utilization.Run once as Administrator or root (i.e., prefixed with sudo) to enable per-process GPU accounting."
  - I got an output in the browser without scalene example.py --html > profile.html, but nothing when adding that bit, how can that be?
    - was a file profile.html created? That's what this command does.
      - For me, yes. But when I open it I only see a blank page.
      - same here, blank page
      - same here, blank page. works better when I run only scalene example.py
        
        same, but I get this message (same as someone above): "NOTE: The GPU is currently running in a mode that can reduce Scalene's accuracy when reporting GPU utilization.
        Run once as Administrator or root (i.e., prefixed with sudo) to enable per-process GPU accounting." And it does not look the same as the solution.
.
% of time = 100.0% ( 19.658s / 19.658s) Do worse than my laptop! I challenge you all
- :-)
- easy! > % of time = 100.0% ( 23.426s / 23.426s)
  - I kneel…
  - Pretty close % of time = 100.0% ( 15.651s / 15.651s)
  - Mine % of time = 100.0% ( 7.700s / 7.700s)
  - mine is % of the time = 100% ( 5.587s / 5.587s)
    - Oh you sweet summer child… What do you know of the winds of winter…
    - you are an outsider in our club
sorry. i am new to this. How to save the cell script as file.py?
- Try JupyterLab -> File -> New -> Python file, and then pasting the contents of your cell there.

I got this error:

  >scalene example.py
  Traceback (most recent call last):
    File "C:\Users\xxx\scalene_profiler.py", line 2099, in run_profiler 
      code = compile(
             ^^^^^^^^
    File "C:\Users\xxx\example.py", line 225 
      BOOK FIVE: 1806 - 07
                        ^
  SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers
  NOTE: The GPU is currently running in a mode that can reduce Scalene's accuracy when reporting GPU utilization.
  If you have sudo privileges, you can run this command (Linux only) to enable per-process GPU accounting:
    python3 -m scalene.set_nvidia_gpu_modes
  NOTE: The GPU is currently running in a mode that can reduce Scalene's accuracy when reporting GPU utilization.
  If you have sudo privileges, you can run this command (Linux only) to enable per-process GPU accounting:
    python3 -m scalene.set_nvidia_gpu_modes

did you save the book as example.py?
- YES. I also correct the number 07 (I put 7 instead) It seems it worked because I didn't have errors, But I cannot visualize the results.
  - Save the book as book.txt and the code as example.py
    - Ok I did it wrong… I converted the content of the book.txt into a .py file whitout using a code… I copy and paste the content in jupyter and then I exported it in executable file… ops

The second function is slowing down due to the for loop!
Regarding "run the scalene profiler on the following code example and browse the generated HTML report to find out where most of the time is spent and where most of the memory is used" - how do I do that? Which line of code? I saved and renamed the textfile. I saved the code from example.py as python file within my Jupyter environment ("+" > Other > Python file). All is in the same folder. When I use the code "scalene example.py" within a cell, I get the error "invalid syntax". When I open a new terminal (in Jupyter) and run it from there, a new tab opens. Is that correct (it's not clear to me from the description)? The result looks different from yours.
- The command scalene example.py should be run from the Terminal. It should be a new tab separate from the main Jupyter ontebook. (If you paste the command line command to a Python notebook it probably says invalid syntax.) - Like said, I did that but it looks different (I assume you mean terminal in Jupyter)
Do we need to put our openAI API key to optimize the selected code lines?
- Is that mentioned somewhere here?
  - It's the sign of flash and lightning? Doable but don't know how, maybe need some extra setting or register?
- This is an optional feature for automatic optimization suggestions. You can optimize the code manually once you know the hotspots without AI assistance as well.
I am missing the memory data, I just have the CPU data on my report +1+1
- There is an open issue in scalene when running on Windows WSL with memory not showing up (https://github.com/plasma-umass/scalene/issues/714).
- I'm having a similar issue but I only have the Time, no CPU or memory
- Mine doesn't show results for function 1 :D
- I only have COPY and GPU, not CPU
- [–cpu] [–cpu-only] [–gpu] [–memory] options that may be added to the command line
  - even with –memory, I only see Time
  - used "scalene example.py –cpu –memory" and still I only see Time (same thing if I switched .py to the last slot with the arguments)
    - At this point, I would suggest writing a bug report. What OS are you on, what I could imagine is something like Windows-WSL not allowing scalene to access required measures (https://github.com/plasma-umass/scalene/issues/714)?
      - It's possible, I'm using Windows. Yep, similar problem to the issue raised there
- In some cases the memory consumption might be so small or so quick that it will be hard to measure. Scalene is a sampling profiler and it does not measure every quick variable creation because that would create unnecessary noise. For small memory uses Python garbage collector (the memory manager) can clear the memory usage before it is measured. For larger and longer memory usage (like numpy arrays) it is much more accurate.
Try downgrade Scalene from version 1.5.31.1 (2023.09.15) => 1.5.19 (2023.01.06) from a github discussion Nov 2023
- scalene –version
- Scalene version 1.5.41 (2024.05.03)
- mamba install scalene=1.15.19 not working but try older versions

Productivity tools

https://aaltoscicomp.github.io/python-for-scicomp/productivity/

I suppose most IDE's contain some linting(?) and formatting functionality?
- Yes! Many have it built in or it uses one of these external tools.
  - Is there any difference? I mean, for example what's the strengths of ruff function or does it have more functions?
Can you define what is autoformatting?
- It's doing things like making all the spaces consistent and matching some standard. It shouldn't change the effect of the code but changes how it looks.
When I use Spyder I neticed that it tells me if I imported a library that I didn't use. This is because it has integrated linters like Ruff?
- I don't know exactly what Spyder uses but yes, it could be. It might make its own thing but also quite likely using one of them. Edit: it seems like it uses pylint: https://docs.spyder-ide.org/current/panes/pylint.html
Are there such productivity tools usable with Jupyter lab
- Any language probably has such tools
- There are jupyterlab extension that fix the formatting of your code. I have not used them but a quick search gives for example https://github.com/mlshapiro/jupyterlab-flake8. There are other tools that we did not cover like VSCode/Codium which are code editors and they come with various plugins for these purposes.
How do the formatters use new lines? E.g.:
var1 = 1

var2 = 2

^ does this trigger something in the formatters?
- Yes, it can. Though I think for Python extra lines are "ok in some places" so it probably wouldn't change that.
  - Thank you. I've been wondering how pythonic the extra lines are.
Do you suggest that we pay for chatGPT to use the extra functions? Are the paid functions useful for coding?
- No. But depending on your institution you might already have the option to use those tools, and/or depending on your hardware, you could e.g. run a local model for the job.
- (Not the instructor) I've found that AI hallucinates a lot of code, depending on your use cases. To me, it's not worth the time.
  - The main hallucination I see is AI suggesting that things are possible with a specific library when they are not. +1
    However, I do use it for "simple" things, where it is just faster in writing the 4 lines, that are "standard" code.
  - Examples that are more present in the training set (e.g. hello world) tend to be reliable, specialized requests have hallucinations. The "speed" of AI is when I know what I want (load these 1000 files) and it is easy to code review. But let's not forget the environment: the average AI session uses 0.5L of water for cooling (compare to a google search which is in the order of milliliters).
    - I didn't know this! This is terrifying :( +1
    - That water calculation was done with GPT3.5; GPT4 most likely uses more. Reference and more links on these slides https://zenodo.org/records/14032261
The Ai generated code is useful when you quickly need a global "architecture" of sort to pinpoint with specific code from you. But you need to know workflows and algorithmical thinking
Question: How do you integrate the AI tool with your text editor?
- VScode and copilot https://code.visualstudio.com/docs/copilot/overview
- Cursor: https://www.cursor.com/
Hello, I had missed Day 1 so I am trying to go through the module now: for the miniforge, is this correct for the command?
- source /opt/homebrew/Caskroom/miniforge/base/bin/activate
- it's different than the installation instructions for Mac, which was: source ~/miniforge3/bin/activate, but it seems to work?
- Note: I did use homebrew to install miniforge

Feedback, day 2

News for day 2 / preparation for day 3

No special preparation needed for tomorrow
Tomorrow we will have:
- Libraries (how to reuse what exists, what to consider)
- Dependency management (what is conda and these environments? How to install what you need without messing everything up)
- Parallel programming in Python
- How to package Python code so that others can install it

Today was:

too fast:
too slow:
right speed: oooooooooooooo
too advanced:
too basic: o
right level: ooooooooo
needed more exercise time: ooooo
needed less exercise time:
I will use what I learned today: oooooooooooooooooooo
I would recommend today to others: ooooooooooooo
I would not recommend today to others:
too slow sometimes, too fast other times: ooo

One good thing about today:

A lot of practical topics without going into detail.
The pace was nice, concentration was fairly easy
Profile and Productivity are new to me, very useful, will use in future research! oo
Good overview of many very useful things oo
I was introduced to a lot of helpful seeming tools. Will look into them! ooo
The instructor today who explained vega-Altair is really good, he explains very well, I look forward to follow any channel if he has one +1+1
- YES exactly my thoughts
- Yes agreed, really good
The plotting session was very helpful to me: ooooooo
- Nice to learn a plotting tool in Python that is similar ggplot2 in R o
Scalene is really cool oo
- The problem was, that long running code was exactly what I would have written :P
A lot of new tools that I wasn't familiar with with regards to the profiling and productivity tools
I loved the plotting with Vega-Altair session, Profiling and also the productivity tool. Thank you for teaching how to convert jupyter into .py, I was wondering about it since months and finally i have the answer!
This is a very hands-on course that uses Jupyter, which is not very common. Jupyter is a great tool for so many reasons, so it’s valuable that you’re using and teaching it here, and that you’re introducing several ways to use it. The tutorial itself is very practical and therefore highly useful.

One thing to improve for next time:

More examples of using different file formats (e.g. Feather) in Python +1
In case you introduce new features (e.g. the terminal of Jupyter) can you do that at the beginning of the session? I had problems and basically the result was, that I could no more run the stuff and thus missed out quite a bit.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
More cat +1<3 +1 +1
It appears to me that there are sometimes small steps missing in the tutorial, which can lead to getting stuck
I would appreciate longer exercise time, as it’s easy to lose track if you get stuck

Any other feedback?

Could be interesting to replace the Exercise sessions with watching the instructors doing those - or any other example tasks. To learn the working practices. And the exercises can go to homework etc. YMMV.
Would be interesting to build on the scripting section and show how to combine them to build projects/packages +2
Great presentations on Plotting with Altair and Profiling +1
I love this hybrid course format using twich + board. Skeptical at the beginning but…WHOA I LOVE THAT +1+1+1+1+1+1
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Congratulations on putting together such a great course, I'm super impressed! +1+1+1
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Thanks!
Where we can find the videos of recorded session?
- Videos are on Twitch for 7 days and after that, there is a YouTube channel (https://www.youtube.com/@aaltoscientificcomputing3454, iirc) where the videos will be uploaded, once they have been post-processed.
  Thank youu!!
This could easily be a 1 week course or even a summer school
- Yes we know… and that's part of the problem, there is far more than we can teach. We can only hope to give an intro.
- What about offering a second course that builds on this one here? I guess it would be interesting to many people? It's so useful to have a course that is so closely aligned with everyday work.
  - We had this https://scicomp.aalto.fi/training/scip/ttt4hpc-2024/ kind of covering more advanced topics, especially for HPC systems
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
  - But if you want something more related to coding (e.g. examples from machine learning were asked in the past), we can think of a "python for sci comp part2". Here a survey to suggest these things: https://link.webropol.com/s/scipod
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →

Day 3 - 7/11/2024

Icebreakers

What's the most surprising thing you have learned so far in this course??

Scalene is a combined memory and speed profiler
the profiling although Scalene still fails to report memory, have to check version compatibility etc
Vega-Altair as an alternative to matplotlib ooooo
Xarray to handle multidimensional data o
Jupyter is actually a very handy tool for readability
Python is easier, than thought.

And the most surprising thing you have learned in your career?

The endless and various ways in which people can mishandle data, by accident or by negligence
Love this question, very difficult to answer though. Probably how you have to learn to struggle with feelings of insecurity about how much you don't know etc. oo
First you think, then you code
- I actually need to code in order to think how implement a solution
  - I see your point. Try to train your algorithmical thinking. It's a game changer
The amount of problems that experts don't have a good answer to. ooooooo
Document everything because I will forget everything ooo
- Unfortunately, I often forget to document
Earlier I tried to make things sophisticated, now I try to go for the simplest solution <-THIS

What are your ideal winter holidays?

Quality time with family and good food
Sleep +1
Advent of code puzzle solving +1
A lot of Xmas lighting
Fireplace and books +2
Skiing in fresh powder snow
Watch cheesy Christmas movies
Outdoorsy activities during the day, good food, friends, family and good books and movies in the evening +2

Now: Library ecosystem

https://aaltoscicomp.github.io/python-for-scicomp/libraries/

This is a discussion-based session, do comment and ask questions!

You can also list your favorite package from your field of research below!

NumPy oooo
Pandas oooo
gffutils, Seaborn, pyvenn (2 ~ 6 sets venn diagrams), UMAP
Biopython SeqIO, Enterez…
Keras and Pytorch o
os, sys oo
math
Crosspy
SciPy oooo
tidyverse ooo
hdbscan o
matplotlib oooo
scikit-learn o
RB: click, vega-altair, flit, scikit, tabulate, colorful, tqdm, jax, and standard libraries (e.g. https://docs.python.org/3/library/collections.html)
polars
MNE oo
ST: black (linter), accelerate, duckdb, sqlalchemy, fastapi, pydantic, click, dataclasses (base Python & pydantic), pandera, optuna, PyYAML, llama-index, LangChain
Is there a web site that we can go to in order to check if a publicly available library is no longer used or no longer maintained? Google allows us to find old web pages all the time and it's hard to determine if a library is no longer used by the community. (Ok! Thank you!)
- We'll talk a little bit about "how to decide if something should be used" including maintenance. Often we check the GitHub or Gitlab pages to see "is it still active?". A project with the last commits 10 years ago can be a bit questionable. With open source (well, everything) you sometimes you need to do a bit of digging.
  - As an Ubuntu user, my "go to site" is packages.ubuntu.com. I guess there isn't a central repository like this for Python (that marks packages that are outdated, etc.)?
  - pypi.org is a main one. You can see most recent releases and if it is being developed. (unlike Ubuntu there are no global releases, so you don't know which packages work together with each other). anaconda.org searches conda packages.
How do you know whether functions implemented in a library are correct/ reliable. I assume big packages like numpy are tested by many users and most likely errors get corrected rapidly, but when using a smaller packages, sometimes I really wonder how much testing has been done and if I can trust the results in everycase.
- Ideally also smaller librarys have their test cases documented at least, so if you look into the code, you should be able to see at least some of it. If it's on github, it's also a good idea to check the issue tracker, to see what issues have been found, how fast they have been fixed, etc.
- Oh yes, this is a good question. It needs to be asked for everything, compared with the risk of it going un-detected and affecting core results. If we are scientists then we can apply our critical analysis to it.
PyTorch + SciPy + PyTest
PyYAML
PyBind11
argparse o
collections.OrderedDict
requests, lxml, boto3
Do you trust more a Github repository or a library based on your experience? For a specific use-case
- Really depends. I would critically analyze both. It's easy to put a library on PyPI so it does'nt mean much (but if it's not there, maybe it's a sign it's not intended for reuse)
How often do you branch repositories from Github and modify them for your own purpose?
- I do it when I need, not too often. It can solve a current problem but in the long term it might be something I have to deal with later, when I need to re-sync with the library.
Basically you have to be a sculptor in this AI-code era, knowing what you're doing and what to look for etc
- yes! and since AI-tools tend to centralize data to few companies, I believe it remains important that we learn the basics to not lose control in future.
How do you keep code easy to reconfigure, so that if a company with a free tier goes rogue ?(
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Anaconda)
- Heh, good question. I guess you have to keep this in mind when you design it, make sure you don't 100% depend on one thing too much. If that's even possible. Maybe it is: don't critically depend on too many things, only the most important parts.
- decentralize. and also deposit your code to persistent services like zenodo.org. make your code citable and persistent so that it survives services coming and going. I remember the time when github did not exist and the services that we used back then are long gone and have been bought up and discontinued.

Dependency management

https://aaltoscicomp.github.io/python-for-scicomp/dependencies/
In this we learn how to install things in isolated environments, so each project can get its own libraries and not interfere with others.

Mamba Supremacy, conda is slow and fails to point out dependency issues
- but latest conda (since version 23.10.0) uses mamba as default resolver
What's the difference between Conda and Anaconda?
- Conda is the open source package management tool, Anaconda is a distribution of conda + various libraries (and graphical interfae tools such the Anaconda navigator)
Do installation debuggers exist? For example, if there are conflicts among package versions, is there any tool that can identify the source of the conflict?
- I haven't heard of that but it's actually a pretty good idea. I have tried to use pipdeptree and condatree which list all packages and their dependencies (or reverse dependencies) to examine the situation when needed
- installers will try to resolve versions and if they cannot, they typically stop even before installing anything. So what I wanted to say is that you will notice if there is a version problem before the installation completes.
- Check http://libraries.io/. For example here the dependency tree for the most recent Matplotlib https://libraries.io/conda/matplotlib/3.8.4/tree Not exactly a debugger but you can get the full view of dependencies and eventual conflicts
What is the difference between Anaconda Navigator, Anaconda prompt, miniforge, conda? What is Anaconda and Miniforge? Are they software ? Programs? Interfaces?
- Anaconda: a distribution that contains:
  - conda (the open source package manager)
  - various python libraries and tools (numpy, jupyterlab…) in the default conda channel -
  - Anaconda navigator (a graphical tool behind the terminal commands like conda install, jupyter-lab, etc)
  - Anaconda prompt: a command line interface where the prompt is able to "see" the conda tool, the installed environments, the libraries
- Conda: an open source tool to manage python package installations
- Miniforge: a light version installer of conda (other similar tools are miniconda and micromamba)
If Miniforge is free and uses free channels, I presume one could (mistakenly?) add Anaconda's channels to it, so that it is no longer free to use? i.e., free vs non-free comes down to just what channels are used?
- I think this analysis is correct. But at least it's not done as a default.
- I have heard (I think it was a reddit thread) that copmanies are blocking the anaconda channel so that these mistakes can be avoided (e.g. within the VPN of the company those URLs are not accessible)
  - It seems that if I add an Anaconda channel, I don't get a prompt from conda, mamba, etc. if I agree or not agree to the terms, etc. That seems unfortunate…the user should be warned…
Opinion on using a package management setup of mamba+poetry for dependency management and reproducibility?
- to decide which package management setup I choose, I ask myself: is this a script/code that I am developing? or is it a library that will be used by other scripts/libraries? if it's something that is not imported into something else, I often stop at environment.yml or requirements.txt. if I am developing a library, then I often need more (packaging lesson later). poetry is fine. personally, I like "flit" when I need to create a pip-installable package. later we will mention also other alternatives. One tool I want to explore more is https://pixi.sh/
- Containers are also an option for "freezing" a certain conda environment into a container image. More about this (in the context of HPC systems): https://coderefinery.github.io/hpc-containers/
- I have been exploring the use of Pixi as a drop-in replacement for mamba+poetry, as I prefer that stuff remain resolvable, and I don't go to dependency hell for packages from PyPI that have become obsolete at that point. It's exactly what I wanted it to be - uses conda-forge as the default channel, with a good interface as you'd expect from Poetry/Yarn. Thanks for mentioning flit, though, I was not aware of it.
So, just to make sure I understood correctly, we have different repositories, each containing different libraries, and to access a certain library from a repository I need to choose a channel? If within the same tool and repo some channels are open source and others aren't, why would I want to use the pay based ones?
- you would want to use the paid, because not every channel has every package, so you might have the package you need on the paid channel, for example –- or use a different tool ;)
Is Docker a similar concept to virtual environments? (Thanks everyone!)
- Hm, in a big picture sort of. It's a way to isolate what is needed to run one program. Docker basically packages a whole operating system. virtual environemnts are only the Python libraries.
- (Not an instructor) My feeling is that Docker isolates things "more" than Anaconda or Python environments. If you start up a Docker container, you have to do something more to peek inside. With Anaconda, etc., one big change is that the PATH environment variable is changed.
  - Correct! With docker or singularity/apptainer you not only freeze the python libraries, you also freeze the whole operating system, so if a tool only works on Linux Ubuntu 16, you can run that operating system within the container with the tool installed and anything else needed.
- We have a coderefinery episode that exactly covers this: https://coderefinery.github.io/reproducible-research/
What are the dependences?
- is this what you are asking? :A "dependency" is something else that's needed for something to run. For example, you might make code that needs numpy to run: it's a dependency. This lesson is about "how these can be tracked and installed so that you don't end up with lots of chaos"
  - So, if I understood well, if I'm using numpy, a dependency is a library that numpy needs to work, but instead of add manually all the numpy dependency, there are tools that install all the dependency forn numpy when I install it?
    - Numpy is the dependency in this case
      - So, when I install an environemnt I install also the dependency like numpy?
        
        Yeah, your environment file might look like this:

name: my-environment
channels:
  - conda-forge
dependencies:
  - numpy

If I build my own library, how do I make it so others can install it without simply sending them a file?
- Oh, this is the last session of the day! Packaging.
  - I've had one for 4 years and I just send it. lol
If I create a conda environment by conda env create --file environment.yml with a set of libraries listed under Dependencies in the environment.yml, will conda by default install the very newest version of each of the listed libraries?
- Correct (or it tries to find the latest of each library that is compatible with each other)
I don’t know when and why to use pip. What is the idea behind this? I feel I am mixing things when I use pip with conda. But sometimes I need to do both.
- Pip is made for Python-only things (but some things like scipy with C code have been hacked into it). It works OK for pure Python things. Sometimes stuff isn't in conda so yes, pip and conda need to be combined.
- As a guideline, use pip only if something is not available through conda, if you decided to use conda (some people prefer not to ), but be aware that this will most likely happen
- You can also install Pip packages to conda environments: https://aaltoscicomp.github.io/python-for-scicomp/dependencies/#additional-tips-and-tricks This can be helpful when you can get 90% of stuff from conda, but are missing the last 10% from pip.
What happens if you realise when working with your code that you need more packages that are not in your environment?
- You can create a new environment file, then create the new environemnt, and work with that and abandon the old one. This is better for reproducibility
- Or - if you are not too worried about reproducibility - after activating the conda environment, you can manually run "conda install" commands. This is ok if you are experimenting, but then it makes it challenging to track what was done manually when it's time to replicate the environment on another system that you use, or to give it to other colleagues.
  - (and then remember to add it to the environment file eventually…)
  - Simo mentioned updating the environment?
    - There are a few ways you can handle this. One is to simply conda install package-name on the fly and then do a conda env export > environment.yml to update the dependencies in your environment file with the newly installed package. Another option is to manually edit your environment file first and then do a conda env update --file environment.yml to install the new package.
- Usually just updating the environment specification by adding the new package and updating the environment works: https://aaltoscicomp.github.io/python-for-scicomp/dependencies/#adding-more-packages-to-existing-environments
What is the difference between importing them into my jupyter notebook and writing this package as a dependency in my environment?
- Importing in your jupyter notebook makes the functions from that package available in your current session. Any packages that you import into your jupyter notebook should be reflected in your environment file so that they are installed and available in your environment. 💡
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
- Simo mentioned yml-syntax for the environment file. Can you point me to a tutorial/URL for information? You are obviously using a certain way (yml-syntax) to create this environment file. I would like to learn about the how and why (not only have an example).
  - Here is a link to today's tutorial: https://aaltoscicomp.github.io/python-for-scicomp/dependencies/#creating-python-environments Particularly focus on the sections on conda env create, update, and export. create creates an environment from a .yml file and export will create a .yml file from your current environment.
  - Thank you… it doesn't really answer my question. It's more about how does miniforge and anaconda read an environment file (aka HOW does it need to be written), how does the import of the packages work, how are the inner workings basically.
    - So is your question primarily about the syntax of the file?
      Image Not Showing Possible Reasons
      The image file may be corrupted
      The server hosting the image is unavailable
      The image path is incorrect
      The image format is not supported
      Learn More →
      - Maybe this would be helpful: Create an env file manually
      - This page shows all of the possible options you can add to environment: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html
Is it better to be more precise in defining the dependencies in the environment file (python=3.9, bokeh=2.4.2 ) or rather "general":numpy, pandas)
- It depends :D Sometimes you need to fix certain packages (e.g. cudatoolkit) to exact version (to match GPU nvidia drivers in the case of cuda) so you can have a few packages with exact versions because it is necessary, and leave conda to figure out the rest of the versions. What is important is (IMO) that after creating the environment on day 1 of your new project, you stick to that until the end of the project. Changing versions in the middle might lead to unpleasant surprises (e.g. default options change with different versions and you stop getting the same results).
Is the rule "one virtual environment = 1 Project" real? To save up disk space I am scared of having to create multiple environments just because 1 library requires python 3.8 max while others python 3.11!
- It's not a real rule, but may be a good idea: disk space is pretty cheap and most stuff is small, and the time you might spend to fix problems can be really large. So one per project and "delete and recreate if anything goes wrong" can often be a good idea. But if you share among projects, there's nothing intrinsically wrong with that.
  - (Not an instructor) I think this is the default recommended way to deal with projects in Julia. The installer downloads everything globally, but uses only specified packages so multiple "environments"/packages effectively share a collective package space.
- Conda caches packages, which reduces disk usage a lot (it reuses the same package across multiple environments). You can also remove the environments if needed because you can re-create them from the environment.yml or requirements.txt.
Library and package are synonyms?
- For our purposes basically yes.
Can I create an environment yml from an existing environment with a tool or such (where I forgot about its packages and dependencies)?
- conda env export will make one based on what is installed, but it can be too specific to be useful (includes every exact version and build). conda env export --from-history will export only what you have requested. These are both fine ways to start off, but usually should get some manual editing to fix them up.
- See also: https://aaltoscicomp.github.io/python-for-scicomp/dependencies/#exporting-package-versions-from-an-existing-environment
What is a channel is simple words?
- In conda, a singel-person or single-organization way of distributing packages. So that I can release stuff without it interfering with everyone else. For example there is a pytorch channel that gives official pytorch packages.
can I reuse packages installed in a given environment1 into another environment2 or do I need to reinstall everything with each project? (Can be quite memory heavy to reinstall everything each time)
- Usually re-install. For most things the sizes aren't too big and the time saved with solving problems is worth it. (you don't really "reuse an installed package" - you would reuse the whole environment, there is some related question above. It caches downloads from PyPI/conda so it doesn't re-download)
  - Sounds good, thank you :)
- I think conda might be clever and reuse the installed files to save space (hard-linking the files). Does anyone know if this is true?
  - I need to verify, but the cache of downloads of packages is shared between environments, but the actual environments are independent, no linking.
    - Conda caches package downloads and uses hard-links when it installs packages so installing the same package to two different environment should not use any more disk space.
when would I want to use a * .yml file v. a docker file?
- Docker (requires root) and apptainer (no root needed) can have a little bit of "overhead". Furthermore usually once the image is created you can't really edit things inside (this is a simplification). So a simple rule could be that for "live" projects on a system where you can install, expreiment, develop, the yml environment is more flexible.
- For taking your developed pipeline into systems where you cannot install things (e.g. HPC systems like LUMI) then containers are the way to go.
- However nobody stops you to always containerize and work with containers
- Can you 'ship' or easily deliver the environment + the script to a colleague easily?
  - Yes with container you can "freeze" an environment and make sure you can give it to others. With the environment file, even if you use the same env file in the future, some of the libraries might have disappeared and so you cannot re-create the past environment anymore unless you are lucky enough to dig out old versions from various github repositories. So the yaml environment is reproducible in the short term (months) while a container image is reproducible "forever" (=until the container intepreter works)
  - So can you just specify the version of the libraries in the env file to avoid this (that it doesn't take the most recent)?
    - You can specify all detailed version of all libraries in the env file, but in a few years from now those versions might not exist anymore in the channels where you can find them today.
  - But it sounds like the container overall might be less 'flexible' but it's difficult to modify (for others later)?
    - I was simplifying earlier, it is possible in some cases to modify containers, but it gets complicated. Container is best used as a "freeze" of operating system + all other software needed.
  - Like I am thinking about shipping the image (but not the dockerfile for example)
    - Exactly. The dockerfile itself is not reproducible. The dockerfile that you use today to create a docker image might not work anymore in 2 years (we had this case in a CodeRefinery workshop a few years ago. :))
    - So there is really 'no best solution' in this case, it's just based on use case?
      - We always aim at the good enough solution :) thank you :)
      - Classic paper on the topic https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510
      - oh awesome!
It's past a break time.
- Are we having time to answer the exercises from this section?
  - Unfortunately not.
Are the dependencies in a yml file installed in the specified order? Sometimes you need them in a specific order.
- Hm, I'm not sure. I know that for pip+requirements.txt it did matter and I could solve some problems by changing the order (because pip didn't resolve all dependencies at once). Since conda/mamba have completely accurate solvers it may not matter.
- Requirements might need to have installations in a specific order. For conda dependencies it does not matter. For conda channels order matter (highest priority channel is given first).
How safe is to use pip freeze to get old environment or are there any alternatives?
- It won't break your environment. At worst it might not be installable on a different operating system (if some versions only work for one) but that can be solved manually later. I'd say it's always better to make a "safety export" so you know what you had, for reproducibility.
I think this lesson should be moved to Day 1 somehow
- We've been talking about that: at least some, maybe not the whole thing, can happen before Jupyter so that we understand how to activate the environment and what it means. This is useful feedback!
  - Maybe put Xarray day 2?
    - I guess they put Xarray in day 1 because NumPy, Pandas, and Xarray go together – same theme. What we really need is more hours in a day.
      Image Not Showing Possible Reasons
      The image file may be corrupted
      The server hosting the image is unavailable
      The image path is incorrect
      The image format is not supported
      Learn More →
What is the connection between the download of the packages into your environment, the command pip install in the beginning of your code and then import (needed) library? I could also ask it in this way: Why do you need pip install in the beginning of your code and how is it connected with the environment?
- You have a python installation in your system. Let's say you don't have an environment. Pip install will install under the global python installatino of your system: very dangerous thing to do because it is tricky sometimes to understand why some old library keeps appearing
- You create an environment and activate it: whatever happens inside this activated environment (e.g. pip install xyz) will stay inside the environment, files will be added in subfolders of the environment, and you don't affect any other environment (or base python installation) of your computer.
- Import is when python runs and needs to tell "we will need this libraries in this script". Those libraries need to be visible in the environment where you are, otherwise python does not know where to find them.
- Pip install (or conda install) inside an active environemt is something you run once. Once things are installed you don't need to re-install them each time. You can also upgrade the version of installed libraries in the environment, but that's dangerous from reproducibility point of view, you might start getting different results.
- I think this is an interesting and a good way of looking at the thing. Maybe it could be answered that creating an environment is similar to installing the Python interpreter itself, while installing a package is the process of downloading the package and extracting it to the environment. Now when you call the Python interpreter and tell it to run your code, it will read the code and when it encounters an import statement it will look for a library with that name in the installation directory. At this point it will find the pip-installed package and it will take objects and definitions from file to memory. Of course it can then import another package or external library recursively until at the end you have everything loaded.
  - Correct! And the python interpreter is actually different in each environment. So you can have one environment that uses python 2.7 and another with python 3.10. Also the python interpreter is downloaded inside the environment subfolders. Environments can also contain non-python stuff like c++ libraries, cuda drivers, executable files. Once you activate it it's like "forget about the stuff I have on my system, in this command line we are now using all the python+libraries+software that come from this environment subfolder "
  - Thank you both for that detailed answer! What is the diffence between conda and pip install? I also dont recall writing pip install in the beginning of the exercise code. Why did it work after all? Can you define (explain) "Python interpreter"?
    - conda installs from conda channels, pip installs from pypi servers. Pip can also install from github repositories. So the physical place where the library comes from is different.
      - The libaries are also built differently. For any external code (C,C++,…) Pip packages contain that for each package. Conda-packages usually instead install the external code via another library so it can be re-used among multiple packages.
    - you can define python interpter inside the env file like any other dependcies. For example in the coderefinery workshop environment, we want a python interpreter that is version 3.10 or higher https://raw.githubusercontent.com/coderefinery/software/main/environment.yml
    - Python interpreter is the python-command. It will look at your .py-file or contents of your notebook cell and it will determine what to do with the code. Python is interpreted coding language so code won't be compiled beforehand, but instead when it is encountered. So Python interpreter == Python executable.
      - Awesome thank you so much! That was very helpful💡

Break time, we resume at xx:30

As always you can keep asking questions
After the break we go to Parallel
(Note: lunch starts and ends 10 minutes later)

Parallel

https://aaltoscicomp.github.io/python-for-scicomp/parallel/

uh oh, I tried running the pool code and now my jupyter code cells don't run
- Lower down on the page? You can try "restart kernel" from one of the menus. Sometimes stuff gets stuck, parellel programming has plenty of things that con go wrong unfortunately, which can end up taking lots of time to solve.
  - It worked!
multithreading. Is it about running, for example, "multiple" parallel -for loops- in different threads?
- "thread" is a very low-level operating system concept. It can be used different ways in Python, but yeah, it can possibly be used like you say.
- Threading is often used for e.g. web applications where the program can start multiple threads that e.g. wait for response from a web user. Multithreading is not usually used for doing more stuff, but for doing stuff while something else is waiting. Multiprocessing is more often used to do stuff with multiple CPUs.
  - Thanks for the clarification
is it normal that i got this value: time spent for inverting A is 1.69 s
- I guess you have a very fast computer for inverting matrices. This is fine (but might mean that the benefits of parallel may be less noticeable).
What is going on with the inverse calculation? Does it use multithreading implicitly or are you going to show it can be speeded up?
- Numpy will automatically use multithreading if it can.. So it's a small demo that you can get parallel automatically by using the right libraries. (The best parallel is that you don't notice, since nothing is going wrong!)
What is the meaning of the line "with Pool() as pool:"? Is like "I will use the function Pool() as a method notation? " Or why the () disappeared?
- It's called a "context manager" if you want to read more. It's basically the equivalent of pool = Pool(), but will make sure that the Pool is closed and de-allocated after you leave that code block, no matter what happens. Useful when you want to make sure there aren't tons of Pools open using lots of resources in the background.
If I well understood there are libraries that allow us to separate the calculations we are doing so that they are conducted in parallel on different processors (??) and then merge the results into a single final object/result.
- Correct: some problems can be split in parallel units (e.g. matrix multiplication). And sometimes the hardware can make a difference (e.g. matrix multiplication on a GPU - graphical processing unit - can be 100x faster than on CPUs). Not all python packages might be working also on GPUs, so sometimes this is confusing "why is my code not faster even though I have the best GPU in the market". Hopefully what I wrote is not confusing :)
- This "split to parallel units, run separately, combine" is very powerful but more importantly is easy to understand and implement.
  - These problems that can be solved in this way are sometimes called "Embarrassingly parallel". See a list of such problems: https://en.wikipedia.org/wiki/Embarrassingly_parallel
Is the %%timeit from the random package? or is that always available?c
- %%timeit is a "magic" from Jupyterlab, i.e it's not part of the python code, but it's a tool provided by Jupyter. If you use a different IDE or editor, you would have to use a different way to measure the time the code takes

Exercise Parallel-1, Multiprocessing (20min) AND lunch break (1hr), we return at xx:15 (12:15 CET, 13:15 EET)

https://aaltoscicomp.github.io/python-for-scicomp/parallel/#exercises-multiprocessing

Don't forget to take your lunch break! You can divide up time how you'd like
Try to do this exercise. If you can't, it's OK, the next ones don't depend on it.
You'll take this working Python code and split so it runs sample() with a multiprocessing.Pool, and combine the results.
There are unfortunately many things that can go wrong here. If everything breaks mysteriously, don't feel bad: it's a thing to improve later.
Q&A may be slow as instructors also on lunch break.

Triend to run pool in terminal… After the last line it kind of went crazy and threw an error

Process SpawnPoolWorker-1:
Process SpawnPoolWorker-2:
Process SpawnPoolWorker-3:
Traceback (most recent call last):
.
.
.

How did you try running? (yes, very many things can go wrong here…)
- Same as the lecturer did
- The demo in this section I guess: https://aaltoscicomp.github.io/python-for-scicomp/parallel/#multiprocessing
- What operating system are you on / you are running from the Python terminal?
  - mac / yes
  - the other lines worked fine, like just shown. After hitting enter after 'pool.map(square, [1, 2, 3, 4, 5, 6])' it just …
    - you could press enter again and it will show you the results
  - Anyone else: Has it worked for Mac for you? (going to lunch… will return later)
- Did you use multiprocessing or multiprocess? I got the same result as you using multiprocessing, but it worked if I changed to from multiprocess import Pool
  - multiprocessing; trying the other now
    - no module called multiprocess apparently. So neither works for me
      - it's not in the environment they give us; you have to install it using conda or pip first
        
        Which one? Or either?
        
        'conda install multiprocess' worked fine for me
        - Thanks! After installing multiprocess, this worked for me too. In jupyter as well (might help those in the thread below)
- Yes it seems that with Mac multiprocessing is not working and you need to import multiprocess (a fork of multiprocessing).
Could we get some written definitions of the key terms like processes and threads? And how they relate normally since GIL and multiprocessing seem to be a bit of an exception from the norm based on how those are introduced in the lesson documentation?
- That's a good point. We should expand the page and cite a good reference. For now maybe the focus is somewhat "intuitive" in the sense that many computations are running in parallel.
- In general, we convert your suggestions to "issues" in our materials, and then hopefully they get implenented :) Everyone is welcome to contribute!! https://github.com/AaltoSciComp/python-for-scicomp/issues
- But in a sentence, what is the difference between processes and threads from the instructors' perspective? Probably the instructor's intuitive definitions and distinction, even without citations, are better than mine.
  - I will highlight this to the instructors.
  - (Not an instructor, but I've been studying this in recent months) My understanding is that a set of threads communicate using the same memory (i.e., they exist within the same computer, sharing the same memory); processors can be on the same computer, but are usually different computers in a network (i.e., each have its own memory) so they communicate by sending and receiving messages to each other. If you have a cluster of 4 computers with 2 CPUs each, and 10 cores per CPU, you have 4 * 2 * 10 = 80 cores in total (which implies that you can support up to 80 threads, or sometimes even 160 threads). But you can "multithread" at most 20 threads within each physical computer (i.e., shared memory part using OpenMP). Between the 4 computers, you need to use distributed memory communication, which is MPI. …hopefully this is mostly correct…
    - Good explanation. One thing to add is that one needs to distinguish between cores on a chip, and on threads that can run on these cores (added this to paragraph above)
  - Processes (or ranks or tasks as they are called in context of MPI) execute (up to a point) independently of each other within the runtime that the operating system provides. A processes uses a chunk of memory. It cannot directly access the memory of other processes. For this explicit message passing is needed.
  - Threads are spawned within the execution of one process. The threads share memory. This is convenient, but also a potential pitfall if more than one thread try to read or write the same memory. The concept of a program being threadsafe.
%%timeit is for Ipython notebooks. If I want to run the code in Pycharm should I use import time and time_start = time.time() /…/ time_end = time.time()?
- Yes that's a reasonable solution. You could try to use timeit as a module, but that may be hard (I don't really do that). So I'd use what you have.
- (timeit is fancy because it runs multiple times and takes an average, but for our purposes it's not needed)
- I have created a Jupyter project in PyCharm for this course, and there %%timeit works
- (Not an instructor) I personally use time.perf_counter() for long-running functions, and time.perf_counter_ns() for short-running functions. Apparently time.time() lacks precision, and is susceptible to clock drift and system load. It can sometimes also cause significant overhead, which skews the timing results. Due to an unholy combo of clock sync and rounding errors, it may even be possible to get a subsequent time to be before a previous time.
.
Where is the cat?
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Sitting by the window. I'm pretty sure it will get up and come by the end of the day.
I followed the solution for the exercise using multiprocessing but when I run results = pool.map(sample, [10**5] * 10), I run into error message AttributeError: Can't get attribute 'sample' on <module '__main__' (<class '_frozen_importlib.BuiltinImporter'>)>
- Are you running %%timeit or something like that in the cell or just map?
  - %%timeit then results = pool.map(sample, [10**5] * 10) (and before this I do import multiprocessing.pool and pool = multiprocessing.pool.Pool())
    - Have you run the cell that defines the sample-function before running this cell? It sounds like sample is not defined or timeit cannot import it. Can you try runnin without %%timeit? Does it work?
      - Restarted kernel now. Doing this order: 1) import random 2) def sample(n): ... 3) n, n_inside_circle = sample(10**6) 4) import multiprocessing.pool and pool = multiprocessing.pool.Pool() 5) results = pool.map(sample, [10**5]*10). Everything works fine until 5) Same error again
        
        Are you running %%timeit around the whole thing? Try running part 4 in a separate cell and putting %%timeit over that.
        
        I did not use %%timeit at all this time
        
        Was the sample-function unmodified?
        
        Yep. Using it as originally written
        
        What editor/OS are you using?
        
        Jupyter Notebook
        
        But it also fails running as a script (.py) from terminal. See below:
For the above thread:

import random
import multiprocessing.pool
pool = multiprocessing.pool.Pool()

def sample(n):
    """Make n trials of points in the square.  Return (n, number_in_circle)                                                                                                            
                                                                                                                                                                                       
    This is our basic function.  By design, it returns everything it\                                                                                                                  
    needs to compute the final answer: both n (even though it is an input                                                                                                              
    argument) and n_inside_circle.  To compute our final answer, all we                                                                                                                
    have to do is sum up the n:s and the n_inside_circle:s and do our                                                                                                                  
    computation"""
    n_inside_circle = 0
    for i in range(n):
        x = random.random()
        y = random.random()
        if x**2 + y**2 < 1.0:
            n_inside_circle += 1
    return n, n_inside_circle

results = pool.map(sample, [10**5]*10)

print(results)

This fails similarly in terminal as in Jupyter

Does it give lots of errors? It might be related to the multiprocessing vs multiprocess (a better version of multiprocessing library).
- Yep. I now tried to change multiprocessing to multiprocess (both places in top of code), but then it threw error message in File "/opt/anaconda3/lib/python3.12/site-packages/multiprocess/pool.py" and File "/opt/anaconda3/lib/python3.12/site-packages/multiprocess/queues.py" saying AttributeError: 'NoneType' object has no attribute 'dumps'

Try running the following:

Cell 1:

import random
import multiprocessing

Cell 2:

def sample(n):
    """Make n trials of points in the square.  Return (n, number_in_circle)

    This is our basic function.  By design, it returns everything it\
    needs to compute the final answer: both n (even though it is an input
    argument) and n_inside_circle.  To compute our final answer, all we
    have to do is sum up the n:s and the n_inside_circle:s and do our
    computation"""
    n_inside_circle = 0
    for i in range(n):
        x = random.random()
        y = random.random()
        if x**2 + y**2 < 1.0:
            n_inside_circle += 1
    return n, n_inside_circle

Cell 3:

pool = multiprocessing.Pool(2)

results = pool.map(sample, [10**5]*6)

print(results)

I could replicate the error if I ran everything in a single cell.
- Still same error. But in my .py script, I managed now to run without error message by changing pool.map to pool.imap. However print(results) gives <multiprocessing.pool.IMapIterator object at 0x103224c20>. Is that right?
- Actually wrapping if __name__ == "__main__": around the pool commands seemed to work. But it works only in .py script, not in Jupyter Notebook

I tried the solution and I restarted the kernel, but my kernel is still marked as busy at import random … I left it alone, but it is still marked as busy..
I got the following results
- %timeit pool.map(sample, [10**5] * 10): 37.8 ms ± 546 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
- %timeit threadpool.map(sample, [10**5] * 10): 208 ms ± 9.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
- %timeit map(sample, [10**5] * 10): 136 ns ± 2.01 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
- So, without parallelism, way faster??
- Or, is the built-in map doing something different (less) than the pooled maps?
- And why do the number of loops differ?
I could not finish the exercise because I could not run this code "%%timeitresults = pool.map(sample, [10**5] * 10)" after 1h it didn't finish
- Maybe try a smaller N just for the sake of seeing it finish (and you will see that the estimate of pi will be quite bad).
Does multiprocessing.pool work if only a single CPU is available?
- It will work but the processes will fight for the resources. Usual laptops have at least 4 CPUs
  - I know that modern devices have many CPUs, but it's a funny hypothetical question, though.
    - You could ask for an HPC node and by default you get only one CPU and then wonder why your code is so slow
- If it auto-detects 1 CPU, it would probably run only one at a time (with a bit of overhead for moving the data around). It's bad when it detects more CPUs than there are, since a CPU being used by two separate things is very slow
  - So, if I want to make an algorithm that is shared to other users, and that other user might be in a situation with a single CPU, should I try to detect the number of CPUs beforehand and implement functions accordingly?
How does the speedup get affected by the number of logical or physical cores?
- In theory, if it can be distributed perfect, then time is 1/(n_cpus). Plus a bit for transfering the data around. If it can be distributed perfectly.
- True, I was wondering if a laptop with 8 cores would behave similar (assuming similar frequency etc) would behave the same as a desktop CPU with 8 physical cores.
Is MPI.COMM_WORLD something like a global handler or for one specific process pool? Or is it something you need to declare/destroy within a scope (module/function)?
- I am no MPI expert, but I have vague memories that it is the global communicator between the processes started by the main mpi process. I will ask an expert :)
- MPI.COMM_WORLD is a pre-defined communication world that consists of all tasks. You can split the tasks to their own communication groups if you want by creating these groups, but this world is the most common one because it consists of every task. E.g. let's say you have big task, which is split into multiple subtasks and you want to do a reduce operation (e.g. get average) within that subtask group, you can use the specific communicator for that group.
If a library has been optimized with MPI (say, NumPy for example) to use any all CPU cores available, is it possible to restrict it to certain cores by force? Like if you're running a BLAS-heavy operation (such as a numpy.dot or numpy.linalg.matrix power) over large matrices/vectors and you'd rather it doesn't clog your resources.
- NumPy does not use MPI. It uses another programming paradigm called OpenMP (open multiprocessing) to do multithreading. But you can limit the number of processors by setting an environment variable OMP_NUM_THREADS=n to be the number of CPUs you want it to use. See this page for more information.
- I see. Thanks! That's quite a useful thing to know, because sometimes mv-products get too heavy and one doesn't really realize before the computer crashes
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
- For more programmatic way of setting these numbers, you can try out threadpoolctl
If I run a parallel python on an HPC node and I have let's say 8 processes, should I request 9 CPUs so that one is for the "master" process?
- Usually this is not needed. If you're using MPI, the main process joins in on the calculations. If you're using multiprocessing the main process waits until the results are returned from the pool (e.g. when results come from pool.map()) so the CPU where the main piece of code was running is free.
- In some cases, e.g. deep learning data reading there might be benefit to having more processors so that data loading can happen in the background while the model is running. For PyTorch, see its multiprocessing best practices
- Also, if your system has 8 and you ask for more than that (i.e., 9), you might actually slow the program down since there will be some context switching (processes swapped in and out of the "running" state) that will add overhead. It's like having a desktop computer and opening too many windows (web browser, Zoom, word processor, etc.). At some point, your computer slows down.
  - Thanks!
    - Oh! And there are other bottlenecks in a program, like writing to a disk. If your computer has 100 cores, but all these cores write to one disk, then the single disk head will be overworked and become your new bottleneck. So many things to investigate when you start looking into parallel programming.
I have understood that GPUs are different than CPUs: can I use python multiprocessing on a GPU? or are ther other libraries for doing that?
- Good question. In general one needs dedicated libraries to run on GPUs. I do not know if the multiprocessing package supports GPU - will have a look
  - Multiprocessing itself can't use GPUs, but the individual processes that multiprocessing launches can use GPUs like a normal Python code would. So if you have GPU code that works on one GPU, you can use multiprocessing to run it on multiple GPUs. To use GPUs you need to use frameworks or libraries that can run operations e.g. vector additions, matrix multiplications on the GPU.
- Yeah it won't work out of the box. For example CuPy is a "numpy / scipy" written for GPU https://cupy.dev/
- Some of the popular frameworks for machine learning can be used also for numerical computing in general. I have not tried it out myself yet, but heard from a collague that one can achieve very fast calculations by using functions available in PyTorch.
  - Yes PyTorch is very useful for this. You can tell the library where you want to variable to be computed (e.g. a.to('cpu') makes sure computations on a will use cpus, and similarly for gpu a.to('cuda') will run the same function but on gpu ) (or better, a function of the same name, but the underlying implementation is different :))
  - Jax is another popular one for this. It is designed for deep learning in mind, but it implements the numpy API and can run calculations on GPUs, if needed.

Break until xx:06, then Packaging

CATS is now fluctuating

Packaging

https://aaltoscicomp.github.io/python-for-scicomp/packaging/

This cat really is the cutest
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
😍
- He looks soooo incredibly happy too! I bet he is spoiled out of his mind! (as he deserves)
- You're occupying his space!
Do you always need to use the period, e.g. "from .adding import add" ?
- The .packagename syntax is a position relative syntax so it imports relative to the current file. It is recommended when you're importing from the current project. This way another package called adding installed to the environment would not overwrite the .adding that you want to import.
Is it not necessary to add the file extension in this case?
- (misunderstood question, removed answer)
- was this about a .py extension?
  - Yes it was when importing the file was added as for instance adding not adding.py - but perhaps it does not matter?
Why would I want a .toml file v. let's say, a .json or .yaml file?
- This is the standard for pip packages. See this page for full pyproject.toml specification.

Exercise, we resume at xx:40

https://aaltoscicomp.github.io/python-for-scicomp/packaging/#exercise-1

Try to do what we just showed, but if it doesn't work, it's OK. There is a lot that can go wrong
The goal is to be able to "pip install", the rest is extra.

Why do we want a virtual environment, like this python3 -m venv venv? and why are we writing venv twice?
- First venv in python -m venv venv is the module venv that can create virtual environments. Second venv is the folder where the virtual environment will be created in. It can be anything really. E.g. python -m venv myenv.
- Also, we want a virtual environment so that we can test out the installation without messing up our current environment.

Getting a large error when running pip install –editable calculator

You can also try without --editable and see if it works. This means that when you modify the installed one directly uses the source, so that if you edit it it is directly used. This can be good for development but isn't needed for this example. (sometimes it doesn't work…)

Worked!

You need a relatively new pip version so that editable installs work with the project structure that we showed you now.

should I do python -m pip install –upgrade pip ?
You can try that.

I did it but it maintained the error, unfortunately. I'll stick with removing the –editable for now

What is the error btw?

  Checking if build backend supports build_editable ... done
  Getting requirements to build editable ... error
  error: subprocess-exited-with-error

  × Getting requirements to build editable did not run successfully.
  │ exit code: 1
  ╰─> [14 lines of output]
      error: Multiple top-level modules discovered in a flat-layout: ['adding', 'integrating', 'su
btracting'].

      To avoid accidental inclusion of unwanted files or directories,
      setuptools will not proceed with this build.

      If you are trying to create a single distribution with multiple modules
      on purpose, you should not rely on automatic discovery.
      Instead, consider the following options:

      1. set up custom discovery (`find` directive with `include` or `exclude`)
      2. use a `src-layout`
      3. explicitly set `py_modules` or `packages` with a list of names

      To find more information, look for "package discovery" on setuptools docs.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build editable did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

I understood my error! I had a wrong file setup and was calling 'calculator' instead of the whole project folder on the pip install –editable

Does pip install install my package in another location or is it just "making use" of the package that I just created? i.e., if I did a pip install and then went in and modified calculator/adding.py (for example), do I need to do a pip install to update it?
- This is what we were trying to show. pip install makes a copy of the code and pip install --editable makes something like a "symlink" if that makes sense.
  - Ah! Ok! I see now. Sorry I missed it earlier.

  """
  HelloCat! package.
  """

  from .greet import hi
  from .hungry import feed
  from .dismiss import bye

  __version__ = "0.1.0"
  """
      |\__/,|   (`\
    _.|o o  |_   ) )
  -(((---(((--------
  """

when create pyproject.toml file name = "calculator-myname", is the name here the "project-folder" which include "calculator" or just "calulator"?
I got the error "calculator' is not installable. Neither 'setup.py' nor 'pyproject.toml' found." Although I had pyproject.toml. And what is setup.py?
- Thanks! pyproject.toml was in the right place (project folder) but it works by pointing pip to that project folder instead
Maybe I am confused by your previous answer, but pip install --editable only works on the directory name, not the package name.
- --editable install is meant for local development, so you need to mention the path to the package.
Can anyone really push anything to the official PyPi? No verification?
- Usually, yes, if it is a project that you own / are a maintainer of.
python3 -m build gives me error "No module named build"
- Try python3 -m pip build
  - Build is not part of pip. https://build.pypa.io/en/stable/
  - True! Maybe the environment is missing setuptools
    - Build takes care of fetching setuptools in an isolated environment for building the package.
- Or python3 -m pip install build if you are missing it.
I tried to install my package ("calc-example") in a conda environment and it appears in the package list. However, when I try to import the package in Jupyter notebook, the autofill suggests only "__editable___calc_example_0_1_0_finder"
- Did you install it to the environment that runs the jupyter notebook?
- yes
- Can you import it with the name that you specified for the package (even though autofill does not show it)?
- Hey, the package name has "-", I guess that's an invalid character in a name..? Or is it..?
  - import calc-example -> SyntaxError: invalid syntax
- What is this "__editable___calc_example_0_1_0_finder"?
- Fixed: I just had mismatching name in .toml and the package folder
If I well understood, this means that you can create your own library and make it available for other user, right? And PyPI will create a page online with the readme file, and the author and other info about the creation of this library?
- Exactly!

Feedback, day 3

News for day 3

Info on certificates is on the webpage
We hope that this was a intro to various topics, but we know you will need to learn more to use it. Do go forward if you need to!
Join this Zoom link to talk to instructors https://aalto.zoom.us/j/69608324491