Python for SciComp (archive 2)

This is the part 2 of the archive of the course notes.
Archive part 1 at: https://hackmd.io/@coderefinery/python2021archive
The live notes are/were at http://scicomp.aalto.fi/python-notes
twitch stream @ https://www.twitch.tv/coderefinery (for best viewing hide chat there and switch to "theatre mode")
Lesson material: https://aaltoscicomp.github.io/python-for-scicomp/

Day 3

Icebreaker

What do you think of the course so far

has been useful and entertaining
Good overall introduction to different tools
Nice overview of Python concepts, which can take a bit of time to get on your own if you're used to another language.
very useful course and good combination of everything needed. But setting up the windows every day is a bit hard. Could the daily email have all links to that day too, not just for previous day? Meaning this HackMD, the twitch, course material and zoom.
- The links are the same every day, aren't they? I just lock my computer and keep going the next day. Had to refresh the twitch page to get the live stream going, but that was it.
- but I use the same computer for million other tasks and finding the windows among others is hard… (many webpages have poor title)
  - Sure, but you can still reuse the links from day 1, right?
  - Sure but the daily mail is on top of email pile
Very nice overview and good course concept.
This website in github is great resource for some basic concepts, I will certainly google less when writting the codes myself :).
Pandas indexing was very nice, not going too deep but gave confidence
nice overview of good, practical information that I can already reuse for my projects

Have you used someone else's code and regretted it? Why?

Rather the opposite: not used somebody else's code and regretted writing it myself: +1,+1
Modifying some else's code for my own purposes has usually been quite productive, and a good way to learn too
I Think taking somebody's code and using it is the resource I fall on often, but then I lack the total understanding on how it works (I took some python course in which only lets say one wa of writting the code was covered). In many ways you have opened in here as to how some structures work for me. so thank you.
Not code per say, but certain R packages have been a complete pain but I was too deep into the analysis to go back and switch
Only because it makes me lazy. If there is somme existing code I tend to use it even if I have to code something myself (which can hurt in the long run).

Scripts

https://aaltoscicomp.github.io/python-for-scicomp/scripts/

rkdarst does Thomas have some static or is that just my headphones?
when downloading that .py file I got an error that this file was blocked as it may harm my device.
- that's annoying. I guess the terminal method?
- Windows don't like .py. If using Edge, choose keep from the little box top right showing that the .py was blocked.
- wget https://aaltoscicomp.github.io/python-for-scicomp/_downloads/4b858dab9366f77b3641c99adece5fd2/weather_observations.ipynb
I don't have -> New -> Terminal in Jupyter. Using it via the browser. Can't convert .ipynb to .py Win10
- What OS are you on? On Windows: you can simply open Run -> cmd
  - cmd does not recognize jupyter, neither does Anaconda promt nor git bash. Whatever, I'll just copy the code to .py manually. Works well now.
there is coming some error with the convert to script.
[NbConvertApp] WARNING | pattern 'weather_observations.ipynb' matched no files
- Are you in the right directory? Normally you should be but there could be something odd.
  - yes, that was the issue probably, but anyway i copied stuff to spyder as i prefer this
However the file is in the catalogue.
- Could you post the error?
- Are you in the right working directory in the terminal? Mine opened in C:\
- I got the same error and I don't know how to locate the file using the terminal.
a note: linux command python may lead to python v.2x, while python3 is definitely python v.3x
My python doesnt have any file–> console. How do I open console?
- On Windows Start -> Run -> cmd or gitbash or similar
- On Linux: open a terminal (e.g. ubuntu: Activities -> terminal)
my extracted .py file give me an error at the get_ipython().run_line_magic coomand and so I was not able to get the output plot from yesterday's notebook. But the conversion worked well!
- It tries to be clever and convert %% into a function that import IPython. I didn't realize it did this, this is somewhat clever but I guess has problems when it can't import IPython.
I personally use ipython and run my scripts there as %run script_name.py, then I see all variables and have all functions loaded. I'm wondering, is there any advantage to run the code directly from a terminal command line like $ python3 script_name.py?
- I would assume, that running it without ipython is probably faster (since ipython overhead is not there)
- Interesting! Until today I did not know one could run scripts out of ipython. I always used and liked ipython as a better looking interactive python shell.
  - I found this way because I came from Octave/Matlab, where people generally run scripts from the shell. Ipython is great, and it is terminal-based, which is a great advantage in my case.
When I try to convert by running "jupyter nbconvert –to" and so on, I get: "jupyter: The term 'jupyter' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is corect and try again."
- Seems like for some reason jupyter is not on your path.

#!/usr/bin/env python
# coding: utf-8

# In[2]:


import pandas as pd

url = "https://raw.githubusercontent.com/AaltoSciComp/python-for-scicomp/master/resources/data/scripts/weather_tapiola.csv"
weather = pd.read_csv(url,comment='#')

# define the start and end time for the plot 
start_date=pd.to_datetime('01/06/2021',dayfirst=True)
end_date=pd.to_datetime('01/10/2021',dayfirst=True)

# The date format in the file is in a day-first format, which matplotlib does nto understand.
# so we need to convert it.
weather['Local time'] = pd.to_datetime(weather['Local time'],dayfirst=True)
# select the data
weather = weather[weather['Local time'].between(start_date,end_date)]


# Now, we have the data loaded, and adapted to our needs. So lets get plotting

# In[4]:


import matplotlib.pyplot as plt
# start the figure.
fig, ax = plt.subplots()
ax.plot(weather['Local time'], weather['T'])
# label the axes
ax.set_xlabel("Date of observation")
ax.set_ylabel("Temperature in Celsius")
ax.set_title("Temperature Observations")
# adjust the date labels, so that they look nicer
fig.autofmt_xdate()
# save the figure
fig.savefig('weather.png')


# In[ ]:

What is considered best practice: One file for each function or one script file with many functions? (Asking from R-centric package development where the first is recommended and all functions should be documented (Markdown, Roxygen2))
- often people put several functions into one file but they should be somehow related. often a project starts with one file, then grows, then is split into more files. once it becomes a lot of scrolling, it may be time to split up. but very rarely I see Python code with one function per file.
can you show how did we create the new .py file again?
- File → New → New text file → Right click on tab and rename it.
Is it correct that a script should start with a shebang line but the imported functions don't need that?
- That is correct. Actually, #! (shebang) is only needed when you run it without python on the command line - it tells unix "Python is the thing that runs this"
Should the weather_function be located in the same folder as the weather_observation script? If I will use the weather_function multiple times in different scripts in differerent folders, will it be better if I place it in a central location?
- for this exercise: yes. Python will first look in the same folder. But it can also be in a different place, but then we need to tell Python where to look. As you write, if you want to use it in multiple projects (and possibly multiple notebooks), you may want to have it somewhere else. Option 1 is to create a package and I think we will learn about that tomorrow. Option 2 is to set PYTHONPATH environment variable where you can tell Python where to look.
- mine doesn't find it, stored in the same dir under name weather_functions.py. Error at the import line, "no module named 'weather_functions'"
  - hmmm… (thinking what this might be, strange)
  - this was my own error, trying to do it from a notepad, not script. Probably won't work the same way there. It does work from a script.
If you want to show the power of script e.g. changing parameter. Would it be better just to do simple program e.g. print(x) and then use script to change x?
- Do I understand that you mean here one Python script modifying another Python script?
  - Yes, is it the goal of exercise? Maybe I misunderstand something. I read from the webpage "you do not need to manually configure your notebooks to be able to run with different parameters"
No module named 'weather_functions' -> where should I store it so that jupyter finds the file when I import it? And which data type should it use? ipynb?
- In our case, same directory as weather_observations.py. the "import path (PYTHONPATH)" concept lets you import from other places.
I have problem running the script "python3 weather_observations.py". It says "/Library/Developer/CommandLineTools/usr/bin/python3: can't open file 'weather_observations.py': [Errno 2] No such file or directory" in the terminal. I am at the folder where the weather_observation.py locates.
- interesting/curious … this is on macOS?
  - yes.
    - and when you do cat weather_observations.py it does print the content of the file to your terminal, right?
      - correct!
      - Yes. Well, It seems like the conversion failed. Now I reconvert the file by running "jupyter nbconvert –to script weather_observations.ipynb" again. Now "python weather_observations.py" works. But "No module named 'pandas'"
        
        ok, "good" :-) the import pandas problem indicates that where you run it, you are not in the Anaconda environment which seems to be automatically opened when you open jupyter. To make pandas visible there, you need to activate the conda environment. I think we have a lesson about how that works later.
        
        Great! So running codes in jupyterlab doesn't necessarily mean that the Anaconda environment is activated, right?
        
        right. in this case it is since jupyterlab itself comes from anaconda in this case. but one can install jupyterlab outside of anaconda or without using anaconda at all and then you may have just jupyterlab (plus all its dependencies) and nothing else.
        
        Is there a command line that allows me to check where I installed my jupyterlab?
        
        This is a good plan! One of the main obstacles for me is that python code never seems to be transferable, because the libraries are not available where I try to run it. Whether it's a different machine (ok, just install libraries) or just a different (virtual?) environment, or different user on the same computer.
        
        very common problem. we will discuss it. what we will recommend is to always have requirements.txt or environment.yml very close to the code and then those who want to run it can create a separate environment for each project. If I see a Python code without any requirements.txt or environment.yml, I often feel a bit lost because then I need to look into the source code and I may not know which versions this is supposed to work with.
How can one convert a python script file to an executable file?
- on Linux, macOS, or Windows?
- Windows and Linux
  - I can answer the Linux part (unsure about Windows, there it depends since there are so many ways). Anyway on Linux: on top of the script add #!/usr/bin/env python or #!/usr/bin/env python3 (this tells the system what language to use to interpret it) and then make it executable with chmod +x myscript.py.
    - Thanks!
This doesn't change the file into binary though, does it? Wouldn't it be faster to run binary code be than to run python code?
I "solved" the conversion issue by copying the code into Spyder and created a .py file there. Then I could run the script without a problem. Could you again explain how to run a Python script in Jupyter?
- You can use the "%run" magic. See https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-run
- Or you can open a terminal inside jupyterlab and run it there
- You could also pass the command to the terminal by writing !python weather_observations.py in a cell of your notebook
- Thank you!
I cannot see the difference in the two created png's after adding the new part. Or did we not change anything, just how the script pulled the function? Example, what info lied where?
- As far as I understand the result is expected to be the same but we pulled the function out and now it can be reused also in other contexts/projects.
  - Alright, then I understood it correctly. Great.

Scripts exercises 1-2 until xx:33

https://aaltoscicomp.github.io/python-for-scicomp/scripts/#exercises-1
https://aaltoscicomp.github.io/python-for-scicomp/scripts/#exercises-2

Try to do these exercises. The first one is more important than the second: at least have the .py file (exercise 1), that is what is needed.

Remember that you can make a new Python file and copy and paste from the HackMD above.

I got the script file:
yes: oooooooooooooooooooo
not yet: oooo
not trying: ooooo

What part has been difficult:

Converting from .ipynb to .py +1
Fixing the parser, more details would be helpful

Command line arguments with sys.argv

#start_date=pd.to_datetime('01/06/2021',dayfirst=True)
#end_date=pd.to_datetime('01/10/2021',dayfirst=True)
start_date = pd.to_datetime(sys.argv[1],dayfirst=True)
end_date = pd.to_datetime(sys.argv[2],dayfirst=True)

output_file_name = sys.argv[3]
fig.savefig(output_file_name)

Can you say something about Argparse vs Click ?
- Argparse comes shipped with Python. Click requires installing the library (which is not a big deal with conda or virtualenv or packaging; more about that tomorrow/later). Click is great for composing command line interfaces (but most project are perhaps not in this situation). Personal preference: I like that with click I can define arguments close to where I am using them. Whereas with argparse we often define them all in the same place. For most projects, both are perfectly fine and it is up to personal preference. If you want to compose a couple of libraries where each have their own command line interface and they should compose to a combined interface, go for Click.
- In the end all modules have their advantages and disadvantages, and it's up to personal preference.

Mine ran without error but produced no output file. I had forgotten to import sys. But tried again now, and still no output file (and no error either)

There can be many reasons, we would need to look at the code.
Have only changed the sys.argv:

import pandas as pd
import sys
import weather_functions

url = "https://raw.githubusercontent.com/AaltoSciComp/python-for-scicomp/master/resources/data/scripts/weather_tapiola.csv"
weather = pd.read_csv(url,comment='#')

# define the start and end time for the plot 
start_date = pd.to_datetime(sys.argv[1],dayfirst=True)
end_date = pd.to_datetime(sys.argv[2],dayfirst=True)

# The date format in the file is in a day-first format, which matplotlib does nto understand.
# so we need to convert it.
weather = weather_functions.preprocess(weather, start_date, end_date)


# Now, we have the data loaded, and adapted to our needs. So lets get plotting
ax,fig = weather_functions.plot_data(weather['Local time'], weather['T'])

# save the figure
output_file_name = sys.argv[3]
fig.savefig(output_file_name)

what ws your call to the script (i.e. the python call)?
- python weather_functions.py 01/03/2021 31/05/2021 spring_in_tapiola.png
- Aha, why I am running functions I do not know. My mistake.
  - Good that you spotted it
  - Hurra. Works now obviously.

What about typer for command line argments?
- as mentioned further down, in the end all argument parsers have their pros and cons, if you need a specific feature go for that parser, there is no "best" one.

The course Richard mentioned is here https://scicomp.aalto.fi/training/scip/linux-shell-basics/

ipykernel_launcher.py: error: the following arguments are required: output how do I solve this?
- Where did this error occur? - after running the code given in the solution for Exercise 3

Exercise 3, argparse, until xx:55

https://aaltoscicomp.github.io/python-for-scicomp/scripts/#exercises-3

(Norway, Finland and Sweden zoom sessions if you need help is open, check email for link)

Work on adding argparse to your code. Or click, or something, if you want to.

Undefined name 'parser'. How to define parser?
- parser = argparse.ArgumentParser() in the example
  - This line seems to be missing in the example just above Exercise 3
    - Yes, it was mentioned, have a look at the solution, it's in there.

But when I've defined the argparser, I get an error if I run my script in Spyder: outfile = sys.argv[3] IndexError: list index out of range How can I keep executing the code from Spyder while developing further?

That means you still have sys.argv[3] in the code. Most likely at the output line.

So, I need to use parser.output instead? I get an error again: AttributeError: 'ArgumantParser' object has no attribute 'output'

You would need to use the name of the argument you defined in the Argumentparser. if you used the example code that would be parser.output

I do

could you paste your current code?

import sys
import argparse
import pandas as pd
import matplotlib.pyplot as plt
import weather_functions

parser = argparse.ArgumentParser()
parser.add_argument('-o', '--output', type=str, default="Out.png",
                    help="end time")
parser.add_argument('-s', '--start', type=str, default="1/1/2019",
                    help="start time")
parser.add_argument('-e', '--end', type=str, default="1/1/2021",
                    help="output filename")

parser.parse_args()


url = "https://raw.githubusercontent.com/AaltoSciComp/python-for-scicomp/master/resources/data/scripts/weather_tapiola.csv"
weather = pd.read_csv(url, comment='#')

# outfile = sys.argv[3]
outfile = parser.output

Ok, sorry… another bug in the example… rename all parser instances after parser.parse_args(), by args, and set `args = parser.parse_args()

If you run from Spyder, you don't have access to command line arguments (though maybe you can set them somehow, but you lose the power). The point of the arguments is you can leave spyder and run other ways.
- If you don't want to leave spyder, you could set some default values (using argparse). Then spyder works for testing, command line works for production.
  - Nice, thanks. This is useful for bugfixing or extending a script
Would you have few comments on GUI design as well? for example traits and traits UI.
What is the best way of debuging in Spyder, in the same way we do in matlab?

Break until xx:13

Then scipy/library ecosystems

Remember, if you registered in Sweden, Norway, or Finland, you have access to a Zoom session for one-on-one help.

Scipy

https://aaltoscicomp.github.io/python-for-scicomp/scipy/

Scipy exercises until xx:32

https://aaltoscicomp.github.io/python-for-scicomp/scipy/#numerical-integration

The point here is to explore using a separate library: reading the docs, etc. and practicing making some code from scratch. Please discuss/ask any questions.

This exercise is:
worthwhile: oooooooooooooooo
not worthwhile:
I'm not doing it: ooo
Playing with argparser instead: oo

I'm not able to get it running I've run pip install SciPy and then tried to import it and get teh following error
- the command would be pip install scipy
- got it thank you!!
I don't understand what the dense/sparse thing with matrix-vector product means and can't find anything about that in scipy.sparse.linalg support
- A sparse matrix has mostly zeros. there are efficient data formats for this, and scipy implements them.
- It allows you to do some things like linear algebra faster. But requires some care to implement.
- In this exercise, we see that SciPy uses the numpy array interface for sparse matrices. It's not exactly the same but as similar as can be and tries to be compatible.
  - This is the point of the exercise, good question!
  - thanks!
import statements: what's different and when should we use
- import library
- from library import something
- import library.something
- import library as shortname
- import library.something as shortname
So the format does differ the computing time. csc 6.1ms, csr 6.76ms, lil 53.8ms, dok 2950ms. All the time are actually slightly longer than the ones in the "solution" section. Any possible reason why it's slower on my laptop?
- I guess different computer hardware, maybe scipy has different optimizations, …
  - Ok, thanks. Just wanna know how to have a insanely fast computer, XD.

Library ecosystem

https://aaltoscicomp.github.io/python-for-scicomp/libraries/

Things I use:
- rasterstats
- rasterio
- geopandas +1
- bokeh
- xarray +1
- numpy
- scipy
I unfortunately rather often find issues with dependencies, where something depends on a specific version of a library, but another library depends on another version of the same. Especially when something won't install with conda, but does with pip, things get complicated
- indeed. it can get complicated.
- does the multitude of environments, package managers etc only mean python is very flexible or has it become a bit too messy? I'd like a bit more standardization
I have heard about Cython language (C++ with Python) that supposedly is faster than Python. Do you have any output on this? Would it be easy for a python user to learn cython?
- Cython can be used to couple Python with C++. Alternative is pybind11 (which I personally prefer).
- To couple Python and C, I would use the cffi package (C foreign-function interface)
  - Great example of how to call cpython and fortran with the mandelbrot set. For me it was pretty easy to just run if you use a jupyter-lab with fortran-magic installed :)
For interfacing with Julia: julia
Does not fit here, but your course is so great. Can you maybe offer a more advanced course as well?
- Best would probably be to mention specific topics you want to learn more about. That way it's easier to think about a specialised course for a topic.
  - Dask, xarray, but maybe also Docker, more parallel programming.
  - For example some of the following:
    Adding C-code within python-code, parallel programming, OpenMP and/or MPI,
    Tensorflow, GPU acceleration (CUDA)
- Maybe! Would you want to help plan it?
  - I could try but still in the process of learning the more advanced Python ecosystem.
I would like som pointers to general guidelines on how to build a "package" out of your project.
- yes, important topic which we will discuss tomorrow
- good topic and I think the way you approached it was fine. Real exercises might take too long, but this points the way
Worth to consider is python back compatibility, like print functions and new things happening that might feel cool but might also break things for those who have not the latest version of python
- Yes, this is one of the most important things to me!
I think this lesson is nice as it is, but could be good to have it earlier in the course, to make it clearer what importing things means and such.
- good point. for a long while I was quite a bit confused about the many ways to import a module/package.
In a project I'm involved with, ASE, which is really big, we do try to stay out of too many requirements to make basic part of code work but that some functionalities ask for dependencies.
- that sounds like a reasonable approach +1
I think the lesson should be there, but maybe mentioned more quickly and left to people to read on their own.
- Thanks, sounds good! I like the discussion though.
What are the best practices when starting to write a library? Is there a guide online?
- CodeRefinery may be a good starting point. Then there are various guides for different languages, e.g. https://packaging.python.org/
- Thank you! Is there any CodeRefinery link specifically one could check out?
- https://coderefinery.org (next workshops in spring, run a lot like this one by a lot of the same people. Spread the word!)
excatly, why not attaching the dependencies to a package rather than having all users try to resolve the environment all the times they try to install? Obviously, ignore machine-dependencies.
- That only really works for "stand-alone" products, that don't intend to be incorporated into other projects.
- A package would say what its dependencies are. but then you need a unique combination for your tool.
- Then there are recursive dependencies. Do you publish only direct dependencies or recursive dependencies also?
Will we include some instructions of how to use git in this course at some point? I think for big/small scientific projects it usually needs more than one people to contribute, thus some version management might be useful.
- This is the main purpose of CodeRefinery, we purposly leave it off here
Can I ask what happens in the breakout rooms in the breaks? I am just following on Twitch and there is a silence.
- Breakout rooms are for live support/help. I hear most are pretty quiet, I guess Twitch+HackMd is enough for most people. Right now there is no audio, it is break time
  - Thank you.
- Do you think Twitch+HackMD alone is a good system?
  - Yes, I think so. Your live support here on HackMD is excellent. Only the silence is a bid off for me as I am alone. But I guess, every participant is allowed also to answer questions to experience a bit more of an interaction with the others here on HackMD? It is nice that you select from time to time questions from here and discuss them in the live stream.
Up to now, everything is good. But for us who do not have any experience in developing a relatively bigger project (still for research), it would be good to have an online/offline course in which a project development is showcased. It could be so that, the project is divided into subtasks where subtasks are developed offline by the participants but the key points are reviewed during the course.

Break until xx:00

Then parallel python.

About this lesson:
- We should keep it: ooooooooo
- We should not cover it: o
  (but keep the material so that people can check it out)

Parallel

https://aaltoscicomp.github.io/python-for-scicomp/parallel/

About parallel programming:
I don't need it:
I need to use multiple CPUs on one computer: ooooooo
I need to distribute across a computer cluster: oooooooooo

I use parallel computing a lot, but mostly in embarrassingly parallel cases like a large number of completely independent scenarios
- And what tools are you using to run them? (out of curiosity)
- Currently MATLAB and its parfor, and R with mpi. Now trying to learn if Python could do it for me. I have done some OpenMP and MPI with C, too, but most of "real work" with those mentioned
  - Also parallelizing "outside" of Python with a batch system (like Slurm on high-performance computing systems) or a workflow management system (like snakemake) might be an approach here
    - Certainly! Have to use Slurm anyway on the CSC computers (Finland)
      - because sometimes it can be simpler if the code does not even know that it is run in parallel with many other instances of the same code. disk-I/O can be an issue so parallelizing "inside" can be a way around I/O limitations. but if the code reads and writes very little and very few times, then it can be simpler to parallelize "outside"
Which profiler would be closest to the one in MATLAB? That one is really intuitive
- I always use this one: https://docs.python.org/3/library/profile.html (cProfile)
If you compile you math library (openblas or fftw3) in parallel then those libraries will do the heavy lifting in parallell for you.
- +1+1
So multiprocessing call several instances of python and then combine it in the end and get around the problem with python and parallel?
- And if you pass in objects, those will be copied, and modifications on them will not be passed back, except if the objects are explicit return values.
My running of output is really slow. What could be the reason for that?
This code:
with Pool() as pool:
output = pool.map(square, [1, 2, 3, 4, 5, 6])
output
- If you run it again, is it faster? If so, took a while to import multiprocessing
  - it is still running, even after a restart of the kernel
  - It's very slow for me too (still running), restarting kernel doesn't help (using windows)
    - Can it have something to do with the OS? I'm on a windows, same problem +1 +1 +1
    - Same problem here with MacOS. This part of codes runs forever.
    - For me it helped doing "from multiprocessing.pool import ThreadPool as Pool" now it works very quick
    - ^That helped, now it's running!
  - I think it is because we are loading the total library: Should have been like this right:
    import multiprocessing.pool
    pool = multiprocessing.pool.Pool()

and in the exercise text it is 'import multiprocessing'. Like Richard talked about earlier. Loading the total scipy or only a subpart of the library.

I'm still stuck here, can't get a result from pool.map(square, data) Computer takes forever
So in multiprocessing will there be one data memory instance, shareded by the instances?
- From my understanding it's not shared, but they write to the same output.
- Does that mean that if you have big data set that the memory will grow for each python instance?
  - That's at least how I understood the process scheme.
  - Yes. But… presumably you start with a lot of data, and it gets distributed piece-by-piece so it stays under control.
I just want to say that the Twitch stream sometimes stop for me and did not realise you continue. Sorry.
- Is this a problem for others?
- Not for me. +1
- What's your location?
  - Let's say I am currently outside the Nordic countries, but I have a Nordic affiliation. Sorry then, must be me.
- I have a hypothesis we should reduce the Twitch bandwidth, we are using way too much for the amount of data we have. Does anyone have thoughts on this?
  - I selected currently 720p because streaming via my mobile phone as router :-)
A basic question: What would be the difference between installing a package from conda vs from Python?
- from python, you can't really install a package. You would have to manually copy it over to your package directry
- then what about conda vs pip?
  - Conda resolved dependencies, pip does not from what I understood
  - there are really two alternatives: PyPI (Python package index) which is often used with pip and virtual environments and Conda. both can pin and resolve versions. many libraries and packages are distributed on both PyPI and Conda. But some are only on the one or the other. PyPI is traditionally only for Python packages but now also contains mixed-language projects. Conda is designed to be able to distribute any language project. Long story short: it depends on the project and the community. Some communities are only on Conda and some projects have system library dependencies that are hard to distribute via PyPI so there it's better to use Conda.
  - Thanks! I have had issues installing some spatial packages, when using one or the other (pip vs conda) works for some packages and the other works for the others. And they don't seem to understand each other's dependencies, i.e. packages installed through one system aren't "accepted" (the later installed packages don't see them as installed)
    - yes exactly. some geospatial packages were much easier with conda for me than pypi.
- Can I ask: But pip is now in my conda env. Should this pip not resolve conflicts? I think it does.
  - hmm… I am actually unsure about version resolution between the pip-part and the non-pip-part of environment.yml. I would probably try to (per project) get either everything from Conda, or everything from PyPI. But I know that sometimes it is not possible.
can I run the mpi-enabled code (e.g., our darts example) within one or more Jupyter cell/s?
- yes. if the environment contains mpi4py which I think it does. but this can be run out of Jupyter.
- But I think that you need to start it from command line with mpiexec from the command line to start the different processes and get them communicating.
  - ah good point. I forgot about the launcher.
What is the difference between threads and processes?
- Good question! It about how operating systems work. You could roughly say "processes have different memory, threads share the same memory". A process can run a system command to start new threads or end them.
- Threads are often used for e.g. web applications where one thread can answer queries from users while, one accesses a backend etc. Applications where some of the threads wait for communication (IO/web requests) and while they're waiting, some other threads do other stuff.
- One might also think it this way: Threading is basically you multitasking at your work. While you wait for printer to print a page, you use your phone to check your email. You change your focus to another task. Multiprocessing is having multiple workers do different things: one person prints while other person looks at emails.
Is there a preferred mpi implementation for python? Would you recommend Open MPI, MPICH, other?
- I think Python (or better mpi4py) does not mind which one. It can talk to OpenMPI or IntelMPI or any other as far as I know. I often start with OpenMPI.
- It depends on the system. On a laptop, OpenMPI is basically always the best option. On a cluster, check the documentation for that cluster.
Is there a way to time the code in windows (%%timeit doesn't work)?
- one "pedestrian" way which I often use is to insert a timer before a function call and after. I use https://docs.python.org/3/library/time.html#time.perf_counter for this. Then I can move these around until I find the bottleneck. This does not run the code many times like timeit does and does not give me statistics but often it is all I need.
- you can also run timeit without the magic: https://docs.python.org/3/library/timeit.html
Can I ask why you did not consider dask? I find mpi4py difficult to handle, to collect all the results in the correct order. Have not reached yet dask.distributed, but it seems easier.
- This is just for demo purposes, for most of the time you probably should prefer something like dask if it works.
- Dask is very good if your're working with a large dataframe or large number of individual tasks that you need to complete (bag).
In the Pool() funtion what is the meaning if I put a number in it, for example Pool(10). Is the number of cores?
- Yes.
- Usually you would want one per actual core, usually auto-detected. On clusters, you might not have access to every core on the node if you didn't reserve them: in this case you need to make sure it uses the right number! Using too many is slow
  - So if I don't specify any number than the fucntion will use the maximum number of cores avaialble in the machine?
I get the error <DLL load failed while importing MPI: The specified module could not be found> when i try to import MPI from mpi4py
- I guess it needs to be installed, conda install mpi4py. I'm not sure about windows-specific considerations here.
  - but mpi4py might be there but the actual MPI might not be there in this case. mpi4py is not enough, it also needs some library that can do the MPI part.
  - Same here, what should we install then? in adition to Anaconda3
    - if you have access to a cluster which has MPI installed, I would test it there. otherwise on my laptop I installed openmpi but I would only do that if you consider to actual use it, not only for this exercise.
- OK, right. So it needs more. Did you install through pip or conda, conda can be more clever about pulling in these non-Python dependencies.
  - from conda i get that '' PackagesnotfoundError:the package is not available from the current channels.''
    - I think it's in Conda forge, so adding -c conda-forge to the install command should help
My jupyter just says "Busy" and does not move forward when I run %%timeit for pool.map(sample, [10**5] * 10)
- same x5
- Hourglass isn't moving at all
- Same also when removing %%timeit
  - Well, it is creating a pool multiple thousands of times and executing it (that's what timeit does to get statistics). So it can well take some time.
- I had a bit of a look around. It could be that the issue is that windows and linux use different ways to create subprocesses. In windows essentially previous things are rerun to obtain the memory state of the process up to that point. This might cause issues in jupyter.
Can I ask what is the difference between conda and conda forge?

Exercises, parallel, until xx:50 (← update, we did math wrong)

https://aaltoscicomp.github.io/python-for-scicomp/parallel/#exercises-multiprocessing
https://aaltoscicomp.github.io/python-for-scicomp/parallel/#exercises-mpi

You could try exercise one or two. Or both. I'd highly recommend trying exercise 1 at least, it is a common, simple way of distributing work.

If the pool is not working: from multiprocessing.pool import ThreadPool as pool - I guess the forking system doesn't work on windows somehow. Though ThreadPool doesn't avoid the Global Interperter Lock

Did it work on windows?:

yes:
yes with ThreadPool:ooo
no: ooo
not in multiprocessing: oo

For the magic function, my code only works if I type it as "%timeit" and put the highlighted line right after '%timeit' as in the same line.
- What about two %% to make it a cell magic (% processes only that same line)
  - Ah, I see. I split the cell and now it works as '%%'.
  - Timeit broke down somehow, whenever I try to run it in jupyter makes the notepad to execute anything "countless times". I.e. simply saying print(1+1) gives 2, but %%timeit print(1+1) (on their own lines) gives thousands of 2's
- Timeit actually, does run it countless times. Well, countable, but it runs it many times to get good statistics. Better to not print from a timeit.
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
I'm still stuck at pool.map(square, data) can't get a result. Computer takes forever. Can't seem to get the import right, maybe??
- from above, try from multiprocessing.pool import ThreadPool as Pool
- ImportError: cannot import name 'ThreadPool' from 'multiprocessing'
- I had wrong name, multiprocessing.pool, see above
- Ok, now it runs, thanks.
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
I have a line results = pool.map(sample, [10**5] * 10) working fine, but in the next cell, when I try to run n_sum = sum(x[0] for x in results) it returns with an error saying "NameError: name 'results' is not defined". It seems very weird.
- yes it is. Is there a %%timeit or something?
- Yes.
- So %%timeit runs in in a different environment and doesn't share the output variables
  - Great! It works after I comment the '%%timeit'.

Dask

For me the Count row is 1 Tasks and 1 Chunks

https://aaltoscicomp.github.io/python-for-scicomp/parallel/#dask-and-task-queues

To display (and to actually calculate) the value in Dask: result.compute()

Feedback, day 3

Online vs in person:
Twitch best:oo oooooooooooooooo
In-person best:oo
Zoom best: ooo

Twitch was good as people can easily watch the saved video after, and just watch the stream live as well without much hassle.
about Zoom: (you can get rid of the image of the speaker which is otherwise in the way in twitch)

Anyway: you need the HackMD for interaction and troubleshooting.

Never tried CodeRefinery in person and can't travel to Finland for this. +1+1
There are CodeRefinery workshops in other countries as well, including Sweden and Norway.

What other courses would you like:

More in-depth on parellel processing, merging C-code within python-code
Scripting +1+1
Data Analysis +1
Bash
ML libraries (e.g., Pytorch, Tensorflow)
- And how to structure your own library of scripts and functions.+1
asyncio please +1
Reseach pipeline: data-python-matplotlib-latex and git over all of it. Particularly, git for researchers will be interesting.+1
I/O processing, more parallel computing, how to handle 500GB files, how to split them and work on them still
how to create files with h5py, how to write to them, how to read them etc.
how to write scripts, functions etc
debugging

Name one good thing about today:

parallel stuff
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
got some understanding about the library structure
materials and organization +1
The whole parsing bit.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
argparse was great.
Making scripts was super useful. +1
I think you are amazing with your course. I learned a lot. I like that you are not afraid of doing it remotly.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Organisation and live support is amazing, have not seen something like this somewhere else. Teaching is also top. +7
Good overview of some important but complicated topics.
I also like that you are doing it as pair. Nice interaction within each team.
Scripting was very clear
It is a good overview of parallel stuff. I think it is required to at least have a certain practice knowledge about parallelism to follow suitably the course and exercises.

Name one thing we should change:

I totally lost the Script section ever after I couldn't solve the dependancy issue. It might be more helpful to move the dependancy part before the script part. Or at least providing some ealternative solutions to make sure even the ones who has dependancy issue can follow the course.
Definitely, it would be great to have more exercises to practice much more about scipy, multithreading, pool, MPI and Dash on Windows, that leverage better the time. Multiprocessing is still running and the Windows-based Jupyter notebook is busy yet. It would solve some exercises debugging the code and overviewing the code.
Demonstrate debuging when using multithreading/parallel processing +1
The first part on scripts was too slow, and frankly, it is more basic than getting started with python (for me at least), so it was strange to have it in this "middium-level" python course.
- I disagree. This was exactly what I needed to learn today. +1+1
  - this comments highlights the fact that most people that nedds/wants to use python would benefit from the basic understanding of what a scripting language does in the first place.
I feel that I got something from the earlier lessons, but not much from the parallel processing - I understand that it is quite difficult and complex subject, and I'm not sure how it could be improved +1
- I think parallelization is so complex that this kind of demo is about as far as you can get without a more involved course. But the demo is important to give you the idea that this is something you CAN do, but need to learn more
I would have liked to see the "preprocessing" scripts, or the functions explained better. Where did "T" come from, for example.
I would like a section on how to navigate among Jupyter, the command prompt and other modes (conda perhaps) for writing/executing code and how to link them. Basically an overview on how to structure your work using all these tools. We've covered a bit regarding Jupyter but it would be great with more of a birds view. (I got lost importing and converting the script code in the first session and the different ways to run it.)
Parallel computing: I didn't really get the point, especially since the code didn't run on my Windows laptop. Make a good demo and just show us next time! +1
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
SciPy: I would have liked to spend some more time with this today. +1
how the code profile should be looking like before going to parallel computing?
in twitch, your faces are pulted on top of the page and then you hide the example code behind them. in zoom, the people can themselves place the faces somewhere. however, it is nice to see you talking!
More infos on dask, xarray, scipy.
I kinda feel today's class is quite fast and covers many different topics.
Maybe more explanation about the differences between parallel processing implementations, slurm and local multithreading would be helpful before getting hands-on

I attend a weekly free Teams chat called the HPC Huddle with one of your former colleagues now in the uk
Please see #HPC Huddle
Great community open to anyone

thank you for the suggestion!

Day 4

Richard does not sound distorted in the Twitch (minor)
- Richard's sound does sound a bit distorted on Twitch, but we can survive
Yes I can here you
Hear Richard and others
Richard your sound is distorted a bit today as well
Richard's sound is better that Radovan's in Twitch. Radovan has echo?
- To me Radovan's sound is way better than Richards …
- Richard's sound is distorted, except from that it is fine
All sound good now.
Richard your sound is just getting worse +1

Icebreaker questions

When you use someone else's code, how do you do so?

copy-paste from Stackoverflow and adapt.
- +5
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
- usually after searching for syntax or function name/argslist
based on examples they provide if possible.
copy-paste and sometimes through a joint Github repository
I get the source file on a USB-stick, and then I continue to develop it until it is my own code.
- a.k.a. plagiarism ;-)
  - Not as long as the original developer is aknowledged.
  - Doesn't this get into the weeds of licensing?
  - Not for codes written by a supervisor and handed over to their students.
Installing through pip +1
Forking from GitHub +1

What's your scientific background?
biology: ooooo
physics: ooooo
comp sci: oo
electronics: ooooooo
chemistry: oooooo
Mathematics: oo
languages: o
Structural biology:
Material science: o
Meteorology:o
Control engineering: o
Earth science: o

Dependency management

https://aaltoscicomp.github.io/python-for-scicomp/dependencies/

Please ask/comment here

Sabry is still a bit low?
- better? Yes!!
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
Do these exercises/examples work on windows? Looks like unix terminal code
- In theory they should - conda and virtual environments work there. But I wouldn't be surprised if there are quirks we don't know of!
- we'll help the best we can… I'll comment on possible differences
Right now: this is follow-along, general demo. We will get to details later.
does it mean we need to put documentation about all packages we used?
- by "freezing" your environment you indeed document all the versions of the libraries/packages you have
I found this
https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/
to be quite good

Exercise 1:

https://aaltoscicomp.github.io/python-for-scicomp/dependencies/#exercises-1

How do you install Python packages (libraries) that you use in your work? From PyPI using pip? From other places using pip? Using conda?
- PIP: oooooooooo0oo
- conda: ooooooooo
- apt-get/aptitude: ooooo
- brew
- pip and conda mix o
- Pipenv for packages and asdf for python versions.
- I use pip (and usually break everything because I forget to create an environment first)
- I like pipenv or poetry for taking care of my packages.
- mamba
- pip or conda, case-dependent
- create virtual environment using venv
- To be honest I google the package and use whatever it works. I do not know the difference of using one or another.
How do you track/record the dependencies? Do you write them into a file or README? Into requirements.txt or environment.yml?
- I don't +3 : ooo
- I have never done it +5
- combination of writing in README + environment.yml
- requirements.txt +1 +1
- With poetry or pipenv: a lock file +1
- .
- using Snakemake
- using pip freeze and keeping list of packagaes in the requirement.txt
If you track dependencies in a file, why do you do this?
- If not this is done you will have problems
- to make sure it's reproducible +1
Have you ever experienced that a project needed a different version of a Python library than the one on your computer? If yes, how did you solve it?
- Yes, I had no idea at that time
- Yes. I create environment for different analysis that need specific packages or versions. +1+1+1+1
- Yes, installing new versions in pip until it worked. +1
- Yes, intalling new versions and sometimes I need to downgrade the package to a previous version.
- Almost any project I have tried to run on my machine. Sometimes I need to incrementally change versions to find the right version that was not documented
- Not on python, but the same can happen e.g. in Java when building plugins for frameworks, and their used packages suddenly change and you have to update your code.
- Yes. Creating a new environment in conda and installing the version there or testing versions until it worked +1
- yes: needed two different packages that (since installations were based on different tools: one pip, the other, I can't remember) with incompatible dependencies. What a nightmare! Eventually, I found another package that defined a dependency tree that included the compatible versions of the two otherwise incompatible dependencies.
- Yes. I replaced in the occurrences of the uncompatible function in the code, which luckily were not so many. Still required some work.
- Generally create new environments before installing

continued

I have to say that sometimes deleting the $HOME/.local has fixed strange issues with wrong dependencies
- Yes! On our cluster in happens all the time, a pip install --user makes permanent changes which cause unexpected problems months later
- In one case even having a separate environment activated, still went to use packages in $HOME/.local +1
Anyone know here the name "wheel" comes from?
- I think refers to cheese wheel, some reference to Monty Python (which I don't know well) - the Python creator is a big fan of M.P.
- I had also seen it as a unix group. so they are unconnected?
Nice thing this "oneliner"! Run python commands with "python -c". useful!
- Can be useful in continuous integration pipelines such as GitHub Actions (when debugging automated testing)
Can you explain the main differences between pip and conda? In my understanding, they have access to different libraries (maybe not right use of word) of packages.
- pip installs only Python packages. Simple, but when you depend on other scientific libraries, it's not enough.
- conda installs packages (Python, R, C libraries, etc…). Useful for scientific stuff
- pip can install on the system itself, or in environments
- conda only works with environments
- conda works better for virtual environments where there are more dependencies for scientific area
My industry buddies that are all for poetry and think conda is "too big". Why do you conda is so much used in scientific communities? I mean to build something which is only python it seems a bit overkill?
- Well, I often need things that are not pure Python. And I want to use the same thing for all my projects.
- see the above answer for a partial answer: when the Python side is not enough. If pip packages need to compile something, it can matter what else is installed on the operating system (example: to install pip package X, one has to install debian package Y - not a good dependency)
  - But couldn't we use a docker container instead? And have the total app/package/set-up totally isolated? I would personally think this is better for the reproducibility.
    - But worse for usability. At least if anyone ever wants to use your code and not just run your program. I think the difference is whether you code for an end-user (at which point I would agree, a docker image works well) or if you code for usage by other coders, at which point a docker image is mostly useless.
      - Good point!
        
        With binder we could still have the usibility part. Behind that you have a docker container?
        
        I'm not sure what Binder is running in the background, but you don't need to create a container yourself to publish on Binder. We'll come back to this.
- yes, that's another option. different cases do different things. I often see conda in docker.
- so far, I never met a task that I cannot get done with apt-get, so from my point of view, conda is an overkill.
  - consider yourself lucky! This is a good case to be in (and perhaps shows good dependency management, keeping it simple)
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
What do I do if I need a certain package that is only included/ available through older versions of anaconda?
- all (or most?) old packages are there, in theory you can recreate an old environment with old versions.
  - How do I do that? Because the way I tried it didn't work (downgrading conda). It was an old and probably quite specific package that might not be maintained anymore(?)
- I would start with a new, blank environment, then slowly start adding what you know you need. the conda version itself shouldn't matter - that is only the manager
- but yeah, sometimes you have to try several things to get it.
  - conda install <package> didn't work unless I installed an older anaconda version.
- let's check error messages later. Things are difficult. But note: conda != anaconda. conda is package manager. anaconda is a distribution with certain fixed default versions (so downgrading certain things may be hard). So, you should use conda to make a new conda environment (not full anaconda) that starts with nothing installed. then add what you need.
  - Hm, ok, I'm probably missing some basics here…
is it ok to mix conda and pip? Do they "cooperate" or will they step on each other's toes?
- yes, you can install pip packages in conda environments (and it happens often). pip packages can also be in conda's environment.yml
what if I messed up my conda base env? what should I do? reinstalling the whole anaconda?
- I probably would (not because you necessarily have to, but deleting and re-creating ensures it is reproducible).
- Then try to do risky stuff in other named environments, so I can remove and reinstall just those.
  - but when I messed my conda base, how can I fix it?
- I would delete and recreate. and produce a good list of dependencies (environment.yml) so that you can get back to the state you need easily. Unfortunately this happens
  - how can I recreate my conda base after being messed up?
- I would delete the installation folder and reinstall.
When trying to create/use environment.yml I often need to add channels manually. Is there a way to store the channels as well as the package names, e.g., when exporting conda environment?
- I know environment.yml can store channel names. I don't currently know many details.
Assuming backward compatability: Is there a way to specify minimal versions for a package? +1
- packagename>1.5.5 for example.
- packagename<4.0.0
- and if you need to combine them for the same package… I would have to web search
Could you go over what I can do if a recent install broke my environment?
- potentially create a new environment? (essentially see the question above about messing up conda base environment)
- personally: remove and reinstall, see if it works then. If not, then try to figure out where the conflict is (not fun)
- also I recommend to not install into base environment but create a new environment for each project. then it's always safe to delete and recreate and less fear of breaking things
- what about rolling back using conda install –revision ?
How exactly do you reset your conda base environment? (I.e. what do you have to delete)
- One option is to run conda install --revisions 1, which should bring you back to the original base.
We are not doing any of the exercises?
- The time for the exercises was 60 minutes total, for a 60 minute lesson. They could be done in a longer course, or as homework.

Break until xx:10

then binder

Binder

https://aaltoscicomp.github.io/python-for-scicomp/binder/

This is mostly a demo-

Radovan has slight echo in twitch. Was better before break.

does anyone else hear it?
minimal. Probably due to the room he is sitting in +1
has his microphone changed? he had to restart zoom

https://aaltoscicomp.github.io/python-for-scicomp/binder/#exercise-1

Data might be missing as well
- I might have 200+ GB of data. Not easy to share.
- But how about sensitive data?..
  - At least provide a way to obtain the data (even if its just giving an email address to contact)
  - It could be possible to provide a training data set of reduced size that doesn't contain any sensitive data.
Code without documentation is not accessible
No examples to lead the user to get fomiliar with the code, and the user wouldn't know if the code is working fine or not.
An example dataset is useful to provide together with the code.
Code might need non-python packages like netcdf
it might not run on other peoples computer
Tests to show what the code is supposed to do.
What operating system the code was run on.
I think code sharing is feasible in many ways bit the reproducibility of the implemented and computing environment with the specific data used could be different and take more time even with the source code. Here, if I understand, the binder tool can provide us such advantage among the most importants.
.

What problems can you anticipate 2-5 years from now?

Dependencies have newer versions
Do we all have ARM computers?
- Translation environments?
Different computer architecture?
Jupyter might not be available any more since there are now "better tools".
Binder is no longer a service…
.

Binder demo

Currently: Making a new notebook, adding an earlier visualization example there.
https://aaltoscicomp.github.io/python-for-scicomp/binder/#binder-exercise-demo

The repository is at https://github.com/bast/python-demo-2021

About downloading a file:
- Right-click on Raw and select "Save as"/"Save Link as" (better to clone the repo, but if you just want the one file)
  - Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
    (Sorry, did not mean -1…)
Does binder take .yml files too?
- yes. And similar files from other languages like R and Julia
- list: https://mybinder.readthedocs.io/en/latest/config_files.html
What if the repo is not public? Do you have to pay for Binder then?
- Binder doesn't do any authentication (right now), so can't be used. Binder is completely non-commercial, but the tools are open. Perhaps you could put together your own thing.
  - How could we do this with a private repo? I'm what would be the "way to go" then?
- The underlying tool is repo2docker, which in theory can get you most of the way there. Though not on the web.
- I think they might be thinking of things like private repos. I'm not sure of the current state.

The direct launch URL: https://mybinder.org/v2/gh/bast/python-demo-2021/HEAD

Who is running binder?
- The software is open source. The development team overlaps with Jupyter and the service is supported by several cloud provider + the Turing institute (and probably others).
Will other large program written in C++ be running on Binder?
- In theory you can run anything there, if you can set the right files to configure it. They have CPU/memory limits, so it can't be too big I guess.
What is the difference to Google Colab?
- Binder takes a git repository, so the setup is different. Colab just pubishes your notebooks directly.
  - Which means Binder is more flexible about dependencies, but Colab is simpler.
- They are quite similar services. Colab has GPUs, so it can be better for some tasks.
Why do I get /home/joyvan if i run !pwd in the binder?
- joyvan is a default username in jupyter's docker images (person of the planet Jupyter?). That's the user's home directory, the git repo is cloned there.
- Jovyan is the word for person from Jupyter or related to Jupyter
Is Zenodo better than github if I want to share my code since Zenodo has a perminant doi adress?
- They have different uses. GitHub is about continuous development, Zenodo permanent. Often times papers and so on will want a snapshot on Zenodo or similar
what happens if you now delete your github project? Does Zenodo keep the release?
- Zenodo keeps the release.
Can you talk about other tools equivalent to Binder?
Code changes in the repository are updated in real-time in Binder?
- You would need to restart Binder
How will Binder work with files other than Jupyter notebooks. Like plain .py files? Will it work for R files?
- R-notebooks are supported similarly to Python notebooks.
  - Also Julia.
- You can start a terminal in Jupyter lab, so in principle you can run anything.
How can we know the Binder instance computation features and boundaries to execute more complex implementations?
- looking…
- This and next section gives current limits: https://mybinder.readthedocs.io/en/latest/about/about.html?highlight=limit#how-much-memory-am-i-given-when-using-binder
- This isn't resource limits but other usage guidelines: https://mybinder.readthedocs.io/en/latest/about/user-guidelines.html
- https://mybinder.readthedocs.io/en/latest/about/about.html answers questions about CPU, time, and memory limits. Thanks!

Break until xx:00

Then another demo: packaging software

Packaging

https://aaltoscicomp.github.io/python-for-scicomp/packaging/
This is a demo

Audio fine.

Do you have any experience/opinion on nbdev?
- Interesting. I haven't used it yet. What is your experience with it?
- It really let's you develop libraries without much knowledge. The auto-generated documentation is very pretty, too. I have had some troubles with version control across differet OS (different stuff added in the meta section of the notebooks). Also, notebooks are easy to get messy, one needs to be careful to separate exploration from developed stuff
  - nice to know! thanks for the tip!
Why are Dockers, Kubernetes rather in demand than Zenodo? I have seen this even in academia.
- I think t-hese solve a different problem.
  - Zenodo is for publishing a version of your code. It does nothing to help with dependencies. It just gives you a a DOI.
  - Docker, Kubernetes, Singularity and so on record your environment and reproduce it. They solve the dependency problem. You can then publish the containers, if you want.
  - So you might want to publish a Docker container on Zenodo :)
    - Yes, thank you and then I show I can do both :-)
in packaging, what about dependency issue?
- also here we need to list dependencies. many packages list their dependencies inside setup.py. this is also what we will do here in a moment. as a side remark: it can be confusing and is not easy to know where to list dependencies (setup.py, requirements.txt, environment.yml) since in Python fortunately or unfortunately there are so many ways of doing this. I regard requirements.txt and environment.yml as best places to document dependencies for people. setup.py is the place to document dependencies for packaging. then there are many more ways (poetry, pipenv, flit, the one can read the other …)
Can I install the package in a seperate dir (not a central location in Python)?
- Kind of. You can install it into a virtual environment.
- The --target option does this, in a way. pip install --target=path_to_folder package_name installs it to path_to_folder. But Python will not find it by default. To import it, you need to add the folder to the PYTHONPATH environment variable.
Yesterday, I had exactly this challenge. In my requirements.txt were several git urls. But what is pip install -e git+git:
- Did you mean to have the -e? It is usually used for local folders. It allows you to keep editing the package, so that any changes are reflected when you import.
- The git+git:// part tells pip that you want to download a git repository. (Github stores git repositories)
  - I had in the requirements.txt a line starting with-e git+git:. And for some reason pip install -r requirements.txt did not work, so I looked into the file and tried to do install all by hand into my env.
    - The -e might be a problem
    - Yes, I think -e does not make much sense together with a Git repo. It is useful when you want to install a local folder and still keep editing it and you want everything to refresh just after saving changes.
    - Thank you. I had never seen this before, I was clueless. I still did not understand so much the -e option.
      - Basically, you use -e when you have the package on your hard drive. If you don't add -e, pip will make a copy of the package.
What are the difference, sometimes you typed "pip install", sometimes you typed "conda install"? Are they different tools for downloading packages?
- pip install fetches packages from PyPI (or URL/GitHub like in the example before), conda fetches packages from Conda (often conda-forge). pip is often used with virtual environments, conda is conda. both solve the similar problem but have been created with different motivations. to some extent you can also mix them (you can pip install packages from PyPI to a conda environment).
- but in short: yes. two different ways of fetching and managing dependencies.
What IDE do you recommend to use on a windows machine, with a purpuse of creating and running python scripts?
- actually same question for Linux: any recommendations?
- I used spyder
  - I use spyder also, it looks similar like matlab.
- PyCharm is really good for debugging, I have been using it on Mac and Linux for more than 5 years
- coming from Java, I still use eclipse, also for python. Might not be the best option though
- Visual Studio Code. It supports Jupyter notebooks, has nice plugins. It is also available on all platforms afaik.

Panel discussion

What questions would you like the insturctors to debate?

IDE:
- Spyder
- tmuxinator, vim and jupyterlab Simo
- tmux + vim
- PyCharm
- Eclipse + PyDev
- Visual Studio
- Emacs
what an efficient way to start learning a new language for doing a specific reserach project?
- Do a smaller project in the language. Or just start with the actual one, but choose a manageable task.
- how could we recognize how deep we should go in learning the libs and tools?
What OS do you use?
- Linux: oooooooooooo
- Win10: oooooooo
- MacOS: ooooo
Why do you choose the OS you are using now over the other options? Any reasons or benefits?
- I started using Linux because of my curiocity, but now I do quite a lot of administration of various servers, and it just so nice to do from a linux desktop/laptop; another reason is specific linux software used in space-physics research.
  - some of the software I'm using simply runs better on Linux as Linux has a lot of libraries build in or they are simpler to install, e.g MPI
- Linux is so common in the field of high-performance computing and scientific computing that using it will make using other libraries / systems much easier. In some fields Windows is really popular so there's nothing wrong with using it, but it can make building programs or using programs by other people harder, as you need to adapt their documentation to Windows. Simo
- When I started, it was very hard to do programming in any other system than Linux. Especially for high performance computing.
- Great to know it, and what is the situation of MacOS?
is it worth havingsome sort of "linux basics" course/mentoring?
- Yes. I use Win10 at work (company policy), but I still prefer bash for command line stuff (I use git bash)
  - as I got, there is something coming (mainly about the linux bash)
When should I start migrating to Julia? XD +1
- I'd say that if you're going to write your own algorithms (like a fast solver etc.), it might be good idea to use Julia instead of writing C extensions Simo
  - So is Julia a 'replacement' of C or of Python?
    - More a replacement to Python, but I'd say it's easier to write C-style vectorized code on it. So it's a bit like writing numpy-style stuff but with the whole language being numpy. However, Python has a much more mature libary ecosystem, so it's easier to use existing solvers. Simo
    - I woudl actualyl say it replaces C rather than Python. It makes writing fast code easier. Python is great for combining existing libraries. Jarno
      - That is true. Simo
      - replacing C in the numerical Python ecosystem?
  - So is Julia EVER going to replace or wipe out python, or at least soon-ish?
When you start a new project with Python or during the development of a new project, what's the usual steps/workflow/checklist to make the whole project more clear and managable?
- I usually start by creating a git repository and an environment.yml-file for the project. After every dependency install I update the requirements. Simo
- attend a CodeRefinery workshop!
  - Do they have any online workshopssss or only in person?
Totaly off-topic: But has Sabry built a cool studio for live demos in his home?
- I should show you my rkdarst's studio too
  - that'd be nice
    - It would be nice to see both.
- Yes I have, Richard has more monitors than me though. I did not have chance to show the light board today (show off)
  - That is a miss, I can tell you that. Next time, Sabry. Or maybe during the after-party.
Is anybody considering switching to a Apple M1 processor? +1
C or Rust?
- new project starting from zero? Rust
- but we don't need to rewrite everything either. C will be around for a long time
What about the copyright issue? Here you talked a lot about sharing the code in an easy-to-use way. Is there a risk of someone take your code and make it their own without referring to your work?
- That is always an "issue" when makin open source software, but in the end, if you have a license file, they are breaking it, and make themselves liable to lawsuits. And, by having it in a git repo, you have a track record of when your code was there. Thomas
- I notice on Github, there are many different kinds of licenses one can choose. Which one(s) would be more generally recommended for a scienfitic project?
  - Most of them are ok. gpl is pretty viral, lgpl is more relaxed, the MIT licenses are normally fine as well.Thomas
- https://choosealicense.com/
What do you comment if you compare Python and R? What kind of problem is more suitable for python respective of R?
- Python is better when doing general stuff that might be related to your scientific calculation. So for example, using web server as a API to your model. Calculations on array data (physics etc.) is usually easier in Python. That said, R has a huge ecosystem when it comes to statistics and machine learning. Simo
Just a remark: Would have been cool to see some females in the debate! +1 +1 +1 +1
- I thought the same, where are the women?
  - I know :( we were suppose to have one female instructor but in the end it was not possible
- Next year.
- thank you for the comment. this is really a problem which we need to change.
- Is there only one female instructor in the community?
  - definitely not. there are so many excellent instructors and programmers. we need to find a better way to reach them and need to create a more welcoming environment to join as instructor. there is a lot to do and improve.
I think: keep focus on Python and advertise CodeRefinery workshops on the side. It is not possible to go through too much detail anyhow unless more days. Next big CR workshop in March 2022.Diana
A company told me recently, open source software is worth noting. Is this true? I guess Python somehow earns money there? May it be the future (especially in research)?
- You can't directly make money with the code. However, there is plenty of examples where open source projects are spported by large companies and can pay people to code for them. (e.g. wine got several people paid for by blizzard at some point). So calling it worthless is (imo) wrong. Also, you can still make money with open source software, in the form of support.Thomas
- some companies make an amazing amount of money with open source so this is not true in my opinion. also many companies use a lot of open source code without donating back to the community.
- Most of the open source companies produce a product (such as Anaconda) that is available open source, but support & enterprise add-ons are sold separately. For big companies, paying for the service is worth the added support. Simo
- Also it's always the question how you consider worth. Must not necessarily be money, especially in public research, open source software is of much more worth than close software, since it actually advances the community/society instead of putting up pay-walls in front of every single development.Thomas
- Hm, this company makes money with it. If I don't pay them, I don't get access to their code to communicate with their device, hence their statement probably "open source software is worth nothing"
  - It might be that the source code is open, but the license related to the hardware is not. But if you cannot access the source code, it is not open source.Simo
  - About "open source software is worth nothing": How much are open-source developers really worth?-article There's a lot of articles such as this and in general the worth of open source software is recognized.Simo

Please join our community to make things even better at the next workshops

Aalto Scientific Computing: https://scicomp.aalto.fi/help/community/ (daily zoom https://scicomp.aalto.fi/help/garage/)
CodeRefinery https://coderefinery.org/get-involved/ (commnunity calls https://coderefinery.org/about/community-call/)
CodeRefinery chat: https://coderefinery.zulipchat.com
- Help section: https://coderefinery.zulipchat.com/#narrow/stream/141114-help
Nordic-RSE weekly coffee break is a good point to start https://nordic-rse.org/communities/coffeebreak/#weekly-virtual-coffee-break

Feedback day 4

One good thing about the course

I really like the panel discussion session today. It's very good to know the professional/experienced opinions on the questions coming up by us beginners :)
This was awesome for getting an idea about what's out there. I found especially useful the introdcution to running scripts and providing arguments through command line, and the part about packacking. +1
A really good session as a whole. Got a good understanding what one should think when doing these kind of projects. Definitely a low lewel approach, liked really when you said that using existing code is really ok.
Really useful and well structured course about Python for Scientific Computing. I have learned several tools, ways, and tricks from your experiences, material, and comments. Truly grateful! +1 +1
Encouraging atmosphere
argparse and packaging.
The last instructor, who was also doing pandas, was awesome
HackMD is an awesome platform for Q&A. I have never joined any other courses using this before, and I feel the other tools are not as effecient as this one. +1+1
- I start to think of using it in my future online teaching, so kinda curious about how do you manage to repond so promptly? How many people would be sufficient to make it work well?
  - Good thing with zoom-teaching is that one person can usually write here while others are speaking. Richard probably knows best how many instructors there should be Simo

The best thing was to finally see an entire workflow and hear the comments of practioners. I've read a vast amount about coding and SciComp generally but I've not come across this type of forum before even tho I attend many online events. It is far easier to visualise a role in the field after seeing how many of you there are and how many different opinions you have! +1

What could we improve?

I found the dependency management section a bit hard to follow (having not much experience with it). A simple type-along example might have been better for this +1 +1
Today was too much demo and few exercises, the demos were nice but it made my ADHD kick in +1 +1
Today's session was a bit hard to follow, since there's no hands-on exercises and follow-along examples that we can practice and get a flavour of the topic. +1
^ Agreed. Maybe if this could be 5 day session there could be more time for the exercises we did.
Managing better the time for introduction, explanations, examples, exercises, debugging and comments. Perhaps following a Nash Equilibrium ;) to distribute better the activities and the course time.
Link to other HPC Communities and perhaps we can even have in-person sessions. I agree that pair-coding / shadowing is extremely helpful.
Go to the next level and maybe advance with the content, i.e. more advanced topics?
- We did this last year, we might re-do it in February https://aaltoscicomp.github.io/data-analysis-workflows-course/
  - Is this really the next level? How about more in depth parallel programming, a SLURM course or ? (I don't know).
    - we have that too :D https://www.youtube.com/playlist?list=PLZLVmS9rf3nN_tMPgqoUQac9bTjZw8JYc
      - although I am not sure how to "sort" the levels
        
        I need to check your website better. Sorry.
        
        the main driver for doing more is interested people like you, so just ask for more and we are happy to provide more!
maybe there could be one project that we work on throughout the course such that everyone has a working result of something in the end (which uses all the concepts at some point). This would make the lessons more connected and more hands-on
- Sounds like a follow-up hackathon. :)
- There is a disadvantage with that: Currently, instructors can work on their own part, a example working through all courses would require a lot more coordination.

Please join our community to make things even better at the next workshops

Aalto Scientific Computing: https://scicomp.aalto.fi/help/community/ (daily zoom https://scicomp.aalto.fi/help/garage/)
CodeRefinery https://coderefinery.org/get-involved/ (commnunity calls https://coderefinery.org/about/community-call/)
CodeRefinery chat: https://coderefinery.zulipchat.com
- Help section: https://coderefinery.zulipchat.com/#narrow/stream/141114-help
Nordic-RSE weekly coffee break is a good point to start https://nordic-rse.org/communities/coffeebreak/#weekly-virtual-coffee-break

Python for SciComp (archive 2)

Day 3

Icebreaker

Scripts

Scripts exercises 1-2 until xx:33

Command line arguments with sys.argv

Exercise 3, argparse, until xx:55

Break until xx:13

Scipy

Scipy exercises until xx:32

Library ecosystem

Break until xx:00

Parallel

you can also run timeit without the magic: https://docs.python.org/3/library/timeit.html

Exercises, parallel, until xx:50 (← update, we did math wrong)

Dask

Feedback, day 3

Day 4

Icebreaker questions

Dependency management

Exercise 1:

continued

Break until xx:10

Binder

Why is it possibly not enough to share “just” your code?

What problems can you anticipate 2-5 years from now?

Binder demo

Break until xx:00

Packaging

Panel discussion

Please join our community to make things even better at the next workshops

Feedback day 4

One good thing about the course

What could we improve?

Please join our community to make things even better at the next workshops

Read more

2025 Sep/Oct CodeRefinery workshop

CodeRefinery at PDC summer school 2025

CodeRefinery project meeting at NeIC AHM 2025

CodeRefinery meeting notes