Python for SciComp 2021 (archive)

This is the archive of the course notes.
The live notes are at http://scicomp.aalto.fi/python-notes
twitch stream @ https://www.twitch.tv/coderefinery (for best viewing hide chat there and switch to "theatre mode")
Lesson material: https://aaltoscicomp.github.io/python-for-scicomp/

Icebreaker questions

Where are you from and what is your background?

Sweden (Malmo University), Department for Material science and applied mathematics
Finland(University of Oulu), Water Engineering
Finland (Aalto), neuroimaging + computational science. I am a staff scientist
Finland (HY), datascience/cognitive science/ PhD
Finland (AALTO), Computer Science, Senior University Lecturer
Norway (Tromso), computational chemistry
nFinland (Aalto), Chemical Engineering, Bioproduct Technology
Finland Tampere, Computational Physics
Finland (Aalto), ELEC - Control Engineering, Master's thesis
Sweden (Uppsala), Biology
Finland (Oulu), Chemistry
Finland (HY), cognitive science PhD student
Norway (Ås), postdoc Immunology/Virology. Very beginner with bioinformatics
Norway (Oslo), Micro technology
Norway (Trondheim), Numerical mathematics and CFD.
Norway - Data science
Norway (Trondheim), PhD Student in systems neuroscience
Russia (HSE), PhD, Mathematics / Statistics
Sweden (Uppsala), Evolutionary Biology MSc student
Sweden (Chalmers), PhD student Arctic climate physics
Finland (Aalto), neuroimaging + complex systems
Sweden (Lund), PhD student in Engineering Sweden (Uppsala) Computational physics-
Sweden (Uppsala) Computational physics
Norway (NORCE) , researcher in NORCE E E E E E E E E E climate
Finland, University of Oulu, Postdoc,beginner to python
Finland (Tampere University)
Finland (Aalto University), Electrical engineering/MEMS design, Doctoral student
Finland (Tampere University)
Finland (Tampere University)
Finland (OULU university)
Finland (Tampere University)
Finland (Tampere University)
Finland (Tampere University)
Finland (Tampere University)
AAlto, postdoc
aalto postdoc
Finland (Espoo), Ninjalabo, Unemployed Embedded AI researcher
Finland (TUNI), Photonics/Physics, Postdoc
Norway (Trondheim) Industrial Ecology
Sweden (Karlstad) PhD student in Applied Mathematics
Norway (Oslo) Materials Science
Finland, Computer Science
Norway (Trondheim), Acoustics
Norway (Oslo), government data analyst
Finland, Aalto ELEC, Doctoral Student
Finland (Aalto), Data Science / Statistics
Finland (Tampere) Electrical and Electronics Engineer, PHD student
Sweden (KTH), MSc Sustainable Energy Engineering
Finland (Aalto), chemistry and highway engineering
Finland (UniOulu), Environmental health/economics
Norway
SWeden (Lund), Evolutionary biology, Researcher
Finland (Oulu), Environmental microbiology, postdoc
Sweden (KTH) Sustainable Energy Engineering
Finland (Aalto), water and environmental engineering
Finland (Aalto), Computer Science, University Lecturer
Finland (Aalto), ComNet, PostDoc researcher
Norway (Oslo), Engineering, computational geomechanics
Sweden, Linkoping, Biochemistry
Sweden, Chalmers University, Biomedical Engineering
Sweden (Stockholm), Karolinska Institutet, Postdoc in Neurobiology (PhD in Method Development)
Finland (Aalto) , Computational Chemistry
Finland(University of Oulu), Computer Science, researcher
Norway, engineer, computational physics, Oslo
Tampere University, Finland
Finland(Oulu), Control engineering, researcher
Norway(Trondheim), Climate change
Sweden (Stockholm), Biomedical Engineering
Sweden (Uppsala), Evolutionary Biology
Sweden, Uppsala university, Earth Sciences Department
Finland (Oulu), Space physics
Sweden, Stockholm University, bioinformatics
Sweden(Uppsala), Evolutionary biology
Finland (Aalto), Neural engineering
Sweden(Uppsala), Astronomy
Norway (Oslo) - Geosciences
Sweden (Lund), bioinformatics
Norway, Oslo, Engineer, beginner
Mourad Finland UOulu
Sweden, Stockholm University, Hydrogeology
Finland (Aalto), Electrical Engineering
Sweden(Chalmers), Geophysics
Finland (Aalto), Chemical Engineering,
Sweden(Karolinska Institutet), Psychiatric Epidemiology
Italy (EBML Monterotondo), Neuroscience
India, previously worked in Sweden, Computational Chemistry
Finland(Aalto),civil engineering
Finland (University of Oulu), Ionospheric physics research group
Sweden (Chalmers), Railway mechanics
Norway(UiO), Chemistry
Finland (UEF), neurosurgery
Norway(Oslo), SINTEF Community
Norway (Bergen), NORCE
Finland (Aalto University), ELEC, PhD researcher
Finland (Aalto Univeristy)
Finland (Aalto), Information and System Management
Norway (Oslo), Researcher, Climate science
Norway (University of Oslo), Chemistry Department
Finland (Oulu), applied mathematics
Sweden (KTH), Theoretical Chemistry & Biology
Finland (Aalto), Project specialist, Communication and Networking Department
Finland (Aalto), Department of Mechanical Engineering
Sweden (Linköping), Chemistry-IFM

What's your Python experience:

None: oooo
basics: oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
objects: ooooooo
science: ooooooooooooooooNooooo
big projects: ooooo

basic scripting skills and use of scientific libraries
basic scripting skills and use of scientific libraries. Terrible with for loops.
like to learn how do to "for loop" jobs (typically on 2D matrix type data) with faster build-in numpy methods.
Basic computations. I'm in transition from Octave/Matlab to Python.
Do most R than anything else. A little bit of Jupyter oooo
Same here - R is my main scripting language. Every time I try to start
with Python, something has changed a lot, and setting things up seem
complicated. Now I'll try for real! ooo
Basic of python scientific libraries
python main language for analysis, trying to understand OOP libraries better
norway
Basic, some Pandas, Pyomo, Numpy
Basic
Various projects and teaching
Basic (doing self-study), for-while loops, mostly on spyder
Used R mostly, one course work with base python (not numpy, pandas, scipy)

Questions during introduction

Is this being recorded?

video will be available for 14 days on Twitch and my understanding is that we will put recordings on youtube later
the twitch video feed seems a bit out of focus sometimes… Perhaps a larger font could help.
- It might be that twitch adjusts your video quality. Maybe just set it to 720p or 1080p, works fine for me
content will be archived
I am in aalto. I have the anaconda 3 installed but I cannot start the navigator. I can prompt only jupyter notebook or spider. is the notebook enough?
- Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
  As long as you can run Python through your notebook, a CLI and a notePAD/text editor you are fine.
I am using an Aalto laptop. I do not have anaconda in my laptop. How do I use jupyter in this case?
- Please see installation instructions at https://aaltoscicomp.github.io/python-for-scicomp/installation/
- Most likely something like "jupyter notebook" should work, but it won't start jupyterlab. We recommended the Anaconda installation for simplifying this.
could you make the view on the twitch landscape?
- we have chosen portrait so that learners can use the other half of their possibly only one screen to type along and try these things out. this is more difficult on landscape. but i also understand that for some participants landscape might be preferrable, especially if you have multiple screens.
- on twitch I recommend to hide chat there and switch to theatre mode (icon on bottom right of window)
- Just a comment: vertical twitch works just amazingly nice at the second vertical screen!

Jupyter

https://aaltoscicomp.github.io/python-for-scicomp/jupyter/

Don't hesitate to ask questions here and we will answer.

Does Google Colab offer the same as Jupiter?
- Technically yes, check the figures at this link for a great comparison https://buggyprogrammer.com/jupyter-vs-colab/ From personal experience, colab gets expensive if you need to manage large datasets and lots of computations
What is the bash icon? I can't see it in my jupyter-lab.
- It should look like a terminal…?
- But I don't have it, see:
- Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
- The terminal is I think what you are looking for
Is there any advantage of Jupyter-notebook over a well-commented regular code?
- Here on the pitfalls of Jupyter https://scicomp.aalto.fi/scicomp/jupyter-pitfalls/ (written by Richard)
- Personal opinion: well commented regular code is better, but I see the advantage of jupyter for quick sketches and interactive display of variables / plots.
- Well-commented "regular" code may be totally fine for those who are OK with opening source code or even have access to it. But notebooks can be nice to communicate a workflow to somebody who may not be primarily interested with source code and it looks more like a story/notebook to them. Notebooks are particularly nice for data visualizations and linear workflows. Less great for non-linear workflows.
%%bash does not work on my Win10. My Jupyter does not color %%bash, so I guess that's a sign that it doesn't understand what I'm trying to do.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- thanks for feedback. we should probably avoid this part in future since it is not working on all systems
%%bash worked on my installation in Windows 10 using anaconda3
%%bash worked on WSL2 with Ubuntu and Jupyter running as service
%%bash worked on an Aalto win10 laptop (git for windows installed)
jupyter-lab: access is denied
(my Aalto computer, anaconda prompt, base environment)
- start Anaconda cmd.exe prompt
  run python -m jupyterlab
  Does this work? It worked for someone's Aalto compter last week.
- Works!
- Great! It's annoying how this default installation is broken somehow…
- In several ways. I have never been able to install a completely functional environment. Mostly works, but some packages won't install, won't work etc
Is there any online platform to run Python, I am using an Aalto laptop which I am not the administrator for, I cannot install jupyter
- Try https://jupyter.cs.aalto.fi/
- Alternatively use http://vdi.aalto.fi

Exercise session, end at xx:45

https://aaltoscicomp.github.io/python-for-scicomp/jupyter/#exercises-1
https://aaltoscicomp.github.io/python-for-scicomp/jupyter/#exercises-2

Play around with Jupyter some, we are making sure that everyone is on the same page.

A generic question: What is Github and why was it founded?
- A versioning tool/repository. You can store your code in a versioned manner on github to be able to go back to earlier versions, check new additions etc. Also makes collaboration easier, since people can more easily simultaneously code
- Or in other words it was created as a web frontend/storage space for Git repositories. Git is a tool to manage code versions. And it's not only for "code", also good for data, documents, manuscripts.
%%timeit make my code extrmly slow
- As far as I understand it does not run it slower but it runs it many times to get statistics so it feels slower
- Yes, you see it is running the code thousansd of times to get good statistics.
  - I see. this makes sense
Can you explain how the comma in variable-defining works for the exercise 2?
- you mean the a, b = 0, 1 part?
- Yes, and in the calculation afterwards
  - aha yes. in Python you can assign to multiple variables at the same time. In this case it assigns a tuple (0, 1) to the tuple (a, b). You can even do this in Python: a, b = b, a for swapping values without a helper variable.
- And in this case a is the output, b is a "helper"-variable to help a become the next number?
  - In the a, b = b, a case, a takes the value of b and b takes the value of a, no helper variable used and needed. One could imagine the values on the right being deallocated at the creation time of the variables on the left. In some other languages you may need to do this: c = a; a = b; b = c but in Python this is not necessary.
  - Great, thanks I got it now!
  - This is what I wrote having a background on C.
Richard and Jarno's mics are very different volume level on. Is that possible to adjust abit:)?
- better now?
  - Yes, great! Thank you:D
Can we timeit without magics?
- yes you can also use this outside of notebooks: https://docs.python.org/3/library/timeit.html
Where can we see excercises as they are being solved?
- In some lessons we will solve them on-stream afterwards, but not this time unfortunately.
- Some (most) exercises have a "solution" box which is just after the exercise and hidden by default (so you peak only when you want to). For example Exercise 2 in the jupyter page: https://aaltoscicomp.github.io/python-for-scicomp/jupyter/#exercises-2
Is there a shortcut for running all code in a notebook?
- I am unsure. It seems one can define a shortcut for this. Some pages mention that you can select all cells with CTRL-A and then run them with SHIFT+ENTER (https://stackoverflow.com/a/48778486)
  :
- Run → Run all cells. Or Kernel → Restart and run all cells.
  - Yes this is what I use. I recommend to always do this before saving the notebook or sharing it with others (because this will be the first thing the other person will do with the notebook). To avoid problems introduced by running cells out of order.
Can someone explain meaning of this: 14.3 ms ± 65.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ?
- It ran the code 100 times to get statistics on the time. mean=14.3milliseconds, standard deviation is 65.7 microseconds
- So cool that you can do that in a notebook now:D
- Cool thanks, got it now! :)
Is there timeit function for Python interface directly?
- Yes, see link above. The magic is bascially a wrapper around that.
I use https://jupyter.cs.aalto.fi/, are the codes always run by aalto server?
- Yes, it runs on CS-managed servers, Richard manages it.
How do I edit a markdown cell?
- double-click?
  - Ah, yes, thanks.
is there a way to see the "workspace", like in matlab?
- You can print all variables with %who, or for example all integers with %who int. Not quite the same but similar.
- I use ipython, and there two excellent commands: %who, as was said above, and also %whos with more info in the output. They are similar to ones in Octave/Matlab.
A Windows applicable option for the %%timeit -things would be useful.
- It should work on Windows, does it not? What is the error message?
- My bad, works! Earlier used magic and bash commands so I made a wrong assumption.
magic %% doesn't work for me.
- what error do you see?

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-ec2d92ce1d6b> in <module>
----> 1 get_ipython().run_cell_magic('timeit', '', 'for i in range(len(a)):\n    b[i]=a[i]**2\n')

~\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2397             with self.builtin_trap:
   2398                 args = (magic_arg_s, cell)
-> 2399                 result = fn(*args, **kwargs)
   2400             return result
   2401 

~\Anaconda3\lib\site-packages\decorator.py in fun(*args, **kw)
    229             if not kwsyntax:
    230                 args, kw = fix(args, kw, sig)
--> 231             return caller(func, *(extras + args), **kw)
    232     fun.__name__ = func.__name__
    233     fun.__doc__ = func.__doc__

~\Anaconda3\lib\site-packages\IPython\core\magic.py in <lambda>(f, *a, **k)
    185     # but it's overkill for just that one bit of state.
    186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
    188 
    189         if callable(arg):

~\Anaconda3\lib\site-packages\IPython\core\magics\execution.py in timeit(self, line, cell, local_ns)
   1167             for index in range(0, 10):
   1168                 number = 10 ** index
-> 1169                 time_number = timer.timeit(number)
   1170                 if time_number >= 0.2:
   1171                     break

~\Anaconda3\lib\site-packages\IPython\core\magics\execution.py in timeit(self, number)
    167         gc.disable()
    168         try:
--> 169             timing = self.inner(it, self.timer)
    170         finally:
    171             if gcold:

<magic-timeit> in inner(_it, _timer)

NameError: name 'a' is not defined''

You seem to not have initialized/set your a variable. You try to square a but haven't defined a earlier, probably.

Numpy

https://aaltoscicomp.github.io/python-for-scicomp/numpy/

My jupyter says that numpy does not have an attribute "arrange", even after importing.
- one r
- Yes, just saw.. Thanks.
- Short for "array range"
Appending new elements to numpy listlike-objects in a loop seems to take much longer than to regular python lists. Is there some fast way to do it, or should one use regular python(multidimensional)-lists and generate a numpy equivalent at the end of the loop?
- Numpy arrays are exactly that, they are arrays, not lists. So each addition is completely compying the array to a new array of a larger size, so as Richard said: if you need to append, use lists and as you suggest buld the array last.
A general question: are there any rival python libraries for numpy?
- numba is kind of a rival. But it's by the same developers :)
  - is it a sort of numpy-fork?
    - No. It's trying to be a more general purpose library for fast computing.
      - Thanks!
  - The most important part of Numba is compiling python functions. It makes python loops as fast as numpy loops.
- There are other tensor libraries. For example PyTorch or Keras. They could be seen as GPU-centric competitors.
- NumPy is being reworked as a standardized API, so in the future, there can be many implementations that all have the same API and are thus drop-in replacements for NumPy. CuPy is one such replacement that uses CUDA to compute everything on the GPU.
%%bash worked but %%timeit is giving me a usageerror. ("Cell magic %%timit not found.")
- its as you write in the question. you forgot an 'e' in your jupyter script (%%timeit instead of %%timit)
why numpy calculate x^2 faster?
- Numpy is calling a program written in C and compiled. Compiled programs are generally faster than Python.
- Specifically it uses a library called BLAS, which has been developed from the 1970's and has been improving ever since. It is blazingly fast. Intel even has its own version specifically optimized for their CPUs.
how to get the information about row and col?
- Row and column numbers would be .shape
A note: I've noticed that Numpy is way slower at AMD processors comparing to Intel. Has anybody tested numpy with the recent Apple ARM M1 processors?
A basic question, is there any trick to differentiate when to use () or [] in coding using python?
- not really a trick. but [] indicates an indexing, while () is a function call. i.e. if you have an object that can be indexed (e.g. lists, dictionaries etc) you can use [] to get an element (either a stored value from a dictionary, or an index from a list/array) e.g. myarray[2], while with () you would call a function (e.g.) myarray.reshape(2,3).
Does numpy support linear algebra operations?
- I'd say most. you can e.g. use matrix multiplication etc.
- See numpy.linalg
does it matter choosing between float64 and float32
- it depends on how many significant digits the computation needs for floating point operations. If it is enough to only have 6-7 significant digits, then float32 may be enough.
  - that is great explanation thanks
If you dont declare a datatype, what does python (numpy?) use by default?
- depends a bit on the value I think
- It tries to automatically infer a proper type if you make an array out of existing data. Otherwise it defaults to float64.
Can you convert an array of floats to an array of integers?
- yes, using astype=int
- like d=d.astype('int')
  - or d = d.astype(int) (seems to be the preferred way these days?)
I tried summing numbers from 0 to n first using a for loop and then using np.arange(n) and sum the array. If I check for relatively small n, I get the same sum with the two methods, while if I go to e.g n = 100000 np.sum() seems to give wrong answer… why? is there a max length on np.arrays?
- Yes! Numpy operates on int64, while python native integers are of arbitrary size (bignum)
- it's still a bit odd,since (at least on my machine I get the correct value on numpy arange with 100000
- Double check the data type of the array x.dtype. int64 can hold pretty big numbers.
  - it is int32
  - try explicitly telling numpy to use int64: x = np.zeros(100_000, dtype=np.int64)
    - thanks, now it works correctly! But is there a way I can set int64 to be default? Not sure why it isn't?
      - It is supposed to be the default. Unless they changed it? Are you running on a 64-bit system?
      - I think so, but how do I check that?
        
        Sorry, no idea. Perhaps Google can give an answer.
        
        Thanks, I'm on 64 bit system. but anyway, probably a good routine to always specify dtype then.
Do np.array([0,1,2]) and list(range(3)) give different output?
- the first makes a NumPy array, the second makes a native Python list. They are different things that look different when you print them. The numbers should be the same though.
  - There are numpy functions that only work with numpy array?
    - Yes. Most functions will also work on lists though, where the first thing the function does is make an array out of the list. But there may be functions that really need arrays.
Will this course cover code documentation (use of Sphinx for example)?
- I am unsure but I think not but we have a great lesson on this here: https://coderefinery.github.io/documentation/
- Thanks!
Do we nee to submit the exercise somewhere? If yes, where?
- No, work by yourself.
Is there a good way to reconsile dimensionality issues such as summing size (3,) and (3,1) results in (3,3) instead of treating both as column vectors?
- This is managed through "broadcasting" rules, which can be tricky. When I always do it try to be explicit by having arrays always have the same number of dimensions. This is achieved by adding dimensions on the fly by indexing with None, like this: x[:, None]. Then you know broadcasting will only happen along dimensions of length 1, which is less surprising.

( please ask at bottom)

Numpy exercises until xx:32

Numpy exercises 1 and 2, if you get done there are advanced exercises at the bottom, or check out some of the documentation or read HackMD

Links to exercises (there are also solutions):

Note that if you registered, you can join your Zoom for exercise help/discussion/etc. (Sweden/Norway/Finland). Link by email.

I do not get automatic syntax suggestions in jupyter (e.g., if I type pri and press tab nothing happens). Any possible reason(s) for this behavior?
- Is it not giving anything at all? It might just take a bit of time before the options show up.
- What happens if you type pr and hit TAB? Does it show you a list of keywords that start with "pr"?
  - - so that part seems to work, good. still curious why the tab completion of pri doesn't
    - this fixed it: %config Completer.use_jedi = False
      - Then this link might potentially help as well: https://stackoverflow.com/questions/33665039/tab-completion-does-not-work-in-jupyter-notebook-but-fine-in-ipython-terminal
so although we read the whole row and one column, python will give the result as row vector?
- Note that NumPy can also do 1-dimensional arrays, wich are neither row of column vectors. Row/column vectors are 2-dimensional arrays (n, 1) or (1, n). A 1-dimensional array is just (n,). When you select a single row, you get a 1-dimensional array back. This is different from MATLAB where things are always at least 2-dimensional, so you always need to know whether you have a row or column vector.
- There is no concept of row or column vectors. These are n-dimenional arrays. Some operations reduce dimensionality. The concept is more powerful that matlab's 2D-focused approach
Error got when trying astype('int'):
```
- ---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-d90a92f4af50> in <module>
----> 1 b.dtype('int')

TypeError: 'numpy.dtype[float64]' object is not callable

This is the error I got when I tried doing .astype('int')
```
- You wrote b.dtype('int') instead of b.astype('int')
- I tried both, didn't work. I also tried without quotes, it returns arrays with all zeros
  - What is your b variable?, i.e. what happens if you just call b
  - Okay, it worked now (without quotes) after I redefined b.
    - Be careful though, without quotes it only works if the type is defined (like int is). float64 for example always needs quotes or has to be called by np.float64
  - Thanks! :)

Exercises until xx:55

https://aaltoscicomp.github.io/python-for-scicomp/numpy/#exercises-3
https://aaltoscicomp.github.io/python-for-scicomp/numpy/#exercises-4
plus advanced exercises

Note that if you registered, you can join your Zoom for exercise help/discussion/etc. (Sweden/Norway/Finland). Link by email.

np.add(a,b,c) cannot create the variable c if it doesn't exist already, right?
- yes
- but add takes two inputs, not three. Or do I misunderstand something?
  - c is the output argument - store in existing array rather than allocating new memory. c must exist and be of the right shape/dtype/etc.
  - but that would be c = np.add(a, b) or not?
    - c = np.add(a, b) creates a new array called c. np.add(a,b,c) uses an existing array called c.
      - maybe clearer to do then: np.add(a, b, out=c)
        
        That would be clearer, yes
        
        Image Not Showing Possible Reasons
        The image file may be corrupted
        The server hosting the image is unavailable
        The image path is incorrect
        The image format is not supported
        Learn More →
        
        the np.add(a,b,c) option might be more efficient since it doesn't need to reserve memory.
So if we do not copy the float array into a new variable, how do we add the int and float arrays and save into int array? a being float array and b being int array, np.add(a,b,b) does not work
- Thats because int + float will be cast to float. So you need to stoe it into a float array (i.e. ǹp.add(a,b,a)). casting to int would likely loose information, which numpy tries to avoid.
- Great, i thought so. So: Create an array of dtype='float', and an array of dtype='int'. Try to use the int array is the output argument of the first two arrays => means it won't work
Anything wrong with a+a when a is an numpy array? Is it better to do np.add(a,a)?
- These are two ways of writing the same thing. a+a will call the np.add function.
- (I know this is not the question but) in this case I would find it easier to read/understand to multiply/scale by 2 or 2.0 instead of adding to itself.
  - What I meant is that I would find it clearer to read b = 2.0*a instead of reading b = np.add(a, a) or np.add(a, a, b) +1

Pandas

(this is included under day 2, some feedback is under day 1)

Feedback day 1

intro, jupyter, numpy, pandas part 1

Where are you from:
Finland: ooooooooooooooo000000000000000000000000000ooo
Norway: ooooo0ooo0oooooooo
Sweden: oooooooo0ooooooooo
Europe: ooo
Other: o

Was today:
too slow: oo
about right: ooooooooooooooo0ooooooooo0ooo0oooooooooooooooooooooooooooooooooo
too fast: oooooooooooooooooooooooooooooooooooooooooooooooo
boring: o

Name one good thing about today:

Great instructos :)
NIce overview of what can be done
Informative references are introduced :)
I liked the step by step examples made it easy to follow and see the progress from basic to bit more advance
Basics were really covered
Good presentations
Good colaboration between teacher
very good session
Great lectures!
Discussion between two lecturers make this easier to follow because the speed is slower than when you have only one lecturer
Well explained and got a good idea about what can be done
pandas was super useful, so easy to import a csv file and get insights from it with just a few rows of code
excellent organization and use of right tools (twitch + hackmd)
The pandas section was great
Parallel zoom session available to get help and share screen if needed
Very nicely organised!
Introduction to Pandas was pretty awesome!
Well organized, answered all questions clearly, good overview, pace was slow but that was helpful to keep from getting overwhelmed from all of the info
collaborative teaching
good way to dip our toes
like the conversational way you did the instructions a lot! +1
definetely need to use pandas more, super powerful
i am enjoying the session. it was hard, but all things should be hard in the beginning
Good overview, well organised, good instructors. I like the way questions can easily be asked, answered and also reread later
Nice, calm speach from all presenters.
nicely scheduled
The use of real data in Pandas helps to show how powerful the library is.
Nice overview. Of course, if anything goes wrong with the exercises, there is no time for debugging. Anyway, this can be done after the lessons since materials will be avaiable.
Good depth of lecture content. Not too detailed but still a good overview that opens up to further exploration.
The coverage of tidy data. Often people get introduce to pandas but not the "tidy data" concept and then have trouble using it because they run out of memory or try to implement things manually that pandas groupby etc handles easily. I think a lot of learners benefit immensly from best practices like tidy data.
Great practice exercises and timing

Name one thing that should be improved next time:

Font size in the Twitch could be increased a bit
- agreed with the increased font size
  - Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
10 min too fast for exercises. 15 min barely keep up (with basics knowledge). Anyway, homework to do.
If we could have a big picture of the whole packages and tools, it would be better. How the all packages and libraries are related? Numpy to Pandas to …
Why is the instruction file not a jupyter notebook? (i.e. copying between the open notebook and the material takes extra time)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- This could also be shared as a notebook for students
Leave the code what you wrote in live sessions visible for a bit longer (if they are not in the material)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
more time to do exercises
numpy: I'd like to get more motivation on why it's important to know these linear algebra operations in Python
thans
Thanks!
I would like to have more time to do exercises.
very nice.
I totally enjoyed it. Things I found confusing were quicly explained and the examples, while simple, were very enlightening
Interesting format of dialog between teachers +1!
I hope you get more in depth with pandas tomorrow. Which is better to use between DF[] and DF.loc[]/DF.iloc[].
the time for the exercises was sometimes a bit tight…
I enjoyed everything
Going through the exercises would have been goodh <<<<–– I agree
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
There can be more exercises with solutions that can be studied offline
Please explain in more detail why the 'titanic["foo"] = "bar"' was selected as the data to be added?
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
I think the format of this training is "inefficient". And in a VERY good way! Not trying to cover everything gives more depth and also time for the students to grasp the important things
Some more time to work on the exercises would have been useful
%% was not recognized in windows I guess. So it would have been nice if we had another option.
I enjoyed the course. i know a bit about python but this was pretty cool to see different way to sort issues, as i always just go with the first solution that works :D i always struggle with doing the math on dfs so i hope we cover various ways on processing data. and are we going to do df.plot or also matplotlib?
Numpy could have been a bit slower in my opinion.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
In the first section the two teachers approach was sometimes confusing especially when both were talking at the same time - also some things seemed very improvised / stressed (maybe cover less material in a more extensive way or allocate more time?)
It was a bit confusing to have a course page and another page (do not even know what this is called). Why not have it all in one place/tab?
Maybe explain the paedagogic approach taken a bit better. I am unsure about the difference between lots of written content but then you go and pick and choose some subjects to present live. Is the idea to follow the course or read the course material?

Pandas

https://aaltoscicomp.github.io/python-for-scicomp/pandas/

can you show how to groupby more than one variable?
- To be honest, I never remember these things! I always have to check the documentation.
  - you can simply add a list of headers to the by argument (e.g. by=['Survived','Sex']), but you'll get 4 plots, so in the call you also have to adjust the layout.
Can you scroll up (how did you plot the diagram?
- can you check the lesson page (link above) for it?
- https://aaltoscicomp.github.io/python-for-scicomp/pandas/
Is it possible to have the y-axis range on both histograms be identical?
- we'll learn a lot about plotting and adjusting them tomorrow.
- You can use sharey=True
set_index kind of moved the "Name" field to the first column. It sound like that being the index is a sort of property of the column, is it?
- yes, It's a property of the DataFrame: basically choosing one "special column", so to say.
Why not titanic ["Heikkinen, Miss. Laina",:]? I know it won't work, but why?
The loc and iloc are somewhat unintuitive.
- as opposed to .loc/.iloc?
- The .loc/.iloc have the philosophy "explicit is better than implicit". The [...] interface did some magic kind of stuff: if it was integer, it would sometimes be selecd based on position, sometimes based on label, which had possibility for confusing. So it's a trade-off "simple but confusing" or "verbose and direct"
- This explains a lot, thanks! I somehow thought the loc and iloc were indexing, but they actually produce output, not indices
  - I actually still don't get it. Why, after changing the index to the Names column, can't we use the Name to retrieve the row?
when i try the command: titanic.at["Heikkinen, Miss. Laina", "Age"] I get the error: ValueError: At based indexing on an integer index can only have integer indexers
- did you run the titanic = titanic.set_index("Name"). If not, the index is a number by default.
- Ok, I see that I run it but forget to set it to "titanic", thanks for the clarification.
What if you want to use multiple filters? e.g. All passengers above the age 50 who are also male? Can you do this in one selector?
- yes, using &
- titanic[(selector 1) & (selector 2)] - I often forget the () and that makes a confusing error!
  - I had that same issue, forgetting the brackets.
  - same problem. Why are they needed? Does & precede the other syntax?
    - Yes, boolean operators are commonly ranked very high in operator precedence, so you would mess with your conditions.

(ask below at bottom)

Exercise until xx:47

https://aaltoscicomp.github.io/python-for-scicomp/pandas/#exercises-1

Play with loading data, read a bit more from the docs and see what you can find.

I never remember when I need () or []. is There a way to get Jupyter to suggest which one I should use?
- I don't know how to get auto-suggestiosn
- "()" is function call. "[]" is "mapping protocal"-slicing/extracting. But yeah, sometimes we all do it wrong, but it will be come more intuitive over time.
- The problem is, that an object can have fields that are "sliceable", and at the same time it can have functions. But in theory it should be possible to determine whether [] or () can be used. Worst case, try it, the error will normally either say "XYZ is not callable" (then you need [], or "invalid syntax" (most likely at least, then you try to use a function and would need ()).
- Yeah, I agree, it is confusing. It will get better, don't worry!
Is there a "preferred" way to specify both rows and columns? For example, titanic[1:10, "Age"] won't do anything, but titanic[1:10].Age does. And then, to use the values in calculations, you actually have to say titanic[1:10].Age.values
- I would readthe docs on the slicing to see recommended, I would cehck best practices now!
- using .loc / .iloc might make this easier. titanic.iloc[1:10]["Age"] might be what I use.
  - Seconded on titanic.iloc[1:10]["Age"] format. loc won't allow using lists of column names, but this formatting does.
  - .iloc works. This would be a major pain with a compiled language, but with some trial and error, pandas does seem powerful!
- titanic.Age[:10].mean()
- Ok, need to read more. There seems to be a long list of dot-separated columns, functions etc. which is pretty confusing +1
Why does titanic.groupby(titanic["Age"] > titanic["Age"].mean())["Survived"].mean() give different results than titanic[titanic["Age"] > titanic["Age"].mean()]["Survived"].mean() and titanic[titanic["Age"] <= titanic["Age"].mean()]["Survived"].mean() ?
- I'm reading this to try to parse it, unfortunately I need more time to understand it. Can you summarize? Is it > vs <=?
  – The first was my solution. The second and third were the model solution from the website. (Replacing < with <= doesn't change the result in the third.)
- The first is a very odd selection. You select by those who are older than the mean age vs those who are not. By taking survived from there, you select the survived column, which are 0 and 1, and you calculate the mean "survival rate" of those two age groups (i.e. the younger and older than average groups) (I think). The second gives you the mean survival rate of the above average aged (which is the same as one of the values from the first call). But I agree the third should probably be the same as the other value, which it isn't
  – I found the solution: Apparently groupby doesn't discard NaN's by default, but places them in some group. So here the problem seems to be that the survival value from those individuals whose age is unknown is placed in the false category. By choosing only those passengers whose age is known, the result will be the same as in the model solution:
  titanic[~titanic["Age"].isna()].groupby(titanic["Age"] > titanic["Age"].mean())["Survived"].mean()
Comment: I found that I had to update both pandas and matplotlib in order to be able to get plotting to work after importing pandas… AND I had to add %matplotlib inline, My system is linux and mint
- Thanks for pointing it out. Do you know which versions of pandas and matpltlib did not work?
- matplotlib 3.4.1 -> 3.4.3
- pandas 1.2.4 -> 1.3.4

Day 2 icebreaker

What's your favorite language besides Python?

R : ooooooooooooooooooooo
C : oooooooo
C++ : ooooo
matlab : ooooooooooooooooooooooo
julia : oo
rust : oo
Fortran: oooo
Java : oooo
JavaScript : o
COBOL :

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

bash: : oooo
Perl : oo
Haskell :
Amiga E: o
Go: o
Markdown:o
NCL: o
None of the above, I only know Python: oooo

Advent of code
- you can also solve puzzles from past years, heaps of fun
- I'm defenitly going for this!
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
- It's also fun comparing/discussing solutions with others. Great way to learn/improve a language.
Is the sound skipping for anyone else?
- not for me
- Had to reload twitch to get the stream going. Sound is good. Even level. (Oslo)
- good quality here (Helsinki, home connection with DNA), sound levels a
- Uhm… is there any sound at all?
- yes, but you have to unmute. Twitch is muted by default
  - oh… I guess I still have to wake up
    - nah, happens to me as well every single time.
  - restarting solved it thanks!
levels are ok.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
hello cat

cool thanks for the answer! much appreciated

More Pandas

We are somewhere here: https://aaltoscicomp.github.io/python-for-scicomp/pandas/

Can you input a list (or similar) to extract several rows based on the list content?
- I know you can with columns, likely rows too. can you try and report?
- yes (as demoed now)
Does index always start with 0 in Python?
- Yes. Though… something could use a different convention.
I would really love to have a way to get the column names from pressing the tab key (for example) when writing e.g., titanic["S{press_tab}"] or any other way than having to "hard-code" them or remember them. Is there any way to get this?
- If you know the columns, they are fields of the dataframe. i.e. titanic.A{press_tab} would give you titanic.Age, which is the column.
- and then I guess you could add the [""]
  - What doesn't work is selecting multiple at once (but you can always use "dir(titanic)" to look at the fields, or "titanic.head()" to get the column names)
Can pandas make a N/dimensional dataframe https://stackoverflow.com/questions/36760414/how-to-create-pandas-dataframes-with-more-than-2-dimensions
- It used to have a "Panel" data structure but I heard they were removing it as not fitting the model. There is xarray that provides labeled multi-dimensional structures, which I think would be the current recommendationon
- Marijn: basically you don't. If you have more than 2 dimensions, use NumPy. Remember that each row is a datapoint and each column is a variable/property of that data point. If your data does not adhere to this scheme, you probably want to use something else than Pandas.
Left speaker is much louder than right - ok not much
- Is this true for others?
  - Jarno is now a bit louder, but not much.
titanic.loc['Lam, Mr. Ali',"Age"] -> how does the interpreter know which column the name is in?
- the name column has been made the special "index". And this is part of the magic of pandas, it's provides these convenient semantics
- Age = Column, "Lam, Mr. Ali" = row, just wondering whether you want to know how it knows the row or the column.
  - If you use .loc, you first specify row and then column. df.loc['Row name', 'Column name']. If you only specify a single thing, that will always be interpreted as a row name: df.loc['Row name'].
- but do I need to define it first? I'm getting KeyError
  - yes, you need to run titanic = titanic.set_index('Name')
  - thx a bunch!
Is there some logic in using single and double quotes? Both seem to work
- in Python they are completely equivalent. Use whatever makes sense (or " to quote ', etc.)
- Python used to use single quotes for strings, but double quotes were added since it's what most languages use and well, "older" people tend to stick to their single quotes
  - The current system has existed since at least python 2.2
i tried to take the runners from material (both the origival and then change it) but it gives an KeyError "The following 'value_vars' are not present in the DataFtitanicrame: [400, 800, 1200, 1500]"
- same
- the example does not need the names of these columns. The documentation (Shift-Tab) explains it pretty well.
- there is a typo in the material I believe. The melt function should be given 'runners' instead of 'df' for first argument +1
Could you please (again) explain the difference between "pd.merge(df1, df2, …) and df.merge()".
- Practically the same.
can you explain the 'inplace' option for merging? are we creating a copy here: runners.merge(age, on="Runner")?
in R (Tidyverse) we have left_join, right_join, full_join and such. How does one do this in Pandas?
- Various options on merge, how=, and so on.
Can we have copy of the scripts after every lesson?
- please
  - They are on the course material normally https://aaltoscicomp.github.io/python-for-scicomp/pandas/
- thanks for the response, but there are times when they do scripts that arent written in the original material. And they are very good
  - ahh ok. We'll ask if the jupyter notebooks can be provided
    -thanks so much o
In the line "titanic["Child"] = titanic["Age"] < 12", the excuating order is from right to left? I mean, "<" goes first and "=" goes second.
- Yes, assignment is commonly executed last. its essentially the same as a = 2*3

Can you show how you generated your runners dataframe?

runners = pd.DataFrame([
         {'Runner': 'Runner 1', 400: 64, 800: 128, 1200: 192, 1500: 240},
         {'Runner': 'Runner 2', 400: 80, 800: 160, 1200: 240, 1500: 300},
         {'Runner': 'Runner 3', 400: 96, 800: 192, 1200: 288, 1500: 360},
     ])
     
runners = pd.melt(runners, id_vars="Runner",
         value_vars=[400, 800, 1200, 1500],
         var_name="distance",
         value_name="time"
     )

Thank you
- As mentioned above there is a typo in the script (df instead of runners) when melting the dataframe.
- Ah! that I missed

Can we add more information to the groupby table. Now we calculate just the mean. can we add the sd.dev etc? How?
- Yes, it is possible. I always search groupby docs and you can find how to do it
- For example, grouped.agg([np.sum, np.mean, np.std])

Until 10:47

https://aaltoscicomp.github.io/python-for-scicomp/pandas/#exercises-2

Note that if you registered, you can join your Zoom for exercise help/discussion/etc. (Sweden/Norway/Finland). Link by email.

Is it possible to select a selection of rows, e.g. 10:40 and 60:100 at the same time? If yes how to do it in a single line?
- This seems to be able to do it (this is a good question, I am not entirely sure): https://stackoverflow.com/questions/47965716/how-to-select-multi-range-of-rows-in-pandas-dataframe
  - Thanks for the link. I tried it once like that, but always got and error that the slice is not possible. Would be great if somebody how know could help.
Is there a way to access a function documention in jupyter similar to the way you access R documentation with F1 in R studio?
- Yes, is it ? on the end. Something like that.
Problem: type(titanic[titanic.SibSp == 8]) gives DataFrame, but titanic[titanic.SibSp == 8].Name or titanic[titanic.SibSp == 8]["Name"] returns "'DataFrame' object has no attribute 'Name'" even though printing titanic[titanic.SibSp == 8] does show the column "Name"… what am I doing wrong?
- I get the same error. Is this because the Name column is now the row index?
  - Could be… with columns different from "Name" it works as expected. Try "Sex" or "Age"
  - Thanks, then it works. But what if I would only like to extract the row indices (here: Names) for the rows that match my condition?
  - The provided example solution for this does not work. I ask that the instructors address this before we move forward.
  - Actually, this is very bad… if you make one column the index, that column stops working as the others, apparently.
- I wonder what is the opinion of Pandas' experts on this behaviour? Any suggestion?
  - My (non-expert) understanding: titanic[titanic.SibSp == 8].info() shows no column named "Name" so, yes, there is no column "Name". The same applies to titanic.info(). So, yes, the index column is NOT a column at all!
    - Indeed. The index is not a column. It is available through df.index.
- Sorry about the exercise solution being wrong. We'll address it asap.
- Is it possible to change the index "column" into a column or make a reguler column become index?
Q: Is Pandas inherently 1D? We have seen examples of only such datasets. Also the "tidy" dataset sort of implies that the things on grids are not a thing in Pandas. Is it true?
- It did but it's not recommended anymore, see the note at the top of https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Panel.html
  - Thank you for the above answer, but what did it? Could you rephrase the answer, please?
- pandas Panels supported multi-dimesnional data. Now it says to use either multi-dimensional indexes or xarray.
Is it possible to use a .txt file with Pandas or should I convert it to .csv somehow?
- The .txt can have arbitrary contents, so pandas itself cannot guess what is inside if it is not well formed. If it is a complex structure, perhaps you need to first read it in with a custom python routine, and then pass on the data to pandas.
- pd.read_csv('file.txt', sep=' ')
  - Thank you! :)
In exercise 2, what does the actual array mean when you write titanic.SibSp.unique() ?
- array([1, 0, 3, 4, 2, 5, 8])
- is this the series index of what exactly?
  - No.
  - It's a Pandas Series object (or perhaps a numpy array?), containing the unique values in that column.
- Ah! So what defines a "unique()"?
  - unique([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]) = [1, 2, 3]
- ok so it selects the unique values only
Is there a way to undo the "set_index" thing? See above about the effect of having set "Name" as the index of the Titanic DataFrame.
- there is a method .reset_index()

maxFamily = titanic["SibSp"].unique().max()
titanic[titanic["SibSp"] == maxFamily].reset_index()["Name"]

Break until xx:02

Hi, will we have access to recordings of the lessons?
- Yes! They are available on Twitch for a while and will also be made available soon on YouTube. The lessons from yesterday are already there: https://www.youtube.com/playlist?list=PLZLVmS9rf3nOS7bHNmbcDoyTnMYaz_TJW
  - thanks!
  - links from course webpage

Visualization

https://aaltoscicomp.github.io/python-for-scicomp/data-visualization/

what if we remove '*' from matplotlib.inline?
- what do you mean?
- what if I don't use %matplotlib.inline
  - it may not be needed anymore. in my notebook I don't need it but I hear that in some notebooks it is needed to see the plot appearing in the notebook.
In the example: why do you save fig and not ax?
- In general, saving only ax will only give you the individual "subfigure", so if you have multiple axes you want to commonly save the whole figure
- Can you even save a single axes?
  - You are right, you can't. You can only save the figure, at least from what I can see.
can we set DPI when saving the figure?
- yes, there are ways to adjust size and (effective) DPI
I am always confused between fig and ax. Both things allow labels. When do you use what, please?
- ax is the x/y grid something is drawn on
- fig is the complete canvas, that can contain multiple axes
  - e.g. if you have multiple axes on a figure (i.e. multiple subplots), you can have an overall title of the whole figure and individual titles for each axes/sbuplot in the figure
- There is matplotlib.pyplot.xlabel and there is Axes.set_xlabel. Is the first one for 1 picture or the overall fig, and the second command for the subplots?
- Thank you!
Are there industry standard names for what to import the packages as? e.g. should matlab always be imported as plt?
- I've never seen any other name for it.
- It is just convention that many people use. It can be good to keep that since then others reading your code will be familiar and not surprised.
- matplotlib.pyplot is imported as plt by convention/habit. plt is a useful search term which will filter out matplotlib.pyplot relevant help
  - But its the pyplot package from mattplotlib thats imported as plt.
    - True
  - Yes, I always wondered what else is there in matplotlib.

Exercise until xx:30

https://aaltoscicomp.github.io/python-for-scicomp/data-visualization/#exercise-matplotlib

This may be simple for some people, but it gets us started. Explore and experiment, most people can probably complete it.

Say I feel there is a way to set the labels to each series, but I do not know/remember how to do it. What is the best way to find out how to set the labels (or do anything else)? I tried ax? and ax.scatter?, but it's too long and not specific about labels. I've tried to add , l after the color spcification of a series, and then pres tab, but it does nto suggest "label". So, how am I supposed to find out that I should write "label='blabla'". This is about how to find documentation…
- In jupyter, you can add a ? to an element, to get documentation info. or ?? to get the source code (if available). e.g. fig? or fig??
- I'd google: "How to label an axis in matplotlib"
  - yes, google, but I'm after a more direct way. For example, in other languages the IDE gives this kind of information while I write the code. this is important because it is bound to the actual version of the library I'm using. Different versions might have slightly different functionality and Googling the doc of that specific version is just a nightmare.
    - How about the inspect (Ctrl-i) in the Spyder editor? I find that helpful
    - Also, in any language, you will need to know a few things about how to get to something. In general, you could use dir() on a variable/package to show all possible fields/functions in python. d
      - In this case what should be the argument of dir()?
        
        dir(ax), but it will list a lot of things.
        
        yes… a bit too many…
        
        but thats the same as in any language if you have an object with lots of fields/functions.
        
        true, but this should be a rather simple thing to find out.
        
        Well, you can do ax.{tab} in jupyter and it will give you options. or ax.s{tab} for everything starting with s. In general I would assume somethig to set the labels to be under ax.set… so I'd tab with that and have a look if there is an option.
        
        from doing this I see ax.set_label (which is not the solution) and ax.label_outer (which is not the solution, either). So, not helpful.
Bottom line: is Google the only way to answer the question efficiently?
- Well, you can directly go to the matplotlib tutorials/documentation and look there, but I wouldn't know of any language where you can get this information without looking at the documentation/definition of the relevant functions/objects.
  - Eclipse IDE does this. Tabbing in the arguments list gives you the possible options with their respective documentation. Please note, the point is not skipping the documentation (not at all!). The point is quickly find the right bit of documentation.
    - But that only works on a typed language, not on an untyped one. Because without actual execution you don't commonly know what you have there.
    - yes, indeed. That's why I ask for alternatives that could work here.
      - Maybe spyder could help a bit, but I'm not sure. In general, python is very much forward and backward compatible, so interfaces don't change, so the versioning issue is not that big of an issue commonly.
      - agreed, though I find this useful also for dinfing out how to do things, e.g., answer the qestion "how do you set the labels of a series?"
my 50cents would be to use vscode with wsl if you are after IDE functionalities
Could you give some opinions on best practices? E.g. is it better to do fig, ax = plt.subplots() or plt.figure() and calling next statements like plt.(...)?
- just found it in the docs
when making the data2_y_scaled, how does it know that the values are y-values? I am referring to the suggested solution which actually multiplies the y with 2, for y in the list.
- we can call this any way we like. but they get plotted as y-values because we pass them to ax.scatter as y=....
  - Okay, so I could have called it x*2 for x in …? I am just confused because we called it y, and I do not believe it should know that this will be y just yet.
why I couldnot use data2_y_scaled = 2*data2_y
- this is because data2_y is a Python list and not a Numpy array that we have seen yesterday. With a Numpy array we could have done that. And perhaps we should have used Numpy arrays in this example. I was unsure what would be less confusing when preparing this example.
- it means we have to use a for loop to scale and put the results in anothe variable
Is it possible to define a customized color palette?
- yes. https://matplotlib.org/stable/tutorials/colors/colormaps.html Though it takes a few lines
- thanks!
is it possible or smart to include all data sets in one scatter call?
- In particular so that you can add different colors?
  - once it's more colors than perhaps 5, I find it hard to "read" the plot ("overplotting") and then it might be better for understanding to split plots up into separate plots or to use something different than a scatterplot. But it is possible. Generating plots is a bit of an iterative process where we may need to change to a different plot kind once we realize that it's not understandable anymore.
why we dont put plt.view()
- The %matplotlib inline magic command is doing that for us. When the last line returns a plot object, Jupyter will display it.

Matplotlib's two interfaces

Styling and customization

Exercises until xx:00, then break until xx:10

don't forget to take your break!
Customization-1, Customization-2, and Customization-3
https://aaltoscicomp.github.io/python-for-scicomp/data-visualization/#exercises-styling-and-customization

Take the example, search the web some, and try to do what it says there. Mostly exploration: you probably won't have enough time to do everything, but take your pick of what is interesting.

Coming up: Data Formats at ~xx:20

Is there a difference between changing the scale of the x axis using the suggested solution and "plt.xscale("log")" in terms of these two commands giving different outcomes in a different setting (e.g. plotting a gallery of plots)?
- It might make a difference if you have multiple plots and you want the first to have log axis and the second linear. I am not 100% sure but I recommend to test this out. Modifying only one particular ax has less side effects on other plots which might be in the same notebook later.
- Ok, thanks a lot. Then it seems like a safer option to change the axis rather than using plt.
  - yes, most of the time plt works fine and is simpler but the "ax" one is a bit more explicit and safer. I now use the latter.
when I changed the font size, the plot didnt appear automatically, why?
- you needed to re-run the cell, right?
- I put it in different cell, do I have to rerun from the beginning?
  - you can either run cells individually. but it is good practice to run all cells before saving or sharing notebook
what is the difference between using "size" and "fontsize" here? I tried both and it does the same thing, I think.
- They are the same, search for "fontsize" here https://matplotlib.org/stable/api/text_api.html#matplotlib.text.Text
Is there a way to have all tick parameters in one dict?
- Can you please give more details about this? Is this about seeing all the possible settings/names to know how they are called so they can be adapted?
- So it would be nice to just have 1 dict object for all of my tick parameters so that I could just run one instance of ax.tic_parms(dictionary) or something like that instead of as in the example solution one line for each tick parameter
  - I see. Thanks. Looking for a more elegant solution than the one provided …
    - Browsing documentation I don't see examples of setting all in one go but what I would do now is to define my own function where you pass in ax and a dictionary, and then inside the function you set the key-value pairs one by one.
Seaborn have a way of setting color scale to coloblind suitable
- great! Can you please share a link to an example?
- https://seaborn.pydata.org/generated/seaborn.color_palette.html
  - thanks
- Ex.
  import seaborn as sns
  sns.color_palette("colorblind")
I'm trying to run the seaborn example. do I have to install seaborn first? It doesn't produce an output when i run it?
- It should be part of anaconda. Does it give error or just no output at all?
  - no output at all
    - that's "good" - this means the problem is not installation
    - ok, looking … (might be something like missing magic, wasn't needed in my notebook but might be needed …)
    - does placing %matplotlib inline on top solve it? (seaborn uses matplotlib underneath)
      - no, still no output. but I think it is something with my settings. no cells produce output now … hmm
        
        in this case I would try to Run -> Restart Kernel and Run All Cells
        
        now i get errors where everything ran fine previously
        
        what error (example) are you seeing?
        
        for the seaborn example:

AttributeError                            Traceback (most recent call last)
<ipython-input-9-5ee3a1b86e89> in <module>
      4 import seaborn as sns
      5 
----> 6 sns.set_theme()
      7 
      8 # Create a random dataset across several variables

AttributeError: module 'seaborn' has no attribute 'set_theme'

OK, looks like it's a different seaborn version. I would them remove this line. This is only to set or change the overall theme. Does it run without? You can also try to import seaborn; print(seaborn.__version__). In my case it gives: 0.11.1.
- 0.9.0
- i get error on the palette now: colormap light:g is not recognized; worked when changing colormap
- OK sorry for the troubles. Looks like my example does not work with older Seaborn. On Thursday we will talk more about how we can avoid this trouble. This is very typical: "works on my computer" and we need a more robust way to share our work so that it also works on other computers.

Data formats

https://aaltoscicomp.github.io/python-for-scicomp/data-formats/

what would be good for collecting data that has timestamps? (e.g. what and when someone did something in a test)
- that depends a bit on what you want to do with it later on.
  - I want to connect it with data collected with other means (e.g. eye-tracking glasses…)
    - it really depends, on whether you later on need to just add more info to one subject, or if you also want to compute the same piece of information from multiple subjects. E. if it is mainly about data collection, you might want a dictionary of {"ID" : {"Age" : 12, "eye_track_data" : ...}. But if you need to compare in between subjects, you might want to use a pandas dataframe, where you have e.g. 1 Column for each type of data (or, if multiple entries for each type are present, the column can be a list that contains all individual entries, or again a Dataframe with moe info on the individual Data).
      (Thaks!)
      But overall: I would try to adjust my way of storing data to how the data is going to be used in the end, and think about optimizing data access based on that. Potentially (if size of the data is not prohibiting) even storing it in multiple synchronized copies, that are fit for specific tasks, but that only if you are sure that you will work (i.e. load) with that data a lot…
It is possible to store as well arrays in a cell of a pandas dataframe.
- yes, you can store numpy arrays in a pandas dataframe. they don't even need to be of the same size
Why is the shape of a 1D numpy array (4,) and not (4)? Why the "extra" comma?
- The comma tells Python that what you wrote is a tuple, and not just parenthesis. (4) is the same as 4.
  - And (4,) is different from (,4), I guess?
    - Yes. (, 4) generates a syntax error
  - Is (2,3,) ok?
    - Yes. You can try it in your notebook
    - (2, 3,) is a tuple of length 2 (not 3) and identical to (2,3)
semantics vs structure (discuss at end)
what does the !head mean on the demo? Is it magic showing of the head of the file?
- "head" is a unix command that means show first lines. ! means "run shell command". Mac and linux at least
  - Do we need the magic installed on or os to make that work?
    - It could be that head only works on specific Operating systems. possibly not on a windows system, that doesn't have bash commands installed. But then, the command itself is not that important and only there to show what is in the file.
    - Good point! Tip for running linux on windows is WSL
Sometimes data (outcome of some software let's say COMSOL) has weird sepration. It has tab sepration and space too. How to you handle that kind of data with multiple seprators?
- You can give pandas multiple separators when reading e.g. csv files. So if the COMSOL format is kind of a spread-sheet format (i.e. you have a fixed number of columns), then you could use pd.read_csv('fileName', sep='\t '). Maybe have a look at the read_csv (documentation page)[https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html], since there are a lot more options you can set (skip comment lines, define headers etc pp)
could you use windows functions
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- The instructors are not on Windows this time, but we should find operating system independent options.
- There are Python libraries that do similar things. Something like open("filename").readline()[0]
- I'm not sure what windosw equivalent is, other than openining it in an editor. Probably is something
- if you are on powershell (depends on the jupyter configuration I guess), you could use type file.csv -Head 3, but I can't check that since I'm on linux myself.
What is the difference between the % and ! in jupyterlab? Something on the operator side?
- ! will call system commands. % will call jupyter magics (built-in command in jupyter)
Another data format I have used is pickle, wonder if that's any good +1
- if it's python pickle, than you are essentially storing binary data, which is good for loading data and should be reasonable efficient. But maybe Simo can write something more about this later on.
- - Pickle is mentioned here
netcdf is great in that it's kind of well-defined, but typically the libraries are a bit annoying in that you have to build the structure very explicitly. I.e. you can't just say "save this", but always specify names, dimensions etc. I wonder if it would be possible to create a library that would make reasonable assumptions from the data itself?
- thanks! so its like "explicit is better than implicit" or something?
  - yes, very explicit. But that also means that it's very annoying for "just saving your work", always quite a lot of work. But then, that's not a problem (rather a benefit) when you carefully craft something for a repository.
    - xarray has a an easy interface for netcdf
How to judge if a format is expected to take-off and be alive long in the future, or is going to desapear in few years?
- Impossible :) But commonly, if a format gets adopted by enough people there is a chance that it will stay around.
- If it's open source,

Feedback

Today was:

too fast: ooo
about right: ooooooooooooooooooooooooooooooooo
too slow: oooo

I wish for:

more exercises : oooooo
more time for exercises : oooooo
more discussion : oooooooo

longer workshop duration: ooooooo
slightly longer workshop duration: ooooo
about right length: ooooooooo
shorter workshop duration: oooO

more in depth on topics: oooooooooooooooooo
about right depth of topics: ooo
less in depth on topics: o

recovery points to get back on track:o

If you have a multiple people watching one stream, how many are/were watching:

2:ooo
3:
4-5:
6-10:oo
10-20:
other, specify:

One good thing about today

Excellent walk-through of data storage format
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
illustration examples were good. Starting from a working example, it was close to what I'd expect to do while working with actual data. I'm kind of "examples learner"
.idea of using the existing galleries of libraries such as matplotlib and seaborn
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
.matplotlib was really good. Specially all the sources for style and configuration. Great! :+2:
matplotlib was really useful. This was exactly what I wanted to learn from this course.
data format examples were good, I haven't really thought about how well the precision is preserved
pandas introduction was great!
Sound improved. Matplotlib useful
Matplotlib was great!
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Sound
Really liked Matlibplot section. very useful. Lot of tips!
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
very interesting to hear about different datatypes before collecting the data
Interewsting discussion about data formats
Pandas & Matplotlib explanation and excercises
Matpoltlib and data formats
Going through the examples provided in the materials was useful, .e.g., that of Matlibplot examples and data formats
practical reminders of how to approach using Matlibplot (e.g., take an example and play around)
matplotlib exercises

One thing that we could improve

. scroll slower. now your bottom page is something inrelevant (the course material), and the command you wrote will be very fast hidden under your faces +1
let your code examples be on the screen longer oooo
the last session could have been moved before as it was theoretical and maybe better to hear something like that in the beginning when you are more focused +1
a bit slower overall; scrolling and talking slighlty too fast
Focus more about datatype with more examples.
use more general commands (i.e. Windows/Linux issues)
Explain how to find answers/solutions by ourselves rather than relying on copy-paste.
The beginning with pandas was repetition from yesterday, I didn't understand why. Also the two instructors doing pandas were slightly talking at the same time. Rehearse your script so this doesn't happen.
Stream the JupytherLab instead the lecture notes, so we can copy the examples done during the explanation, if not too short to follow the examples during the explanation. Also a bit further explanation on how to use NetCDF4 and load those files.
Start with an intro - "Good morning" and this is what we will do in next 3 hours would help as I want to know the journey through material is clear (and had to look at notes instead).
If a longer workshop is considered, then I think your style of selecting topics works well. Maybe even splitting the topics. But in general, your style of independent lectures works well, avoid anything that refers to previously seen examples etc.
sometimes the talking head at the top hid the code used for printing something, and I got stuck because I had forgotten to look what is there +1 +1
In the beginning, there was a bit too much jumping back and forth in the code with false starts, which was a bit difficult to follow at times
Yesterday had more time for excercises. Yesterday was a better mix.
Some, if not most, of the last section was reading what was already written there out loud.
That never-ending exercise session sounds good, but probably can't be sustained for long. But it certainly would be nice to have something like this hackmd continuing and have some experts just miraculously appear to answer any difficult questions…
- At least for Aalto we do have a daily support zoom-call where people can come and ask questions. Not sure about other universities

DAY 3 & 4

Please visit this link https://hackmd.io/@coderefinery/python2021archive_pt2

Python for SciComp 2021 (archive)

Icebreaker questions

Where are you from and what is your background?

What's your Python experience:

Questions during introduction

Jupyter

Exercise session, end at xx:45

Numpy

Numpy exercises until xx:32

Exercises until xx:55

Pandas

Feedback day 1

Name one good thing about today:

Name one thing that should be improved next time:

Pandas

Exercise until xx:47

Day 2 icebreaker

What's your favorite language besides Python?

More Pandas

Until 10:47

Break until xx:02

Visualization

Exercise until xx:30

Matplotlib's two interfaces

Styling and customization

Exercises until xx:00, then break until xx:10

Data formats

Feedback

Today was:

I wish for:

If you have a multiple people watching one stream, how many are/were watching:

One good thing about today

One thing that we could improve

DAY 3 & 4

Read more

2025 Sep/Oct CodeRefinery workshop

CodeRefinery at PDC summer school 2025

CodeRefinery project meeting at NeIC AHM 2025

CodeRefinery meeting notes