Try   HackMD

Python for SciComp 2023 (archive)

Please do not edit above this

Updates

02/11/2023 - New schedule at https://scicomp.aalto.fi/training/scip/python-for-scicomp-2023/#schedule
15/10/2023 - Lots of partners added
27/09/2023 - Registrations are open. Please tell your colleagues!

Test this collaborative notes

  • This is a question or is it?
    • I think it is
    • I have a different opinion
      • why?
  • This is more of a comment
  • Add your question here

Questions from registration

  • Will you record the course?
  • I would like to ask you whether this course will held in-person or virtually?
    • The course is online only. We have heard that some participants have organized in-person rooms for watching the stream together. Get in touch with your local peers!
  • Would it be possible to watch the class via Zoom or is it only possible to connect via Twitch?
    • We try to avoid zoom for 2 reasons:
      i. we can reach as many learners as possible;
      ii. we can record the streaming without the need of making sure that no participants' voice or face ended up in the recording.
      You can interact with us via this document, if you need more help (e.g. with installations) try to connect with local peers/local support and follow the course together.

What would you like to learn? Summary from the registration form

This is a summary with the help of Llama2 :)

Most mentioned learning goals:

  • Refresh python basics
  • Move from Matlab to python
  • Pandas, Numpy, and how to handle large datasets (e.g. timeseries)
  • Matplotlib
  • Scipy
  • Discipline specific tools for bioinformatics/AI/MachineLearning/Image processing/RNAsequencing/Omics/Physics/Statistics/Cython
    • Unfortunately we will not cover these due to the wide audience, but hopefully you get some ideas on how to get started!
  • Parallel python, GPU programming with python, OpenMPI
  • How to write sustainable and reusable code
  • Best practices in code writing
  • Dependency management
  • Advanced plotting (e.g. 3D) and interactive visualisations
  • Script writing and automation
  • Exporting and sharing of results
  • Data conversion

Test this collaborative notes

  • This is a question or is it?
    • I think it is
    • I have a different opinion
      • why?
  • This is more of a comment
  • Add your question here
  • testing
  • test
  • te
  • hellooo world

Questions from registration

  • Will you record the course?
  • I would like to ask you whether this course will held in-person or virtually?
    • The course is online only. We have heard that some participants have organized in-person rooms for watching the stream together. Get in touch with your local peers!
  • Would it be possible to watch the class via Zoom or is it only possible to connect via Twitch?
    • We try to avoid zoom for 2 reasons:
      i. we can reach as many learners as possible;
      ii. we can record the streaming without the need of making sure that no participants' voice or face ended up in the recording.
      You can interact with us via this document, if you need more help (e.g. with installations) try to connect with local peers/local support and follow the course together. You can also ask your local organization to organize a Zoom to do exercises together.

What would you like to learn? Summary from the registration form

This is a summary with the help of Llama2 :)

Most mentioned learning goals:

  • Refresh python basics
  • Move from Matlab to python
  • Pandas, Numpy, and how to handle large datasets (e.g. timeseries)
  • Matplotlib
  • Scipy
  • Discipline specific tools for bioinformatics/AI/MachineLearning/Image processing/RNAsequencing/Omics/Physics/Statistics/Cython
    • Unfortunately we will not cover these due to the wide audience, but hopefully you get some ideas on how to get started!
  • Parallel python, GPU programming with python, OpenMPI
  • How to write sustainable and reusable code
  • Best practices in code writing
  • Dependency management
  • Advanced plotting (e.g. 3D) and interactive visualisations
  • Script writing and automation
  • Exporting and sharing of results
  • Data conversion

Icebreaker questions

Where (country/) are you following from?

  • Finland, Espoo
  • Finland, Oulu +1
  • From my office in Finland :) +1+1+1+1+1+1+1+1+1+1+1+1+1+1++11+1+1+1+1
  • Germany +1 +1 +1 +1
  • Japan +1+1
  • Palestine
  • Norway +1+1+1+1+1
  • Finland, home +4+1+1+1+1+1+1+1+1+1
  • Finland, office +3> [> []]
  • Sweden! +3+1+1+1+1
  • Nehterlands- Finland+3+31
  • Sweden, Hllsinki
  • INDIA, Home +1
  • University of helsinki
  • Helsinki,Officeoland, Wroclaw, home :) o+f1fice , viikikki
  • Sweden, Stockholm
  • DFiJaplanand
  • Ethiopia
  • Belgium
  • Finland, Oulu +2+1+1
  • Finland Oulu at 2niversity
  • Finland, Oulu at university
  • Finland, at home +1
  • the Netherlands+1+1
  • Latvia
  • Finland, Helsinki
  • Helsinki Finlandnland
  • Finland. office
  • Sweden
  • Finland
  • office at aalto +1
  • Sweden
  • United arab emirates
  • Italy
  • Spain +1
  • Finland, Tampere
  • University of Helsinki
  • Mexico, San Luis Potosí
  • Finland, home office
  • Iran
  • Finland
  • Sweden
  • Espoo, Finland
  • France
  • Aalto offices
  • Iran, But now in Espoo, Finland
  • Finalnd
  • Sweden, Umea
  • China
  • Finland

How much have you used Python before?+1

  • Somewhat in work projects and a little bit off work. Had some basic courses as well. +1

  • A bit in my own projects+1

  • A fair amount +4+1+1

  • Quite a bit +1+1+1+1+1

  • Not much, just the basic synthax+1+1+1+1+1+2+1+1+1+1+1+1

  • 20 years, still learning and having fun

  • Introductory material online +1 +1

  • just once

  • A bit, but not much through Jupyter +1

  • Not much, starting to get into it +1

  • A few years+1+1+1+1+1+1+1

  • A bit of knowledge +1 +1 +1+1+1

  • In my work for about 10 years

  • Almost none+1

  • For half a year at this point +1

  • a little bit +1+1+1

  • Nothing = 0 min

  • 2 university courses on python basics

  • written a bioinformatic pipeline

  • A few years +1

  • Just started using it since 4 months +1 +1

  • TUse almost always

  • Just for some basic calculation

  • A few months

  • A few years

  • in my current and previous project +1+1+1

  • A year and a half, almost daily for data processing

  • Just started two weeks ago

  • I took introduction to python for data analyis.

  • Took a couple courses at uni. I have used it for some work/personal projects as well

  • Just a little bit, for API, and some other learnings with SQL

  • A couple of university courses

How much have you used Jupyter before?

  • I'm familiar with it. +1

  • Almost nothing+1+1+1+1+1+1+1+1

  • Quite a bit (JupyterHub, JupyterLab, Jupyter Book, Jupyter Notebooks) +2

  • I use it weekly +1+1+1+1+1+1+1+1+1

  • All the time+1+1+1

  • Not much

  • Very few times on university courses +1

  • just once +1+1

  • Not too much+1+1

  • Never +1+1+1+1

  • Every now and then +8+1 +1

  • I've already used it sometimes +1+1

  • Not at all expect setting up for this course

  • I have used it as part of some courses

  • I used Jupyter for data analyis

  • Quite freque

  • ntly

  • a little+1+1

  • not at all+1+1+1+1+1

  • not much

  • Little bit

  • used it for a basic course

  • I've only used the jupyter extension on vscode

  • I've used it from the start

  • not much

  • not too much, uskalla write script as text

  • Jupyter is my main tool

  • Almost never

  • I use it once a while

Do you have any favourite JupyterLab extensions and why?

  • No, not that experienced with Jupyter +1
  • jupyterlab-code-formatter so that I don't have to think about formatting
  • no, I just use the plain jupyterlab and maybe Git extensions if that counts
  • no+1+1+1+1+1+1+1+1+1
  • I do not even know what those are +1+1+1+1+1+1
  • I don't know what it is, but I use Jupyter notebooks quite frequently.+1+1+1
  • I dont' know about those either, but I use Spyder
  • No
  • No
  • No
  • No
  • No
  • I don't know what those are
  • no

Ask more questions here and test this notes document

    1. Will this course give introduction to Object oriented programming in python??
    • Is this a question or did this belong to some of the ice-breakers above?
    1. Can I use the cloud based Jupyterhub?
    • for this course you surely can.
    1. what is ur opinion about Julia?
    • it's a really nice language with a growing ecosystem of libraries. nice alternative to Matlab or Python
    • I like the strong type system and good tooling that it provides
    • I normally use R, haven't tried Julia yet, is it worth it?
      • a deciding factor could also be what language your collaborators and colleagues know and use so that you don't end up working in isolation. but if this is for a single-person project, it can be fun to experiment with a new language.
    • I have used it for one course for optimization, I don't have a strong opinion about Julia
    1. Can I use an IDE like PyCharm instead of Jupyter Lab?
    • If you have access to JetBrains products you could install DataSpell and open Jupyter notebooks there :) (Good to know, but it's quite expensive haha)
    • In principle, you could also use IDLE (which is included with Python)
    1. I use DataSpell to open my Jupyter notebooks and I want to download the model Scipy to use it. However when I fill in the install line into the Python commant prompt it does not recognise any line also not the ones to download pip. What could be the problem here?
    • I am not familiar with DataSpell, so it might be related to that? If you have miniconda or anaconda installed you could test jupyter notebook outside other IDEs to make sure the problem is not related to some system-wide installation issue.

Introduction

    1. Would there me any insight into Numba?
      • Numba will be mentioned but we won't do exercises on it
    1. Hello, I am new. Where should I start?!
    1. Hi, I just joined. Which data set we are working with and from where do we download it?
    • Just listen to the stream, everything will be mentioned on there. Up until now it was mainly general introduction.
    1. How about Cython?
    • as far as I can see only mentioned and discussed
    • Yeah, so many different things we can't teach. Ask us about them in the panel discussion.

Jupyter

https://aaltoscicomp.github.io/python-for-scicomp/jupyter/#jupyter

  • University of Helsinki (Turso) users can try the Jupyterlab at: https://hub.cs.helsinki.fi

    • Ignore the messages that: "JupyterLab build is suggested: @elyra/r-editor-extension needs to be included in build". Just click Cancel.
  • Installation instructions: https://aaltoscicomp.github.io/python-for-scicomp/installation/

  • Note: we assume you have Anaconda and JupyterLab set up and can start them quickly. We've got a little bit of intro here, but if it's completely new you may need to play with it a bit more after the course.

    1. My bullets don't become bullet points in Jupyter
    • Can you copy/paste what you have written?
      ​​​​​​   paste your code here
      
    • Have you switched the cell to Markdown mode?
      • Thanks I figured it out now
      • Needed to add a space ;)
    1. I am working in cloud based Jupyterhub and I can not open a .ipynb ending window. I have a terminal and a python file, ending with .py. Lacking the upper bar with "Code" etc. Is there something I can do about this?
    • Which provider are you using? This might be something specific to the service you are using, so likely nothing can be done about this.
    • I activated it through Anaconda Cloud (https://anaconda.cloud/)
    • I managed to find the corresponding window, for me it was called spicy-tutorials-2023
    1. The %%bash command, does it work on windows? Is there an alternative for windows?
    • ..no "Couldn't find program: 'bash'"
    • I used %%cmd and it works for Windows
    1. Is it ok to code in windows?
    • Yes, although as seen in #11 there might be a few commands that don't work due to them being unix commands.
    1. How do you capture the output of a %%bash command into a Python variable?
    • sorry that this does not answer your question but personally I would avoid bash magics and rather use a python command for whatever you tried to do with bash magic and capturing it. it will be more portable to other operating systems.
    1. The command
    ​​    %%bash
    ​​    hostname
    

    doesnt' work

    • Used %%cmd
      • the example might unfortunately have been operating system specific
    1. Does %%bash start a new session? Or does it continue on the same session jupyter was started in?
    • I get: Couldn't find program: 'bash'
      • Are you launching in Linux?
    • I have windows laptop and I used anaconda navigator
      • Then unfortunately you don't have bash. I wonder if there is a different one.
    • ..
    1. Are the results of the cells saved in any variable(s)? I mean something that is returned as a value (e.g. 1+1)
    • If in the cell you assign the result of the operation to a variable (left hand side) then yes. E.g. a=1+1 versus 1+1
      • But what if it's just a piece of code returning a value? Can i use e.g. Out[5] like in Mathematica notebook?
        • Not really. You can see all variables you have in your running notebook with the magic %whos
    1. I missed a part( at home with my 9-month old son) what does bash do/mean?
    • It's the command-line interface in Linux, basically like cmd in Windows.
      • thanks
    1. where are the excersises?
    • see below
    1. I have used Juypter before and noticed that every time I open the note book again, i have to rerun everything. Is it the usuall case?
    • If you close the kernel, then yes. The results of the notebook are only stored in dynamic memory, and tthe printed results I think are saved when you save the document. But they are not values in memory, you cannot use them.
      • note also that running everything is what the next person will do who opens your notebook so it is a good practice to make sure you can run the whole notebook from top to bottom before saving/sharing it. I have seen notebooks where the person needs to run cells in a very particular order otherwise it fails and that is something to avoid.
    • Unfortunately it's not completely consistent, as it does store some outputs (at least iirc), but the state of the variables is not stored
    1. I have used jupyter a year ago and I was trying to plot a graph by using some codes from other examples. At first it does not work for me because it needs some inputs to read them for doing precalculation, but finally I find solution for it. But it took long to solve it.
    • did you solve it by loading the data from the web? or did you end up storing the data close to the notebook and read it from disk?
      • I find a simwilar example and I used it's methods for my calculation.
    1. By the way, Jupyter does not only work with Python – it can also work with e.g. Java, R, Julia, Matlab, Octave, Scheme and some other languages (e.g. Wolfram engine, but you need a custom implementation).
    • or R kernels
    1. I've tried to use the solution code of Jupyter 2.4 (%%timeit). Seems like this comand dramaticaly increase execution time.
    • %%timeit actually runs the code many times to get an average (runs it over and over until it gets a few seconds of total time, then divides by number of runs). So it appears slower but the time given is slightly more accurate.
    1. Can we use jupyter.cs.aalto.fi if we are from Aalto?
    • yes! Should mostly work, it has most of the same things installed.
    • if you have a triton a, you could also use jupyter.triton.aalto.fi (should also mostly work)
  • (ask new questions at bottom)

Exercises until xx:45

https://aaltoscicomp.github.io/python-for-scicomp/jupyter/#exercises-1
Try Exercises 1, 2, 3 - whatever you have time for.
It is OK if you do not complete all of them, the rest of the course does not depend on this.

I am:

  • done: oooooooooo
  • not trying: o
  • I dont' know how to change my folder
  • I need more time: ooooooo
  • done: oo
  • it's :50
  • no, I cannot hear you speaking. Have we not continued or is something wrong with just my audio?
    • It is continuing right now, try to reload the stream
      • Thanks that helped :)

Exercise 3 - please answer here

https://aaltoscicomp.github.io/python-for-scicomp/jupyter/#exercises-3

  1. Pitfalls:
  • I have used Jupyter Notebook with a genetics-specific Python interface to Hadoop parallel computing (Hail.is), but it can have issues where parts take a loooong time to compute. Having a script that runs and then deletes the cluster when it's finished works a lot better also ran into the modularity issue.

  • I started a long test code in Jupyter. In the end I needed to move it to a module, but that was too hard at the time and I never went back to it.

  • You cannot see what variables and functions you have defined, which does sometimes complicate development. This does depend on how you write code, though.

    • %who magic will list all defined variables
    • Or you use Jupyter Extension in VSC, there you have a tab near the terminal tab, that shows you all the variables
  • We use Jupyter for experiment execution (manipulationg source, in-situ data processing etc.). In Jupyter wrong secc execution usualy cause wrong behaviour that's hard to catch. Code style solve this problem, but not fully.

  1. Success stories:
  • Jupyter is fantastic for short debugging, on the other hand
  • Test out small things while writing the code that works into an editor for reuse / importing back to notebook as a library.
  • I have done several courseworks with Jupyter, which helps make the code easier for others to understand.
  • Good for data visualisation on the fly, and short distance from development and debugging to seeing the output

If you have a specific question you can ask it here.
Solutions are also available on the course page (just click on the solutions section to open).

    1. Can I copy code into here, if i have a question?
    • Yes. put it in these type of comments:
      ​​​​​​Text formatted as code
      ​​​​​​Can contain multiple lines
      
      • Thanks
    1. I cannot figure out what the differnce between this:
      ​​​​​​%%timeit
      ​​​​​​n, n2 = 0, 1
      ​​​​​​for x in range(100):
      ​​​​​​    n3 = n + n2
      ​​​​​​    n = n2
      ​​​​​​    n2 = n3
      ​​​​​​#print(n3)
      

    and this is:

    ​​​​%%timeit
    ​​​​a, b = 0, 1
    ​​​​for i in range(100):
    ​​​​#print(a)
    ​​​​    a, b = b, a+b
    

    Running the first one leads to: UsageError: Line magic function %%timeit not found. At least on my system.

    • "UsageError" makes me wonder: is this using a Python kernel in Jupyter?
    • Yes it should be running in a python env
    • Are these two different cells? Or is this all in one cell? timeit only works at the beginning of a cell (iirc), since it times the whole cell, putting it somewhere in a cell, will not work. At least for me: if I run the code in it's own cell it works. if I combine them I get exactly the error you report.
      • Thanks for the help. The problem was that I had a commented line of code (e.g. #) before timeit. The error is resolved. :)
        • ah yes, the first line has to be the magic!
    1. What did people get for a Fibonnacci code? Here's my example:
    ​​​​def fibonnacci(max_n):
    ​​​​    numbers = []
    ​​​​    for i in range(max_n):
    ​​​​        if i < 2:
    ​​​​            numbers.append(i)
    ​​​​        else:
    ​​​​            new_num = numbers[i-1] + numbers[i-2]
    ​​​​            numbers.append(new_num)
    
    ​​​​    return(numbers)
    
    • nice readable solution!
      • however, here we are appending to a list (in the meantime copying the whole array multiple times) even though the size is known from the start.
    1. Yet another recursive coding, possibly not efficient but short,
    ​​​​def fibo(n):
    ​​​​    return n if n < 2 else fibo(n - 1) + fibo(n - 2)
    ​​​​    
    ​​​​[fibo(i) for i in range(10)]
    
    • Also: This code will re-calculate all prior numbers multiple times, so it's good if you want to have a specific fibonacci number, but if you want the numbers up to that point, you will have to store them "on the way"
    1. I don't know how to change my directory in jupiterlab
    • Upper left side, there should be a folder symbol.
    • or are you trying to move "up" and that fails? yes I was trying to move up or navigate in in windows
      • You can't go "up" relative to the first folder you started in.
      • In this case you may need to close the jupyterlab and reopen it further "up" or where you wanted to be instead
        • Does this mean that I can't save my notebook to any other folder. I mean folder up
          • correct. you can save it where you opened it or "below" (unless you specified --notebook-dir as in the answer below). so in this case I would save it, close jupyterlab, then move it where you want it and re-open there. it is a security measure that jupyterlab can't access all your hard drive unless you ask it to. thanks!
      • jupyter-lab --notebook-dir=PATH will give some other place as the root directory.
    1. Here's a fibonacci solution using generators:
    ​​def fibo(n):
    ​​  x1 = 0
    ​​  x2 = 1
    ​​  for i in range(n):
    ​​      tmp = x1
    ​​      x1 = x2
    ​​      x2 = tmp + x1
    ​​      yield x1
    
    ​​print(list(fibo(10)))
    
  • (ask new questions at bottom)

Break until xx:02

Numpy

https://aaltoscicomp.github.io/python-for-scicomp/numpy/

    1. how long is it gonna take today?
    • Two more hours from now (12 CET).
    1. What is a tuple?
    • (1, 2, 'a', 'b') <- example of Python tuple: the parentheses things.
    • Tuples can't be modified after they are made (while lists can be). Both can have different types of Python objects, not just numbers.
    1. Did lesson restart already?
    • Yes. Can you hear the stream? Sometimes I need to re-load the Twitch webpage.
  • 32.how do I open Numpy?

    • import numpy or import numpy as np loads the library. We recommend you follow along from the same JupyterLab you had before.
    1. Would using for i, a_elem in enumerate(a) instead of for i in range(len(a)) be slower or faster?
    • You should test! If you need elements also, enumerate might be (and it's also more readable.) But we have a particular teaching purpose for doing it this way.
      • a_elem is here a variable to go through elements of a, correct?
        • yes
    1. I can't find np.arrange it says this attibute doesn't exist!
    • one r: arange Thanks!
    1. import numby doesn't work but i get:
    • b to p: numpy
    • numpy does need to be installed, but if you use default Anaconda then it's there: https://aaltoscicomp.github.io/python-for-scicomp/installation/
      • I open jupyterlab through anaconda
      • Does import numpy as np still not work? Make sure it's spelled corretly.
      • I get: ModuleNotFoundError Traceback (most recent call last)
Cell In[33], line 1
----> 1 import numpy as np 

ModuleNotFoundError: No module named 'numpy'
​​​​    - that means numpy is not installed. You can run `conda install numpy` or `pip install numpy` - works now, thanks! 
    1. where are you? I could not see when you run np?
    1. It's going a bit quick, or maybe i shouldn't try the code together with them?
    • Yeah, it is fast. Try typing some, but once it gets too fast then wait for the exercises to keep following along. It's hard to be both fast and slow enough at the same time.
    1. why we cant use np.b.shape
    • np.b isn't anything: b is the object.
    • The functional form is np.shape(b)
    1. Why does Python start with 0? For a while I found this better than Matlab starting with 1. But I think Julia starts also with 1. Why do the big four languages not follow the same start index?
    • languages with C origin/inspiration often start with 0. Fortran, Julia, Matlab and R start with 1. there are pros and cons to both (some things are easier in the one, other things easier in the other) and the one camp often asks why the other camp has chosen differently.
    • the choice is often also related to what language the language creator "grew up with" and found a more reasonable choice
    • Julia probably chose index-1 to have an easier starting point for Matlab users
    • could it also be "languages which try to look like math does" and "languages that were designed closer to how hardware works?"
      • that's a great viewpoint which I never considered!
    • The reason behind 0-indexing is usually because the arrays are represented as pointers in memory and the index is basically a shift from the beginning of the array by "i times the size of the data type of the array".
    1. import numpy as np does not work, it does not do anything, not even returning errors
    • If it works you won't get any output.
      • so where is the numpy window?
        • I don't fully understand the question. You should have the Jupyter notebook open and this is Python, you are interacting with that Python.
        • numpy should not make a window. When you run this line, you import a library into Python, meaning that you can now use the functions defined in that library.

Exercise until xx:32

https://aaltoscicomp.github.io/python-for-scicomp/numpy/#exercises-1
And then you can try other exercises after that, too.

I am:

  • done: ooo

  • not trying:o

  • need more time:=O00o

(to vote above, add an "o" symbol to the line you agree with)

    1. Storing a np file does not work, it gives a FileNotFoundError
    • what command do you try to run?
    • np.save('file.npy',a) and then x =np.load('file.np')
      • File names look different. You have stored is as file.npy, but you try to load file.np.
        • Thanks :)
    1. Can I ask a question about the exercise? "Can you adjust one to do the same as the other?" What does it mean, please? I get two lists [0..9]
    • I think it means "can you make both give the same output". You need to read the documentation on them and adjust the arguments some.
      • I see! Thank you
    1. With the linspace function I get an array with periods at each element: array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]) - is this correct?
    • yes! So it makes the type "float" (floating-point number), as opposed to "int" (integer).
      • Thanks. I tried to write "np.linspace((0,9,10), 'int')", but this is not the right way it seems.. Nevermind, "astype" worked :)
        • you need to use keyword argument (e.g. dtype=), and also the type itself rather than 'string'.
        • also, np.linspace expects start, stop, num as the first three arguments, not as a tuple in a first argument. You can check the documentation here.
    1. When executing np.save('file.npy', a) I'm receiving the error "Read-Only file system: 'fily.npy'"
    • On what system are you running?
      • MacOS and using VSCode
    • Hm. I guess VSCode started the Jupyter/Python with a "working directory" where it isn't writeable. You could give a whole path to where it can save: np.save('/Users/YOURNAME/...') - can a mac expert comment?
      • Ahh this would make sense, I'll double-check and can get back to you. Thank you!
    • import os ; os.getcwd() confirms what Python sees as the current working directory.
      • Realised my error, I had opened the file through the command-line so I had yet to save it (as well as addition of the cwd) thanks so much for the help
    1. why I get zero when I write b=a**2?
    • What do you mean get zero? Does it print zero or print nothing?
      • the value is zero
        • what do you get for a? it looks like a is not initialized as expected
      • I think the problem was with %%timeit. when I remove it works.
  • ..

import numpy as np
a=np.arange(10000)
b=np.zeros(10000)

Array maths and vectorization

Continuing after exercises: https://aaltoscicomp.github.io/python-for-scicomp/numpy/#array-maths-and-vectorization

    1. Can't we go through the excercises?
    • We sometimes try to but often don't have time. We try to have solutions - if you'd really like this, make a note and we'll see, maybe we'll make time.
    • Sometimes we tried and then we got the feedback that this was too repetitive and could we not rather stop going through the exercise again :-)
    1. First axis is row, second axis corresponds to the column in a np.array?
    • I think so but I always check almost every time! (one of my main uses for jupyter/IPython).
      • Thanks for admitting this!
      • I have to check this every time
    1. when you do a[idx], why does it return a 1-dimensional array instead of a 4-dimensional array?
    • If you only give one index, you will get the element at the underlying element array.
      • But in this case idx is a 4x4 boolean array?
        -ups, sorry, Which exercise do you refer to, or rather what line?
        - I'll paste the code, one second.
    ​​​​a = np.arange(16).reshape(4,4)
    ​​​​idx = (a > 0)
    ​​​​a[idx]
    
    • The resulting array has 15 elements (all other than 0 are bigger than one). This cannot be cast to the same shape as the previous array so it is flattened.
      • Thanks, good to know! I might expect the value not passing the boolean to be replaced with NaN or missing or something, it's good to know that it flattens it if the resulting values don't fit the dimensions of the array.
        • For more information, see boolean array indexing docs. When indexing array is same dimensions as original, the result is flattened. If indexing array is smaller (e.g. [1x4]-array), the dimensions that the indexing array have are contracted. For example:
          ​​​​​​​​​​​​​​idx2 = (a.sum(axis=0) < 10) # is sum of row less than 0, indexing array is 4x1
          ​​​​​​​​​​​​​​a[idx2] # returns only first row, shape 1x4
          
    1. Can the a[idx] work with NaN values. For example to filter out NaN values?
    • Yes
    • It might, but I think it's not that advanced. The pandas library (next lesson) is what does most of the missing value handling.
  • ​​   c=np.ones([3,3])
    ​​   c[1:3,1:3]=2
    

    While the index runs from [0..2], I have to specify the last index as 3 to change the bottom right part of the matrix?

    1. Could you please explain abit about exercise 3. About what is happening in the memory.!!! Both variables point to the same memory. But b is pointing to a . If a changes, b should change as well. It's the other way around, too !!??
    • The advanced numpy lesson goes into this a little bit more. It's like: a and b are different names in Python, and have different metadata. But the raw memory buffer holding the values is the same. This is efficient for big numerical analysis (lets you use Python for bigger things that you could otherwise), but can sometimes be surprising
    • https://aaltoscicomp.github.io/python-for-scicomp/numpy-advanced/
    1. I get "AttributeError: module 'numpy' has no attribute 'mulitply'" when trying to write "c = np.mulitply(a,b)"
    • I think it's misspelled a bit: multiply
      • ah yeah..
    1. I still don't understand the logic of a[1:3,1:3], middle 2x2 array? why only 2x2
    • left index is inclusive, right is exclusive, like [1, 3). You can think of it as for (i = 1; i < 3; i++)
    • Rows 1 and 2 and columns 1 and 2 (counting from zero).
    • I think the problem (the unintuitive part here) is that the 1:3 syntax is essentially a range syntax, so the last element is not included. But while this is already somewhat unintuitive when written as range(1,3) (which is 1,2) writing it as 1:3 (which again is [1,2]) is extremely counter intuitive
    1. I still do not understand what np.dot does, the logic behind resulting values is lost for me.
    • It means "dot product", and for 2D it's matrix multiplication. But I'd like a mathematician to explain a bit more about why it's called this!
      • For every (i,j) element, its ith row of the first matrix scalarly multiplied by jth column of the second matrix.
      • A =
        \begin{bmatrix} a & b \\ c & d \end{bmatrix}
        B =
        \begin{bmatrix} e & f \\ g & h \end{bmatrix}
        C = A X B =
        \begin{bmatrix} (ae + bg) & (af + bh) \\ (ce + dg) & (cf + dh) \end{bmatrix}
    1. I still don't understand the logic of a[1:3,1:3], middle 2x2 array? why only 2x2
    • Rows 1 and 2 and columns 1 and 2 (counting from zero).
      BUT why we included 3 in the sysntax ?
    • For consistency. For example, a[:3] will give you 3 elements. Also, that's the way it's done in C.
      • Isn't essentially that what happens is range(1,3) or range(3) is being used?
        • Yeah, it is. It's also in question 51 above (which I think is a copy of this one).
    • this is by definition. It is counterintuitive if used to another convention.
    • I think the problem (the unintuitive part here) is that the 1:3 syntax is essentially a range syntax, so the last element is not included. But while this is already somewhat unintuitive when written as range(1,3) (which is 1,2) writing it as 1:3 (which again is [1,2]) is extremely counter intuitive
      • just looked it up a bit: The main reason for range not to include end is probably since people like to write for x in range(len(a)): which wouldn't work if end would be included..
    1. should i use this np.copy straight after creating a?
    • Can you clarify where this is? Exercise 3. So after creating a should I use np.copy to avoid a changing
    • a = np.eye(4)
      b = a[:,0]
      b[0] = 5
    • Oh yeah. So, you would do a a = np.copy(a) (or b = np.copy(b)) if you needed to have them not refer to the same memory area.

Continuing: https://aaltoscicomp.github.io/python-for-scicomp/numpy/#types-of-operations

    1. I was not even able to finish exc 2, but maybe I'm slow?
    • don't worry, it is a lot of material! As long as you tried some, you can carry on later.
      • I wish it was a bit slower to get more hands on opportunity when we have you here
        • Yeah, I know: it's hard, no matter what we do some people say we should go faster and some say we should go slower. Especially the first day is sort of fast since otherwise it's too slow for many people.
    • you are definitely not too slow. material contains more than can be covered but that is a choice so that learners can read up in more detail on the topics interesting to them and everybody is interested in different aspects.

The University of Helsinki Turso cluster is experiencing some file system issues at the moment.
It looks like if you had an open Jupyterhub session at https://hub.cs.helsinki.fi those have been terminated.
We are investigating and repairing the problem. This only affects University of Helsinki users.
Update xx:08 : Issue resolved for now, please re-connect at: https://hub.cs.helsinki.fi if your session was terminated

Pandas

https://aaltoscicomp.github.io/python-for-scicomp/pandas/

    1. I heard recently about Polars, what do you think about it in comparison to Pandas?
    • Polars seems to be more efficient when data size gets very big
    • https://www.pola.rs/
      • Looks very interesting. Typically pandas alternatives utilize similar data structures and programming styles as pandas, but they can just deal with more data faster. Dask is another alternative we're going to be talking about later. However, to use these efficiently one usually needs to learn the way that pandas is supposed to be used (e.g. df.select(...).filter(...).apply(...).agg(...)-type of coding. So it is good to start with pandas and then use more advanced tools for the execution if pandas is not fast enough. Code is basically interchangable if it has been written in pandas style.
  • 57.Can you paste the url here?

    1. "SSLCertVerificationError" when I try to load the file. Do I have to install something more? "URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)>"
    • Are you running this from a node on a computer cluster? (e.g. Aalto Triton or HY Turso) or actually it seems to be a mac osx issue. If your case is the latter, then this could help (assuming you have installed following our guidelines)
      • Open terminal
      • source activate python-for-scicomp
      • pip install --upgrade certifi
    • is this on jupyter.cs.aalto.fi?
    • I'm on macos, yes. But not on a cluster. I did upgrade certifi but did not solved the problem. It seems complicated to solve, based on stack overflow
    1. When importing pandas, is numpy imported with it? Since the data of the column consists of numpy data?
    • yes, pandas depends on numpy and will load it where necessary. But numpy will not directly be available in your interpreter/console, as long as you don't load it there. A lot of packages depend o others (and we will go into package management/dependency stuff later (I think day 3?)
  • 60.I'm getting a HUGE error message titled "URLError"

    • this sounds similar to #58. Feels like some SSL certificate authority is not properly set up.
    • I doubt this because I don't see any mention of certificate failure. My error has this in the end: ``<urlopen error unknown url type: https>`
      • See question 58 if you are on a mac
        I'm using Windows OS
    1. what happened in this line: titanic[titanic[|age]>70]
    • We selected all the passengers older than 70, i.e. entrances in titanic dataset where variable age is more than 70. We do that by Boolean indexing, so we selected the whole column and compared it to 70, thus receiving a list of Booleans [True, False, False, False, True, True...]. When we pass that list of Booleans to indexing, we receive a dataset with only the rows with True value corresponding to them.
    1. What is the difference between at and loc methods, please?
    • Ok it seems I use at for one single value, and loc multiple rows , columns.
    • There are also other differences like return datatype, and different behaviour for set. Some links can be found here.
    1. R works very similarly, but is it completely independent from python, numpy and pandas?
    • yes, although the R dataframe might have inspired the creation of pandas
    1. I have always wondered, is there a difference in using titanic.Age and titanic['Age'] ?
    • I can imagine you'd have problems with the first one if the column has a space in the name (like "Age, years" or "Distance, m")
    1. The error originated from having requested value 1600 instead of existing value 1500 in value_vars +1
    1. 1600 is false put it 1500
    1. How to define more complex conditions such as for instance,
    ​​​​cond = titanic.Age >= 0 and titanic.Age <= 5
    ​​​​titanic[cond].groupby("Survived")["Age"].count()
    

    Here cond errs

    • whats the error?
    ​​​​ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
    
    • sometimes I use np.logical_and(cond1, cond2) to combine, not sure if this is the cleanest way of doing it

    • I assume the problem is, that the and operator doesn't properly work here, and tries to interpret the series resulting from titanic.Age <= 5 as two values (name index and truth value ).

      • Yes, seems like and does something odd here. if you instead use & it seems to work
        • yeah, I guess because & would translate to np.logical_and for these boolean indices
        • & is bitwise, so each element is ANDed with the respective one, while and only works on individual boolean values (which a series isnt)
        • For this error, add parentheses: (condition) & (condition). This is because of precedence, it's trying to do titanic.Age >= (0 & Titanic.Age) <= 5
    • Working solution:

    ​​​​cond = (titanic.Age >= 0) & (titanic.Age <= 5)
    ​​​​titanic[cond].groupby("Survived")["Age"].count()
    
    • Thank you!
    • Nice line!
      • And parenthesis are important not only to get it working but also it increases the readability

Exercises Pandas-1 and Pandas-2 until xx:52

https://aaltoscicomp.github.io/python-for-scicomp/pandas/#exercises-1
https://aaltoscicomp.github.io/python-for-scicomp/pandas/#exercises-2

We know this went fast - you'll have to read and review what we just covered. Do this and play with the exercises some - after this is just wrap-up and you continue as homework.

    1. "titanic.iloc[0:9]" slices out the first 9 nine passengers, I expected the first 10 since I thought it's 0 indexed? Or wait, is the column names also counted as a row?
    • It is like in math notation with i being an integer: 0 <= i < 9, so you get i=0,1,2,,7,8 (9 passengers)
      • Could I ask what you mean when you say "like in math", like in algebra, or in a python operation?
        • sorry for the confusion, I meant like the mathematical statement "i is a list of all integers larger than or equal to zero and lower than 9". Matlab instead would be "i is a list of all integers larger than or equal to zero and lower than or equal to 9"
    • indexing in pandas or numpy with the X:Y syntax is the same as putting a range command in there. i.e. iloc[0:9] is the same as iloc[range(0,9)]. And range excludes the end value.
      • Ok. I usually write range(9), but this could also be written as range(0,9)?
        • yes 0 is the implicit start value of range. See also #53
    1. Can lecture notes be written in finish language for Aalto university study credit?
    • We have someone who can evaluate it, so probably OK (I'll check to confirm)
    • We have neural machine translation also :) happy to get them in Finnish

Feedback, day 1

News for day 2

  • Tomorrow we continue pandas (quickly) and go to more advanced topics - if today was easy or boring, tomorrow gets more interesting (and exercises go slower).
  • Catch up with today in order to do tomorrow.
  • Twitch has videos immediately, YouTube later today

Today was:

(vote with a "o" symbol):

  • too fast: oooo
  • just right: ooooooooooooooo
  • too slow:
  • too easy: oooooo
  • right level: oo
  • too advanced: oo
  • I would recommend this course to others: ooooooooooo
  • Exercises were good: ooooooooooooooooo
  • I would recommend today to others: ooooooooooooooo
  • I wouldn't recommend today:
  • The time allocted and the course materails are a litele bit better to make equlvalent.
  • for the first day it was ok: ooooooooooooooo

Do you see value in the video production:

  • yes, during the course: oooooooooooo
  • yes, for future reference: oooooooooo
  • yes, for taking the course without watching live: oooo
  • a bit:
  • not really, I wouldn't use:
  • yes, I will go back to YT and rewatch: ooo
  • yes for future reference it is beneficial

One good thing about today:

  • I liked the explanation about the matrix slicing
  • It was very informative even though I thought there was nothing I don't know about Numpy
  • Very useful
  • Quick glimpse of all the subjects and how they relate to each other, free to focus more on personal interest afterwards +1o+1
  • This was nice for someone that has quite basic knowledge of python for data analysis oo
  • The exercises in combination with the explanations were very useful! o
  • The preparation information was very useful. I should have had more than one screen in use
  • I think I've got a few keys to get started on my own on np and pd.
  • Great platform setup, and nice structure of the day, right amount of breaks+lecture+exercise ooo
  • Contents well structured. o

One thing to be improved for next time:

  • I always think there is a lot of material and this stresses me out.
  • Pandas section was not really clear. I didn't use it before, so it was super hard to follow
  • Related to the comment above, indeed, I think that if someone would be starting in Pandas this lecture was quite fast :) oo
  • it's still unclear what is jupiter lab vs, jupiter notebook
    • Jupyter notebook has a simpler interface. In JupyterLab one may have sevral notebooks open, it has more features and one may install extensions as well.
  • Maybe too many pages with info and links for the course, not always easy to find the right page
  • More time for the exercises, and a bit more info for Numpy and Pandas exercises in the descriptions (don't want to open the solution)
  • Learning paths are different for people with different background/requirements. I would suggest having some recommended topics list for people in Physics / Maths / Chemistry / Biology

Any other comments:

  • Is there a chance the user could get more positive feedback? To start with confidence into the Python/scientific computing journey?
    • Definitely! It is unfortunate that python errors are often obscure compared to other scripting languages.
      • Still much better than for C/C++ :) Especially with some of the latest suggestions.
  • In data formats section could you please also discuss matlab data format? We use it mostly, becouse non-python part of our group uses matlab

Day 2

I was here yesterday:

  • yes: oooooooooooooooooooooooooooooooooooo
  • no: o
  • partly: ooooooo

How big is the data you use with Python:

  • no data:
  • <1M: ooo
  • 1-100M: ooo
  • 100M-10G: ooooooooooooooooo
  • 10G - 1T: oooo
  • 1T - 100T:oo
  • more:
  • yes

You can ask questions for us to discuss before we start:

  • Test this document if you didn't already! Click the "pencil" icon on top. This is an example question

    • This is an answer, your question didn't have a question mark?
      • It was more of a comment than a question
    • Actually
  • another test

  • more..

  • What special should I know from Tuesday session (I missed that)?

    • We did a lot of basic things: Jupyter and Numpy, which was a basis for a lot of the other lessons. Pandas got more involved and much deeper, but was a high-level overview (we'll continue a bit today). If you missed yestedray, it's not that bad for today.
  • In part of course instrution we hve: "Python is strongly and dynamically typed. Strong here means, roughly, that it’s not possible to circumvent the type system." Is there any other language that behaves oppositely? JavaScript, PHP!!!

    • We discused on stream some
    • One can also give type hints in Python. This is common for bigger programs. There are also type checkers that check whether types are static throughout the program.
  • Isn't C notorious for being weakly typed? Basically it's easy to convert one type of pointer into another extremely easily.

  • Good audio

Pandas (continued)

Exercises until xx:25

Pandas-3: https://aaltoscicomp.github.io/python-for-scicomp/pandas/#exercises-3

I am:

  • trying:ooooo
  • done: o
  • wish for more time: ooo
  • not trying:
    1. Very clear, could you say something short about differences between pandas and R's tidyverse?
    • Tidyverse is similar where you have tools for doing plotting, column management, filtering etc. in one collection of packages. Pandas is just one package that does similar things. Idea is the same: to have data in a table-like format and then use functions from within the package that always return a new DataFrame (or Tibble in Tidyverse). This makes it possible to chain operations like df.select(...).groupby(...).agg(...). Similar to tbl %> select(...) %> mutate(...) etc. in Tidyverse.
    • From a "philosophical" point of view, Tidyverse adheres to the "tidy data" principle, which means that datasets should be organised so that each measured variable is a column and each observation is a row (e.g. each row is a subject and each column is an answer to a questionnaire item). Pandas can work with any tabluar data without strictly adhering to this convention.
      • Columns in Pandas have same types, so it is highly recommended to use the tidy data format as the tools have been designed for it. However, one can create e.g. Pivot tables and multidimensional relational tables in pandas where a column is a table etc.. However, these are very hard to visualize as they are multidimensional.
    • Also there is some coordination in data formats: arrow/feather. We'll talk about these more later today.
    • For a comparison of R and pandas, see this page from Pandas's documentation.
    1. My Jupyter notebook says there's no kernel, so it won't run code. How can I fix this?
    • The simplest option is to restart Jupyter
    • You could try options of: Kernel→restart Kernel and/or Kernel→Change Kernel. Let's hope one of these works.
      • restarting worked! I tried restarting only the kernel before and that didn't solve it.
    1. when I type nobel.head my table looks a lot less nice than the one on Jarno's screen. Is there a way to make it look nicer/clearer
    • try nobel.head() <- the parentheses, to make it call the function. Is this it?
      perfect, that worked!
    1. Has anyone else got the youngest Nobel prize winner to be 3 years old?
    • Wer're checking why this might be anyone else get this?
    • I just checked, it's for an agency, not a person (Office of the United Nations High Commissioner)
      • ah :-)
    1. Count laureate by country and get percentage,
    ​​​​nobel['bornCountry'].value_counts(normalize=True) * 100.0
    
    1. When creating the pivot table from subset, the line "subset.loc[:, 'number'] = 1" gives an error 'ValueError: cannot set a frame with no defined index and a scalar'. Why could this be?
    • For setting, use dataframe.at instead of dataframe.loc
    • I'm getting the same error wit at..
    • OK, maybe the dataframe subset does not have an index. How did you create it?
    • I used the model code but replacing the countries with the country identifiers (e.g. 'FI')
      • Did you convert the view to a dataset with subset.copy()? I could replicate the error if I did not do it. The reason why this happens is that before you do a copy the object is just a "view" to the dataset. If you would try to add a new column into the view pandas does not know what to do with rows that do not belong to this view. Once you do a copy pandas creates a new dataframe with only the rows that belong to the view.
      • Yup I also used subset = subset.copy() before subset.at[:, 'number'] = 1
      • For me the following works:
      ​​​​​​​​countries = nobel['bornCountryCode'].sample(4)
      ​​​​​​​​subset = nobel.loc[nobel['bornCountryCode'].isin(countries)]
      ​​​​​​​​subset = subset.copy()
      ​​​​​​​​subset.loc[:, 'number'] = 1
      
      • Maybe the selection wasn't a dataframe?
      • Now I see the issue, I had the first two lines in a different cell than the last two. I thought the subset dataframe would be saved at the earlier point. But when copying all that to one it works
        • Maybe there was some problem that caused the subset to be bad and because the code refers to itself (subset = subset.copy()), the problem kept persisting. If you run the previous cell again, it should recreate the subset and it should work. When encountering errors I cannot solve I usually just restart the kernel and run all cells so that all objects are recreated from scratch.
        • Alright, thanks for the help!
    1. my histogram doesn't work but I get this error
    ​​​Cell In[33], line 1
    ​​​----> 1 nobel.hist(column='age', bins=25, figsize=(8,10), rwidth=0.9)
    
    • I'm not sure, I can't see where the import error is.
    • Is this the whole error message? Can you paste more?
    1. Related to the tidy data concept, do you think it’s bad practice to have nested dictionaries as pandas data frame elements? Would it be better to open the dictionaries and increase the size of the dataframe?
    • The idea behind tidyverse is that yeah, they should probably be opened. When it's in dicts, it's hard to use all these tools and you end up doing things manually. For most cases, better to use a bit more memory + spend less time on processing. (pandas has ways to save memory in these, too!)
    • Sometimes you might have something like a big dataframe of data that you run a models against (e.g. different models with different parameters). You might want to have a small dataframe of models and their parameters that refer to that big data, but you most likely do not want to combine the tables as that would duplicate the big data for each parameter value.

Data visualization

https://aaltoscicomp.github.io/python-for-scicomp/data-visualization/

    1. When should I use "ax" to set labels and such? Just using "plt." lets me do this.
    • We'll cover this today (the two main interfaces)
  • 10.When trying to import:

    ​​​Traceback (most recent call last)
    ​​​Cell In[15], line 1
    ​​​----> 1 import matplotlib.pyplot as plt
    ​​​ModuleNotFoundError: No module named 'matplotlib'
    
    • How have you installed Python/JupyterLab? matplotlib probably isn't installed in this Python. We've recommended the anaconda distribution since it has everything and simple to manage. I opened through anaconda. Should I always pip install matplotlib although I use anaconda navigator?
      • It's probably better to conda install matplotlib. I think there is a way to install packages with the navigator, but I don't have it installed so I can't check Ok, thanks!
    1. Is Matplotlib and the other libraries listed built upon pandas?
    • I think not exactly built upon but developed very closely together. pandas can call matplotlib automatically (dataframe.hist())
    • So, I can use Matplotlib without pandas?
      • Yes. (matplotlib does use numpy under the hood)
    1. What does matplotlib.use("Agg") stand for?
    • "anti-grain graphics": it basically says prepares it to make pictures in the right format.
    • Matplotlib has various different rendering backend. "Agg" is one of the simplest engines and it creates graphics in rasterized (e.g. .png) format. For publications you might want to use a "better" rendering engine that creates images in vector graphic form that can look better in print. For more info, see this page
    1. Why not just import matplotlib instead of matplotlib.pyplot, lower amount of libraries?
    • Under the hood it's probably doing most of the same. It's the way they organized it. (it used to be that importing pyplot would do some automatic setup)
    • any way to use matplotlib without pyplot (or pylab)?
      • Using with pyplot is the most common way. Like the previous answer said, it sets up various things that make it more usable. Using the base module would require you to set up these base things.
  • 14.can we ignore the pyplot in "import matplotlib.pyplot as plt". so just wirte it as import matplotlib as plt ??

    • Then you would have to write plt.pyplot every time instead of just plt.
    1. Is it better to use the ax interface, or is it better to use the matplotlib.pyplot interface (also why)?
    1. What is this command doing: fig, ax = plt.subplots()
    • It creates a new figure object and axes for that plot. You can think of the figure as the window or canvas that you're drawing into and the axes as the x-y axes you want to use for the plotting. In some cases you want to do multiple subplots and plt.subplots can do that as well. However, for historical reasons it turned out that this function is now the easiest way to initialize even a single figure and axes. By storing the axes into variable you can make certain that you're telling matplotlib to work with specific axes.
    1. Could you please comment on the choice of colors / colormaps?
    • below there is a block "why these colors?" - these colors have been optimized for color vision deficiencies
      • See this page for more information. Many of the colormaps chosen by matplotlib are "perceptually uniform" so that most humans perceive the change in color in linear relation to the change in the data. This is important as otherwise your plots might show artefacts and forms that aren't there but we just perceive them there because of the colormap.
    1. How do I 'un-do' the matplotlib.use('Agg')? :)
    • You can replace it with a different setting
    • or close the kernel, remove the line and rerun +1 probably easiest
      • "restart kernel and run all cells" 👍
    1. Could you please explain the matplotlib.use command for clusters?
    • matplotlib by default, when pylab is imported, prepares to show the pictures on a screen. Clusters don't have that, so we have to prepare for a "headless" configuration.
    • See question 12. as well.
    1. Often we have to visualize 2D data (measurements on a grid, density etc.). Could you please tell a bit more about this as well?
    1. We are doing only exercise 1 right?
    • yes. see info box below. there is also a solution in there if you get stuck
    1. if I am using plt.scatter(x=data_x, y=data_y, c="#E69F00"), then I don't need to declare fig, ax = plt.subplots() ax.scatter(x=data_x, y=data_y, c="#E69F00"). just use one of these 2 methods
    • that is right, this relates to question 15
  • 23: If I use the "ax.legend" before setting the labels, I get the message: "Text(0.5, 1.0, 'some title')" before my plot. If I use it after setting the labels, I get the message: "<matplotlib.legend.Legend at 0x7f>" before my plot. Why does this happens?

    • Try adding the brackets, legend is a function: ax.legend()
      • the same thing happens
        • but in both cases you also get a plot, right? Yes.
        • what you see there are return objects from the last function call in a cell
  • 24 Can we please have an example on how to use fix, ax in fuctions and how it isolates?

    • Looking for a good example
    • It might be useful for setting up subplots, esp multiple contour plots with the same colorbar.
    1. Is it possible to store a figure object into a file, so we can plot it later? (Context: One can save data into figures in a cluster but visualize/customize them later on a workstation)
    • I would rather describe it in code rather than in file since it should not take any time to create the figure object (but it might take some time to re-generate the data)
    • Yeah, I'd rather save (raw data) + (code for plotting figure). we'll se data formats for raw data in the next lesson.
    1. Maybe you could say something sometimes, just for knowing voice still works. Yes everything works!
    • thanks for reminder
    1. It would have been nice to know that we have the break now also. When is the next break? Do we have another break after this as well? Thank you!!
    • oops yes, we restart xx:15 - currently break. it is really easy to miss it.
    1. I tried ax.scatter(data_x, data2_y, label = 'set2', c = 'cyan') ax.scatter(data_x, data2_y_scaled, label = 'set2 (scaled)', c = 'green') but it doesn't plot anything in the end, only shows this message: <matplotlib.collections.PathCollection at 0x4631228>
    • hmm looking what that means
    • try to add fig.show() at the end
    • alternatively try adding %matplotlib inline on top of the cell (this was needed in older notebooks)
    • Tried the adding two above. Gives an error <ipython-input-24-98a05c63be98>:4: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure. fig.show()"
      • are you running it not on your computer?
      • I am running it on from Jupyter webpage as I had problems yesterday with Anaconda installation
      • But now I discovered that plotting 3 sets on the same plot works if I put all commands in one cell. No idea why though.
        • Hmmm interesting. We should look in more detail but now I need to listen to co-instructor :-)
        • I could replicate this. If you have a plot in one cell jupyter will try to plot it. In the next cell you can get empty axes. So put all commands for one plot in one cell.
        • OK, thanks.
    1. When I am running the scatter i get a warning "matplotlib.collections.PathCollection at 0x1ea5eb5a760", is this indicative that i need to correct something?
      • it seems this is also question 28 (looking there)
        • I wasn't able to plot at first with that warning only, this was because the ax object was already used. Defining the figure as ax2 instead let me plot it, but still with the warning. I think question 28 might not be able to plot for the same reason
          • trying to solve it in question 28

Break until xx:15

    1. Dear instructors, could you please write down your names here? I would like to follow you on Linkedin if you have one.
      • Radovan Bast (but my linkedin is probably outdated; "bast" on github)
      • Johan Hellsvik
    1. Is Pandas plotting resorting to matplotlib?
    • Yes.

Exercise: Customization-1

(first as demo, then time to do it yourself)
Just watch now

    1. Do excel files are also OK?
    • Pandas does have support for reading/writing Excel files
      • More about this in the next session.
    1. "Purchasing Power Parity" dollars
    • thanks!
    1. Does matplotlib support a simple hex rgba string input for setting the color of a plot?
    • Yes (and probably converts to that internally)
    • So does that imply any speedup if we supply such an input?
      • Probably far too minor to be worth considering.
    • Using default colormaps has lots of advantages when it comes to legibility. See also answer 17. for more info.
    1. Can we use SQL-like syntax for querying a Pandas dataframe?
    • You can convert SQL commands to pandas commands by following instructions here.
    • If you want to run SQL-like interface, I would recommend DuckDB. It also supports running SQL queries on Pandas DataFrames. However, if you want to do SQL-like operations I would stay in DuckDB as long as possible (doing queries, groupings and joins) and convert the end result into Pandas DataFrame once you have the data you need.
    • SQL syntax may often be quite readable (fairly close to natural language) in contrast with Pandas groupings etc
      • It can also be efficient as the expression can be evaluated as a whole instead of doing it in parts. Dask does similar things as SQL where it converts multiple expressions (groupings etc.) into a single expression that will be run over the data.
    1. Can you maybe explain imshow and vmax, vmin, in particular, when a log scale is involved?
    • imshow shows the data as an image. So a 2D-array of size (M,N) would be shown as a (M,N) pixel image. vmax and vmin set the data range that the colormap should cover. So everything in data larger than vmax would be shown as the maximum of the colormap. Same for vmin. If these are not given, they will be set to the maximum and minimum of the data range.
    • What is the difference between image and plot?
      • plot needs two 1-dimensional arrays (x,y) so that it can plot lines etc. between each point pair (x[i],y[i]). Imshow takes a full 2-dimensional array (a) and plots a 2D grid where each pixel corresponds to the value in (a[i], a[j]). If you give plot something like ax.plot(x, a), it will plot you multiple line plots instead of an image.
    1. Does the course include a section on interactive plots?
    1. I get this error:
          URLError: <urlopen error unknown url type: https>  
    • Hm I thought I saw this recently somewhere. What computer are you operating on? (are you on jupyter.cs.aalto.fi?)
    • No, will try that
    • Yes, yesterday I had the same :(
    • HP Elitebook Windows OS
    • Are you running through full anaconda or something? One idea is that Python isn't built with the SSL support it needs to use HTTPS urls: https://stackoverflow.com/q/28376506 - suggests things like conda installing openssl. But I'd think a bit before trying things on that page.
    • It's working from jupyter.cs.aalto.fi. Will try to resolve issues with anaconda installation on my laptop later. Thank you.
    1. How should I rewrite this "gapminder_data = pd.read_csv(url).query("year == 2007")" if I want to do it separately?
    • first: gapminder_data = pd.read_csv(url)
    • then: gapminder_data = gapminder_data[gapminder_data["year"] == 2007]
      • So there is no way to use query separately?
        • oh there must be. give me a moment
          ​​​​​​​​​​​​​​​​​​gapminder_data = pd.read_csv(url)
          ​​​​​​​​​​​​​​​​​​gapminder_data = gapminder_data.query("year == 2007")
          
    1. Setting fontsize to 100 kills my kernel every time, what am I doing wrong?
    • testing here
      • I can set it to 100 without crashing (but it also then looks gigantic and unreadable)
    • If you look at the console Jupyter log it might give a hint why (if you can find it). Somehow I'm not surprised this might be possible
      Haha, thanks for trying! I can't see the log. Setting only the x axis to 100 goes fine, but not the y axis
    1. How to use .set_major_formatter()? Generally how to get help for using a function or package in python? Is there a command for getting help?
      • help(name_of_object)
      • trying it out
    1. (pandas): In the course instruction page, we have:
      Unlike a NumPy array, a dataframe can combine multiple data types, such as numbers and text, but the data in each column is of the same type. So we say a column is of type int64 or of type object.
    • Could you please explain it a bit?
    • Yes! Basically, under the hood, every column in the DataFrame is its own independent numpy array. So that's why they can have different dtypes: each column itself is efficiently stored, but different. They are nicely linked in the dataframe for doing things across rows (somewhat) quickly too.
    • Thanks!
    1. where is it importing the dataset from, and what does tips = sns.load_dataset("tips") mean?
    • sns is what they traditionally import seaborn as (inside joke it seems: https://stackoverflow.com/q/41499857)
    • I think seaborn includes some sample datasets for testing dsitributed with the code. It takes it from this
    1. What was that shortcut you used to split a code cell?
    • ctrl+shift± it seems (from edit menu)
      • yes, it is one of the very few shortcuts I use and remember in jupyter. it splits the cell at the position where I am
    1. When I run the same code mine plot has background color but not yours, why?
    • which code is this? was this exercise customizing-1? or 2?
    1. i got this error in juypter notebook: C:\Users\eeeee\anaconda3\Lib\site-packages\seaborn_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead if pd.api.types.is_categorical_dtype(vector): C:\Users\eeeee\anaconda3\Lib\site-packages\seaborn\categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper)
    • These are warnings, so you can ignore them for now. Some package is using old functionality that is about to be removed.
    1. after importing seaborn, it still gives me the attributeError that module 'seaborn' has no attribute 'set_theme', am I missing anything? also for the layout in plt.subplots it return with the TypeError: init() got an unexpected keyword argument 'layout'
    • then I am suspecting you have a different version of seaborn? you can try: import seaborn as sns and print(sns.__version__). I get 0.13.0
    • Yes, my version is 0.10.1
      • updating to later matplotlib and seaborn versions will probably fix these problems. sorry that the examples we use are not backwards-compatible enough.
    1. can you please explain the sns example briefly (once more, sorry)?
    • don't apologize, this was very brief. What I wanted to show is that it can use dataframes directly and map visual channels to data columns. my next step would have been to replace their dataframe with my dataframe and to replace their column names with my column names. then I would adjust colors and then tweak more. but please ask more.
    • I always had problem with violinplot and your short example was quite nice. Could you please send the link to your script here?? Found it here:(https://seaborn.pydata.org/examples/grouped_violinplots.html). Thanks for bringing it up.
      • we will update the material to link to a recent gallery example (they changed it recently and we did not notice)
    1. I have again problem with installing seaborn. I tried conda install seaborn aand got: invalid syntax. the same error with pip install seaborn.
    1. Another example with previous data:
import seaborn as sns
import pandas as pd

url = (
    "https://raw.githubusercontent.com/plotly/datasets/master/gapminder_with_codes.csv"
)
gapminder_data = pd.read_csv(url).query("year == 2007")

sns.set_theme(style="dark")

# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(data=gapminder_data, x="continent", y="lifeExp",
               split=True, inner="quart", fill=False)
​​​​ - very nice! this is exactly how it works.
    1. I'm having a bit of an issue launching JupyterLab from Anaconda Navigator. However I can launch it by first launching Jupyter Notebook and then changing the last part of the adress to lab. The error message: "Error executing Jupyter command 'lab': [WinError 5] Access is denied". Maybe a user rights issue, that I need to be admin? Maybe there is an alternative way to launch it that would work?
    • Not a teacher on the course but I was able to bypass this problem by creating a separate environment instead of using jupyter from the base. :)
    • If you are using a terminal to start this, and if you are on Windows, you might need to typejupyter-lab without space. Assuming that you have jupyter-lab in the environment where you are. If you are unsure you can type conda list |grep jupyter to see which jupyter packages you have in your environment. If it complains "conda is a command that is not recognised", type first "source activate base", then run "conda env list" and then see if in the list of environemnt you have other options rather than "base".
    1. How to plot a linear regression line using e.g. example from Exercise Matplotlib?
    • great question. looking for an example
      • other libraries can do this directly. in matplotlib you need to take a "detour" and compute a regression using numpy or scipy and then use the resulting function in matplotlib. example: https://stackoverflow.com/a/6148315 - the upside is that you have then full flexibility how you define the fit function, it does not have to be a polynomial
        5​

Break until xx:11

Data formats

https://aaltoscicomp.github.io/python-for-scicomp/data-formats/

  • is audio good?
    yes!

    1. Why is Excel not good for "human readable"? What are the criteria here
    • in excel it is, but the file on disk isn't.
      • By human readable its mean't that the file can be opened by simple text editor etc. and you can see what is in the data. Excel is a binary format where data is stored in a specific format that is not human readable. Of course, programs make it readable for us.
      • also related to question and discussion in next question. excel format is also tied to a particular non-open-source program which could be a problem if we want to make sure we can still open the data in 10 years to double check that Nature paper.
    1. Why do you think HDF5 is not human readable? I open it in silx and bingo, all is there. Also, why can I not store arbitary data types in ? Which one is not possibe? So far, I could store data (np.ndarrays, strings) in it.
    • because it requires software that understands it to make it understandable for humans. if I open it in a text editor it will look like a random character set.
      • But if I follow this logic, a simple text editor, maybe like SubEthaEdit, is also a program, right? Who says a simple text editor is the standard?
        • just to be clear: HDF5 is a great format. but I need to know about what it is to select a tool that can open it. in some other formats I can open it "naively" with almost any program and from its contents I will see what this is about.
          • Ok, but could I not argue then those programms are just missing the functionality to open hdf5? For example, h5web just opens a hdf5 file and it does not even matter, what the structure is inside.
            • what is nice about hdf5 is that it is not tied to a particular program. this makes it a good format. some formats are very tied to very specific and often commercial programs which makes data then less reusable.
              - Thank you. I did not know this.
    • About arbitrary data: you can store arbitrary data in HDF5 if you convert your data into binary format and send it there, but then you need to have a library that converts your object into binary and back. The HDF standard has certain supported data types that it natively supports and can convert into / out of.
    • If you only need to open things in a program, then "human-readability" doesn't matter. This is for the case of: someone gets something and wants to open in a text editor to verify what's going on. When you get to big data, this doesn't matter so much.
    1. JSON is widely used for imagery, but seems pretty bad from this table, is there an alternative that you recommend?
    • typical alternatives to JSON are: YAML, but JSON is not bad. it is very standard and reasonably human-readable
    • It's not very efficient, but it just goes to show that standardization/ease of (some) uses is more important most of the time. multi-TB telecscope data (for example) needs something more efficient.
    • There are use cases where JSON looks much better. Answers from REST APIs, for example.
    1. Is base64 coding/decoding a good generic choice?
    • sorry, in which context was this mentioned? asking because it depends where (I was distracted typing)
    • no specific context, just wondering if base64 fits somehow in data science in general
      • I would save data in csv if it is tabular and not too large. I would go to other formats and other encodings only if it does not fit into a CSV file.
      • good to know, CSV as a start point, thx!
        • plotting CSV files is easy (often just one line), it is easy to open and read it in any programming language (python, R, matlab, ). Excel can read and write CSV files.
    • Base64 is an encoding for a string/bytes data. You would still ask: how do you convert those data to those bytes? That's where the other formats need to come in anyway.
    • Base64 can be useful if your data has e.g. unicode characters that can be hard to interpret.
    1. how to display the tree structure/view in juypter?
    • was it the one on the left? if yes, then click on the "folder" symbol on top left, just below the jupyter logo and this will open or close the file tree browser.
    1. what is the differnce between juypter notebook and juypter lab?
    • jupyter notebook (JN) was there before jupyterlab (JL). jupyterlab has all the functionality of the JN but it also offers more: editor, tabs, easier navigation, plugins, command prompt. this means that any notebook we create can be opened in either JN or JL. but in practice I prefer JL since it just offers more tooling when creating notebooks.

Productivity tools

https://aaltoscicomp.github.io/python-for-scicomp/productivity/

    1. Why does the linter show only syntax errors at first? Is it because of interpreter/compiler-style operation? Can it be configured to show all types of errors/warnings at once?
  • see below

    1. Why did linter not show all errors to begin with?
    • if there's a syntax error, everything may be invalid and it can't look deeper. the parser to look at variable names can't run (or might give very wrong results).
    1. Can you run pylint in Juypter not juypter lab?
  • 61.b "pylint: command not found" This came on JupyterLab.

    • I guess it's not installed in your setup. Are you running through Anaconda?
    • Reply: No? jupyter.cs.aalto.fi >> Notebook > Python 3
    1. How you copy and pasted the lint_example.py on the new jupuyter lab please?
    • it says above: File->Open from URL gives a method. Thank you!!
    1. If my code calls functions etc from modules, does linter analyse also those?
    • It looks into those modules some, to make sure you are using it right, but probably doesn't score them (unless you tell it to score a whole project)
      • Linters will usually look for simple imports, but it cannot usually tell if an object has a attribute or not because it does not create them.
    1. So I found the error that the linter could not spot but it is unclear what the error actually was, ax.plot(df['x'], df['pred']) could run. this could not ax.plot(df[['x']], df[['pred']]). Whats going on ?
    • the second version which did not work assumes that df is a list of dataframes but not a dataframe. you could try to do this: print(df[['x']]) - and this is what matplotlib receives.
    1. Could you please show (once more) how you opened a Terminal within the Jupyter-notebook?
    • the "+" symbol on the tab bar of the jupyter lab opens the launcher and there you can select whether you want a python notebook or an editor or a terminal - did you find it?
      • Yes, but it opens it in a new tab. How could one have it in the downside of notebook?
        • you can drag it down below the other
          • Ahaaa! Done! Thanks!
    1. I am using File> New> Terminal but Terminal window does not open in JupiterLab! Do you have any suggestion for solving thr problem?
    • does it say "Unhandled error"?
      • No, It does not show any action/changes
        • what if you go to new launcher and then click there? any error then?
          • I will try it now
          • Yes it gives the "Unhandled error" in New launcher
            • I see the same in my JupterLab. Trying to find out why
              • it tries to open /bin/bash which might not be available on your operating system. I believe it can be configured to open a different terminal (powershell or git bash)
                • Where the different terminal (powershell or git bash) can be found! I can only see the Terminal option
                  • I am unsure how to configure this. Checking
    1. After you copied and pasted to the jupyter lab how you run the code_style_exmaple.py please?
    1. Is it possible to install black and flake8 via Anaconda Navigator? For example for flake8, I find 3 different ones - flake8-import-order, flake8-polyfill, and pytest-flake8
    • Most of these are extra addons for flake8. flake8 is the base package. Here's an example list of flake8 plugins. Black is just black. Both are available in both main- and conda-forge-channels.
    1. Can the linters be used directly on jupyter notebook files?
    • I use jupyterlab-code-formatter and black directly in jupyterlab notebook (it creates a symbol that I can click)
    1. I remember working with C++ in Qt IDE, I know it's not Python but there was a feature that would reformat the code like "Black" does but at the IDE level, without running the formatter in the termial. Isn't there something similar for Python?
    • yes you can configure almost any editor to run black in the background. I configured one key on my keyboard to run it in my vim editor. Some people let it run automatically when saving files (but that can be problematic when working with other people who use different settings since then just saving the file after a tiny change can make it looks like you changed everything).
      • THANK YOU!
    1. Do linters detect type mismatches?
    • mypy is a tool to read type annotations and detect type mismatches
    • it works :) just an example,
    ​​​​# $ cat bad.py 
    
    ​​​​def f(n: int) -> str:
    ​​​​    return { 1 : None + 1 }
    
    ​​​​f(None)
    

    and for mypy bad.py we get

    ​​​​bad.py:3: error: Incompatible return value type (got "dict[int, int]", expected "str")  [return-value]
    ​​​​bad.py:3: error: Unsupported operand types for + ("None" and "int")  [operator]
    ​​​​bad.py:5: error: Argument 1 to "f" has incompatible type "None"; expected "int"  [arg-type]
    ​​​​Found 3 errors in 1 file (checked 1 source file)
    
    1. Can a linter understand (and possibly flag) different conventions such as indentations using spaces/tabs?
    • yes, it can also automatically convert tabs to spaces (depending on preferences and configurations)
    1. Do you know if someone has created a linter specifically for their own preferences?
    • Most likely yes, but usually linters are written to match some community standard. They can usually be configured to match your own preferences.
    1. Is there any way to autoformat the python code with "Black" using "PyCharm"? This is the tool where I usually work in Python.
    • I am very confident that it is possible (but I am not a PyCharm user so I don't know exactly how to do it)
    1. Comment on this: I read: Coding is easy these days due to languages like Python, but software architecture is way more difficult. (I guess apart from CS degrees, this is not taught. ) What is you take on this stamtent?
    • one difficulty with Python is that there is not just one way to do something but often many. and it can be hard to know which one to choose (how to package, how to do dependencies, how to deal with types, how to distribute, how to organize a project)

Feedback, day 2

News for day 3

  • We got to more advanced tools, which are much more likely to be new to many people (scirpts, library ecosystem, dependenciy management, binder). Many of these are critical to advanced work!
  • For preparation: if you haven't yet, we use the command line more. Reviewing the shell crash course would be useful if this is new to you: https://scicomp.aalto.fi/scicomp/shell/

Today was:

  • too fast: o
  • just right: oooooooooooooo
  • too slow:
  • too easy: oo
  • right level: ooo
  • too advanced:
  • I would recommend this course to others: oooooooooo
  • Exercises were good: ooooo
  • I would recommend today to others: ooo
  • I wouldn't recommend today:
  • Interesting topics today: oooo
  • Thank you so much!!! oo

One good thing about today:

  • Topics choices, speakers o
  • Topic covered both Data analysis and programming-related skills and inforamtion.
  • Getting atleast one knew knowlage realted with python
  • I am really enjoying the advice given for using the code formatters, that would be cool with the other sections too.
  • Really loved the session on productivity tools o

One thing to be improved for next time:

  • Make sure we can do productivity tools from Jupyter. o
  • it'd be really helpful if there was a bit more talk about integrating productivity/formatting tools with version control. o
    • Yes, it would be good! Unfortunately we aren't that much about version control here, so it would be too advanced to some. We'll try it in the future, or adding to the codeRefinery courses: https://coderefinery.org/

Any other comments:

Day 3 and 4

Continue to https://hackmd.io/@coderefinery/python2023archive2