Python for Scientific Computing 2024 - Archive of chats part 1
Test this notes document
- This is a question
- one more question
- …?
- …
- edit me…
- …
- aadf
-
aadf?
- are the comments working?
- I guess they are… but was that a comment?
- how far can we go?
- Is there a point when the indentation stops working? :D
- Nope! LOL
- see?
- weeeee
- 👻
- The limit might be around 42.
-
AADF =
- aaa
Some questions received so far
-
Is it going to be recorded?
-
How can I get 1 ECTS credit from this course?
-
I got a message at 7am that the stream is live?
- Enrico: apologies, email automations went wrong. :( We actually start just 10 minutes before 9am Stockholm/Rome (10am Helsinki/Athens)
-
If I use a MacBook, is it recommendable to spread the 3 windows (twitch, notes, Jupyter) over 3 screens that I can slide between? (if I put everything in 1 screen, the Jupyter window gets very small)
- If you have more physical screens, it is good to spread them I think. Mac "virtual" screens: I don't find them handy personally.
-
I noticed you are using pandas on this course. What's the situation with polars currently?
Day 1 - 5/11/2024
Icebreaker questions
Where (country/…) are you following from?
- Finland, Espoo +16, Helsinki +15
- Portugal, Oeiras +1
- Uganda
- Germany, Potsdam
- Norway, Oslo +8
- Sweden, Uppsala +2
- Sweden, Gothenburg
- Sweden, Lund
- Norway, Tromsø+5
- Sweden, Stockholm+1
- Sweden, Linköping
- Sweden, Norrköping
- Germany, Hamburg
- Norway, Svalbard, Longyearbyen
- Ecuador,Guayaquil
- Netherlands, Amsterdam +4
- Bangladesh
- Norway, Kristiansand
- Norway, Trondheim
- Norway, Bergen +1
- Poland, Kraków
- Spain, Barcelona
- Lebanon
- Iran, shiraz +3
- Australia, brisbane
- Finland, Oulu +1
- UK, Manchester +1
- Sweden, Gothenburg
- UK, Horsham
- Italy/connected from Denmark
- Spain
- Norway, Bergen +1
- UK, Surrey
- Brazil, Sao Paulo
- Turkiye
- France, Cannes
- China
- India
- Iran
- Denmark, Copenhagen +1
- Ethiopia
- Norway, Tønsberg
- Iceland
- Birmingham, UK
- israel
How much have you used Python before?
- Beginner and want to learn for my work +7
- it looked so cool, I am excited!
- Too much but never enough +2
- every day at work +1
- I just finished my edx Python course!
- A bit, but not seriously +2
- I have used it but…
- A bit, but not enough +2
- Not that much! +6+1
- Just back at Uni
- Since 2015, active research software engineering
- A lot before, but…
- Actively from 2018-2023, but nearly completely out of touch since then.
- Using continuosly but not in organized way +2
- Some, but not as much as I'd like +2vs
- Quite often, but I wish to make more use of the capabilities of parallel programming
- A bit, during courses and tutorial
- am
- I am beginer+1
- I do machine learning, forcasting, data learning.
- A little bit
- Beginner (Australia, Brisbane)
- beginner
- I learned the first time a couple years ago, since then have used it a bit, mostly editing others' scripts for my own work or consulting chatgpt, so would like to learn to be more independent with it.
- Never
- Learnt in undegrad
- Master's in Bioinformatics
- ExpertN
- Intermediate +1 +2
- Just finished textbook
- quite a lot, but I want to learn more +1
- Not much, very basic like level zero
- For a few projects here and there +2
How much have you used Jupyter before?
- Confortable
- Never+1+1+1+1+1+1+1+1
- beginner +8
- Never! +1 +1 +1 +1 +1 +1 +1
- A bit +6
- …
- A little bit +7
- Never
- Am familiar with it but haven't used +1 +1
- Jupyter is my primary method of testing python implementations of stuff I'm trying to build +3
- A little, but not very much
- never +1
- very briefly
- A little
- briefly
- never
- Briefely
- Quite a bit +2
- A bit, but I prefer Emacs
- Never
- i use vs code +1
- I use Pycharm
- Once briefly for a course
- i've used google colab which is basicly the same thing
Where do you store your files?
- laptop +7
- HDD
- Personal drive +2
- My personal laptop. x = sum(2+1+1+1+1+1+1+1+1+1+1+1+1+1)
- Onedrive +5
- My neighbours laptop, as they have cloned my repo
- Company drives, CSC
- .
- University +2 +1
- A git repo, and a local backup on my laptop +2
- HPC+1
- My laptop
- HPC + personal computer +1 +1 +1 +1
- git
- my laptop
- github, laptop
- local PC+1
- Work PC +1
- my pc +1
- Hard Disks
Ask more questions here and test this notes document
- How can I test that my python environment works correctly?
- Try opening a new Python notebook and importing a few packages:
import numpy; import matplotlib; import pandas; import xarray
- When you say that we need to have Python opened, you mean the Jupyter Notebook, right?
- Do you give the the recording ??
- I could not install as per your guidance, but i have Jupyter notbook, is that enough? for this course?

- this did not work at my Miniforge prompt
mamba env create -n python-for-scicomp -f https://raw.githubusercontent.com/AaltoSciComp/python-for-scicomp/master/software/environment.yml
- wget the file first as below.
- Alternative solution:
mamba env create -f https://raw.githubusercontent.com/AaltoSciComp/python-for-scicomp/master/software/environment.yml
- i tried above no worked. see below
- it worked now seems. as below prompt i got…
saved the filoe manually and did 'mamba env create -f environment.yml' not worked
-
Please say hi to the cat! :cat:
-
What do you recommend for building python environments: mamba or conda?
- They are the same… but different :) Mamba is a c++ rewrite of conda, so it is MUCH faster. There are also some differences in the licenses.
- Can they co-exist or I need to choose one? (I already have many environments in conda)
- They can co-exist but it's a bit inconvenient to switch back and forth in daily use–still, I do it. :+1:
- You can install mamba in addition to conda, all conda environments are mamba environments as well. Then you can seemlessly switch to mamba. :+1:
- We'll talk at length about mamba, conda and environment building on the third day (dependencies)
- Note that if you use "mamba create… " to create an environment, you might get trouble using "conda activate…" to open the same environment. If you want to stick to conda, you should also use "conda create.."
- If you do that from miniforge, all
conda
commands should reroute to mamba
if I'm not mistaken.
- Possible. However, when I did "mamba create …python-for-scicomp", then "conda activate python-for-scicomp" resulted in error message that conda could not recognize the environment. It worked after I did "conda create…" instead
-
I'm struggling to import numpy and pandas on VS Code, any ideas?
-
Which interactive environment do you recommend? Jupyter or Spyder?
- Whatever works for you. Jupyter is very interactive and general (other people can run same notebooks) while Spyder is more of an complete IDE with extra features. If you like the extra features, Spyder is a good tool.
- They are different, Spyder might work better for some (e.g. you can see variable contents, execute raw python files ort their parts, have all the plots in one place etc). You can also use Spyder with Jupyter notebooks, there is an extension for that. Spyder is an IDE (supports autocompletion, syntax check), Jupyter is not (therefore no autocompletion). You can technically use the Jupyter as a frontend for many other kernels
- Remember to set your Twitch player resultion to "source", otherwise it will start blurring the video to save bandwidth.
- If you do not need to edit this document, click on the "eye" icon (view only), this makes it less slow to interact with.
Introduction
https://scicomp.aalto.fi/training/scip/python-for-scicomp/intro/
- This is a question
- This is an answer
- Here, have another one!
- ..
Jupyter
https://aaltoscicomp.github.io/python-for-scicomp/jupyter/
-
Ask questions here
-
what is it that I am activating here?
-
How do you create your environment
- We describe this in the installation instructions - unfortunately we don't have time to go into detail now.
- We'll talk about environment creation on the third day. We'll talk about tools like Miniforge, conda and mamba there.
-
What is the difference between Jupyter and IPython?
- IPython is an interactive Python "interface", while Jupyter uses that to create "Jupyter notebooks". Jupyter notebooks can be saved as files, they have cells with code and code results. If you use IPython, you do it in your console. If you use Jupyter, you do it in a browser.
- I understand. So jupyter notebooks can have a copy of the output to persist (like some graph/plot), but IPython requires you to re-run everything each time to get to the point when you need that output later. This does lead into the question that you need to re-run the notebook/IPython code anyway to go further along while prototyping, so doesnt seem like too much difference except the interface..
- IPython saves the output in Out[] lists, Jupyter does it this way as well. It's just that Jupyter is a GUI for IPython.
- If you want to save intermediate results, we'll talk about this in the second day. Jupyter will just save output strings into cell output, it won't save the Python objects and the state of calculations to the notebook. You can think of a notebook as an way of writing interactive code that shows outputs alongside the code.
- Thanks! That helps a lot! :grin:
-
How did you show the content of the environment?
-
Could it be possible to have anaconda and miniforge environment in one system?
- Yes. You can install different Python distributions. You'll just have to make certain you have only one activated at a time. We'll talk about this more on the third day.
-
Question: jupeter code wont show output?
- Maybe this is like below.
-There is play bottom on top.
Yes, used the playbotton too. I think, when the presenter changed to code/markdown, thats where something went wrong
-
is there a magic for MATLAB? its ok to mix different languages in the same code?
- I don't think so. If you want timing information for MATLAB code, maybe use the
tic
and toc
statements.
-
Question: How to print in Jupiter? Enter key only expands the cell.
- By default cells print the last value of the cell. You can have a single variable name on the last line to make sure that is printed.
- The
print
function also works and appears as output
- Ctrl+Enter executes a cell
-
question: I get an error :Couldn't find program: 'bash'
- This might happen in Windows, because Windows does not have the
bash
-terminal installed. Don't worry about it, this example is one of many. +1+1
-
The environment.yml file took me 10 minutes and still running, I mean is it normal? or should i do something else? via anaconda.
- If you're using anaconda it should have most of the packages installed already, so you can probably use it. We'll talk about more efficient installation methods on the third day.
- where can I find the pdf file of it?
-
Where do I see the exercises?
-
Any tips on version control while using jupyter?
- There is a Git extension for Jupyter and you can also use the git commands in the cells, e.g.
!git ...
- There is also many packages (such as nb-clean) that sanitize notebook output so that only code changes are commited
-
What does kernels mean in the Jypiter context?
- A "kernel" is a thing that takes code input, executes it, and gives some output. So a kernel is what implements the Python, R, Matlab, etc. interactions. :+1:
-
Sorry, which time the stream will be live again?
- xx:50 (depending on your hour)
-
Will we solve the exercise together?
- No, but you may check the solution if you wish. Pls git it a try first, though.
- The solution is not explanatory enough as it does not go through all the steps
-
question: how do it repoen jupetar?
- Activate the environment if needed and type
jupyter-lab
(opens up in default browser window) or jupyter-lab --no-browser
.
-
Question: What doese "loop" mean in timeit output? Is it defined by random?
- A: it is the number of iterations that
timeit
ran. I think it is determined based on how long the code runs. i.e., faster code will do more iterations (by default)
-
I don't understand %%timeit function for Fibonacci code, it says range(10) but why it is doing 1000000 loops? +2
- I think (correct me if I am wrong) that the timeit just loops over the code multiple times to obtain average values of the time it takes to run the code, so it does not matter what range you choose
-
Q: Is timeit usually used for testing the robustness of the code, rather than its output?
- A: It is for testing how fast the code runs. (So that you can try to make it faster)
-Thanks!
-
Question: In Jupyter Notebook, there is an option to select different Kernels. What should be selected and what is the difference between options?
-
CondaValueError: could not parse 'name: python-for-scicomp' in: C:/Users/xxxx/Work Folders/Downloads/environment.yml.txt I am unable to import the environment
- The file seems to be stored in a wrong file format. It shows
environment.yml.txt
when it should be environment.yml
.
Exercise until xx:50
- The exercise blocks on this page: https://aaltoscicomp.github.io/python-for-scicomp/jupyter/
- Try to make sure that you can run some basic code and activate the environment: then you can do the rest of the course. The Jupyter part isn't the most important parts.
- If you haven't done the installation, then this will be hard: in that case, take the time to do the installation. It's OK.
Poll: how is it going? Add an "o" to the lines you agree with:
- Jupyter works for me: oo+19
- I ran some exercises: ooooo+9
- I can't activate the environment: o+4
- I don't understand how to download the environment file (it's not explained anywhere). I tried to right-click the text "this environment file", saved it and tried to import the yml-file, but got an error: o
- I lost 1h trying to set up everything, hopefully I will be able to catch up after classes
-
I did not manage to make a markdown cell
- You can create one by choosing a cell and changing the type from "Code" to "Markdown" in the editor bar.
-
How to print the Fibonacci numbers?
- Check the solution-tab. You can add
print(a)
to print the number (if a
is a variable for a Fibonacci number). Just adding a
to the end of cell will also print the variable at the end.
-
I had problems with %%timeit. I have the following code
Then calling
ran into error:
"IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--ServerApp.iopub_data_rate_limit
.
Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)-
- I had the same issue. If you just run the function without calling it, it times it without printing anything.
- The problem is indeed that when you call it with
print(...)
jupyter tries to print the output, while %%timeit
wants to measure the runtime of the function. This would create thousands of outputs and jupyter realizes that this would create problems and stops the execution. So run the function with %%timeit
in a cell without print(...)
to measure its execution time across multiple executions and in another cell with print(...)
to see the output for a single call of the function.
- Ah, yes. It seems to work well when removing
print(...)
. Thank you
-
Does %%timeit work for parallel processes?
- It uses Python's
timeit
module and can do anything it does.
- You could, but usually you use timeit to measure execution of small snippets (something that you would then run in parallel). For profiling parallel programs we have a separate lesson coming on it: https://aaltoscicomp.github.io/python-for-scicomp/profiling/
-
One general issue I want to mention with jupyter magics:
When you need/want to convert the code into regular python scripts, these magics are at best ignored, or, depending on the conversion end up as part of the script. So I'd strongly recommend to not just throw them in but really only use them on a temporary basis (e.g. when you want to figure out how long which part takes, for potential improvements)
Jupyter (continued)
-
My understanding is that in order to be awarded the 1 credit point we need to send the excercise file, however outlook is not too found of sending code files, what is the preferred format for sending excerices? tar?
- I think you can send *.ipynb files, just not *.py files
-
can you please do the exersices
- Unforunately we don't have time. We might have videos from past years, but also, you can find many videos of the basics of Jupyter.
-
The solution for Exercise 2 gives the following error:
TypeError Traceback (most recent call last)
Cell In[33], line 2
1 a, b = 0, 1
----> 2 for i in range(10):
3 print(a)
4 a, b = b, a+b
TypeError: 'int' object is not iterable
The following code works to get the first 10 numbers:
a, b = 0, 1
for i in [0,1,2,3,4,5,6,7,8,9]:
print(a)
a, b = b, a+b
NumPy
https://aaltoscicomp.github.io/python-for-scicomp/numpy/
-
How do you know all the np.fucntions?
- Do you mean, how do you know what's available?
- NumPy has a great documentation https://numpy.org/doc/stable/user/index.html#user and API reference https://numpy.org/doc/stable/reference/index.html
- After using it a while, you'll start remembering the various functions, but of course at the start you'll have to check the documentations. There are functions for all kinds of different cases (mathematical operators, indexing etc.).
- For numpy (and really everything), I am alwways looking up the functions; sometimes I know the general idea but hardly ever remember exactly.
-
What does the dots mean here in the array([[0., 0., 0.],[0., 0., 0.]])
?
- You can try without dots as well. In that case Python will create integers instead of floating point numbers.
0
is an integer, while 0.
is a floating point number. When NumPy creates the array it will check if there is a consistent data type in the lists. If all are of the same type, it will create an array with that type.
- As an example:
np.array([[0., 0., 0.],[0., 0., 0.]]).dtype
returns dtype('float64')
while np.array([[0, 0, 0],[0, 0, 0]]).dtype
returns dtype('int64')
.
-
Will you also look at how to make an empty numpy string array with given dimensions (e.g. filled with '')? Or empty None array with given dimensions?
- When we are at this level of science, there can't be "empty" arrays: they are of a certain data type and the mem location has to have something in it. (you can't fit a python None or string '' into a floating point memory location). You'd usually make a zero-array.
- What I have been trying to look for is a way to create an array with a given dimensions which I want to fill out with string values afterwards through a for lop
- We'll talk about pandas after the numpy session. It supports string operations in a much more usable way than NumPy. You can do string arrays with NumPy (datatype needs to be
object
), but its much more work.
-
Are numpy .npy files intended to only save one variable or is it possible to add attributes, units etc. and other variables such as with netcdf file formats?
.npy
saves just a single array. We'll talk about xarray later that can support adding various attributes and we'll also talk about other different data formats tomorrow. You can use numpy.savez to save multiple arrays in a one file, but there are even better formats than this that are designed for your use case.
-
What is the purpose of save a variable in a numpy file? When is it better to save it in an excel file or a numpy file?
- We'll talk about working with data tomorrow, but saving a variable as a numpy file can provide you with an intermediate result that you can use to continue where your code left off. There are better data formats for long-term storage that we'll touch on later.
-
I heard earlier that numpy arrays can only have one data type, is there a way around this if one needs to mix data types within the array?
- Numpy basically shows raw computer memory arrays. A uniform memory location can only hold a uniform data type, so you would have to design your algorithm around what the memory can hold.
- There are numpy "object arrays" where the array holds Python objects, but that doesn't have all the benefits of fast operations.
- I'm not one of the presenters, but why would you need to have mixed data types in an array?
- I was considering a situation with a three dimensional array which will contain names as well for instance
- I am one of the presenters: check out our xarray lesson coming up soon! This will likely be what you need. NumPy arrays can only hold a single data type. If you want multiple data types, you need multiple arrays. But now you need something to neatly tie them together, which is where xarray comes in.
-
for np.arange(10), when I type d.dtype, it's giving me dtype as bool not int64, where I am doing wrong?
- This is interesting. Hm… what lines are you running exactly?
- np.arange(10).dtype , it gives me int64, previously I was typing after np.arange(10) and then d.dtype, that gave me bool
-
Is it possible to change the range of the random number from 0 to 1, to 10 to 12?
- You can multiply the result by 2 and add 10
np.random.uniform(10, 12)
documentation
-
exercises are great, but it's going a bit fast: I don't have enough time to do them before we're already moving on.
- I understand, it's unfortunate but we have a lot of constraints. All the material is there to keep trying once we are done (we can even answer qustions overnight)
- Or goal is to have enough time to get your hands a bit wet, we don't expect everything to finish.
- Yes, understandable, thank you:)
Numpy Questions:
-
Is it the case that inline operations are generally faster than the np. operations? E.g., I am seeing a + b
going consistently faster than np.add(a,b)
, and likewise with multiplication.
- Inline functions can sometimes be faster because they utilize operator overloading and Universal functions (ufuncs) that are defined for the array themselves. This reduces startup time needed for setting up the function call, so using them is a good idea. NumPy module functions e.g.
np.add
can have some additional arguments that are useful in some cases.
-
In the first example about mathematical operations, I noticed you used the square brackets are used both for defining matrices and later you showed us how to index with them, how do you distinquish between these two uses of square brackets?
- When creating lists in Python (for defining matrices), the brackets start the expression:
[a]
would be a list with variable a
inside of it. When indexing, brackets come after the variable: a[0]
would be an array index reference. This is similar syntax to Python's own lists and dictionaries, but numpy allows for much more complex indexing such as ranges a[0:10]
.
- Also, spaces matter in Python syntax!
- Indeed:
[a]
is the same as [ a ]
, and a[0]
is not the same as a [0]
(which is not a defined syntax) (which is a really cursed syntax) . Python has a recommended style guide PEP 8 on this, which recommends avoiding extraneous spaces ([a]
over [ a ]
, a[0]
over a[ 0 ]
).
a [0]
seems to work for me…?
- Well that is cursed. You're right. Would never use that in practice.
-
How do you tell the difference between rows and columns in an array? For example, which of these is row 1 and which is column 1? a[0]
versus a[:, 0]
- Rows are the first dimension and columns are the second one. If you do not specify all dimension indexes with
:
, they are filled to the right. So a[0]
is the same as a[0, :]
, which can be read as "first row, all columns". a[:, 0]
reads as "all rows, first column". For higher-dimensional data the same applies. So if you have three-dimensional data a[0]
is the same as a[0, :, :]
.
- This is organized in this way because iteration is fastest over the last dimension (row-major ordering). Having rows first allows one to write loops like this and them be efficient (of course numpy functions are written like this in C, which is much more efficient):
For more info, see this doc and the advanced numpy lesson
-
How to define ident as shown in the lecture np.multiple(a, ident) I think I missed the part of the lecture?
ident = np.eye(4)
It could be any array of a matching size shape, but I chose identity matrix because the results of matrix multiplication vs element-wise multiplication are clearer.
-
how do one know an array is a single row or a single column? I cannt make a single row or a single column.
- Numpy actually doesn't usually use the concept of rows and columns, in the way that matrix multiplication uses. There are 1D and 2D ararys and functions can interpert if it should be a row or column.
- There are "arrays" (of any dimenision including zero!), not matrixes. (There is a separate matrix type that always has two dimensions.)
-
In the "Advanced NumPy" section, you have a typo in the first sentence under "Copy versus view". i.e., "we way it created a "view"". I'm trying to figure out what it's supposed to say… Maybe you can correct it when you have the chance?
-
in excercise 4 section, the second (advance excercise), I am struggling to understand what the task is: "Create an array of dtype='float', and an array of dtype='int', OK, Try to use the int array is the output argument of the first two arrays"
- I have the same question, what does this mean?:
Try to use the int array is the output argument of the first two arrays
I think maybe what they mean is "Try adding the float array to itself while using the int array as the output argument.":
- could you unpack this syntax ar[::-1], please? why there are no ,?
- What this syntax basically means is "for first dimension, start at beginning, go to the end, iterate one step at a time backwards (from the end)". So it would be the same as
a[a.shape[0]:0:-1]
(start from end, go to beginning, one step at a time), but it is more compact notation. Commas are used to separate different array dimensions, while multiple colons are used to describe ranges. So ar[:]
is the same as ar[:::]
, which is the same as ar[0:ar.shape[0]:1]
, but nobody wants to write it out every time. ar[:,:]
would mean ar[0:ar.shape[0]:1, 0:ar.shape[1]:1]
.
Exercise until xx:00 (then quick wrap-up and then lunch)
Now: Lunch, we resume at xx:00 (13:00 EET, 12:00 CET)
- Take time to eat - don't worry about the course, you have time to explore it later, material will stay available.
- You can keep answering qusetions but we will answer them when we get back.
- Installation instructions for the course environment: https://aaltoscicomp.github.io/python-for-scicomp/installation/
- This doesn't work. I followed the installation instruction for the course environment. It's not explained how to download the environment file. I tried to right-click on the hyperlink "this environment file", saved it, tried to import into Anaconda and got an error. So I can't continue and follow along any exercises: o
- try to do CRTL + S on the thing that opens when you click on the hyperlink, or just copy it to a notepad and then save it as .yml
- clicking on the hyperlink just opens a tab with text. So that doesn't work. Ctrl + S saves the installation homepage. I try now to copy it into notepad and saving it as a yml-file, then importing it into Anaconda. Result: I am getting the same error message: CondaHTTPError: HTTP 000 CONNECTION FAILED for url https://conda.anaconda.org/conda-forge/win-64/repodata.json
Elapsed: -
An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
'https://conda.anaconda.org/conda-forge/win-64'
- I don't understand what is happening here, my Anaconda navigator worked totally fine with another environment that I set up earlier. I googled the problem and other people seem to have the same problem and reinstalling didn't help (and couldn't resolve the problem either). I read the suggestion below but I get the error message: 'conda' is not recognized as an internal or external command. Please help me and adjust the installation information: https://aaltoscicomp.github.io/python-for-scicomp/installation/
- Seems like maybe you don't have conda installed? You could try to install in into your anaconda, or maybe install miniforge and use that instead of anaconda, since I think it has conda already, if I'm not mistaken…
- I have conda and conda-env etc. installed (it's in my base (root) folder). Was that what you meant? Can that person that wrote the command below here, copy that command into this editing document again? I wonder if there was a hyphen missing.
- I had the same problem, but what helped was saving the environment file as a .yml file (after first copying it to notebook) and then creating the environment in terminal so that I navigate to the folder where I saved the .yml file and then creating the environment using that file: "
- To the person who is having issues with installation, if you are from a partner university (basically any university in the nordics) please get in touch with scip@aalto.fi so we could find a time slot to solve your installation issues. Unfortunately there are 400+ registered participants and we cannot help everyone with one-to-one support.
Pandas
https://aaltoscicomp.github.io/python-for-scicomp/pandas/
-
Is it the same as scipy that is built over numpy with more functions?
- the approaches between scipy and pandas are very different. The idea of pandas is to make it possible to work with tables, like you would in a database, or in R for example, so pandas offers lot's of functions that allow that kind of approach, whereas scipy is a broader approach, more math/computation focussed
-
How is it different from tables in Matlab if one would compare them.
- From an "interface" perspective they are similar. But the way Matlab stores the data under the "table" objects is different. I was trying to find a reference that I remembered, but could not find it (yet).
-
How did you get the Bash there? :)
-
My experience is that pandas can be quite slow. Is there a way to get the functionality of pandas with the speed of numpy arrays?
- There are alternatives to pandas, we have listed a few of them at the bottom of the page. Luckily their syntax is very similar, so it is often easy to move from Pandas to Polars or Dask.
-
titanic.describe() did not show anything…? describe is not in blue.
- Did you load pandas?
import pandas as pd
yes
- Did you download the dataset? yes titanic.info() worked
- What happens when you run the cell? Sorry, I found that the cell was specified as "Raw". Now it works. Thanks.
-
Could you discuss the usage of indexes in DataFrames, both in manipulating the DataFrame in memory and when writing to/reading from disk as .csv
file?
-
notation issue? some are in ''
some are in ""
- In Python they are always the same, I'm not very consistent in my typing.
- The only time it matters (I think) is if you have quotes in a string, for example,
x = "y'all"
-
Installation issues below have been archived to https://hackmd.io/@coderefinery/python2024archive. If you are from a partner organisation get in touch and we are happy to provide direct help.
-
in the data some of the names contains " "
signs, that makes it harder to deal with those, any workaround ? (for example: Johnston, Miss Catherine Helen "Carrie"))
titanic.loc['Johnston, Miss Catherine Helen "Carrie"', "Age"]
should work
-indeed thanks
-
When talking about Views of the data, does Pandas need to have the whole data in the memory when creating a 'view'?
- Pandas has the data in memory, but the view is not consuming any additional memory. Views show the same data, but limit / reduce the data based on conditions that are specified.
- There are also alternatives to Pandas and alternative ways of working in Pandas that do not load all of the data in memory at one time.
-
We have not called the pd anywhere here. i am baffeled
- The object we are operating on is a pandas object, so this object has all these functions we use
- import pandas as pd <- the functionalities come from here
titanic = pandas.read_csv()
makes a "dataframe" object and from here on it knows it's pandas.
- n.b the above could just as well be
titanic = pd.read_csv()
which is where we then use the pd
-
referring to your reply above, did we define this 'titanic' as pandas object?
- implicitly, yes. We called a pandas function (
pd.read_csv()
or pandas.read_csv()
, since pd is just an alias for pandas) to create the object and the function returns a pandas dataframe object. Since python doesn't require the programmer to explicitly define the type of an object, we don't say so explicitly, but we assume that the function returns the value that is stated in the documentation ;)
-
ok. thank you. Now, is the function 'read_csv' something known as a normal/standard function, but when assosiated with 'pd' it become the pandas object. is that correct? and if it is associated with any other type of … thing (i dont know) then it becomes a different type of object? in other words will this 'function' takes form of the associated object?
-
I'm not using Jupyter, but instead, am running Python on the command line (i.e., the interpreter). Everything is fine so far, but in the first section when the histogram was drawn (i.e., titanic.hist (column='Age', ...)
), nothing was shown. I only got array([...)
. Is it possible to make a histogram pop up or I just need to use Jupyter? (Reply: Thank you! "import" worked! And I'm now switching to ipython!)
- This might not work from the command line. It might work with
import matplotlib.pyplot as plt; plt.show()
- The most comfortable way to make the figure pop up is to use
ipython
instead of just python
to start a terminal. ipython
is an enhanced version of the python terminal that supports this sort of stuff. If you are inside an IDE (like vscode or pycharm) then you can probably configure it to open an IPython terminal instead of a regular one.
-
A pedagogical / philosophical question. For a beginner learning python, which one would support the learning process better: starting with pure python types or numpy/pandas? I'm not a beginner myself, I'm just interested in your opinions.
- It depends on your goals: to learn Python itself, then basic Python types. To learn Python for computing purposes, numpy/pandas. (Really you need a bit of both… it depends on how you would like to approach it, more theoretical or more practical.) A scientist working without numpy/pandas is wasting lots of time. But some people need to know the details.
- Thank you, I haven't been able to articulate that kind of point before. I have colleagues that emphasize the praticalities but can't build complex pipelines, so my personal opinion is that knowing python more broadly is really useful.
- With wasting time, do you mean the time lost in running the programs?
- Mainly re-implementing things that already exist: the "not invented here" syndrome. Also numpy/pandas/etc have lots of C code so will be much faster to run if you use them correctly.
- Thank you, that's good food for thought. Sometimes the vectorized operations do not seem as transparent as more explicit syntaxes (think for loop) but I understand that people are able to internalize these functions differently.
-
titanic.loc[::10,"Age"].mean() works but using titanic.loc[0:10,"Age"].mean() throws an error, I don't understand why exactly?
- Oh this is a good question!
::10
means "every tenth row". 0:10
means "the row with name 0 to row with name 10" - but these aren't row names! the row names are the actual names, like we see in the index. So it doesn't work. .iloc
would use integer row numbers and titanic.iloc[0:10]
would work.
- is it not possible to do this subsetting in one operation then? we need to do iloc then loc on columns?
-
it seems to me like pandas would be a better option for mixed datatype datasets like titanic and np for dataframes with the same datatypes.
- I think yes (one of the instructors)! Well, it depends on the details. If you have columns but they are the same datatype, the columns are still actually different and Pandas has benefits. If you have an actual array (say, 2D map of temperatures across the earth), then arrays are better. (so it's not so much "same datatype" but "axes are the same").
- I somewhat disagree, because even if you have same datatypes you still need to identify different columns. So if in Pandas you have T and P as independent columns, and you calculate mean, you get mean of T and mean of P. If you have them in one array in numpy, you will just get a number that does not mean anything as it has been calculated across multiple different types of variables. If the data is a 2D-array of same data (e.g. data in different positions), its a different case.
- Use Pandas if you have "tabular data": data that fits well in a table, or a spreadsheet. Such as the Titanic dataset. When your data is not tabular, for example temperature readings over time across the globe (temperature x latitude x longitude = 3D data), then Pandas is not the right tool and you want to use xarray or just plain numpy arrays.
-
How to understand the fact that the survival rate both above and below average age seems to be higher than the total survival rate? Code error or?
- "The women and the children first!" This is half a joke and half a guess, but I am guessing that young men (near mean age) would have stayed on the ship while trying to rescue children and older folks.
- Children would skew under average, but why is the save effect over? Did they save that many old ladies? Weird!
- Could be interesting to redo the analysis separately by sex
- Trying, but it is proving challenging. My first time trying pandas
- Actually, see discussion below. Probably because of missing data. i.e.,titanic.Survived.mean()
is different from (titanic[titanic["Age"]>0]).Survived.mean()
- Women survival rate 0.7420382165605095
- male survival rate 0.18890814558058924
- Super interesting
-
Why does Python use Methods and Functions rather than making everything a Function? And am I the only person who finds this distinction incredibly confusing? lol
- This is a good, phisolophical question. The main thing is that then, the object (that has the method) can determine what the method does - which is good for really complicated code.
- Python uses a style of programming called "object oriented". If you are used to that style, it makes perfect sense, but if not, I understand it can be confusing.
-
Are R, Matlab, and Python (with Numpy and Pandas) all pretty much the same in functionalities? Do you recommend sticking with one? What are the respective advantages over another? That Matlab is not free is one thing. What about speed? Adding a question here - does knowing one help with learning the others?
- Generally yes. But Python can do more these days than Matlab, at the cost of being slightly more complicated to understand and write code in. And I do recommend sticking with one for now. Speed is essentially the same (as they all use the same underlying BLAS library to do the actual computations, see the advanced NumPy lesson)
- Mostly, it will depend on what other people around you are using. If most of the analysis tools in your field are using R and you are the only one using python, you might be a very good python programmer, but you can't build on your colleagues work, so you waste a lot of time reinventing the wheel. +1
-
I encountered something weird computing the survival rate over and under mean age. First, I found that titanic["Survived"].mean()
returns 0.3838 as expected. So 38 % of the passengers survived. However, when doing (titanic[titanic["Age"]<titanic["Age"].mean()])["Survived"].mean()
, I got 0.40625, and with (titanic[titanic["Age"]>titanic["Age"].mean()])["Survived"].mean()
I got 0.40606060606060607. So survival rate suddenly became above 40 % both for under and over the mean age. How does that make sense, when the total survival rate was 38 %? Did I do something wrong?
- Could it be because some people don't have numbers in the "age" category? When taking the whole dataframe, the survival rate takes these in account, but when you filter by age they are removed?
- I think you are right. Now I tried
(titanic[titanic["Age"]>0])["Survived"].mean()
, and it returns 0.4061624649859944. So more of the people who did not survive lacks age info.
Pandas exercises set 1 AND break, we resume at xx:05
-
I am a little confused about how to index a pandas data.frame. For example, in R I could say titanic["Age" > mean(titanic$Age),]
How do I index a pandas data.frame like this? (titanic["Age"]<titanic["Age"].mean())
returns a list of bools…
titanic[titanic.Age < titanic.Age.mean()]
- or more verbose but meaning the same:
titanic.loc[titanic["Age"] < titanic["Age"].mean()]
-
I still can not run the program and my titanic.csv is not defined properly. I tried to follow the instructions. I importer panda as pd and also defined the titanic from url but it does not work.
- could you clarify what you mean by you defined the titanic from url, please? Yes I run the command url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv"
titanic = pd.read_csv(url, index_col='Name') <– this looks correct to me, did you get any error messages (feel free to ignore, and focus on the rest of the lesson) No I didnt get any error messages
- What happens if you now run just
titanic
or titanic.describe()
?
-
Sorry, did anyone get how to calculate the Pace?
- runners['pace'] = runners['time'] / runners['distance']
- Cheers!
- should it be runners['distance'] / runners['time'] instead?
- I think pace should be, time per distance. In this example, seconds per meter. So this calculation is correct.
distance/time
would be speed
- Nice observation about the units (seconds/meter) ! Of course this example is just a toy mode.
- Sorry, I wasn't commenting on the units, only the formula. Edited for clarity.
- apparently pace is 1/speed
- What is it supposed to be anyway? Is it just a mistake in equation or is pace something new?
-
Is is a good practice to have nested Pandas objects? (e.g. to include the distance information in the 'runners' example above)
-
could you please at the end of the workshp explain more what should we do to get the credits
-
Just curious : Do you know how this groupby is implement in panda (loop, map, ref, …) ? Is it more efficient than a c loop ?
- Its mainly a Python loop, so it is mostly for ease of use cases. Of course running analysis (e.g.
mean()
) for each sub-dataframe fast because they use numpy functionality to calculate the values.
Short break until xx:30 (exercises/break)
- You can take a break or look at the pandas things a little bit more.
- Sending my love for the cat Ɛ><3<3<3 +1+1
- how long is this session going to be. I though todays lessons will take 4 hours?
- We end in 33 minutes (at xx:00). There were two hours before lunch and two hours after.
xarray
https://aaltoscicomp.github.io/python-for-scicomp/xarray/
-
That's new to me. Looking forward to it +1+1
-
ds = xr.open_dataset(filepath)
gives me a ValueError: ValueError: found the following matches with the input file in xarray's IO backends: ['netcdf4', 'h5netcdf']. But their dependencies may not be installed
- Did the previous line
filepath = DATASETS.fetch('NARR_19930313_0000.nc')
work OK? (I'm no expert here…)
- yes, they did not give any errors.
- Weird. For me it shows a message "downloading dataset" (the first time I run it only) and then filepath becomes a local filepath… I don't know
- I had the "Downloading" message from
filepath = DATASETS.fetch('NARR_19930313_0000.nc')
- I'll try to figure it out..
- Oh I see! It requires an extra dependency: "netcdf4" or "h5netcdf", which aren't automatically installed. I'm trying
!mamba install netcdf4
to install it.
-
How to use python shell in Jupyter?
- In the launcher view if JupyterLab there is "Console", that will give this linear prompt view.
- Is it ok to still use Jupyter for this ?
-
there is too much fiddling aournd. take a deep breath guys :) <- no
-
Can it make a 3D plot instead of heatmap?
-
I mainly use netcdf4 instead of xarray at the moment. what are the advantages of xarray over netcdf4?
- xarray is a way of accessing and working with data that is stored in netcdf4 format. So xarray supports netcdf4 as an input format. However it provides additional tools for working with the data.
- For me, accesing the data stored in the netcdf by labels name is the main advantages. Also, doing group operations, selections, and filtering by labels names instead of using index.
-
What is geopotential height?
-
My instinct tells me this may be relevant for spatial transcriptomics?
-
What are NetCDF and HDF5?
- CATS just hit the broadcaster's computer lock button - did it change the stream?
Wrap up of day 1
News for day 1 / preparation for day 2
- Day 1 covered broad basics: it showed you some things but you need to continue learning yourself
- For day 2 it's most important to have Jupyter working so you can do the exercises - if you had problems, now you can work on solving it before tomorrow
- Ask for help from a colleague if you need it.
Today was:
- too fast: oooooooooooooo (in the end)
- too slow: oo
- right speed: oooooooooooooo
- too advanced: oooooo
- too basic: ooo
- right level: ooooooooooo
- needed more exercise time: ooooooooooooooooo
- needed less exercise time:
- I will use what I learned today: ooooooooooo
- I would recommend today to others: oooooooooo
- I would not recommend today to others: oo
- too slow sometimes, too fast other times: oooo
- not enough cat :cat: o+1+1+(++1)+1000+1o+999999999999<3<3 +1ooo
- too slow at first and too fast at last oooo+1
- there was a little too much expectation for the people to know how to put the code together +1 +1
- was good but I missed a lot of parts because I had problems with importing pandas and Xarray, so I couldn't do some things
One good thing about today:
- A good overview of the best practice data modules
- Find a couple of new tricks (i.e. mamba instead of conda, pd.melt, pd.at, advancede numpy)+1o
- the structure of the course
- numpy and pandas was really very great, but the day was fast and intense enough that I had no energy left really to concentrate on xarray, which was completely new to me and went by super fast (also maybe less motivation to learn that since I haven't needed it at all so far in my work)+1
- Xarray part is too fast +1+1 +1+1+1
- Good overview and a lot of material to fall back to +1+1
- was really nice, I already used both pandas and numpy, but I discovered something new and useful. Especially Xarray, is very interesting, I never hear about that. Thank you +1
- Is there any course concerned how to integrate c/c++ with python?
- It was nice to learn about the panda and numpy tools and how they work. The examples and exercizes were very clear and straightforward.
- great that everything is well-documented, so I could catch up when I had to leave for other things
- useful overview for me who uses python a lot but is self-taught so maybe have gaps
- great show of what Python can be used for
- The reading/course material is reaaalllly good, it's like a survival manual. kudos!
One thing to improve for next time:
-
Include xarray dependencies in environment.yml
-
include workflows from data prep to processing
- Yes, more of a story to follow maybe, with an overarching task and introducing each library's approach(es) step by step. +1
-
Segregate the participants into soem groups based on existing proficiency level and the design the modules accordingly.
-
Give sufficient time much ahead of training schedule so that all the installation issues are sorted out for all participants so that they are ready for training day.
-
cat management :3
-
I felt like the Xarray part, which was the tool I was least familiar with, went too fast
-
provide a little more help with writing the code and the logic for the code as it's quite different to some other systems
-
I think it would have helped if the Intro could give a brief explanation of what to expect for the rest of the day and even how NumPy, Pandas, and Xarray are linked. Like if you have a set of uniform data type, you should use NumPy. If you have a 2D table with mixed types, you should use Pandas, etc. We realise it now at the end of Day 1, but I think teasing us in the beginning would have been nice!
-
Seems Day 3 has "Dependency management", including conda. Given the problems some people had with conda, perhaps you could consider pulling that section out and adding it to the Intro section for Day 1, to make sure everyone is on the same page.
Any other feedback?
- yes, one should have a right expectations, but the course material should consider being more pedagogical. The xarray was just jumping in with no proper background and went on too fast. Interesting package though. Thanks. +1
- Also, the installation required for xarray was not given earlier but only when we started. Took some time to solve…
- Thanks!
- It was very engaging day with good practical sessions.
- Maybe, have some more of the use cases, especially for Xarray? Currently, it is not obvious for me if I should use Xarray if I work with 3x3 tensors for electrodynamical modelling and nonlinear optics.+1 (addressed above)
- I really enjoyed the day, but I should definitely have dedicated more time towards setup beforehand. I was essentially using the breaks for setup. Trying to setup xarray lesson as we speak
- ..
- I zoned out at one point. Partly it was too fast to type at the same time, as the lecturer.
- I feel like, to really understand this day, I should attend before tutorials about NumPy, Pandas and Xarray numpy and pandas worked well but 'conda install netcdf4 pythia-datasets -c conda-forge' did not worked (Anaconda)
- You need to do that in Terminal after "conda activate whatever_you_called_the_environment_for_today"
- Very fun first day! The day was a bit too crammed and it was slightly hard to stay focussed. Also, not enough :cat:.
- MORE CAT +1+1+1+1+1+1 :cat:
- Tomorrow more cat or we riot! +1+1+1+1+1 :cat:
- if there is no cat tomorrow we don't attend :cat:
- It is very nice that everything is accessible for the future, with a very nice structure!
- Overall very good course, nice overview of the different tools one has access to in python. Although some of the steps might have been more thorough, I understand the choices in the interest of time :) :cat:
- The Twitch stream is wasteful of screen estate. the broadsasted picture is a rectangle 900x500 the content is a portrate page mostly vertical. It result in a windows with big black band.
Any other questions:
- How can we download these sessions?
- As in download the video? We will upload the video on YouTube within a week. The materials will stay available (and usually get renewed every year)
- Thanks, that first day has been really engaging so far!+2
- MAMBA > conda +1+1
- Gracias!
- Takk +1
- Kiitos!
- What is the cat's name? :cat:
- CATS (is the online name)
Day 2 - 6/11/2024
Some polls to warm up your fingers. Mark your choice with an "o"
I was here yesterday:
yes: ooooooooooooooooooooooooooooooooo
no: oo
partly: ooooooo
How big is the data you use with Python:
no data: oooooo
<1M: ooo
1-100M: ooooooooo
100M-10G: ooooooooooooo
10G - 1T: oooooo
1T - 100T: oo
more: oo
What's the most chaotic data you have used?
- gene annotation (gff3 files)m +1
- tetranucleotide frequencies vectors
- a diffusing quantum particle on a 14k-sized hexagonal lattice - in files of 1K values each
- microbiome data: 20 samples, 20,000 features, mostly zeros, proportions (not counts)
- data with missing values and not ideal date format
- Excel sheets from experimental collaborators, with for instance exclamation marks to highlight "interesting" data
- GO semantic similarity
- daatcousa ftrom feren
- acoustic and satellite data; data from ocean and atmospheric simulations (mainly because of size)
- Food recipes data from food.com for NLP
- Excel files with thousands of sheets all full of untidy data +1
- One time I received data from a collaborator in a PDF where the number of rows and the alignment of the rows didn't match. Had to throw it all away.
- GBs of metabarcoding data from another facility, took ages to figures out samples, controls, etc
You can ask questions for us to discuss before we start:
- is there some course material/videos relative to xarray use in ML/DL to handle-preprocess data?..
- ML/DL is machine learning/deep learning, right?
- yes
- I will ask around. I have personally used another framework (Dask) for these purposes
- are you covering matplotlib today?
- today I will cover Vega-Altair which is an alternative to matplotlib but I will point out material on Matplotlib and exercises
- There is the matplotlib lesson from last year, the YouTube 2023 playlist is linked in the materials
- will the cat partecipate again? :cat:
- I hope so… likely coming at the end as feeding time approaches. I will turn on my camera if it comes.
- …
- YES, almost always you think you can use a library, script from a github. BUT then you discover they used python2.7. Or an outdated version of something else… in the end, you just DIY - agreed
- Oh yes.
- AI is good in getting a quick conversion from python2 to python3… but then good luck with the code review process to verify that it's correct :)
- HELL YES! More often than not, existing toolchains just don't exist and need to be remade - maintaining gigabytes' worth of repos for a single project
- Quite often we have to make a lot of new stuff, since science is new. But can it be built on existing things or is all the groundwork re-made also?
- I do have to build on top of existing code, but often something or the other is incompatible (looking you, GCC and CUDA!) so I need to fork/edit the source and build everything - which is the perfect superposition of being useless and frustrating +1
Working with Data
https://aaltoscicomp.github.io/python-for-scicomp/work-with-data/
-
You mention npy
. What about npz
? For simple numpy array storage, I find npz
preferable, as it can be used to save multiple numpy arrays?
- I think it's pretty similar, but compressed. I'd always use the compressed version (unless for some reason I know the time to compress/decompress is too much for your particular needs - like for fast processing)
- Ok. So
npz
would possibly be green instead of yellow on "Space efficiency"?
-
How about SQL?
- Oh, would be nice to add. But which one: sqlite? Or another database engine? I think that sqlite (and duckdb, which is similar) are good for certain types of data storage and processing. If the data is relational like that.
- yes, I would also start with sqlite
- DuckDB is also great for situations where you need to read the data, not write data.
-
What is the difference between HDF5 and NetCDF4, specially when using for geospatial data storage? Can they be used alternatively?
- I'll try to raise this to the teachers
- Not a teacher, but I think NetCDF4 is a special case of the more general format HDF5. There are other special cases of HDF5 out there, as well.
- HDF5 is a hierarchical data format that can store datasets, attributes etc. in a folder-like structure. netCDF4 uses HDF5 to store its data, but is a defined structure on where the data should be stored. So e.g. coordinates are always with a dataset with a specific name.
-
I'm just wondering about "tidy" data. Does data have to be "melted" in order to be tidy? For example, I understand the point raised about the runners
data set, but was yesterday's titanic
data "tidy"?
- Not a teacher, but I would argue that "It depends". That is, it depends on the situation, when you need the data wider or longer (and how much wider or longer) and the different formats could be said to be "tidy".
- The way I think about tidy data (maybe it's helpful): rows are observations and if my table is "tidy", then it makes it easy for me to add new observations to it, without every time extend the rows. If I need to change rows every time I get new data, I might also need to adapt scripts.
- I think this is a helpful view
- thanks! adding to it: if the data layout is tidy, then the next phd student or next postdoc or the group lead can add new measurements/observations to the data and they don't have to adapt (and understand) scripts that process the data
- I would think that the titanic dataset was tidy (each row is an "observation", columns are properties of observations)
- (I asked the question – Thank you all!) I guess what makes the
runners
data set untidy is because each runners time is stored in a separate column, instead of a separate row. With the titantic data set, each passenger is a row?
- It is probably best to think of each "observation" as a row. In the titanic dataset, each observation happens to be a subject. In the runners dataset, each observation is a race result (which there are multiple of for each subject)
- In the Titanic-dataset each column was a variable so we could answer questions like "group by survival status and calculate average age". In the runners-dataset each column's title was a variable as well. So even though the data was in a table-like format we could not do something like "group by running distance and calculate average running time for that distance" because the data was not in a tidy format. Not certain if this explanation helps.
-
I am often working with very large datasets and a big bottleneck is just loading data (netcdf files >30GB). If I stored variables needed for the simulations in intermediate data formats can I get a big speed up?
- Oh yes, exactly! Especially binary formats that need less interperting could be big speedups. It's one of the main benefits of binary formats.
- Which data formats need much less interpreting than netcdf? :)
- The optimal would be if you can find a data format that has explicit information on where parts are stored, if you only need parts of the data, that (and the appropriate libraries can help with reducing load times). Alternatively, splitting data into parts can also help in load times, as long as the parts don't become too small (and depending on what kind of file system they are stored on). Essentially: if you have to load a 30 GB file to get 20 MB of data, than the way you store the data is likely not what you want.
- Tools like xarray by default load the dataset in a lazy way. This means that data is not loaded into memory unless it is used. See this doc for more information: https://docs.xarray.dev/en/stable/user-guide/io.html#netcdf This can be used to load the data description from the disk and then accessing only a small piece of the data that you're interested in and then loading that piece from the disk for analysis.
-
Could you explain the difference between .from_dict and .json_normalize? How deeply can the latter unpack a JSON?
- I haven't used .json_normalize but it seems like it goes more recursive and tries to dig deeper and make more rows for nested dicts. .from_dict is much more strict in the input. I think.
- :-) thx
- This is exactly what it does.
- with json_normalize you can specify pathes for the "tidying", i.e. what the "path" for each observation is (if you go to the docs, it shows an example of this, where the path is ["counties"]
-
So, according to the table of data formats, we can already guess that parquet
is the preferable choice for storing dataframes created with pandas or xarray or similar?
- At least for pandas, yes. It's basically made for this kind of case.
- Data in xarray is often stored as netCDF-4-files. For more info on formats supported (and recommended) by xarray, see this page Parquet is a great format for storing tidy data.
-
I didn't follow how to convert json file structure to pandas so can you please write here
- You can try
pd.json_normalize(my_loaded_json)
if you have the json already loaded into Python (as list, dicts etc.) using json
-module or pd.read_json(json_file)
if you want to load it from a saved file.
-
To view JSON:
-
Isn't pprint a built-in in Python?
-
pprint is "pretty print" that gives better formatting.
-
type(countries_json) tells me countries_json it's a list. list of dictionaries?
- Probably yes (I haven't looked but this is pretty typical)
- Yes. The
json.loads
converts the text representation of the data to Python objects.
-
For very large datasets (json), what would you recommend instead of loading the entire dataset in memory?
- Does anyone know if there are JSON loaders that can load partial data?
- I have used Dask for this purpose. There might be better options of course.
- If you know you need only part of the data, maybe one of the binary formats (that allow selective loading of the data) are a good idea. When you get big, it's worth more to take the time to be efficienc.
- Yeah, but I'm reusing large data that is provided in json format.
- Converting the data to another format would probably solve a lot of the problems, but
polars
-package (which is a pandas alternative) for example supports scanning json files and taking only parts of them: polars.scan_json
- Thanks!
- I'll also add that doing reading and conversion to intermediate format once is usually better than reading the data every time.
-
I have received a dataset with extra information in a .pkl format. It is associated with ID numbers given in netCDF datasets. Once I read in the pkl file using the command 'pickle.load(file)', I get the data in the pickle file in a dictionary, where the keys are the ID values. However, when then doing the matching with the ID values in the netCDF file, not all IDs identified in the netCDF are found as keys in the pickle file. What can this be due to? Do I read in the pickle file a wrong way? Are there other ways to do this? Thanks!
- Possibly some values for ids in the netCDF were not calculated and thus not stored. Other than that, I would ask my collaborators to possibly get some alternative way to get this (meta?)data for the netCDF data.
- Usually one can store attributes and metadata to netCDF datasets so one does not need to have external metadata files.
- Yes thanks, but in this case, I have received both files from another person, and the netCDF includes 3D information (space and time), whereas the pickle file inlcudes more information related to each object in the netCDF file (where each object has an ID). The issue is when I create a list of IDs at a certain timestep and then search for these IDs in the keys for the dictionary retrieved from the pkl file, not all IDs are found, which is weird.
-
Can you provide the link here for the installation on the screen?
-
What is the benefit of using a Jupyter notebook over say, RStudio's RMarkdown files?
- Jupyter notebooks can work with both R or Python kernels (not at the same time though). So you can use jupyter for both.
Break until xx:55
- Then plotting episode
- You can keep asking questions above.
Plotting with Vega-Altair
https://aaltoscicomp.github.io/python-for-scicomp/plotting-vega-altair/
if you try to print data_monthly, do you gets something there (i.e if you put in a new cell under the code cell if you are using jupyter just the name data_monthly and execute, what does that print)
- Yes, looks like a normal dataframe. Also
type(data_monthly)
gives pandas.core.frame.DataFrame
.
- How did you install the Python environment?
- As instructed,
conda env create -f environment.yml
. I also removed and reinstalled it and same results. No problems yesterday.
- Any advice to check?
- Maybe you could try:
alt.renderers.enable("browser")
It worked for Spyder use case.
- Thanks, still nothing renders.
- How are you running? through JupyterLab or something else?
- Yes, JupyterLab.
- can you check the Altair version? e.g. with "print(alt.version)" in a code cell. It should be 5.4.1 if you installed recently with our environment for example.
- Yes, this command prints
5.4.1
.
- Which operating system and which browser are you using?
- Ubuntu 22, Firefox.
- I am unsure if this is advanced, but are you able to open the "developer console" from firefox? It tells you if the page (jupyter) has some errors (e.g. related to rendering, network, etc).
-
Nothing strange as far as I can see quickly
- OK, thx! I thought type error might be a clue.
-
Continued from above:
- You could also try another browser if you have one
- I am booting up an Ubuntu 22 to test :)
- Thx! Going to grab lunch but will check back here.
- I am not able to replicate your error with Ubuntu 22 and Firefox. Maybe restart firefox and jupyter-lab
- Restarting jupyter-lab AND firefox seems to have solved it!
- You can also try to see which renderer you have enabled with "print(alt.renderers)"
- output:
RendererRegistry(active='default', registered=['browser', 'colab', 'default', 'html', 'json', 'jupyter', 'jupyterlab', 'kaggle', 'mimetype', 'nteract', 'png', 'svg', 'vegafusion-mime', 'zeppelin'])
- And then set a renderer like jupyter: alt.renderers.enable('jupyter')
- and then rerun the whole cell with the code
-
Was "name" a pre-existing columns in these 2 dfs? or was it created when concatenated?
- Seems like it was already there (which was quite convenient for us)
- So then if you concatenated 2 dataframes without a pre-existing variable such as name; would the concatenated dataframe give some sort indication where the data comes from or would all the data simply be merged without any origin indicator
- In this case concatenation joins rows together so it takes rows from one dataframe and joins them with rows from another dataframe. Of course these dataframes should have same columns (variables) or there will be a lot of missing values. One can also join two dataframes based on a shared column or index to add more columns (see pandas.merge)
-
I get error "URLError: <urlopen error [Errno 23] Host is unreachable>" while doing this plotting. How do I fix this?
-
Can we easily also change the orientation of the x-axis labels, e.g. having the labels "Oct 2022", "Feb 2023" etc. tilted 45 degrees?
-
Can you define what is a visual channel? +1
-
If I understood well, the main difference between matplotlib and Vega-Alter, is that Vega-Alter allow to use pandas dataframe created with pandas? Are there other advantages?
- One benefit (if you feel like its a benefit) is the coding style. Vega-Altair is much more declarative so the graph modifications can be easier to do. Matplotlib allows for very nuanced modifications to the graphs as well, but you'll need to modify the different figure objects a lot so it can get complicated and hard to reproduce. Vega-Altair uses a grammar called Vega-Lite that defines the graph as a JSON, so all elements are well defined as code.
-
Question: What does yearmonth(date):T do?
yearmonth()
is an internal function that says "render the label as year-month"
:T
says "this is temporal data" which gives hints about how the plot should be formatted
-
Do you reckon I can highlight pieces of code in Jupyter notebook (inside a code cell, not markdown)?
- I don't quite understand this - what does this mean?
- For instance, if I want to highlight a "y" in "yOffset" inside a code cell
- what would be your goal with this kind of highlight?
- call attention to it when someone is reading it; I guess I could just use markdown
- yes, this is exactly what the markdown cells are for. Code cells need to be sort of standardized to what the programming language expects. Explanantion for readers goes into the markdown cells (which can be in between the code cells, if code needs to be explained in place)
- makes sense, thank you
- I like to use comments to highlight important code bits within the code. Important for when you move away from Jupyter and continue with "code only".
-
why don't the graph appear when I try and plot? (on spyder)
- Any error message?
- "If you see this message, it means the renderer has not been properly enabled for the frontend that you are using.""
- We instructors aren't sure how to run it in VSCode/spyder right now. Does anyone know?
- I see some "spyder" mentions here, not sure if it's useful: https://altair-viz.github.io/user_guide/display_frontends.html
- You might want to try
alt.renderers.enable("browser")
(from the previous web page)
- Yes this worked: it opened a web page to show the plot, thanks!
- For vscode
alt.renderers.enable('mimetype')
could work
Exercise, we resume at xx:40
Vote: how's it going?
- I am done: oooooo
- I am not trying:
- I had problems:
- I wish I had more time:+1
-
Does Vega-Alter only works with pandas data
- RB: I am not sure. I have only tried it with Pandas. Researching …
- and after more research: it needs tabular data but it does not have to be pandas.
-
Does vega-altair do interactive plots in Jupyter?
-
When I change the axis of the plot to make it horizontal the bars are overlaped eventhough the xOffset is defined.
- you also need to change from xOffset to yOffset. did this help?
-
Where does the name Vega-Altair come from?
- https://altair-viz.github.io/getting_started/overview.html "The project is named after the brightest star in the constellation Aquila. From Earth’s sky Altair appears close to Vega, the star from which our parent project drew its name."
- it used to be called Altair. I think Vega is a backend of Altair and can also be used separately.
-
Yep, I agree, it feels a lot like ggplot2 :+1:
-
I still did not get anything to render, see above, please help. REF1
- Thanks for the highlight, a helper is there typing :)
-
I don't see plots the code runs fine otherwise in Spyder or Pycharm.
- See the comment above about Spyder: You might want to try
alt.renderers.enable("browser")
- So we need to tell Spyder where to render https://altair-viz.github.io/user_guide/display_frontends.html
- Check that you have Altair 5.3 or above. (you should have 5.4.1 if you used the installation environment we provided)
-
Will the data (URL) that we use here in the exercise be available in the future so we can work with the exercises again later?
- Yes, it should stay around indefinitely (until we have some need to re-arrange, that should be somewhat rare) :+1:
-
As a person trying to go from R to Python, this feels very appealing to me. I'd definitely see myself using this library to generate graphs :+1:
-
In which situations, we should not use it?
-
Why do I have to call "alt" everytime I want to change the size and color?
- do you mean the
alt.Color()
type of thigns?
- yes, exactly. I thought it already been meantioned
- In Python conventions one doesn't often import everything (if you did
from altair import *
then you wouldn't need alt.
, but Python people have found that can cause other problems when different libraries conflict)
- I think the question meant why the format changed from
color="max temperature"
to color=alt.Color(...)...
and this is a way to customize visual channels. more info/examples: https://altair-viz.github.io/user_guide/encodings/channel_options.html
- It's not really a Python thing but it was the choice of the library creator to allow customization this way.
-
Is there any equivalent of R shiny app in python?
- Plotly? https://plotly.com/python/
- Plotly is interative but not an application like shiny which can plot new figures dynamically. With plotly I guess we can only do mouse over or zoom in/out.
-
Is Vega Altair plotting working well for producing series of 2D plots (e.g. temperature as function of height and lattitude) in a for loop, with the purpose of making a video?
- I don't see why not. You might need to save plots as individual frames and use something like ffmpeg to join them into a video. But some people have created animations at least: https://github.com/vega/altair/issues/1268
- When creating many images for a video I would also move out of a Jupyter notebook and use scripts which might make parallelization easier.
-
Can Vega Altair plots also take inputs from the user in the interactive mode? As in, if I want to add or delete some data interactively, instead of changing the data file
-
Can Vega Altair plots be used to simulate an algorithm too? Something on the lines of seeing the predicted value based on some trained neural network data input
- I think this is out of scope for a plotting library. But you can of course plot outputs from e.g. scikit-learn or pytorch on a vega-altair plot.
- okay, thanks
-
Not a teacher: Someone mentioned relational databases like sqlite here (that note is gone). I checked and it should be possible to use sqlite3 module or using the Jupyter Kernel developed by Mariana from QuantStack (https://blog.jupyter.org/a-jupyter-kernel-for-sqlite-9549c5dcf551). I have no experience with that myself, but maybe someone of you has?
We are on lunch break, we return at xx:00 (13 EET, 12 CET)
- Then Scripts, Profiling, and productivity tools: various things that you want to use along with Python
- You can keep asking questions and we'll answer when we can.