# Python for SciComp 2022 (ARCHIVE) ###### tags: `Training` :::danger ## Infos and important links - This is the archived document: - The live document is at: https://notes.coderefinery.org/python2022 - Program: https://scicomp.aalto.fi/training/scip/python-for-scicomp-2022/ - Materials: https://aaltoscicomp.github.io/python-for-scicomp/ - Prerequisites: installation instructions at https://aaltoscicomp.github.io/python-for-scicomp/installation/ - Suggestion: if you have just one screen (e.g. a laptop), we recommend arranging your windows like this: ```bash ╔═══════════╗ ╔═════════════╗ ║ WEB ║ ║ PYTHON ║ ║ BROWSER ║ ║ WINDOW ║ ║ WINDOW ║ ╚═════════════╝ ║ WITH ║ ╔═════════════╗ ║ TWITCH ║ ║ BROWSER ║ ║ STREAM ║ ║ W/Q&A ║ ╚═══════════╝ ╚═════════════╝ ``` * **Do not put names or identifying information on this page** ::: *Please do not edit above this* # Questions from registrations: - Which Python tutorial do you recommend doing prior to the workshop to learn the basics to follow the course? - Please check the first link in the prerequisites https://aaltoscicomp.github.io/python-for-scicomp/#prerequisites - If there will be recordings of the lectures available, that would be very appreciated - Recordings are available immediately on https://www.twitch.tv/coderefinery for 7 days, and after few days they will also be on our Youtube channel - I have anaconda installed in my pc but I'm not sure about the various libraries required for this course, if they are installed in my pc. Also, I'm a beginner with a little knowledge of the syntax so will it be too difficult to follow through this course? - Check a quick tutorial on Python syntax, first link in the prerequisites https://aaltoscicomp.github.io/python-for-scicomp/#prerequisites - Anaconda will include everything, it is important that you are able to go through the steps in the verification -> https://aaltoscicomp.github.io/python-for-scicomp/installation/#verification-of-python-and-jupyterlab - Is it possible to use jupyter notebook for this? - Yes! Please test that it works following our verification steps -> https://aaltoscicomp.github.io/python-for-scicomp/installation/#verification-of-python-and-jupyterlab - I have been going through these prerequisites and trying to install everything but it does not seem to want to work - Try to un-install and re-install Anaconda. If you are affiliated with a university, try asking your local IT support, they are usually very happy to help. If you are in the Nordics, you should have received a link for a help session with installation issues. - Some things I am curious about and would like to know/learn: - What is your favourite python IDE? I have tried Jupyter notebooks, Spyder, PyCharm and VScode. I like that you can browse the variable values (dataFrames etc.) and get easily help how to use the functions and packages, for this SPyder and PyCharm are good. - All these options are good. It's a matter of taste. As long as your IDE can have a text editor next to an [IPython/Jupyter console](https://ipython.org/), you're golden! Spyder does this out of the box, but all other IDE's can be configured to be this way. - Can you easily create Jupyter notebooks from `.py` scripts and `.py` scripts from Jupyter notebooks? - You can export a Jupyter notebook to a `.py` script (Look in the File menu). There are ways to convert `.py` scripts to notebooks (e.g. [nbsphinx](https://nbsphinx.readthedocs.io/en/0.8.9/)), but these are not as easy as clicking a button. - Where the figures end up in Spyder and PyCharm? Interactive visualisations in Jupyter Notebooks? - This depends on how you configure them. Spyder and PyCharm by default place screenshots of the figures in a separate "plots" panel. This can be controlled by selecting a matplotlib backend in the settings. (I recommend taking the time to figure this out.) Try for example the "Qt" backend to have each figure in a window of its own with nice interactive controls. As an alternative to diving in the settings menu, you can execute the command `%matplotlib qt` in an IPython console (or Jupyter notebook) to switch the matplotlib backend. # Day 1 archive I can hear you! Yes! ## Icebreakers Where are you? * Finland: oooooooooo * Helsinki: ooooooooooooooooooooooooooo * Kuopio: ooo * Jyväskylä: oo * Poland Olsztyn: oo * Dublin: o * Sweden: oooooooooo * Norway: oooooooooooo * Austria: o * Svalbard: o * Netherlands: ooo * Lappeenranta: o * Iceland * Joensuu: o * Espoo, Otaniemi o :) * India: o * Espoo, Niittykumpu: o * Portugal: o * Espoo rulez: oooooo * Netherlands: oo * Italy: oo * Greece: o * Tampere: ooo * Aalto Jengi o * Spain, Tarragona o * Cyprus * Pietari: o * Turkiye * Denmark * oulu How much have you use Python before? * Never used Jupyter. I've used Python for research work and art projects. o * just in another workshop, never applied it :) * Not much, I have done a few basic courses * I have used it alot for spatial analysis and data manipulation. I use the VScode interactive rather than jupyter .ipynb file. * I use it nearly everyday. I use Spyder: ooo * Taken a few courses, but not very fluent in data handling * I have used it for basic visualisation tasks. * Course tasks and research work ( ML and deep learning and general algorithms) * started yesterday o * Occassionally, learned while doing. * * A number of courses, but haven't gone too deep. Mainly adjusting given code: o * I have used Jupyter and Python a lot, mainly for behavioral analysis. I will use it more for neural activity analysis. I'm hoping to learn some more advanced techniques. * Not very much: oo * ~ 4 years o * Used it during a introduction to data science course on coursera and sometimes for personal experiments * All the time :D * several times a while ago * Few courses. Data handling needs more work * -- * less than others * I write weird code on it. Love rasberry pi! * I used it in my master's thesis * Not much, I am a Matlab user n * 3-4 month * basic * How much have you use Jupyter before? - Never: oo - One time: oo - beginner: ooooooooooo - Intermediate: xxoooooooooo - experienced: ooo - Python quite a lot, jupyter not so much oo - Some experience teaching basic python (like carpentries workshops) with Jupyter - not much, but lots of Matlab and R, looking forward to get trasnfer to Python - Less than a year - Used enough of Python and Jupyter * I'm registered for this but cannot type at the line, I need to go in and "edit" the whole page to type...? Any suggestions? - Yes, you need to click on the pencil button (=edit) to post your question. Yep, edit the whole page! ok thanks :) You get used to the chaos :) --- Have you managed the prerequisites: * Good to go ooooo * stsuff installed: oooooooooooox * [stuff not installed: o] * read "how to attend": oo * sort of * I need further help from IT admin from the univeristy: o * stuf * yes, I did it. * yes * yes, I did it. * yes, hopefully everything was installed correctly ! * yes * By requirements, if you mean jupyter-notebook, YES * I didn't use a lot of Python * yes * Don't you think so, questions will be bit lost here? * We will archive the questions. * yes I have used it before. - could you both adjust the volume of your mics to similar levels Diana is too loud compared to the guy * All this simultaneously commenting is giving me a bit of motion sickness :) * I suggest you choose View only :eye: when you don't need to edit. The page does not refresh as often = less movement of paragraphs. * Are you going to record the lectures? * Yes. They are published on TwitchTV immediately after this streaming ends, and later they will archived on YouTube Volume is great on both! ooooo Diana is louder and clearer ooooo - I'm having some issues with kernels on Jupyter notebook. Is it fine to use .ipynb files in VSCode to follow along? - Yes. - Great, thanks! ## Intro https://scicomp.aalto.fi/training/scip/python-for-scicomp/intro/ - question - answer - answer 2 - reply - So Zoom room are only for people in the Nordics, right? - I'm in the nordics and signed up but have no zoom invite. I must check my signup for this... - I belive it's people in the Nordics who are part of a univercity or association that is colaberating with the hub. Not sure though. - Hm... ... lost the voice! and connec ## Jupyter https://aaltoscicomp.github.io/python-for-scicomp/jupyter/ - what is the difference between jupyter notebook and jupyter lab? - jupyter lab is bascially a more modern re-implementation of the jupyter notebook interface. "notebook" can also refer to the file format, which is the same between them. - Does anyone know good JupyterLab shortkey cheatsheats? - I guess we need to make some... - here's one incomplete list: https://coderefinery.github.io/jupyter/interface/#keyboard-shortcuts - I did not manage to make it work. I don't see my different Kernels in Jupyter Notebook. Can I still work but through VSCode with .ipynb files? - If you can run a notebook or python any other way, you can attend the course - Yes python works fine (3.10.6) and the .ipynb extention makes it look like a notebook. - It asks for Aalto account to log in - Are you running it from your computer or from an Aalto jupyter website? - are you trying to access an Aalto jupyterhub? This is just one option for aalto people, most people should install their own - I am from Tampere on Uni computer - Linux or Windows? - Windows - Can you check if Anaconda is installed? start -> type anaconda. According to your university policy you might be able to install it. If you don't have time, you could try Google colab, but not everything might work https://colab.research.google.com/ (it's Google's own jupyterhub, it has some standard packages like numpy and pandas) - ok - I have installed python also does it work on python? - Yes it does - Are there different markdowns? - Different flavors: same basic systax, slighly different other advanced extensions - There is also reStructuredText, which does the same job but has different syntax - I have experienced Jupyter notebooks but am new to Jupyter Lab. Does Jupyter Lab have nbextensions? Namely, I would like to produce a table of contents (TOC) in Jupyter lab. - I've seen this before somehow - - jupyterlab does have extensions, but they are different from classic notebook interface extensions. - The simplest solution is a folder with only the files you want, ordered by adding numbers in front (01_first_section.npy...) - But [jupyterbook](https://jupyterbook.org/en/stable/intro.html) might be the better solution - How did you hide the left panel with all the folders? - Click on the folder icon an the far left side - Keyboard shortcut: Ctrl-b (on Mac: CMD-b) - Thank you :) - %%bash does not work (Windows indeed); %sh does not work either +1 - Bash doesn't work on my Mac either (UsageError: %%bash is a cell magic, but the cell body is empty.) - Add the rest of the cell and I think it would - Oh, my bad... jumped the gun there :) It works! - If this doesn't work, it's OK - not needed for the rest of the course - Couldn't find program: 'bash' - What could be a reason for this? - How did you display the list of magic commands? - [`%lsmagic`](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-lsmagic) - How do I make a title in Markdown cells? - you should use the "#" character in front of your title. The more you put, the smaller the text. - See [here](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html) for an overview of the markdown commands supported. - Thank you! - For a general markdown guide, I recommend [this cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet). - What are actually the magic commands? - These are commands that control the python kernel itself. As opposed to lines of python code that need to be executed. - And they actually aren't part of Jupyter or Python. The are part of the "IPython Kernel", the thing that interfaces between the two. - What is a Jupyter "kernel"? - Jupyter can work with many different programming languages. Python is the most well known, but it can also do R, Julia, and many more. To support multiple languages, Jupyter has the concept of "kernels". A kernel is a program that takes whatever is in a Jupyter cell, executes it, and sends the result back to Jupyter. - When doing exercise 2: UsageError: Line magic function `%%timeit` not found.; But when I type "%lsmagic" I get different options shown. - Is that at the top of the cell and nothing else after it? - It was when running it at the bottom.. Now it times it when I put it first. Thanks! - General question about Jupyter: I guess it is great for sharing code (and data visualization, machine learning, etc.) but when it comes to having an interface with some other machines (like robots through ROS) I would assume it is not the best choice. Am I right? - Jupyter might be useful when you're creating the code (e.g. simulating robot's actions), but yes: if you are not going to be connecting to the robot via network you probably won't want to run jupyter on it. (questions continue below at bottom) ### Exercises until xx:45 :::success https://aaltoscicomp.github.io/python-for-scicomp/jupyter/#exercises-1 Try Exercises 1, 2, 3 from that page - whatever you have time for, nothing else depends on this. Try to get something working, though, so that you can do the rest of the course. I am: - done: oooooooooooooo - not trying: o - need more time: ooo ::: - The disk-icon does not work when trying to save the notebook (Mac os X, Anaconda, Safari combination). Any idea what could help? - That's strange. Are you in a directory where you have write permissions? - Yes, my home directory. - You may go to the top menu: File -> Save notebook. Still an issue? - Fixed: issue was that pressing the disk-icon does not prompt giving it a new name, it just silently saves "Untitled"... - I started from the command window in a wrong directory. How can I stop the command line saying LabApp and go to the correct directory and start Jupyter there?And How did you save the jupyter notebook? I have Mac. OK Thanks. But how did you save the jupyter notebook in the browser? - Easiest option, depending on the operating system: Open the correct folder in file manager, right click and choose "open terminal here" or any similar option. If linux or mac, you can navigate with the command `cd`. Just `cd` takes you to your home folder. `cd folder_name` takes you to that folder and `cd ..` takes you to the parent folder. - Question: I didn't understand the magic cells. can u explain more? - Magic commands do something other than Python. They can be convenient shortcuts that don't really follow Python syntax, or they can control Jupyter itself. - It can be a bash command or script, for example, or even a `conda comamnd`, for example `%conda env list` or `%conda install package`. - The `%%timeit` in the beginning of a cell made simple `for i in range(10)` to an infinte loop. ```python %%timeit for i in range(10): print(i) ``` - try two %% for timeit at top of cell - Yup I had that - also note that `%%timeit` will execute the cell multiple times (sometimes even thousands of times) in order to get an accurate measurement. - This explains it! So it just don't match with `print()` commands :D - Use `%%time` instead. It's not the same, but similar. - `%%time` will run it only once and report the time it took - Yes. But it's a good alternative if `%%timeit` doesn't work. - That is what I meant :) - I ran through the same problem, if that is of some confort. Thanks for the tips ;) - How did you save your jupyter file in the browser? It did not pop up the window for naming it??? -- but I saved it in a wrong directory first. mv? - Use the "Save as" option in the menu. Clicking the disk icon will silently save it under the current name (Untitled). Alternatively, you can double click the notebook name in the tab bar to rename it. - When I put the %%timeit command above my fibonacci code the fibonacci code runs in an infinite loop and crahses with the error "IOPub data rate exceeded. The notebook server will temporarily stop sending output to the client in order to avoid crashing it." Without the %%timeit my code runs well and only once. The same thing happens with my code for the question 4. - Does the Fibonacci code do any printing? If so, try removing that part. - I only print at the end outside my loop. But with %%timeit it prints thousands of times - This is my code: ``` %%timeit Fibonacci = [0, 1] for i in range(8): Fibonacci.append(Fibonacci[i] + Fibonacci[i+1]) print(Fibonacci) ``` - I will check on my own system. In the meantime try the command below - No problem on my system. But the print command get's run thousands of times, so maybe you should remove it. - Why is it run thousands of times? It is outside the loop - You can increase the data limit with `jupyter-lab --NotebookApp.iopub_data_rate_limit=1.0e10` (this means you need to run Jupyter from the command line) - Got UsageError: Line magic function `%%timeit` not found. What the issue could be? - Make sure that `%%timeit` is the very first thing in the cell. Nothing before (not even spaces or newlines). - I lost connection to the stream despite refreshing the browser multiple times. Anyone else? +1 - Richard's computer died, but it's back now. - Didn't loose connection here. Not sure what happens. - Is it ongoing or is the connection still broken? I don't see or hear anything - I saved the jupyter notebook, but I don't see it in the Finder? pwd shows I am in the right place. ? - Can we use magics in Spyder as well? - If it uses the IPython kernel, then yes. I'm not sure if it does or is separate (or is based on it) - Yes, you run the code in the IPython Kernel. Great to know these magics. - What is the advantage of shutting down Kernels? - They will still run in the background unless you shut them off. - What's your opinion on nbdev, does anyone use in their day to day devlopment at aalto or at coderefinery community? ### Exercise 3 - please answer here https://aaltoscicomp.github.io/python-for-scicomp/jupyter/#exercises-3 1. Pitfalls: - Running code out of sequence in Jupyter and getting inconsistent results. - timeit not exist - As it's a REPL kind of environment, I end up repeating code instead of using functions/classes. - Too many different notebooks for different parts of the project and poor nameing to keep track of what part is where. - Needles repitition in many notebooks when working in groups on a project - I find more debugging functionalities in regular IDEs like PyCharm - may be I am just more familiar with those. 2. Success stories: - It's very helpful when working with Data exploration - having markdown and graphing capabilities in-place helps. 3. Good development strategies: - Learning debugging early on - When the main development/experimenting is done consider moving the code or parts of the code into python scripts that you then run -> make the `main()` as simple as possible. - Is Richard back? I cannot hear anything - it is a break now - Sorry I got disconnected form the stream so I didn't get the info. - where can I find the answer and explanations? I am completely lost after timeit part. Ok I thought we go through the exercise together??? Where are the solutions located? - If you have a specific question you can ask it here. Solutions are also available in the course page (just click the solutions section open). done * done: * Question: why do we see different times when we run the code several times? * Computers have many things going on in them - so code gets interrupted different amounts depending on when it's run. That's why things like timeit run the code many times and take an avarage. * <span style="color:blue"> Why the answer for EX1 in Numpy-Advanced section has a normal distribution? I ran it 20 times and I have got almost the same answers?I guess it is becasue of random number generation method in Numpy. It has a limited range maybe?</span> - Do we have a lunchbreak :)? - Unfortunately not, we can try to make a longer other break but with multiple timezones in a morning, it's hard to find a time that works. Some snacks? - Personally, I'm eating by the laptop :smile: - The %%timeit ran my code 7 times, is 7 the default? If so, why? - `%%timeit` runs shorter codes more times. In the example before, my code ran 10 000 times. In a very short run random things can make a huge difference. For example, the first time the code runs it is slower because the code itself gets loaded into memory. - `%%timeit` has flags that you can use to change its default behaviour. By default it tries to get good enough statistics with minimial number of runs. See [this page](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit) for the flags. * Possibly a dumb question, but why should I prefer Jupyter over simple text editor? (I've been using Geany for years (though, relatively simple codes)) - Jupyter has upsides and downsides. Short answer is you should use what you are familiar with and what works best for the thing your working on. - Jupyter is good for quickly changing something and rerunning only that part. So it's good for early development. - Jupyter is interactive, and you would prefer it when you want precisely that. - Jupyter is also good for sharing a story with the code (like sharing with a paper or a public repository) - Jupyter is bad for reproducibility: You can execute the code out of order. You can run a cell, then change it and end up with memory not reflecting the code on the screen. - Jupyter is bad for writing a package. It's not meant for that. Packages are collections of .py files, which you would edit in a text editor. * Sorry for sticking to that once again, could one control the `%%timeit` run count is a way that `%%time` is run just once? - Try `%%timeit -n 1 -r 1` * <span style="color:blue">some *blue* text</span>. * ## Break until xx:04 ## Advanced Numpy https://aaltoscicomp.github.io/python-for-scicomp/numpy-advanced/ Tip: Use theater mode for Twitch. You may view the main window only in this case. * is copy in numpy behaving similarly as pandas? (Im always confused between copy and view) - In pandas section we'll probably talk about views more, but views are not copies. View = go through the data in some specific way (for example, get me data in every even index), copy = create a copy of the data where you take out data (for example, every even index). Views are typically faster than copies. - Pandas columns are typically numpy arrays, so when you're doing views/copies in pandas, you're usually doing numpy views/copies in the background. - How can Python use Numpy through C, if I don't have C installed? - Numpy is a compiled library. It's compiled before it's distributed to you, like a lot of scientific Python software. - Where can I find solutions for Ex1-3 the Fibonacci number? I thought we go through them together? - Solutions are below the exercises, if you open the solutions drop-down box. Unfortunately due to technical issues we did not have time to go through this exercise. ## Exercise until xx:24 :::success I am: - done: oooooooooooxooooo - need more time: o - not trying: oooo - ::: * What is the time to beat? * 1.4s * I got ~1.4s for the C code on my laptop. * What time did you get? * 1.39s * 780 ms ± 21 ms * 693 ms ± 2.12 ms per loop * 486 ms * 1.16s * 821 ms ± 52.4 ms per loop * 1.34s * 1.05s * 1.02 s ± 24.9 ms * 1.36 s ± 97.3 ms * 680 ms * 872 ms ± 9.62 ms * 1.02 s ± 13.3 ms * 1.39 s ± 62.2 ms per loop * 883 ms ± 19.3 ms per loop * 3.48 ms ± 59.5 µs per loop * 678 ms ± 16.9 ms per loop * 65.6 ms ± 199 µs per loop * 880 ms ± 10.8 ms per loop * 1.45 s +- 50.3 ms per loop * 1.94 s ± 0.225 s per loop * 1.13 s ± 31.1 * 1.36 s ± 468 * 745 ms * 891 ms ± 75.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) * 680 ms * 682 ms * 1.17 s * 623 ms ± 3.79 ms per loop * 1.18 s ± 23.1 ms * 584 ms ± 107 ms per loop (when running the same as the tutor showed) * How can I get the time, not worked with `%timeit` - you need two- %%timeit to make it do the whole cell - - [x] works - it still doesn't give me the result: ![](https://notes.coderefinery.org/uploads/73a16861-45f2-4241-a959-25cbc9372afc.png) - 25.5 s ❌ - For me it just started running the code.. had to stop it from generating numbers - have you tried `%%time` instead? - I did, the answer looks different though: CPU times: total: 1.77 s Wall time: 1.78 s - I'm having problems writing the code. Any suggestions? The time is quite limited to go through the previous basics of numpy. - you can create an array of random numbers using np.random.rand(d1,d2,d3...) - Do not worry. I have the same issues :D - Thanks ^^ I made a test run with the suggestion above and 3 numbers in the parentheses. Got 3.9s and no loops as the above stated answers. - Nice! - I can't get it to work. It just crashes Jupyter's kernel. - This might happen if you're not using numpy functions or if you have limited RAM available. - I am trying to figure out if there is any sort of memory limitation. - What's your machine + operating system? I have seen these issues with WSL (the linux virtual machine inside Windows). - MacBook Pro 2018 (2.3 GHz, 8GB RAM), running macOS Catalina. - Ok I had the same, it should work. Maybe chrome or safari are taking all resources :) - I only have close to 1GB of RAM free. Anaconda Navigator and Firefox take almost 3GB. - OK, I am actually not even able to import numpy, so the problem might be there. - Finally managed to fix it. All it took was updating Anaconda. xD - When I run a cell of code with %%timeit at the top, it does not save the variable I'm executing. Is that correct? - Yes. - Why would sum(np.random.rand(100000000)) takes infinite time but np.random.rand(100000000).sum() takes 2.4 seconds? - The first converts every number to a Python object, then sums them. The second tells numpy to use the raw data object to sum them using its compiled C code - numpy.sum() is as fast for me as rand(...).sum() - The difference is sum() vs np.sum(). - I find these two make no difference to execution time on my machine - Is there any difference between np.random.rand and np.random.random? - i think rand specifically returns a random float between 0 and 1 - https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.html - https://numpy.org/doc/stable/reference/random/generated/numpy.random.random.html - Trust only the docs :) - I have read somewhere that one of the things that make Numpy fast is vectorization, what does it mean and what does it imply? - Basically, it can tell the computer "do this one operation on all the data at the same time". - Instead of writing a loop say "do this operation on this element. Now to the same operation on this element. ..." you use a NumPy function that does the operation on all elements. While the end result is the same, the second is much faster. This is because CPUs have dedicated instructions for performing operations on large amounts of data. NumPy will use these. - I tried with %%timeit, but I still do not see any timeoutputs. onlz i see [*] in front of the commands. is it still running on my computer? - It is most likely running, but if the code is not written fully using numpy, it might convert the numbers into python object, which takes a long time - I just copied the three lines. for random integer sum by numpy. - Maybe the kernel crashed - ok, may be yes. thanks - Do you count 'import numpy' time while calling that code from CLI? - Usually imports are not included as that can vary based on what your ssd / hard drive is doing. Hard drives are always slower than your RAM, so this will overshadow the difference in the code implementation. - I get the error that 'numpy.random._generator.Generator' object has no attribute 'default' when trying "np.random.default_rng(seed=0)" - Check your numpy version. `numpy.__version__` - 1.23.4 - ok that is the recent one, then one needs to check the whole code - I copied it from the advanced NumPy page. Nevermind, now it works.. Thank you :) - Does it only change the heading of the column to rows onnly? - are rows and columns even a thing in numpy... hm... let's see... - I see! - yes makes sense now - ## Exercise until xx:50 :::success Numpy-Advanced-2 https://aaltoscicomp.github.io/python-for-scicomp/numpy-advanced/#exercise-2 Try to write this `ravel` function. The next lesson doesn't depend on this, so don't worry if time is a bit short. ::: - I admit I had to do it with paper and pencil first :) - As most people should do... - Good strategy! - Am I missing something or is this trivial? - To me it became trivial after I drew it - A oneliner does it. - Yes, pretty trivial. Our goal is to make you think for the next part (and to give you time to digest) - I thought we are supposed to use numpy internal functions, but seems that is not needed. - There is [`np.ravel_multi_index`](https://numpy.org/doc/stable/reference/generated/numpy.ravel_multi_index.html) that does what you are asked to implement during the exercise. But the point is indeed to do it yourself, as a way to understand the `.strides` property of arrays and why it exists. - Why should the function take n_cols? It is not needed to calculate the appropriate index or am I missing something? - n_rows are not needed in this case, but n_cols are needed. And there could be row_major vs col_major in that case the other variable be needed I guess. Both might also be needed to check that we don't go out of bounds. - telling people the function doesn't need n_cols would be a hint to the solution! - Asking not telling, assumed i was in the wrong :) - haha I just meant maybe that's why it's in the question, not that you shouldn't have asked - also using negative indexes in python starts the counting from the end of the matrix so if you wanted to implement that, you would need n_rows too - they told me "NameError: name 'ravel' is not defined" - what to do now? - You need to specify ravel as a separate function (`def ravel(...)`) - ok, thank you. But i't still doesn't work :D - now it is invalid syntax - !! Youre not sharing your screen! +1 - sorry, my fault - is there a way to make numpy print the _ thousand separators? - Use `:_` formating option ```print(f'{ravel(3_465, 18_923, n_rows=10_000, n_cols=20_000):_}')``` - What is the use of transpose and reshape at the same time in real data other than an example for demonstrating the speed? - You wouldn't often use strides just like this. But you need to know how it works inside, and then many other numpy functions make sense, and then you can write much faster code. This is why we are teaching it. - Transpose happens often in linear algebra and machine learning. Reshaping is also a common trick in machine learning. - I work with volumetric time series (x,y,z,t) and reshape is used daily - I suppose going from daily measurements of number of cups of coffee to weekdays ranked by the number of cups. A bit of a contrived example, but something similar could happen. :smile: - Often, this happens by accident. You transpose a matrix, then sleect only a couply of rows, do some other stuff to it, then try to reshape and *surprise!* it takes ages! All because of that transpose in the beginning you already forgot about. - How to free up memory in python? Can you/should you do it like in C? - Usually you don't manage memory in Python. Numpy creates some special cases when you should think about it. - Each object in python has a "reference count" that tells how many times this object is used in the code. Once the reference count goes to zero, the garbage collector should free the memory. If you want to remove the object in your local function or scope, you can use `del object_name` to manually decrease the reference count. - For example: ```python a = np.random.rand(10) b = np.transpose(a) del b print(a) print(b) ``` - Printing `b` will give an error, but the array is not removed as it is still referenced by `a`. - Since the base array and its derived view share the same memory, any changes to the data in a view also affects the data in the base array. What if I want the derived view array changed while keeping the base array the same? Do I need to create the dervied array in an explict way every time? Then if I do this, numpy also need to copy the data, instead of construct an view, am I correct? ``` a = np.zeros((5, 5)) b=a[:2, :2] ``` - You would need a copy of at least the part you change (unless you do something really weird with strides. Please don't :smile:) - So I would create a view and then copy: `b=a[:i,:j].copy()` - Then this syntax is quite different from Python list, and this difference sometimes could cause bugs if the developers are not careful. - Yes, the Numpy slicing syntax is different from Python. It causes bugs in my code as well, and I've been using it for 10 years. You just need to get used to it. ## Break until xx:14 - Do we need to do Exercise 3? - Not needed for anything else, but of course you can try! 👍 - Try the "create an infinite matrix" part of Exercise 3. It's fun! ## Pandas https://aaltoscicomp.github.io/python-for-scicomp/pandas/ - I often hear about database in courses like SQL. What are the differences in the contexts between SQL and pandas? - Both are used for data analysis. Pandas has options of accessing databases like SQL. Databases are used for storing data and doing queries on top of them (getting some subset of data). Pandas is then used to do analysis on the data. Pandas has its own data format (DataFrame), but it is also a collection of tools for working with the data. - Very rough simplification: If the typical data structure for pandas is a DataFrame = a table with rows x columns (e.g. each row is a person and columns are properties like name, surname, email,), SQL is a collection of dataframes that can link to each other - comparison between pandas and SQL: https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html - Silly qyestion but is there ever a difference between 'single' and "double" quotations in python or a reason to use one or the other? - i don't think there is a difference, but if you use ' you can insert " within it without ending the string, and vice versa - [relevant Python style standard](https://peps.python.org/pep-0008/#string-quotes) - TLDR: No difference and no recommendation: just pick one and stick with it within project - What are the major improvements/benefits if I use pandas instead of regular Python list? - Dataframes can contain different types of data (numbers, strings, datetime objects) - Indexing and extraction options come to my mind first. With lists, you need to manually check things, with pandas a lot of the work is done by the library, and your code becomes a lot easier. - Columns in data frames are stored as numpy arrays. This will give you the massive speedups that numpy gives compared to regular Python. - Each row is also a pandas series, right? - No. Columns are Series-objects. If you choose a row, you basically ask it to pick the same index (think about the ravelled numpy arrays) from each column. Operations are fast across columns, slow across rows. Each column can have different data type. - if you select a row, it makes a (new series out of it) - If different rows have different data types, the row-series will have a data type of Object. This is slow, so one should avoid this if possible. - Is this a hash index or tree index? Can we do startWith optimally? - You can do the indexing with various ways. Usually you'll want to use boolean indexing based on some values - It looks like the second line doesnt work. Probably because the keyword "Name" is not a column in this dataframe - Oh, good point! Because we made "Name" the index. - KeyError: 'Lam, Mr. Ali' I recieve this error even I could print that u:titanic[titanic["Name"]=='Lam, Mr. Ali'] - ah, because the "Name" column was set as the index! (i.e. the labels on the rows instead of 0,1,2,3,...). so there's no "Name" column any more. But this works: `titanic.loc['Lam, Mr. Ali']` - The example on the page could be fixed ## Exercise until xx:55 :::success Pandas-1: https://aaltoscicomp.github.io/python-for-scicomp/pandas/#exercises-1 Try these exercises, if you want go through the rest of the things in the lesson we didn't cover. Or try later exercises. We continue with pandas tomorrow. ::: - Is there any perf difference between ```titanic.iloc[:10]["Age"]``` and ```titanic["Age"].iloc[:10]```? - for larger index entry in iloc? - I would prefer the second method. Reason for this is that you're not using rest of the columns. If your `iloc`-call is very complicated, you might need to do a copy of the underlying numpy arrays: they cannot be represented by simple strides. In the second case you explicitly choose column `Age` and then index only that column. The benchmark below seems to validate this intuition. - interesting question, i see a difference: ``` %%timeit titanic.iloc[:10]["Age"] # 21.9 µs ± 1.22 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) %%timeit titanic["Age"].iloc[:10] # 13.6 µs ± 640 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)> [] ``` - titanic["Age"].iloc[:10].mean() - 28.11111111111111 - is this right? - yes - (22+38+26+35+35+54+2+27+14)/10 = 25.3 - there is one NA in the first 10 rows. How is it treated? - The NA is ignored, so it is: (22+38+26+35+35+54+2+27+14)/9 = 28.11111111 - How do I combine two boolean masks? (For example if I wanted to find the survived passengers aged 22. Simply adding 'and' in there throws errors.) - The easiest way to do this is probably using the [`.query()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html) method: `titanic.query('Survived == 1 and Age == 22')` - I see difference between the ans I came up with and one that's provided, is it because of the *nans* or it's a mistake myside? My sol: ``` avg_age = titanic["Age"].mean() titanic.groupby(titanic["Age"] <= avg_age)["Survived"].mean() Output: Age False 0.366864 True 0.406250 Name: Survived, dtype: float64 ``` Provided: ``` titanic[titanic["Age"] > titanic["Age"].mean()]["Survived"].mean(), titanic[titanic["Age"] < titanic["Age"].mean()]["Survived"].mean() Output: (0.40606060606060607, 0.40625) ``` - This does not work `titanic.dropna(titanic.iloc[0:10]["Age"])` nor this `pandas.DataFrame.mean(titanic.iloc[0:10]["Age"])` - Once you have the ["Age"] at the end you choose one column of the data and the object is no longer a DataFrame. - You can do: `titanic.iloc[0:10]["Age"].dropna()` - And `titanic.iloc[0:10]["Age"].mean()` - If you want it to remain a DataFrame, you need to do titanic.iloc[0:10][["Age"]] - The double parentheis will keep it a DataFrame (you choose a collection of columns, in this case, only one) - You can then run similar things: titanic.iloc[0:10][["Age"]].dropna().mean() - Using the attribute functions (like .mean) is better than using pandas.DataFrame.mean because the attribute function is defined for series and dataframe both, whereas the pandas.DataFrame.mean only works for dataframes. - Sorry, I tried my best, but did not work...what is wrong? pandas.DataFrame.mean(titanic.iloc[0:10][[“Age”]].dropna()) - pandas tricks: - titanic.rename(columns={'PassengerId': 'n'}) # renames the key ## Continuing after exercises: Tidy data - Seems similar to pivoting/unpivoting? - Yes. Pandas has [functions](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html) for that as well. ## Feedback, day 1 Tomorrow, we continue with Pandas - going to the basics of how to use it, rather than theory. Take the time to review if you need to, and read the [tidy data paper](http://vita.had.co.nz/papers/tidy-data.pdf). Then, visualization (matplotlib) and data formats. See you tomorrow! Today was: - too fast: oooooooooo - too slow: o - just right: ooooxoooooooooooooooooooo - too simple: oooo - too advanced: o - was worth attending: ooooooxooooooooooooooooooooo - I will recommend to others: oooooooooooooooooo One good thing about today (any lessons): - numpy arrays under hood +1 ooooo - jupyter is crystal clear. o - Simple and great explanation of Pandas +1 Extremely great session on Pandas, loved it xD +1 - very comprehensive - Numpy advanced: really nice. o - <mark>**iloc**</mark> One thing to be improved for next time: - recall some basic concepts from the Numpy lesson (if enough time available)+3 - maybe reformat the NumPy session to be closer to that on Pandas (information capacity, preparation, presentation quality, etc.) - Also add to the introduction email that we should have a look at the basic Numpy part.+2-1 - Maybe give a heads up t++1o prepare in advance the basics of NumPy and Panda s. +1 Also, a little bit more detail about Markdowns and Magic Commands would be appreciated. - Heads up for what to go trough for the next lessons would be great. - more of advanced pandas - you can say some other tools that is used in machine learning approaches. Other comments: - Spend one word or two about the extensions in jupyter notebook vs in Jupyter Lab, and their use (i.e.: TOC) +1 - I missed this: (new lesson for this year, please browse the basic numpy lesson material here yourself as a prerequisite), which really spoiled the second hour for me - Could you reshare the link as to where we could get credits for this? :) - https://scicomp.aalto.fi/training/scip/python-for-scicomp-2022/ - please make sure you are registered! - A general discription of some basics for the beginers might be good --- # Day 2 Archive - How to save individual variables and the whole workspace in python, similar to saving all variables in `image.RData` and individual variables in `variable.Rds` in `R`? - Python's [`pickle`](https://docs.python.org/3/library/pickle.html) module let's you save arbitrary things to a file: ``` import pickle import numpy as np a = np.array([1, 2, 3]) b = np.array([3, 2, 1, 2, 3]) # Saving with open('my_vars.pkl', 'wb') as f: pickle.dump(dict(a=a, b=b), f) # Loading with open('my_vars.pkl', 'rb') as f: my_vars = pickle.load(f) ``` - If I have created a conda/mamba environment to a local folder(`conda create --prefix`), how to perform `conda/mamba clean` to the environment? - ... ### Icebreaker, day 2 Loud and clear Is this for the twitch recordings or youtube? Twitch is good The twitch stream is perfect for uploading Do you use the recorded videos?: - no: xxxXxx - yes, the same day: oooooooooo - yes, other days in the course: ooooo (someone deleted over half of the votes here) - yes, after the course: ooooooooooooooo - Sometimes the amount of new info is overwelming so I folow for as long as I can but when I get overwelmed I take a break and start off again using the video later on +3: ooo - all of the above: oooo ## Pandas continued https://aaltoscicomp.github.io/python-for-scicomp/pandas/ here: https://aaltoscicomp.github.io/python-for-scicomp/pandas/#working-with-dataframes - FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. sub1.append([sub2, sub3]) # same as above - How does this affect? should we use the concat function? - Hard to follow at the moment.. not really sure what is happening. too much skipping on the screen - -thank you -- Runners and ages: ```python runners = pd.DataFrame([ {'Runner': 'Runner 1', 400: 64, 800: 128, 1200: 192, 1500: 240}, {'Runner': 'Runner 2', 400: 80, 800: 160, 1200: 240, 1500: 300}, {'Runner': 'Runner 3', 400: 96, 800: 192, 1200: 288, 1500: 360}, ]) age = pd.DataFrame([ {"Runner": "Runner 4", "Age": 18}, {"Runner": "Runner 2", "Age": 21}, {"Runner": "Runner 1", "Age": 23}, {"Runner": "Runner 3", "Age": 19}, ]) ``` - This merging runners works also on the "melted" version that we checked out yesterday. Then the columns are all words - But if I run runners after the merge it does not show the merged dataframe, is it so because it is not saved? - Yes the `runners.merge()` **does not** save it anywhere unless you do it explisitly as `new_variable = runners.merge()` - Of course you can overwrite `runners` as `runners = runners.merge()` - Thanks! - Is database and DataFrame here the same thing? - No they are not the same, although both are used in data analyses - Ok, thanks. I heard you were talking DataFrames with the word database and that confused me. I might have misheard as well. - Yup, good to clarify! I think Pandas DataFrames as Excel spreadsheets and DataBases - I don't have so much experience with - but they can represent a bit more complex data structures with relations and stuff. - I am getting an error on running runners - Can you paste the error here? ``` --------------------------------------------------------------------------- NameError Traceback (most recent call last) /var/folders/_j/ldfb6z6j7bz2bpwx8hdpr6340000gp/T/ipykernel_12593/385962062.py in <cell line: 1>() ----> 1 runners NameError: name 'runners' is not defined ``` - Try running the command from yesterday where we defined the `runners dataframe (https://aaltoscicomp.github.io/python-for-scicomp/pandas/#tidy-data) - of course, so silly of me. Thank you - I have noticed that when merging Runners with Age DataFrames, some of the entries in Age are missing in the returned DF. How can we make sure that we do not loose data and that a default value is inserted in the missing fields (for instance NA)? - The data for "Runner 4" is missing because it does not exist in the original dataframe. The order matters, `runners.merge(age)` is different from `age.merge(runners)` - I will check how to add defaul values - Since order matters, how to know which of the two DFs has more entries? (though this is problematic because both DFs can have missing values) - The idea of merge is to add the column to the first dataframe using data from the other. You should choose the order depending on the result you want. - Both original dataframse are there, so data is not actually lost - See the "how" parameter in https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html. With `how="cross"` you get all the data from both dataframes. - Right, thanks :) ### Exercise 2 until xx:30 :::success https://aaltoscicomp.github.io/python-for-scicomp/pandas/#exercises-2 Try this exercise and review what is before and after it. There is way more information than you can do, but we'll see what we can manage. * You need to re-run some cells from the day before in order to get those values of the `titanic` data frame. * 5 more minutes added for time. Note: If you are from FI/NO/SE and need live help, remember the zoom link you have in your welcome email. ::: - Am I doing something wrong? This should give the names and it works if I add something else to the [] than name, showing name and ticket for example: titanic[titanic["SibSp"] == 8]["Name"] Error is: self._check_indexing_error(key) - Try refreshing the page, we fixed some bugs in it last night - Still gives same error - `titanic[titanic["SibSp"] == 8].index` - we made the "Name" column the index when we read it. there was a mistake in the notes before - Can I get them in table format? with that it gives out an array - I'm havin trouble interpreting the results from the "Women and children first" example. My computer prints the following: female False 0.758865 True 0.593750 Wouldn't this mean that 76% of adult females survived and only 59% of child females survived? - Oh yes, you are correct, I said it wrong! I guess children had more trouble surviving in the lifeboats? - That makes sense to me. Thanks! - I am sorry, I slow. Why doesn't this repeat row 2 and row 4? sub1, sub2, sub3 = df[:2], df[2:4], df[4:]? Is row 0 = header? - Python convention is array[x:y] is "includes x, does not include y". In this case pandas is following that. - 0 is not header but first row (counting from 0!) - [:2] means all lines from first (0) to second (1) - I'm getting an error in `largeFamNames = titanic[titanic["SibSp"]==8]["Name"]`. `largeFamNames = titanic[titanic["SibSp"]==8]` works fine. - Can you paste the error message? - It should be `titanic[titanic["SibSp"] == 8].index` becase "Name" is used as index. See the solution of the exercise 2. - Actually "wget https://api.nobelprize.org/v1/laureate.csv" works well from command line or bash magic - :+1: - The URL thing does not work. Gives an network error. - Yes they blocked direct downloads. Try to download it on a new web-browser-window. - Do you mean as using different window or just totally different browser? - Same browser, just new window. - I can't find the “File” and “Open from URL” options for opening the csv-file link... - Did not work. Hopefully this will not be a problem in the next step - The file is not used in other sections, you should be fine. ## Visualization https://aaltoscicomp.github.io/python-for-scicomp/data-visualization/ - It would be nice to learn the lambda functions too +3 - `def f(y): return y+y` - that is the lame as `x = lambda y: y+y` except it can be a function argument, you don't need `x = ` - When you say automatic plotting vs manual. Manual is matplotlib yes? - manual is needing your attention. automatic is run one command/script or notebook and everything comes out in final form. Which could be matplotlib, seaborn, or whatever. - Any reason to use %matplotlib inline instead of just plt.show()? - It looks like %matplotlib inline is no longer needed for newer versions of jupyter. See answer below. - It looks to me that even without `%matplotlib inline` everything works (i.e. the figure is displayed in the notebook) - Hm, I wonder if things are more automatic now. We should check, thanks! - Indeed, that is [the case](https://github.com/ipython/matplotlib-inline#usage): "Note that in current versions of JupyterLab and Jupyter Notebook, the explicit use of the %matplotlib inline directive is not needed anymore, though other third-party clients may still require it." - Can you still quickly discuss the difference between matplotlib and pandas. I saw there are some plotting options in pandas as well or? - the Pandas options are just an automatic wrapper to call matplotlib functions on the dataframes - Why is the subplot command needed if we only plot one figure? - One can work without using `subplot`, but then one usually needs to do `fig = plt.figure(); ax = fig.gca()` etc. which is much more work. - You can just use plt.scatter without using any .figure or .subplots - You could do it, but you might end up using axes from a previous plot. If you use `subplots`-function, you can create new axes so that you know that your scatter plot ends up in the correct figure. - So related to the same example, - what does the line `fig, ax = plt.subplots()` do? - Why is it ax.scatter? plt.scatter also works. What is the difference? - for all I know, plt.scatter will create new axes, so you cannot e.g. add additional plots this way. - No. If you run multiple plt.scatter comands directly after another they are in the same plot. - We'll mention this below in the "Matplotlib has two different interfaces". - If you are doing one thing at a time, the plt interface is convenient. If you are heavily scripting and making things modular (you can pass the axes objects around) or making things concurrently, the object-oriented interface is useful. - for those having PTSD from using Matlab, is there an alternative to Matplot that is similar to R's GGplot? - Yes, there are some suggestions at the top of the page, some mention similarity to ggplot - Right, my bad... I had HackMD open instead of the course page. - where can we get the hex strings for other colors to be used in matplotlib? - converters are available everywhere eg https://www.rgbtohex.net/ - search "HTML color codes" - Thank you, this is perfect - I see the next part of the lesson covers something that I always struggled with... Hopefully I will not have doubts anymore after today! - Which part? - The very next: "Matplotlib has two different interfaces" - A small note - the second link about the color-blinded people is broken (*Okabe, M., and K. Ito. 2008. “Color Universal Design (CUD): How to Make Figures and Presentations That Are Friendly to Colorblind People.”.*) (at *https://aaltoscicomp.github.io/python-for-scicomp/data-visualization/*) - One can also check [matplotlibs's guide on how to choose colormaps](https://matplotlib.org/stable/tutorials/colors/colormaps.html). The default colormaps have been optimized so that they do not create accidental patterns due to brightness / perceptibility changes. ### Break until xx:01, Exercise until xx:16 :::success https://aaltoscicomp.github.io/python-for-scicomp/data-visualization/#exercise-matplotlib * Work on the plot above and try to make it a little bit more fancy, based on what it says there. Explore the interfaces some - note that the exercise combines a bit of Pandas/numpy too! * **Don't forget to take your break!** ::: - Is there a way to get scatterplot from the provided code outside of Jupyter? I'm trying fig.show(), but it says the object has no attribute "show". Does it need the complete rewrite of code? - You can save the figure using `plt.savefig("plot_name.png")` - Oh, yes. plt.show() also works. I was using wrong object. Thanks! - Right. This is to me the most confusing part of Matplotlib. - I'm not sure I understand the difference still - Try to keep what I said in mind, and observe examples, and I think that my be the best way to understand. One uses `plt.` and one calls methods on `fig.`, `ax.`, etc. - one yellow dot has disappeared on the chart - shadowed by the blue! 8.81 vs 8.77 - 11 dots of each color should be displayed, and there are only 10 yellow ones. - see above question! - If "parts of a figure" link doesn't work, refresh the page (just fixed) - Matplotlib gallery: https://matplotlib.org/stable/gallery/index.html - I have a question that comes rather late in the course: yesterday we saw Numpy as a powerful and fast tool to compute data. Yestarday and today we saw Pandas as a powerful tool to have tidy data. Here come the questions: - When to choose one over the other? - When data is homogeneous 2D array of numbers and you work on whole array at once: numpy. When it's more like "observations and type of observations", especially with different data types: pandas - This also applies to n-D data (e.g. a time series of air velocity fields in the atmosphere - 6D)? - yes. well, numpy is designed for n-D data. Pandas not so much (tidy data can't be n-D, right?), but for labeled n-D data, see this package: xarray. Pandas used to have a "Panel" that was 3D but it's deprecated, there are other ways to make 3D it more tidy in Pandas (if it should be tidy), or use xarray if it's really an array. - Ok, thanks :) - NetCDF manages to make tidy data from n-dim data, but that is another topic. - Can Pandas be as fast as Numpy? - Pandas does use numpy under the hood for each index/series/column. So if you use pandas right, it's the same as numpy. But usually data types are different so it's a bit slower, but that's because of your data. - Good to know, thanks :) - I have a basic question about fig, ax = plt.subplots(). So, fig and ax are objects with the same property, but if I run fig I see an empty figure frame, and ax a line of text. How did matplotlib knew these are two different things? Are fig and ax some magic words? I use handles in matlab, but I never tried to really understand what it was doing. - `plt.subplots()` returns a tuple with two different objects (which get assigned to those two names). The two things returned are different: `pandas.Figure` and `pandas.Axes`. ### Exercise: Styling and customization until xx:50 :::success https://aaltoscicomp.github.io/python-for-scicomp/data-visualization/#exercises-styling-and-customization * There are three options here, do whichever interests you ::: - I am doing the exercise 1 and it errored: 'module 'pandas' has no attribute 'read_csv', what should I do now? I have copied all the scripts - that's really weird. exactly what line did you run? How did you import pandas? Is there any chance that it is very old? - did you copy-paste completely or forgetting `as pd` or something like that - this is what I have used 'data = pd.read_csv(url)' - How about `import pandas as pd` - This line works well - `print(pd.__version__)` ? - 'module 'pandas' has no attribute '__version__'' - did you make a file called `pandas.py`? - if so it's importing that instead. Try renaming that file. - I am doing exercise 3, using this example https://matplotlib.org/stable/gallery/lines_bars_and_markers/line_demo_dash_control.html#sphx-glr-gallery-lines-bars-and-markers-line-demo-dash-control-py when I copy the code exactly as it is in the example I get the error: 'Line2D' object has no property 'gapcolor' even though it is listed in the arguments in the documentation - I'm also noticing this. Let's look at versions - Try: - `import matplotlib` - `print(matplotlib.__version__)` - `gapcolor` is at least in 3.6.2 but I have 3.5.3, which may be the reason for this error - Ex 1: how do I make x label not scientific? Tried this but doesn't work: `ax.ticklabel_format(style='plain')` :+1: - Try this! Works for me... ``` import matplotlib.ticker as mticker` ax.xaxis.set_major_formatter(mticker.ScalarFormatter()) ``` - is it possible to change/update the size x/y-labels *after* they have been created? - Yes, should be. I can't find the link quickly. - You can also convert matplotlib plots to tikz plots for latex with [tikzplotlib](https://github.com/texworld/tikzplotlib) - Do we always need to write all the lines of the script for the graphs or is it possible to just add a new line with a new command to modify a graph that has been already generated? - Yes, you can modify existing graphs. You'll need to figure out how, though. - ok thanks. And can you visualize the graph without running it all again? - Quite often I need to plot several graphs in 1 figure and sometimes on of the figures is a conceptual one, which I prepare in 3DMax or somewhere else. Can I import a png and use it as a subfigure? - https://towardsdatascience.com/how-to-add-an-image-to-a-matplotlib-plot-in-python-76098becaf53 - I copied a violin plot code for matplotlib. There is something wrong with labels, but I don't see what is wrong in there. Could you help? ax.set_xticks([y + 1 for y in range(len(all_data))],labels=['x1', 'x2', 'x3', 'x4']) gives an error:TypeError: got an unexpected keyword argument 'labels'. If I comment it out, the code works. ? - The function `ax.set_xticks` does not take an argument called `labels`. I guess it's not meant to set labels at all. - This could again be a version issue. at least in 3.6.2 it does take a labels parameter. - True. In fact the [docs](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_xticks.html) mention labels. - seems like 3.5.3 also has this at least documented. Somewhat strange... - Instead, you can use `ax.set_xticklabels(['x1', 'x2', 'x3', 'x4'])` ## Break until xx:10 ## Data formats with Pandas and Numpy https://aaltoscicomp.github.io/python-for-scicomp/data-formats/ - Isn't pickle discouraged from a security perspective? - Yes: you don't want to load pickles that others send you. But if you are only using as a temporary storage for yourself, it's OK and can be convenient. - I see that csv looks superior to json in terms of human readability, being tidy and other. Which one would you suggest to store a mobile application data for example? - can you say more about the data? - lets say users data, meaning the anwsers of a survey - If there are X entries per user and all fit into one table: sure no problem, you can put that into a csv. Problems start commonly when you have: - multiple different surveys filled in by the same participant, those can have overlapping entries - Multiple datapoints for one participant - Wow, that explains a lot. But couldn't you just keep the user ID and create another column recording the different surveys with numbers? - Yes, but imagine there is a non predefined number of something (like a psychologcal experiment, where you need to repeat until you get it right and for each entry you want to store what the participant said). This means, you can't have predefined columns, and you need to fill in additional (empty) columns for all participants that had less tries than the max number of tries. It essentially becomes quite messy in a csv, while e.g. in Json, you would just put in an array for this value and those arrays can be of different length. - Then I would make it "long", to be tidy one row is one test: (person_id, time, question, answer) - you append and then reshape/process to a wide format later. - but that essentially replicates the person_id for each row (and potentially more info or you again need a hierarchy of csvs where additional information on the person is stored elsewhere) - Or if you want to store some more meta-data on different levels (e.g. you have Information from multiple surveys), and you want to store the metadata for the surveys - which can be many along with participant information in the survey. Then you either need a "hierarchy" of csv files or you will get huge amounts of replicated data per entry since you need to store all metainformation in each row. - I will save these answers and study them , thank you :) - One more comment here: Yes, a lot of researchers like csv, because it is "easily" transferable/readable. But imo it is only suitable for relatively simple data, and anyways any "textual" format might loose information for e.g. floating point numbers (both csv and json have this problem) - csv has a few disadvantages: As soon as you have multi-layered data, csv either will contain multiple copies of some parts of the data (e.g. if you have multiple entries that refer to the same source, you have to replicate all IDs), or you will need loads of files, or you end up with unconnectable data. Otherwise zipped csv or zipped json are ok for "simple" data. and Json (at least to me) is quite ok readbility wise there are viewers for json which make reading it quite easy. - I kind of understand what you explained wisely. - **Comment**: jsons can also be imported as e.g. dicts, which can come in handy :+1: - I've previously run into trouble when pickling pandas dataframes on one machine and trying to read them on another. The read function was not found or something... Is the problem in pandas version or pickle version perhaps? I was hoping they would have been better backwards compatible - Pandas has it's own functions for pickling and reading pickles: - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_pickle.html - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_pickle.html - These should be called automatically when you pickle the dataframe. Maybe it's a version issue. - So issue in *pandas* version I guess - yes - Any comments about the protobuf (Protocol Buffer) file format for storing large datasets? ### Exercise: pickle (until xx:33) :::success https://aaltoscicomp.github.io/python-for-scicomp/data-formats/#exercise-1 * Save and restore some object * Quick exercise, nothing else depends on this. ::: ### Exercise: CSVs (until xx:47) :::success https://aaltoscicomp.github.io/python-for-scicomp/data-formats/#exercise-2 ::: - Running the following from the CSV section yields "True" for me? `dataset.compare(dataset_csv)` `np.all(data_array == data_array_csv)` - That might happen. It depends on what random numbers you get. Pandas csv writer tries to represent the floating point numbers to the full precision, but sometimes you can have edge cases where it does not match. We should update the text to reflect that though. Do check the CSV though. It might use a lot of decimals. - Very basic question: if I type `dataset` in Jupyter Lab, the `dataset` is shown as an output. However, being very big, the middle rows are truncated and not shown (instead I see just `...`). Can I enable scrolling through all the dataset? - I got an error: AttributeError: 'DataFrame' object has no attribute 'compare' - The "compare" function was introduced in Pandas 1.1, which is quite old. Can you try `print(pandas.__version__)`? - right, I have 1.0.5. How do I upgrade it? - Are you using Conda? - Yes, I am in Anaconda Navigator - I have not used Navigator. *Anyone else?* But if you can run the conda command in the environment, it's `conda update pandas.` - ok I ll try this! Thanks :) - With `pip` it's `pip install --upgrade pandas`, but if something else depends on Pandas, you need to upgrade that as well - How can we make a simple file with a) modules to load, and b) parameters (having a 'metadata' approach in mind)? Bonus question: Can you get a list of the modules loaded in a script (such a list could be useful to create pyenvs to make sure there are no version conflicts)? ML is exactly the reason behind my question. :) - It can get tricky if you need to save the ML model and the software used in the same file format. - [MLflow](https://mlflow.org/) is one alternative, but the industry has not yet decided on one format. - Original Poster here again... I looked into it and basically, one could make a file containing the list of modules (for instance in a CSV file) and dynamically import the libraries using `__import__()` - see: https://docs.python.org/3/library/functions.html#import__ -. This will import each module using the original module name: `__import("pandas")` is the same as `import panda`. This is extremely discuraged and one can mess things up by doing this. Bottom line: it's better to just include `import` statements in the code. - - Could you share a resource to study Json files with pandas? - This seems interesting, it has info on both pandas usage and a bit of the json format itself: https://pythonbasics.org/pandas-json/ - Is there a way to store the state of the Jupyter notebook? + Maybe dill https://pypi.org/project/dill/ is an option? - .. ## Feedback, day 2 Today was: too fast: too slow: ooöoo just right: oooooooOooo too advanced: too basic: oo worth attending: ooooooooooooooooo not worth attending: I would recommend to others: oooooooooöoo One good thing about today (for each lesson): - matplotlib and its two 'flavours' (object-oriented or not) +3 - great work by all speakers + I know I will reuse the materials +2 - great material. +2 - motivation for thinking about data formats, this is actually exactly the place where I am at with my thesis data now - . One thing to be improved for next time: - The data formats talk is quite slow.. even running out of time! - The data format part seems something which is just more efficient to google when you need - I would like to do more hands-on practice. +1 - type don't copy/paste, then you're forced to explain more +1 - all lessons ran out of time exactly when the interesting parts were starting +1 - Less of the absolute introductory bits. Time was spent on import statements in the first lesson, this is too basic - I think a bit more time on the differences in matplotlib and making nice looking plots would be good Other comments: - The level of this course is very variable: "Introductory level" is too slow for those have experience and the advanced stuff is too much for beginners. Would it be possible to have this course separated in two courses? First would be an introductory course ("these things you can do with Python, this is how to do them") and the advanced version for people who already have experience. The advanced version could focus on explaining the best practises, e.g. stuff you can not find when googling for a script, what is most efficient method etc. - This is an old question for us, and yes, it's a big issue. We used to give more dedicated advanced courses, but people can be advanced in different ways, and we ended up having to make most courses basic anyway (or not help most people). In other words, hard to say what is basic and advanced in interdisciplinary environments. - Our current strategy is to try to be basic, but have some advanced stuff scattered around, and let people know that you may interact differently in different parts. - We're using a split strategy for our "intro to the cluster" course now, though: basic course in June, and then advanced workflows in February. - A good column to add to the table comparing different data formats might be how suitable the data format is for storing metadata alongside the actual data - . - . # Day 3 & 4 Continue to part 2 -> https://hackmd.io/@coderefinery/python2022archive2 --- :::info **This is the end of the document, please always write at the bottom, ABOVE THIS LINE ^^** HackMD can feel slow if more than 100 participants are editing at the same time: If you do not need to write, please switch to "view mode" by clicking the eye icon on top left.