# Python for SciComp 2022 (ARCHIVE part 2) ###### tags: `Training` :::danger ## Infos and important links - Previous of this document are archived here: https://hackmd.io/@coderefinery/python2022archive ::: *Please do not edit above this* ## Day 1 & 2 If you are looking for the Q&As from day 1 and 2, they have been archived at https://hackmd.io/@coderefinery/python2022archive. # Day 3 ## Icebreaker ### Have you used someone else’s code and regretted it? Why? - Yes, badly commented code just ended up doing something it was not meant to do and the ones that were commented were not very professionall. getting me into trouble :D - Yes, the code had a lot of presumptions about the data that were completely wrong so I ended up rewriting the whole thing :) - I have usd other's code. It was well cited so I think it's fine - No, I don't regret easily. :) - I have regretted the opposite: rewriting my own version of something that was already well implemented :D ooooo - The typical "don't reinvent the wheel" - I've used modules (to be more specific — packages in R), that do not work correctly, so I needed to fix them. Didn't regret it, but it was something I wasn't expecting to do. - If you understand the code, then not really. If not, then it might backfire :DD ### What's it like in a livestream course? - This setup could also work in a lecture hall, it would be great to have an hackmd while the instructor is teaching - :+1: - I like it better than a live lecture course. It is more flexible and I prefer coding in my litle cave, but you have still made it nice and interactive :) +3 - The fact that you guys are working in pairs and play "teacher/student" is amazing! It really helps keeing the right tempo +4 - I love type-along -- this is cool +1 - Sometimes hard to follow, if you drop put or don´t understand it might be hard to get back on. It was easier since the topics have been familiar. I think this course or way od learning is not best format if new to the topic - I like livestreaming as you can have multiple levels in the course depending on your background. Like this document, where you can ask also more advanced questions. +1 (livestreaming is so much more flexible. Lecture halls are easily dominated by one or a few students - whose questions others might find not interesting/important) - It's probably not as good as in a lecture hall in terms of engagement, but the fact that we can ask live questions in the Hedgedoc that you guys address (!) makes it way better than just a standard online course - Yes, in the past we had these workshops in-person only and we were using coulored post-its so that people could "raise their hands" and get a helper to check their code/errors/etc. We kind-of have the same for those in the Nordics with the Learner's zoom where helper and learner can go to a breakout room, share screen and discuss. Unfortunately we do not have enough resources to run a zoom for the whole world. - I'm not living in Finland, so it's great that I can still follow this course :D - The course is given from Finland, a non-Scandinavian country, but I see your point ;) - Suomi mainittu. Torille! - Sorry for the mix-up :( ## What are the bad sides of livestream? - Getting sucked in to coding and noticing that you have not been following for last 15 minutes and have no idea what is happening +3 - not bothered to try the exercises, getting numb - Peer learning helps here, if you follow with a friend/colleague in a physical or in a virtual space - Speaking of peer learning in virtual space, I do not recall taking a moment during exercises and using Zoom breakout rooms in previous lessons. Did I miss anything? - If you are from the Nordics, you have a link to the learner's zoom in the welcome email. If your inbox exploded, just email us at scip _at_ aalto.fi and I can resend you the link. The learner's zoom is not very active, 200+ people got the link and currently 2 are there :) So you didn't miss much and some live discussions about Pandas have been documented to HedgeDoc. - Appreciated. I meant to say that I didn't notice whether exercise solving in Zoom breakout rooms were explicitly announced during lessons so far. - . ### Other - Related to yesterday's topic: Is there a way to pipe command line tool text (.csv) output to e.g. Parquet format? Many bioinformatics tools output large result files in .csv format and they fill up space. - I could not quickly find one, but it should be easy to write in Python using Pandas :smile: - Actually, check out https://pypi.org/project/csvcli/. The convert command sounds promising. - The tool mentioned above looks great, but if your code produces CSV's where there's some metadata in comments, you might want to write your own script that stores that metadata into the parquet as well. Apache Arrow as [it's own CSV readers](https://arrow.apache.org/docs/python/csv.html#) and an [exhaustive documentation](https://arrow.apache.org/docs/python/parquet.html) on Parquet writing. Good balance:oo Richard could be a bit louder - If you are having the lecture in person you could ask questions such as "does it make sense" or similar, getting direct feedback regardless if people sit with their computers or not. Now it's hard to answer you directly while you are speaking for example. - We have the learner's zoom for this (if you are from the Nordics) ## Scripts https://aaltoscicomp.github.io/python-for-scicomp/scripts/ - How does this question (i.e., why scripts?) relate to using Spyder instead of a notebook? - Scripts need to be written with a text editor. Spyder is just an example. - It is possible to write it in Jupyter, of course, but in the end the code needs to go in a .py file - Does that mean that any .py file is called a "script"? - In the context of this talk yes, but in general "scripts" are pieces of code you can run from the command line. Of course, if you write a Python library it will be full of ".py" files that won't be run from the command line. In that case one would just call them "Python source code". - I guess is this more a question about Jupyter, but to me the best part of scripts is that the version control is human readable. Is there a way to get human readable version control for Jupyter/on GitHub? - "nbdime" at least lets you diff and merge it. I don't find it perfect but it's better than nothing. - There is also [nb-clean](https://github.com/srstevenson/nb-clean) that cleans up notebook outputs/metadata if you do not want to store it in git. - Does "remove magic before converting to script" means that one needs to manually re-write all the corresponding code? Do you have an example of that? (I was not using magics too much before..) - Magic commands are not Python, so you do need to achieve the same thing in a different way in a Python script - Many magic commands control the notebook interface and can be removed from the script without any effect - How can I access the weather_observations.ipynb through the VSCode? Somehow using pandas to download the https://aaltoscicomp.github.io/python-for-scicomp/_downloads/4b858dab9366f77b3641c99adece5fd2/weather_observations.ipynb ? - Should the script do somehing? I ran it on terminal, but nothing happened :) - I think it should generate a plot (check your directory). - where is this in jupyter - it should be i the same folder as the weathe_observations.py file - Thank you :) there it was ### Exercise 1 until xx:16 :::success https://aaltoscicomp.github.io/python-for-scicomp/scripts/#exercises-1 - Get the weather_observations.py downloaded and opened in JupyterLab - Get it exported as a `.py` file and see if you can run it. - If you run it, it saves an image file but doesn't make other output. ::: - Nothing happens when I run `python3 weather_observations.py`. I checked that python3 is installed, it is. - do you get any error message? Are you running from the Jupyter environment? - no error, I'm running from bash terminal - and do you need `.py` added to it? - I did put that, thanks, updated my original q - Have a look in the folder you ran it in. there should be a new image file thats generated by the script. "Nothing happens" is expected. - Yes!! You're right :) I just ran `cat weather_observations.py` and I see that it's supposed to generate an image, this image was in fact generated. Thanks! - I don't have this menu bar "Export Notebook as" in my jupyter notebook. What's wrong? - Is it called "Save and export notebook as"? I don't have it. - Is it JupyterLab or the older Jupyter Notebook - maybe you can find it under some other name... or you can use the `nbconvert` option. - Running from VSCode, got this error: ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2 - I think VSCode is assuming the code is in C. There should be an option somewhere to change to Python. Does the file name end in ".py"? - It ends with .ipynb - If it's .ipynb you need to convert it to `.py` probably. Try the `nbconvert` command. - I think I'm lacking the understanding of how to do this. Should I download the weather_observation.ipynb into the folder where I'm working. Then run the nbconvert command in my terminal, followed by the run "python weather_observation.py"? I thought this could be done in VScode just writing it out (download, convert, run), but I'm wrong? - You can run the ".py" file in VSCode, but not the ".ipynb" file. - Alright, now I downloaded it, used the terminal to convert, opened up the ".py" file in VScode and it ran. I got a "weather.png" as an output :) Thanks! - I ran it but nothing happend? No error either - see above. And have a look in your folder, there is a new file. - I cant select New from File (sorry for the potentially stupid question): ![](https://notes.coderefinery.org/uploads/924afe3c-1a4a-46a6-b2ad-080b57d095c3.png) - I think you need to be in the file browser interface. This looks like jupyter notebook, not jupyter lab. - Whats the difference between jupyter notebook and jupyter lab? - They are different interfaces for looking at jupyter notebooks. Jupyter-lab has some additional features. - -How can I run jupyter lab? can I use it with anaconda? - From the command line: `jupyter lab`. Yes, it comes with anaconda. If not installed already, `conda install jupyter-lab` will do it. - what about pip? - `pip install jupyterlab` (no dash this time) - Jupyter command `jupyter-nbconvert` not found. ??? - Hm... try without a `-` in it, `jupyter nbconvert`. - it was without a dash - I did this and it worked: $ jupyter nbconvert weather_observations.ipynb --to python - Great. - Any preferences or experiences working with VScode vs jupyter-lab in this type of settings? - VScode is lighter in my experience. However, the best is to move away from notebooks when the project gets "serious". Here a list of pitfalls of jupyter notebooks https://scicomp.aalto.fi/scicomp/jupyter-pitfalls/ The main pitfall is the fact that a human interacts with the code which makes it prone to errors and irreproducibility of the results. - Is the only outputs are .py file and a weather png picture? - Yes, I got the answer. - I have a error ParserError: Unknown string format: -f +1 - Is this in nbconvert? Can you paste the command here? - I added these as in the example: start_date = pd.to_datetime(sys.argv[1],dayfirst=True) #coming from the sys end_date = pd.to_datetime(sys.argv[2],dayfirst=True) - I added the sys in the beginning and add all things that was on the example - what command line do you run? - I was working on Jupyter. as the example. Did all the same way :) - I tried both in Jupyter and in bash CLI, got the error in both places - Can you paste the whole error message? Still trying to figure out whats going on... - It is two pages long :D I would spam this page - TypeError: Unrecognized value type: <class 'str'> - ParserError: Unknown string format: -f - The error is from the Datetime parser, not the Python parser! What is `sys.argv[1]`, exactly? It should be the second word in the command line. - sys.argv[1] outputs -f, is that correct? - In Jupyter, the code or the error? - I'm not sure how argv would work in Jupyter. It's a Python list, maybe run `print(sys.argv)` and paste here. - On the command line, `sys.argv` is a list of the words in your command. So what command do you run on the command line? - sys.argv : The term 'sys.argv' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again. - Where did you run this? Did you `import sys` - in the command line, yes I did - In Jupyter, can you run `print(sys.argv)` - ['C:\\Users\\xxx\\anaconda3\\lib\\site-packages\\ipykernel_launcher.py', '-f', 'C:\\Users\\m_siv\\AppData\\Roaming\\jupyter\\runtime\\kernel-7c778a24-e699-48b4-b25a-cc54e4e3f93f.json'] - OK, the problem is that `sys.argv` does not work in Jupyter. It only really works on the commandline - OK, good to know :) - Hmm, it doesn't work for me even from the command line. - Can you paste the command you run here? - print(sys.argv) - I mean the command you run on the command line. (`print sys.argv` is Python code, it should not work on the command line.) - right, sorry, `python3 weather_observations.py` - You need to add a date to the end. In the above, sys.argv = ["weather_observations.py"] and sys.argv[1] is not defined. - oh right, okay I ran `python3 weather_observations.py 01/03/2021 31/05/2021 spring_in_tapiola.png` but it still gives me a parse error - `dateutil.parser._parser.ParserError: Unknown string format: sys.argv[1]` - Try formatting it like 2022-02-17 (although most standard formats should work) - Tried it, doesn't work :( - Does it print a line number? What is that line in the code? - Maybe you have parenthesis around sys.argv[1]? Somehow Python thinks that's a string, maybe. - OMG yes. I was putting quotes around sys.arg[1]. Thanks so much! - for me it works when I run it from my Anaconda Prompt (miniconda). However when I try to run it from the terminal in Jupyter notebook, it gives the error that the module 'pandas' is not found. How does that work? - The two have different environments, that have different Python packages installed. This is suprising to me, though. Maybe running "conda activate base" in the Jupyter terminal will help. - I now run the same environment in the jupyter lab terminal, but then I get the same problem. Running 'python3' instead of 'python' in front results in "Python not found" - Getting this error: `IndexError: list index out of range when doing the argv[1], 2 and 3`. - Did you call the script with four (script name + three) arguments? If the script receives less arguments that in tries to access, you'll run into this error. - Ah, I tried running the file itself, but I did not call it from the terminal. Now it works. - I have this error in jypter notebook `ParserError: Unknown string format: -f` - See if it is similar in the discussion few questions above ^^ - Can we go a little bit slower? It is hard to follow up while performing - I think Richard's comment was kind of relevant: what if you don't remember the order of the parameters? Should you always open the file itself and check it, or is there an easier way? - use a parsing library and it will auto-generate `--help` - this is true only if you comment your code correctly, or is it fully automated? - If you use a parsing library (like argparse) it essentially does that job for you, so that's why using these libraries is so useful. - Yeah, parsing library generates `--help` automatically. If you don't tell it help text, it just gives you names, but it's enough to figure out. - or often there are conventions: input before output, config-file before input, etc. - to me its command line inputs have precedence over any other input (like config files), but if we are talking about the order of inputs, like `python call.py configFile other inputs` I would agree that the config file should probably go in first. - In the next exercise we're supposed to make input file and output file as arguments, does this mean input file *url* as argument? - Basically, it should be a string variable that the read_csv-function tries to open. In the case of pandas, pandas can open URLs as well. If the file would be a local file, read_csv would read that. - The input file is still a string which will be used, but you could also use a local file with the same format. since it uses pandas `read_csv` function both urls or local files work. - In short: yes, the argument is a URL (but read_csv works with both URLs and files!) - I have the same problem as before. I use jupyter with my browser. I install jupyterLab too. but I still have the same problem that I don't have "save and expert..." option. What should I do? - Do you run jupyter-lab on your own machine from the coderefinery environment? - Exactly. That's it. Thank u. - Comment: an attempt to start a new terminal in Anaconda JupyterLab ends up with `Unhandled error` message. So switched to CygWin python environment and proceed there. - good! well, not good of the error message but good you got something. - What does 'metavar' stand for? - variable that stands for a variable - in this case, the help text says `--name NAME`, `NAME` is the meta variable that takes the place of what the user will type ## Exercise (until xx:48) :::success https://aaltoscicomp.github.io/python-for-scicomp/scripts/#exercises-2 - Try to get a basic arparse setup working with your weather_observations.py file by modifying it ::: - How would you save the commands you have used to run scripts? - Assuming that your terminal is bash, you can write bash scripts. There are other options for these "pipelines". Here some teaching materials from our CodeRefinery workshop: https://coderefinery.github.io/reproducible-research/workflow-management/#solution-3-script - That is admittedly an issue someimes, but yes, writing scripts, or having a log file for your commands is important. - Storing the arguments or a configuration file next to your end results (metadata with the data) can help you document the process. In some cases, you can also store this information with the data itself as comments, or as metadata. For example, as a json. - % python weather_observations.py --help ``` Traceback (most recent call last): File "weather_observations.py", line 15, in <module> weather = pd.read_csv(args.input,comment='#') NameError: name 'args' is not defined what went wrong? ``` - args = parser.parse_args() ? - I don't understand what this answer is trying to tell me -- now I got it - I have a problem with --help. It doesn't generate automatically: Error is dateutil.parser._parser.ParserError: Unknown string format: --help present at position 0 - is it still using sys.argv for the to_datetime part? - I think you have `sys.argv[1]` in the to_datetime ## Break until xxfunction. - Did you give additional arguments? i.e. how did you call it/ could you paste your code here? - Fixed. Thanks! There was some confusion with both sys.argv and argparse being present in the code. Got rid of sys and everything worked. ...that's what I get if I don't keep the code clean - Why is there in the solution in the read_csv() a comment="#"? - some csv files have header lines indicated by #, and pandas allows you to indicate a comment character tha allows the parser to ignore everything behind this character in a a line. - If I try to load any of the urls from the commandline it just freezes and does not return my promt. If I download the files locally everything works fine. Does anyone else experience this problems? I can't use any of the URLS in my code or arguments. - What is the peoblem here: ``` $python3 weather_observations.py qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found. This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem. Available platform plugins are: eglfs, minimal, minimalegl, offscreen, vnc, webgl, xcb. Aborted ``` - It expect to find a graphics library, which does not work. Reinstalling Matplotlib might help. - matplotlib.use("Agg") ? - In a functional programming approach, should argument parsing be put into it's own function called from main() or should it form part of main() or is there no real best practice around this? - I don't know if there is a best practice, but given that without some kind of struct/object transferring multiple arguments can easily become cumbersome, i would probably put it into main. - Thanks for the input! - To the person that I helped in the zoom, here the "crash course" on the Command Line Interface (CLI) -> https://scicomp.aalto.fi/scicomp/shell/ Although it is about the Linux shell, the Windows PowerShell is not the same but some commands are shared as we saw. I personally use the GitBash shell on Windows and run my python+conda from there. Here a tutorial on how to install and use GitBash https://coderefinery.github.io/installation/shell-and-git/ - Thank you! - Small advertisement as I saw that we have already linked coderefinery materials a few times, the next CodeRefinery workshops is on March 21-23,28-30, 2023. Registrations not open yet, git+shell pipelines + python reproducibility will be covered. https://coderefinery.org/ - Can an argument both be positional and using the `--` notation? Or do we have to change the code because the output file is used both as positional and optional argument in the example inputs. - An Argument can either be positional XOR optional. You would need to decide on one of the two options. - Ok, so there is no python script that could execute all of the example imputs without errors. - You could write a parser that can do this, but I would not recommend it :smile: - I just assumed that we are expected to write a scipt that can execute all of the example inputs in one go. - Which input is not parsable by the example solution? - Ahh found it. the last is wrong.... , need to correct that. - I haven't tried the example soulution. Output file is used both as optional and positional. - You can add an optional positional argument. So in principle you can accept either positional or `--` argument. ## Break until xx:02 ## Library ecosystem https://aaltoscicomp.github.io/python-for-scicomp/libraries/ Questions: What other Python packages do you use - write name + short description - . - . - simpy - simulation package - typing - package for adding typing - scikit-learn - package for ML. +1 - tensorflow - package for ML. - os (for general functionality), subprocess (for running R subprocesses), glob (for listing files/folders), datetime, dateutil (working with dates), smtplib and mimetypes (for sending emails), pygbif (field-specific package) - nilearn (neuroimaging) - rpy2 (calling R from python) - networkx - for modelling networks - csv - loads and writes csv files - uproot - read root files for particle physics - re regular expressions Has anyone else ever used your work? - I spent a summer writing a piece of proof-of-concept software for a research group. The next summer the next summer worker used my code assuming it was optimized and ready for real world use... - I am terrified to ever find some of my code in places where I don't expect it. Have left code at the end of projects and have no idea what people did with it. I hope they realised that it is not optimised and bug free in any way. - Always start your code with a comment to warn future users (and future self) :D - . - . - Related to what Simo was saying: beware of "typosquatting", you might type a typo when you do `pip install numpi` for example, and that package could be malaware. There is a list of known typosquatting pip packages somewhere, I can find it. - wOW, THAT'S CRUEL - One example is the "requests" package that will be discussed now with APIs, last August there were three typosquats (which I will not write) that installed ransomware (encrypts all your computer and asks for bitcoins to unlock it) ## Web APIs https://aaltoscicomp.github.io/python-for-scicomp/web-apis/ - Is google firebase an API? - It looks like a collections of services - all of which probably have an API. Not all may be "HTTP server APIs" but many probably are. - I ran the catfacts demo on my computer just now and it worked all right. - Is getting data through WebAPIs safe? Or are there security issues that we should keep in mind? - Most security issues would be on the server side. But, you should never trust the data you get in (treate it as data, not as code that is executed) - You should also use https when possible if the provider allows for it. - . - How about if we wanna enter more than one parameter? - `?param1=val1&param2=val2` ... but we will see a more natural way to do it with `get(... , params={'param1'='val1', ...})` - Can we make the json requests look like dataframes easily? - You'll need to convert that json to the data frame. since json can be "anything" you need the right interpertation... read json below might be able to do it easily. - https://pandas.pydata.org/docs/reference/api/pandas.read_json.html ### Exercises until xx:55 :::success https://aaltoscicomp.github.io/python-for-scicomp/web-apis/#exercises-1 https://aaltoscicomp.github.io/python-for-scicomp/web-apis/#exercises-2 Third exercise is optional. ::: - I know most of you are aware of this, but please remember that web-scraping comes with ethical and legal issues. If something is publicly available on the internet it does not mean that it can be used for research, or other purposes. So always check with the owners of the website information if you are planning to do some serious scraping. - What would be a typical use case of WebAPIs in research? I've never come accross them. - https://developer.twitter.com/en/products/twitter-api/academic-research - https://www.nature.com/articles/s41597-021-00974-z - https://materialsproject.org/api - If you need data for research, you might have to collect it yourself. - Download data from zenodo: https://developers.zenodo.org/ - How can I see which options there are for 'activity' when choosing which activity to put in the parameter field? - Do you mean in the [activity API](https://www.boredapi.com/documentation)? There the activity is not an input parameter. It will be a part of the respon - See the end of the page for possible parameteres - Thank you! - Advertisement :) today at the end of this course I am actually hosting a zoom discussion on Responsible Conduct of Research (ethics + legal issues) https://www.aalto.fi/en/events/responsible-conduct-of-research-questionable-research-practices-and-possible-cures-qa - if the web site does not have information about API, how do I know python api package can get data from it? - In the most simple scenario you could be able to get the webpage HTML code and extract information from there (see Exercise 3 https://aaltoscicomp.github.io/python-for-scicomp/web-apis/#exercises-3) - thank you! - Clarification on the json security & arbitrary code execution: json decoders in Python are secure from arbitrary code execution as long as the data is not something you interpret as a Python object (e.g. Pickle). The decoder won't do it automatically though. Usually, you'll want to validate that the json is in a correct schema. For different schema validation libraries, see [this page](https://json-schema.org/implementations.html#validator-python)[name=Simo] - Do you have an introductory course for programming in Python? I use Matlab but want to know more about Python and also Julia. - Our course page has a chapter on [Python basics](https://coderefinery.github.io/data-visualization-python/python-basics/). At the end of the page there is also a link to other great tutorials. - Where can I find the schedule for the course? - There is a rough schedule here: https://scicomp.aalto.fi/training/scip/python-for-scicomp-2022/ ## Break until xx:13 ## Parallel https://aaltoscicomp.github.io/python-for-scicomp/parallel/ Questions: - .. - Perhaps this is tangential, but do you have any tips for tools/methods to profile/time the code? - The simplest I know is the command line utility `time` - An interesting new profiling tool we've been testing is [scalene](https://github.com/plasma-umass/scalene). It does line by line CPU, RAM and GPU profiling and tells you how much time your code operates on Python vs. native code (C/C++/Fortran). - Embarrassingly parallel approachs leads often reading and writing large number of small files (also err/out files when running on a cluster) and this is not good for the file system. Any suggestions to improve this? - You can either use databases such as SQLite to collect the different results (if results are very small) or you can group up the results into a bigger dataframe / file. Other option is to store the results temprorarily in a memory disk that appears as a file system, but in reality is stored in memory. After you have done multiple simulations, you can collect the data together into a bigger dataset (map-reduce using RAM disk as an intermediate storage). - Can you write simultaneously from different jobs to the same database? So the system waits until the database is writable again? - Yes. Other workers will wait, if you're using a good library. For example, [SQLAlchemy](https://docs.sqlalchemy.org/en/20/core/engines.html#sqlite) supports this. - Usually the database is locked, when some worker is writing. Databases are usually not good if you have a constant stream of data, but if you have a results every few hours or so, there won't be a problem. - Are there any downsides to using the multiprocess over the multiprocessing? (or whichever was the fork) - It adds a new requirement (multiprocessing is part of Python), but other than that not really. There are other multiprocessing frameworks such as [joblib](https://github.com/joblib/joblib) as well. You should use the type of library that best suits your problem/coding style. - should you somehow specify the nr of cores (eg max-1) or does multiprocessing will take care itself not to take all the resources and crash? - If you specify too many cores, you will just get congestion when the different processes will compete on the available CPUs. You can limit the number by creating a smaller Pool (e.g. `Pool(2)`). By default, multiprocessing uses all cpus. See [multiprocessing.Pool](https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool)-documentation for more info. - On yur own computer: auto-detection might be OK (but you might want to do the max-1 thing) - On a cluster, multiprocessing probably won't know how many you have requested, so you *need* to make sure it is correct. - Is my code automatically going to use all CPUs on my machine? Or maybe some bits of it? - Almost certiainly not - Typically Python interpreter will just read the Python file you have and it will execute the statements on each line (it interpretes them). This is done in sequential fashion. The interpreter cannot assume that some part can be done with multiple processes. If you use multiprocessing, the intepreter knows that this part can be done with multiple processes. Otherwise, it cannot assume it. - UsageError: Line magic function `%%timeit` not found. Please help - Needs to be at the top of a Juupyter cell (we need to improve this example, doesn't work as it is!) - I am lost... It does not print out the time it took..? - .. ### High Performance Computing (HPC) Python computing on Nordic clusters #### Courses for SNIC clusters (Sweden) - UPPMAX/HPC2N: [HPC python](https://uppmax.github.io/HPC-python/) - PDC: [Introduction to MPI](https://pdc-support.github.io/introduction-to-mpi/) #### SNIC site tutorials/documentation - UPPMAX: [Python user guide](https://www.uppmax.uu.se/support/user-guides/python-user-guide/) - HPC2N: [Installing Python packages](https://www.hpc2n.umu.se/resources/software/user_installed/python) - PDC: - [working with python virtual environment](https://www.kth.se/blogs/pdc/2020/11/working-with-python-virtual-environments/) - [how to use python](https://www.pdc.kth.se/software/software/python/cpe21.09/3.8.8/index_using.html) - [parallel programming in python mpi4py — part 1](https://www.kth.se/blogs/pdc/2019/08/parallel-programming-in-python-mpi4py-part-1/) - [parallel programming in python mpi4py — part 2](https://www.kth.se/blogs/pdc/2019/11/parallel-programming-in-python-mpi4py-part-2/) - Lunarc: [Python at LUNARC](https://lunarc-documentation.readthedocs.io/en/latest/Python/) - C3SE: [Python at C3SE](https://www.c3se.chalmers.se/documentation/applications/python/) - NSC: [Python at NSC](https://www.nsc.liu.se/software/python/) #### Aalto site tutorials/documentation for parallelization with python (and more other options) with HPC - Start from here https://scicomp.aalto.fi/triton/ - Most of the materials also work on other clusters if they use `slurm`. ### Exercise until xx:45 :::success https://aaltoscicomp.github.io/python-for-scicomp/parallel/#exercises-multiprocessing ::: - Note that `%%timeit` has to be on first line in cell to work! - I cant install multiprocess sudo pip install multiprocess just freezes my commandline - If you are in the conda environment, don't use `sudo`. Not sure if that's the problem but may help some. - Cant get the Pool funtion to work. Always get ValueError: too many values to unpack (expected 2) - same here. are we supposed to use ``` from multiprocess import Pool pool = Pool(1) results = pool.map_async(sample, [10**5] * 10) ``` Thank you. - I think I found the reason: It returns touples for all of the input numbers. So 2x the number of inputs. - I really can't get the multiprocess(ing) exercise to work. With multiprocessing package the kernel just goes to an endless loop or something. With multiprocess I get an error that "random" is not defined even though it's imported in the function definition cell - and you did run that function cell? - Yes, I re-ran it after restarting the kernel (when it jammed). Is it maybe that the import is done only on one CPU? should I put the function definition in the same cell as the Pool-stuff? - How do I run the sample within Pool? I tried pool.map(sample, [10**6]), but that does not seem to work. - Quick reminder: doing code optimization is usually better first step than implementing multiprocessing. For example, one could use the following function and get a 10 time speedup without any multiprocessing: ```python import numpy as np def sample_numpy(n): """Make n trials of points in the square. Return (n, number_in_circle) This is our basic function. By design, it returns everything it\ needs to compute the final answer: both n (even though it is an input argument) and n_inside_circle. To compute our final answer, all we have to do is sum up the n:s and the n_inside_circle:s and do our computation""" x = np.random.rand(n) y = np.random.rand(n) n_inside_circle = np.sum(x*x + y*y < 1.0) return n, n_inside_circle ``` - Talking about week long courses on MPI. I would be interested in a longer workshop/course on dask! :) - We'll keep that in mind! ## Feedback, day 3 Today was: - too fast: - too slow: - just right: ooooooooooooo - too simple: - too advanced:oo - worth attending:oooooooooooooo - will recommend to others:oooooooooooo One good thing about today: - . great materials about parallel! x - . so much interesting stuff! xxx - . good amount of time for the excercises +1 - . I know how to make my code run faster ! thanks! will have a try next week +1 - . Richard is doing great ;) ooo - . The WebAPI part was really usefull for using external data sources - . I loved the difference between Jupyter and scripts, I've taken a lot of Python courses and no one has ever given an overview of that. I realized I need to be using scripts for some of my code that I use as an overview of my experiment (i.e. updated daily, I generate a summary figure each day) - . Everything! One thing to be improved next time: - . too much interesting stuff! - . more exercises - . today's topics are a bit abstract for me ooo - . the Scripts lesson was a bit too fast-paced, at least for me :/ - the API stuff was really interesting, some of the other more abstract topics were hard to pay attention to, but the materials will definitely be a good resource. - today was the most packed day of the course so far, many dense topics and they are barely scratched on the surface. Maybe some of today's topics should be a little more spread around on different days Other comments: - . you're doing an awesome job!! Thank you! +4 - . so many useful resources!+4 - .Thank you all so much :) - . It is really good to have many different instructors! It makes it easier to listen. +5 - K8s and Docker for the future can be quite interesting - High-performance data analysis in Python course: May 18-19 (will be published later on enccs.se/events) - Will parallel be continued tomorrow? The parallel session today was an nice introduction, but I would need quite a bit more depth. - No, we finished it now. - Will you do a similar course on R? - In the past we did a data analysis course with R and Python. The course page is here: https://aaltoscicomp.github.io/data-analysis-workflows-course/ We haven't done that for few years though. - Report links to upcoming courses in a separate md document on the github folder for this course (i.e. "resources.md") and/or via email newsletter, please. - For some reason working with args has always been confusing for me, but it was so well explained here, that I don't know why I found it confusing before. The same goes with working with APIs. Thank you! - Where can we find this "chat" that have been mentioned a couple of times? Does that refer to the mastodon thing? - [Here](https://coderefinery.zulipchat.com/login/). It uses zulip chat as the platform. - I have a question about JupyterLab. Could I (and is it recommended) work on JupyterLab on a remote computer cluster? I found some instructions online but the port forwarding keeps failing. ..Thanks! - Some clusters have this already set up. Depends on where you are running. On Aalto, for example, you can just log in (https://scicomp.aalto.fi/triton/apps/jupyter/). Check with you local admins to be sure. - Otherwise, I can think of several possible issues but in principle it is possible to run your own Jupyter server on a compute node. Again, check with local admins first. # Day 4 ### Icebreaker What is your favorite fruit: - dates - figs - banana +1+1 - watermelon - apples +1 - guava - passion - tangerine - mango, the yellow type from India/Pakistan! - blue- and strawberries. - raspberries How do you install other Python packages? - pip install --user: ooooo - Don't do this if you use conda environments if in an environment use pip install without --user +1 - Installing dependencies with pip in an already established conda environment might break dependencies. A better solution would be noting dependencies in a configuration file (evironment.yml), including the pip install stetements if necessary. - virtual environments: o - I make environments: - from PyCharm on a virtual environment - conda install: ooooooo - mamba install -c conda-forge - I follow the creator recommendations: ooo - conda or pip ...why does sometime conda is not working but pip is ? - Partially it might be because some packages have not been distributed on conda but instead on pip - I had the issue even if it was clearly stated that one or the other could / should work - . - People managing HPC prefer python virtualenv over conda because conda automatically manages your .bashrc or .bash_profile, which typically are on login nodes and not computing nodes. - Depends on how your conda is set up. You can run conda without modifying your bashrc. - Honestly, usually a mess of all of the above rolled into a Docker/Singularity container - sudo pip install - Is one of the installation methods safer than the others? e.g. less chance for malware? - Packages in different sources have different levels of security. Anyone can upload to PyPi (pip), but the Anaconda source is curated. Others are between. - Thank you :) - yesterday's hackMD archive only has the notes up to the scripts exercises and then it jumps to feedback. Where's the rest of it? - https://hackmd.io/@coderefinery/python2022archive2 - it doesn't have anything on the Web API or parallel processing parts of yesterday's lessons - Hmm.. Any idea what happened? It goes from Icebreaker to Feedback, so I assumed everything in between would be there. - I don't know but I am glad we have automatic versioning :) - Everything is now fine for yesterday archive -> https://hackmd.io/@coderefinery/python2022archive2 ## Dependency management https://aaltoscicomp.github.io/python-for-scicomp/dependencies/ - Did you add the link to the webpage you are now showing? - https://aaltoscicomp.github.io/python-for-scicomp/dependencies/ - Is Anaconda.Navigator a package distributor? and is it free for academic bit not for commercial use? If Yes, how do they detect the purpose of usage? - Anaconda Navigator is some sort of a frontend to Conda: many things you can do through conda you can do graphically there, start applications inside, etc. - I don't think they try to detect the type of use - Is there a way to automatically update all python packages to the most current version? - If in an environment: `pip install --upgrade --all` or `conda upgrade --all` (these command might be slightly wrong) - No such option: --all - You would probably only want to do this in an environment. In that case, you can also remove the environment and re-create, which is what most of us would do instead. We will see this later. - A bit off topic, but how can I change the font color of the Jupyter terminal? - Assuming this is a Bash shell, one can set preferences for colors, aliases and other things. I usually google because I never remember the exact lines. The second answer here is close to my setup for example https://unix.stackexchange.com/questions/148/colorizing-your-terminal-and-shell-environment - if all libraries are being updated all the time, what is the equivalent of a "long term support" version for a set of libraries in Python? Is the main Anaconda distribution something like this? - I'd say the LTS concept doesn't directly apply to most python packages - Big things like numpy/pandas: you would expect those within a major version to be compatible (1.x.y), so staying with a major version is similar to "LTS". Major verisions tend to last a while anyway. - Smaller, more ad-hoc packages, depend on the package. Maybe they aren't supported at all, even. - Thanks! But so is it that Anaconda versions are like the equivalent of Matlab R202x so that at least those base packages are stable? - I guess so, yeah. At least they won't be upgraded without reason, but LTS is also about security updates, not just staying at the same version. To stay at the same version, use requirements.txt/environment.yml with locked versions. - Thanks! - If you have an isolated environment with some base packages installed and later you figure out you need another package. Would you just use conda install package (assuming you use conda) or would you create a new environment and delete the old environment? - I personally prefer making a new environment so that if I need to port it somewhere else I have the environment file. If you don't care much about the portability of the environment and the reproducibility of it, then it is fine to interactively do conda install or pip install within the active environment. - Are the commands to add and remove packages from the active conda environment that would update all dependencies accordingly? And once you have everything required, you can produce .yml file that defines the environment? I have an experience that in this way things will break, so I have added the packages to my .yml file and made the environment again from scratch. But this is time-consuming if one wants just quickly try a package. I would use just a single environment file. - Wouldn't it make more sense to edit the environment file and use version control, so that you don't end up with multiple environment files? - I often make initial environment.yml. If I notice something missing, `conda install` or whatever to test versions/dependencies out quickly. Then add my final decision environment.yml, and if I have any doubt that it will work, re-create from scratch to verify. I have to make sure that I don't forget to add it! - The R (bioconductor) and python packages for bioinformatics analysis usually have problems, and I need to specify a certain older version of a package for it to work. Sometimes difficult to find an older version of a package that works. Even some specific c/c++ compilers required. - That can happen often. FYI: conda-forge provides packages for [gcc](https://anaconda.org/conda-forge/gcc) as well if you need that for some ``install.packages`` in R. - I had problems with `conda-forge`. There are issues with `repodata` for conda-forge packages, see https://github.com/conda/conda/issues/8051#issuecomment-451164275 - When I record packages and environments, it does not only show "numpy, matplotlib", etc. but around 100 other packages (probably that numpy and pandas use themselves). Do I need to record all these 100 other packages if with their versions to ensure full reproducability? Or is recording numpy, matplotlib, etc. with their versions enough? - This is a great question that I am unsure myself. If I run "conda env create env.yaml" today, will I get the same exact conda environment in one year when I rerun that command? My guess is: no. - In a perfect world, I would have two files: requirements file with *only* what I need directly. I normally install from this. And then a separate requirements file (`pip freeze`) with frozen version numbers of *everything*, in case I ever need to go back exactly. - Probably by defining the version of the concluding package you can also control the subsequent requirements versions. - Would you advise using conda or mamba? - Check out poetry ;) (and pyenv) - https://xkcd.com/1987/ - xD - Personally, I use (and recommend) Mamba for everything, but I'm using Linux, so the user experience might be different on other systems [name=Simo] - Example: If you import all of these packages from your jypiter nb, is there any side effect than manual work? And again, can you print them some how from inside the notebook? - Can you clarify the question some? - Iam kind of having trouble to understand what is the biggest help of enironments.yml. If you import them inside a jypn notebook and isn't it obvious that thesese packages are used and needed to be installed in order to run the notebeook? - The jupyter-notebook itself is part of an environment (the base environment). So by making a new environment you can have notebooks that will see different packages only available in the new environment. To me this has been very useful to re-run old notebooks found online with old versions of packages (e.g. when a notebook was shared with a paper from few years ago). - Okay then, notebook actually doesnot show the version of the package, it just basiacally start with import commands. What you mean is these packages might be old and there must be an environment package to keep track of the versions? - I can clarify, there is no environment package because the python installation is the environment. If you have one installation of python, and if you did not create any other environemnt, then you have 1 environment, the base environment. And all your packages have a certain version. Now one day you realize that you need to go back in time (or forward in time) but you do not want to corrupt your python installation, then you create a new environemnt which is like a new installation of python that is not going to affect the "base" one you already have. So it is like saying that you can have multiple installations of python+librares that are independent from each other. - Very clear now!, one ore question may I, how doees the notebook realise the environment.yml and install the required versions? - It is better to isntall what you need outside the notebook, and then start the notebook after activating the environment. See somewhat related question below at line 167(about) "- How to activate a conda environment in JupyterLab?" - Okay cool, I am very confident now. - Excellent! :) - if I try to install a new package to a new environment it says : "The environment is inconsistent, please check the package plan carefully", what does it mean and how to solve it? - Is this with conda? - yes, conda - So, I guess the packages have some sort of conflicts or something weird going on: e.g. A depends on numpy<1.0 and B depends on numpy>=1.0. Or maybe it's something about the order they are installed. - I would try removing and re-creating the environment to see if it can be solved from scratch. - And if that doesn't work... time for some investigation. - but it was a new environment and only one package installed. Does it mean that I have problem with something else? - I guess... I'm really not sure, time for some investigation! Maybe get some friends/support and take a look? There is a reason there is something called "dependency hell"... - Does conda environment still assume some packages installed to the computer, like c/c++ compilers? How much it makes sense to add e.g. compilers to the environment. I have university administered computer so I can not control every installation. Or are containers (Docker/singularity) a solution to this? - Conda generally can install (almost) everything needed. For example, even compilers are in conda, which makes it a good choice for these kind of university-managed computers. - Worth adding here that if one messes up an environment in Conda by adding or upgrading packages/dependencies, Conda can 'roll back' the environment to its former state: https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#restoring-an-environment - Something weird happened to me: I have installed requirements with conda through the [environment.yml](https://raw.githubusercontent.com/AaltoSciComp/python-for-scicomp/master/software/environment.yml) provided by this course, and when I run `python -c "import numpy as np; print(np.__file__)"` python claims that there is no numpy. Then if I check the installed packages through `conda list -n python-for-scicomp | grep numpy`, it returns numpy version 1.23.1. Why do you think python does not find numpy? - Is the environment activated? And/or are you sure i'ts running Python in the environment: `print(sys.executable)` might give a hint. - Appreciate it. The environment is activated as shown by the terminal prompt, but `print(sys.executable)` returns `/usr/bin/python3.9`, which is not consistent with `which python`: `/home/<user>/miniconda3/envs/python-for-scicomp/bin/python` - Hm... `type python` (this considers shell aliases if you use the bash shell). - `type python` returns `python is aliased to python3.9`, which is 3.9.15 according to `python -V`. My base python version is 3.9.5 in `~/miniconda3/bin/python`. - With regards to environment exports: it can be good to have the "exact molecular makeup of the cake" if you're making a publication and you want others to be able to exactly replicate your results - +1 (I see you point, in this metaphor: when adjusting the molecular structure, you start at the beginning since otherwise it might be hard to get there again) - How to activate a conda environment in JupyterLab? - The simplest is to activate the environment, then start JupyterLab. You don't *have* to, you would install another Jupyter *kernel* that goes to that environment. - Do I need to have JupyterLab in the environment? E.g. now JupyterLab is in Anaconda environment? - Yes. That is usually the simplest way of doing it. Conda will reuse already installed packages, so having jupyterlab in all environments usually does not create any problems. - In this case, yes. It's not ideal but simple. - Conda environments often fill lot of storage space. Good practises to prevent this? - use `conda clean -a` once in a while. This will remove unnecessary files. - a big problem is the number of files (not always the space itself) which could be limited at a cluster for instance but also on you computer probably - CSC (national supercomputing center) have written [a tool](https://github.com/CSCfi/hpc-container-wrapper) for containerizing conda environments into singularity containers. We haven't tested it in other systems, but it is interesting. - A note from our daily work in supporting users: every day there is a request for help about conda and environments, independently of the level of expertise of the person requesting. So my take home message for you: if you want to learn one thing from today, just try to learn and use conda. :) - And don't be shy to ask for help sooner rather than later! - "Further readings" for those who will want to go deeper: one can use containers (something like a virtual machine) to wrap the conda environment so that all the system level dependencies (OS version, gcc, libraries, etc) are "frozen" together with the environment. This is not the place to talk about this, but here some links you can explore alone: - https://scicomp.aalto.fi/triton/usage/singularity/ - https://github.com/bast/singularity-conda - https://coderefinery.github.io/reproducible-research/environments/ - https://carpentries-incubator.github.io/docker-introduction/ - Maybe one day we should run a course like this all about containers - Also something I forgot to mention: [conda-pack](https://conda.github.io/conda-pack/). It can store conda environments into tar files so that they can be transported to other systems or even packaged and shared.[name=Simo] ## Break until xx:12 ## Binder https://aaltoscicomp.github.io/python-for-scicomp/binder/ ### Exercise 1: until xx:24 :::success https://aaltoscicomp.github.io/python-for-scicomp/binder/#exercise-1 # Exercise Note: this is the last group exercise in the course. ::: Why is it possibly not enough to share “just” your code? What problems can you anticipate 2-5 years from now?: - Suppose one of your dependencies' dependencies has been updated since the paper was published and causes different results when someone runs the code? That's why just sharing the code might not be good enough and you might want to replicate the original environment exactly. - "dependency hell", as stated above XD - The license of the code matters. - It might help to make it easier to interact with your code. I can imagine that some people would like to see how it works, but might not want to install dependencies, look in the code what does what, etc.. then a binder makes it easier to interact. - Not everyone interested may be comfortable cloning git repo and taking it from there. - Code, that was shared on GitHub, may not be compatible with newer versions of Python and its packages - What if running your research analysis requires huge computational resources and storage space? Binder reasonable in this case? - the things can change and the code would not be update and also the version of phytion can change - In some case (e.g. when using Matlab) other people might not have license to the tools, so they cannot replicate the results. - For compiled languages, as for instance Fortran and C, it can happen that the toolchains used when developing the code are not available at a later point. For example that the code requires the Gnu compilers version 7.x.y, and the computer (laptop, server, or supercomputer) where you work have only versions 9.x.y and later. - . - . - ? does one need to constantly host a big machine (in case of heavy code to run) and pay for it if he wants to make it run for all users even without computational power? someone will have to pay for the servers at the end **Questions:** - Can you only use jupyter notebooks with binder? Or could you also share scripts/.py files? - I haven't tried. It opens an interface like Jupyter-lab. You can probably run a terminal - It appears you can run rstudio in it: https://github.com/binder-examples/r . Probably you can do many things with enough work. - Quick mention about singularity: The ecosystem has gone through a similar split as Anaconda/conda. The open source version is [apptainer](https://apptainer.org/) maintained by Linux Foundation while [singularity](https://sylabs.io/singularity/) is a commercial product (with community edition) made by Sylabs. - Currently they are pretty much interoperable, but in the future... no one knows. - Does everything have to be in the same Github repo? Can you use Binder if you have your data in a separate repository? - You could have one repo start binder, and it gets the data from the other one when it runs. (or from anywhere else...) - **A quick note**: in container environments it is quite common to see people using pip instead of conda. This is because the container will usually provide the system libraries in a reproducible fashion and pip has less overhead (less solving). - If you have tens of packages installed, how to automatically list them to create a requirements.txt? - `pip freeze` will create a `requirements.txt` based on the installed packages. The same caveats apply as with conda environments: it will list all packages, not only ones you specified to be installed. - Can you use Binder for bash codes/ R codes/other not phyton? - In theory yes, you need to look at the configuration so the right things get installed - and maybe configured so it automatically works. - [Here](https://mybinder.readthedocs.io/en/latest/examples/sample_repos.html) is a big list of example repositories. R, Julia etc. are all supported. - So Binder runs everything in your notebooks on its own resources? Are there any limits to resources? (I have some pretty computationally intensive code so I'm wondering where the limits are) - Resource limits aren't tiny but they aren't high. For daily computational work, something else would be better (also partly because everything there gets lost when time runs out!). - Cool that there's a free service like that out there regardless! - Google Colaboratory is a similar option with collaboration as the main motivation: https://colab.research.google.com/ - You probably won't want to use binder for doing something computationally complex. But you can use it to demo your results to collaborators very easily. - How does Binder pay for its servers? - Mainly they've been donated resources by different cloud providers. See [this page](https://mybinder.readthedocs.io/en/latest/about/federation.html) about BinderHub Federation. - Security and Binder: From their page https://mybinder.readthedocs.io/en/latest/about/user-guidelines.html "You shouldn’t do anything on mybinder.org that you wouldn’t mind sharing with the world!" so in practice if you are using certain data/code that are not supposed to be fully open (data fully open as in CC0 == anyone can do whatever they want), then do not use binder as they cannot assure where your data/code might go. - Just saw on their user guidelines that users get 1-2GB of RAM. - Are we going to go through the parallel programming? - Parallel programming was yesterday, but we will most likely touch the subject in the panel discussion. If you have any questions related to that, do not hesitate to ask. - Okay, I was in the yesterdays lecture, might be passes over quickly then. I will watch yesterdays lecture on on youtube. ## Packaging https://aaltoscicomp.github.io/python-for-scicomp/packaging/ - Break ? :( - Break-break ? - Break (after a few minutes until xx:08) Questions: - Can you also install this package with conda instead of pip? - You can install pip packages in conda environments as well. - But the anaconda documentation says it's better not to use pip in a conda evironment. - This is because the pip installed packages might break conda installed packages. If you're creating the package yourself, you're most likely installing it with dependencies to the conda installed versions. Pip will find those packages e.g. ``numpy`` and it will consider the requirement satisfied. If you want to be certain that pip does not try to install anything, you can use `pip install --no-deps`. This makes it so that pip won't try to install dependencies. - Also, if you're developing a package, it will be better to do the development in a dedicated environment. Otherwise you might miss some dependency that is satisfied by a package that just happens to be installed. This will also protect you from the possibility of pip installation breaking something. - If I get it correctly, installing local packages through pip in a conda environment is exactly what happened during the demo (and might have triggered conflicts). - Might be. Demo-effect in action. Checking file locations via `package.__file__` and using `conda list` in conda environments will usually show problems. - The above is the same strategy that I was suggested to deal with the problem of not being able to call numpy even though it was installed in the `python-for-scicomp` environment (one of the questions in the pip section, see above if interested). - **Comment**: you may need to do : `pip install --upgrade setuptools` before trying to pip install you package - also: "``./``" may be needed for pointing to a path in the current directory - **Fun fact**: PyPI seems to be pronounced "Pie-pee-eye" because the "PI" abbreviations for Package Index - Interesting. The more you know :rainbow: - there is a package called PyPy so good they are pronounced differently :smile: - `packages = "calculator"` in pyproject.toml? ? - or `calculator-<yourname>` ## Panel discussion - Do you recommend any good Python book/other materials ;) to keep learning? - Robust Python - I used https://www.w3schools.com/python/ to get into python but it is just an introduction. - Python Testing with pytest - I sometimes show my code to my colleagues and they tell me that it is the worst code they have ever seen, but then give me ideas how to do this specific task better. So if possible talk to your colleagues. - How to master the functions?Create, know existing ones. Yes, it is okay to understand the existing functions from API's but to generate your own sophisticated functions is giving hard time to me - http://cicero.xyz/v3/remark/0.14.0/github.com/coderefinery/modular-code-development/master/talk.md/#1 - https://www.youtube.com/watch?v=x0FoTBZcn2U&ab_channel=CodeRefinery - I have a hard time seeing the benefits of using a jupyter notebook over e.g. vscode's interactive mode. Seems to be a lot of struggles with jupyter. Any comments? - I think that each one finds their best way for "interactive mode", but the final goal should be to not be interactive at all to make sure the processing can be run without interaction to (re)produce the same results. - Seems to me like it's another argument for using vscode (or similar) over jupyter. Essentially you can choose if you want to run a file in interactive mode, or as a "simple" file from top to bottom. But, to each own preference I guess :) - Is there a language that is "more friendly" when starting with machine learning? - Machine learning is not really "friendly" because of the underlying complexity. But there are libraries, that hide a lot of it (pytorch/tensorflow) do offer a lot of functionalities). - I guess what I meant was is it better to use python or C++ - I would claim python is definitly "friendlier" than C++ when it comes to coding. And a huge question here is: Do you want to USE existing ML models/modelling approaches or develop new things. for the former, C++ is probably not any better than python, for the latter, it could potentially be faster, but for science (as in this is proof of concept), the speed doesn't make a lot of difference. - Can you recommend any ressources (books, websites ..) for general best practices in coding + 1 - CodeRefinery :smile: - Already followed CodeRefinery :wink: - Yesterday it was said that Python was perceived as "relatively new" around 10 years ago and is now very ubiquitous. In your opinion, do you think Julia is coming up to take Python's throne? Is it worth the time to start learning Julia now already? + 1 - will take very long for Python to be dethroned, but learning Julia is fun and could well be worth your time - if you search internet for "should i learn Julia" you can find different takes on the question from different perspectives, different disciplines etc. - If you have time, learning another language will probably help you with Python anyway. - If I thought that this course was a bit too basic for me personally, where would you recommend that I go next to learn more? Any suggestions on an "intermediate" level course? - if HPC and high-performance data analysis are your thing, then there will be an ENCCS workshop teaching [this material](https://enccs.github.io/HPDA-Python/) in May next year - I would personally make it "goal oriented", so that if your goal is to be an expert on a certain topic, then you go through courses or tutorials for that topic e.g. tensorflow, fastai, etc... So you learn by doing rather than learning for its own sake - Then at some point you need to learn what is not python specific but is more about software engineering: version control + software testing + code modularity + continous integration. The code refinery workshop teaches that. These are all skills very valuable outside academia! - How do you deploy a ML model, for an online use or smt else? Do you have some examples showing it live, with code. - One option is [onnx](https://onnx.ai/). It is a standard for sharing machine learning models. Usually frameworks have their own ways of sharing models or storing them. However, most of them are not interoperable with other frameworks. - [HugginFace](https://huggingface.co/) is a common place to upload your models and has good instructions. - Would you predict Matlab will die out at some point? - Not until all the generation-X Matlab people are around at universities and companies (disclaimer I am a generation-X Matlab user :)) - Some systems/companies are still dependent on Matlab, so it won't die out quickly. For pure data science and machine learning, my feeling is that Matlab is already lagging behind python/R. - Me, too. I used Matlab in all courses and in my postdoc research. Are new matlab generations being created at universities now or the population is going to go extinct? - At least at Aalto many teachers have switched to Python/R/Julia. However, matlab has amazing self learning materials https://matlabacademy.mathworks.com/ which can be attractive for the younger generation. However since most of us will not be in academia forever, it is better to check what is currently asked in the job market, and last time we checked matlab/python job posting ratio was about 6:100. I can do some linkedin queries :) - Linkedin jobs in Finland with Matlab as keyword: 89 - Linkedin jobs in Finland with Python as keyword: 1353 - Same queries with San Francisco/Bay area: python jobs=33030, matlab jobs=2154 - same ratio of matlab/python~=0.065 - Matlab has it's benefits but the huge disadvantage of being proprietary (which at least for science is a disadvantage), but I don't expect it to go extinct any time soon. - Best practice for time management: How do you draw the line between coding and personal life (if any) ? Especially for those with a few years of experience, it is challenging to find time to improve your coding skills and make progress on your projects at the same time. Assuming you have your priorities straight on which tools you should develop. +1 - A lot is learning on the go i.e. while coding on the tools you should develop. - The current discussion of Thomas and Simo about which tools to develop, transferabilities and knowing benefits when to move to more complexity (internalizing) is relevant in this regards. - If you are a researcher, you hopefully have a lot of freedom to try new tools during existing projects, even if it makes you a bit less efficient. - If one would be interested in doing what you do at coderefinery(?) what would you suggest? teaching, say - For CodeRefinery: - Check out our [Zulip chat](https://coderefinery.zulipchat.com/) - Sign up to a [future course](https://coderefinery.org/workshops/upcoming/) as a helper - Check out our [materials on GitHub](https://github.com/coderefinery) and help us develop the courses - For Aalto Scientific Computing (we support researchers with software development and computing) - Most of what we do is documented at https://scicomp.aalto.fi/. - Some of these pages might help you make the case to your university that they need similar support. - Also you can join [Nordic RSE](https://nordic-rse.org/) to solve the [Advent of Code](https://adventofcode.com/) puzzles on [Zulip](https://coderefinery.zulipchat.com/#narrow/stream/305975-Advent-of-Code). - Course link for the Future events section: [High-performance data analysis with Python](https://enccs.github.io/HPDA-Python/) (I think) - . - . - . ## Great resources to learn more (from https://coderefinery.github.io/data-visualization-python/python-basics/) - [Real Python Tutorials](https://realpython.com/) (great for beginners) - [The Python Tutorial](https://docs.python.org/3/tutorial/index.html) (great for beginners) - [The Hitchhiker’s Guide to Python!](https://docs.python-guide.org/) (intermediate level) ## Feedback, day 4 :::success Afertparty zoom, right now: https://aalto.zoom.us/j/69608324491 ::: Today was: too fast: too slow: just right: ooooooo too advanced: too basic: worth attending: oooooo not worth attending: I would recommend to others: oooooo One good thing about today (for each lesson): - Demos, had the right pacing for viewing or type-along +1 - Binder is extremely useful - It's so important to teach about dependency management, packages/libraries, environments, and container images - and yet this is the first course that actually covered this in a systematic way! I've had to pick this up ad-hoc over the past few years and it's been rough. So that was super useful. - Alien topics to me finally clearly understood - The best thing in general was the senior programmers did best to deliver the complicated subjects very professionally to the variety of levels of programmers. - . One thing to be improved for next time: - Bit more hands-on practice. - More time to get through all the intended material. - I don't know if it was just me, but I found today really hard to follow because of the lack of hands-on practice. - This is a very minor comment, but I think it would be better if there was a clearer "leader" for the sessions, and that the co-presenter were a bit more quite. Sometimes the co-presenter said things that interupted the main-presenter, even just things like "yes, aha" and so one can be distracting when the presentation gets abruptly interupted. Other comments: - Best course I've taken all year, thanks so much everyone! +2 - Thank you for providing this course! You show some really nice tips, know-how (e.g., advanced numpy). Much appreciated that you organise this! +3 - HUGE THANKS for this course, team! +3 - Not enough Monty Python references ;) - Thank you for the course! It was really useful :) - Huge huge thanks! - Confronting with a community (see Coderefinery Zulip) helps you grow, consider cases beyond those you encounter during work. - In addition to interesting and structured content, excellent planning of the educational process and excellent psychological work with anxiety. You are the best)) - The couple lecturing is kind of worked really nice. --- :::info **This is the end of the document, please always write at the bottom, ABOVE THIS LINE ^^** HackMD/HedgeDoc can feel slow if more than 100 participants are editing at the same time: If you do not need to write, please switch to "view mode" by clicking the eye icon on top left.