# Python for SciComp 2023 (archive part 2)
:::danger
This is the archived version of https://notes.coderefinery.org/python2023
For day 1 & 2 visit https://hackmd.io/@coderefinery/python2023archive
:::
# Day 3
## Icebreaking questions for day 3
How much have you used a command line interface
- what's that?: o
- a few copy-paste things: oooo
- for some work: oooooooooo
- quite a bit: ooo
- sometimes for getting few things work in my research work
How do you run the same thing over and over, but with slightly different parameters? Like, same code with different data:
- functions?.
- loops.
- with a for loop over your parameters.
- class
- .
- .
- running functions in a loop...?
- Use command-line arguments as input to the script and GNU Parallel/array jobs and run as many jobs as I need
(add stuff you'd like us to discuss):
- 1. Could you please talk about Error handling / decorators? What is the best way for clean code?
- This is a good question, but a very hard one to answer. Typically when it comes to writing clean code the way we learn is to read a lot of blog posts, examples etc. that demo various coding practices. One can look into various style guides (e.g. [google style guide](https://google.github.io/styleguide/pyguide.html) or [numpy style guide](https://numpydoc.readthedocs.io/en/latest/format.html)). Most commonly using existing packages and using the style and structure of those packages make code look better as you do not fight the conventions made my those packages. We'll talk about this in library ecosystem.
- 2. I have apparently two separate installs of Jupyter/Anaconda. Is this bad? Will they crash into each other?
- During dependency management we'll talk about this in general. Usually having multiple installations do not conflict, unless you try to activate or use both of them at the same time.
- Ok, thanks
- 3. I also have a lot of dependency problems in my windows. How can I solve them?
- This is a complicated question, but we'll talk about dependency management later today. Hopefully it helps!
- 4.
## Julia for scientific computing
- Speaker: Luca Ferranti
- [Julialang](https://julialang.org/), runs like C, reads like python. A language developed by researchers for researchers.
- Thriving community:
- [Julia Users Helsinki](https://julia-users-helsinki.github.io/) 13th December: Intro to julia
-
### Ask questions from Luca
- 5. How much work would you say it is to pick up Julia for a experienced Python user ?
- I think it's a pretty smooth transition, syntax-wise at least. The main thing will be to get out of the object-oriented mentality and into the multiple dispatch thing, that takes a bit, but when it clicks there's no coming back :) (Luca).
- 6. Q: I've been advised to try Julia multiple times but due to time constraint I always had to use R for the statistical analysis part of my work... Could you give any hints on what to pick from Julia that would overcome R?
- I can share a few nice resources (Luca):
- Learn julia via epydemic modeling: https://www.youtube.com/watch?v=k0fr7XjH1_Y workshop from JuliaCon 2020
- Julia Documentation: https://docs.julialang.org/en/v1/
- Julia Academy is pretty nice for beginners: https://juliaacademy.com/
- In general, the Julia Youtube channel with all the talks from JuliaCon has a ton of very cool resources: https://www.youtube.com/@TheJuliaLanguage . If someone did something cool with Julia, it's there.
- 7. Q: How would you compare the library situation between Julia and Python?
- For scientific applications, Julia has a very strong ecosystem. For example, I claim [SciML](https://sciml.ai/) has the strongest ecosystem for differential equations solvers and scientific machine learning (data-driven solution of differential equations, etc.).
- For optimization, [JuMP](https://jump.dev/) is very cool. It is a domain-specific language which allows to write optimization problems in a very mathy way.
- This has a nice list of organizations in the ecosystem: https://julialang.org/community/organizations/
- 8. Getting inaccurate decimal value with
```
julia> 0.1*3
0.30000000000000004
```
How to avoid the infinitesimal 4 ?
- When using floating point numbers there isn't infinite precision. After around 15 meaningful numbers we reach the point where the double precision floating point's precision ends and you cannot distinguish between two different numbers. For example, see [the wikipedia article on floating point arithmetic](https://en.wikipedia.org/wiki/Floating-point_arithmetic). Same happens in python, if you run `0.1*3`.
- Indeed, this is a IEEE floating point standard thing, not a julia thing. The same happens under the hood in matlab and python, maybe they print less decimals, but that's just tricking you. That being said, if you need extra precision, you can use e.g. BigFloat in julia. Which is an arbitrary precision floatin point.
- Thanks for the wiki ref, it mentions Logarithmic Number Systems as a way to mitigate floating-point arithmetic inaccuracy,
```
julia> using LogarithmicNumbers
julia> float(ULogarithmic(0.1) * 3)
0.3000000000000001
```
- 9. When I try to launch the terminal from the new launcher it tells: Launcher Error
Unhandled error
- Maybe the same as question 7 below.
- 10.
## Scripts
https://aaltoscicomp.github.io/python-for-scicomp/scripts/
- 6. does Coding mean making a script?
- Coding is used to refer to scripting and also programing.
- In addition to what thomas said, people also differentiate between scripting and programing, dipending on whether we need to *compile the code* to execute(ask if you want to know more about compiling)
- Scripting often refers to "writing my code in a way so that it can be executed as a stand-alone program" (e.g. without running a jupyter server)
- 7. I get bash:fork:retry:Resource temporarily unavailable
- Is it enough to try again? - No, I get the same message
- Is this running on your local machine? Or are you using some online service? (e.g. jupyter on cloud) I'm using JupyterLab on cloud
- Is this at Aalto? No
- There might be some firewall blockers. You could try downloading the notebook locally and upload it to your cloud jupyter lab. I presume you are using HY jupyter?
- - This worked, thanks! I'll have to check my firewall settings
- What cloud service? - JupyterLab on Firefox. That',s what you run locally, but what is JupyterLab running
- Not sure I understand correctly, but I'm running it through anaconda? What is the address in the URL bar? (if it's not private)
- nb.anaconda.cloud/jupyterhub/user/*lots of numbers and letters*.ipynb
- My hypothesis: the JupyterLab service you are running doesn't allow shell or other non-Python access. These things won't work in that case. Yeah, so anaconda cloud probably doesn't provide shell access.
- 8. I like your discussion today, relaxing atmosphere.
:cat:
- 9. Really interesting, I was always doing my draft coding in notebooks and then re-structuring everything in a py script to run it in a server. BUT, before exporting the notebook, shouldn't you make sure your code in the notebook is structured?
- You can also structure afterwards. Depends on your personal choice. The main thing would always do before exporting is removing any "magics" commands, as they can mess up things a lot.
- Most of us instructors would probably wright directly into a script and/or copy to a script ourselves. We demonstrated it this way since it provides the simplest transition (and... there are still problems.).
- Videos are blocking code +1
- sorry about that! I'll pay more attention.
- 10. When I fo the "python weather_observations.py", it gives as result: "No module named 'pandas'". What could be the problem? +1
- Are you running from the JupyterLab terminal? I guess it's not finding the Anaconda environment you are using.
- did you start your jupyterlab instance from the coderefinery environment?
- True, that was the problem. Now activating the environment it workss. Thank you!
- 11. what's the difference between command prompt and powershell?
- only what "internal commands" are available. Powershell has some more built-in tools, but otherwise it's "just another terminal".
- Yes: same general idea, but some operating system commands may be different.
- 12. Is it mandatory to create a new environment?
- Base environment of Anaconda comes with most packages used in this course (+1, should be everything). If the imports clear, you should be good to go.
- I won't spoil the fun, this is the topic of the dependency lesson in 1h :) [spoiler: it is best to always create a dedicated environment for a project]
- 13. Download ... and weather data file, What wheather data file?
- It's downloaded in the script
- 14. I think it worked but I don't see any difference, should I ? :P I mean at least it doesn't throw any errors
- Then it works :)
- 15. I have got an error "ValueError: Multi-dimensional indexing (e.g. `obj[:, None]`) is no longer supported. Convert to a numpy array before... What does it mean?ok
- I am unsure. If you want us to help you need to paste more code to see what is run
- 16. I'm using jupyter from vscode and it doesn't seem to work, at least from the vscode terminal, any tips?
- The steps are a bit different with VScode to export a notebook to script. https://code.visualstudio.com/docs/python/jupyter-support-py
- VSCode might also have different interactions with the operating system. Are you running vscode locally or in a cloud?
- locally
:::info
### Exercise Scripts-1 until xx:17
https://aaltoscicomp.github.io/python-for-scicomp/scripts/#exercises-1
Try to do it, but if it doesn't work, don't worry: rest and see the rest. You can try more during the second exercise.
:::
- 17. Today, the interface of Jupiter has changed, I see only folders, and cannot open terminal. I also connot perform the steps from the ex.1 like export etc. In Helsinki University. ..
- Maybe try "File > New launcher" ? I am not able to test the HelUni jupyter
- - I have only Tabs: Files, Running and Clusters, nothing else.
- And I suppose you have tried to close everything and restart?
- File -> New Launcher -> Terminal ?
- Helsinki University users can join my Zoom room for live help: https://helsinki.zoom.us/j/8318379042 -Juhana K
-
- 18. Why am I able to use jupyter in the terminal (Powershell) via JupyterLab but not in Windows OS Powershell?
- When you launch the terminal from JupyterLab it most likely "activates" the python environment which means that it loads environment variables that allow the terminal to find Python. When you launch a normal PowerShell these variables are not loaded. This is good as you do not want the normal PowerShell to use a wrong Python by mistake (e.g. some system program would suddenly use your Python installation).
- Thank you. Is there a way to call python and jupyter in Powershell, or CMD?
- If you have installed Python as a part of Anaconda installation you should get something like "anaconda prompt" or such in the start menu. I would not try to get it working on the system terminal as it can then break other software.
- Ok. Is this easier in Linux (can it be run from the CLI there)?
- In linux you can usually set up this variable `PATH` to point to `bin` directory in your anaconda installation directory. Or you can run `conda init` which will make anaconda always present in your terminal. We usually discourage it though, as again, it can cause other programs to use your anaconda, which can have unforseen consequences.
- So you can do something like `activate_conda(){ export PATH=~/conda/bin:$PATH ; }` in your `.bashrc`, which is a configuration file that will be loaded in the terminal, and then you can run `activate_conda` to make the conda available. Here the installation would be in `~/conda`.
- 19. Is it necessary to have the `#!/usr/bin/env python` at the top? Does it not cause trouble when we are in a different Anaconda environment?
- That line is called shebang and it is a syntax to tell the terminal what should be used to interpret the script. Indeed it won't control that you are in the right environment. If you type on the terminal `which python` you see where is the full path of your python intepreter.
- I see what you mean, but typically, the path of the python executable changes depending on the name of the environment, so I wondered if it was a good idea to include it as a directive.
- The line basically says "if this program is executed from a command line ask program /usr/bin/env where is python and run rest of the script with that". The program "/usr/bin/env" will find the correct python if you have an environment loaded. But if you call the script with `/path/to/my/python script.py` the shebang is not run as you have specified which python to use from the command line.
- I see, that clears it up a lot. Thanks!
- 20. I assume sys.argv[] is always a string list: YES
- Yes it is
:::info
### Exercise Scripts-2 until xx:45
https://aaltoscicomp.github.io/python-for-scicomp/scripts/#exercises-2
- Work on this as well as you can. If it doesn't work, that's OK.
- Either do what we demonstrated, or try to use argparse.
- If you have more time, look at the next section and exercise
I am:
* done: oooooooo
* not trying:
* stuck: oo
:::
- 21. datetime wasn't found
- is that the full error message?
- no, the error message is longer. It fails to download numpy I think.
- Is it being run in the Anaconda environment?
- yes, I might have issues with no kernel available? Could that be the cause?
- yes it might.
- when I restart JupyterLab I can not select a kernel, or sometimes it lets me select one but that one then immediately stops working?
- where are you running your jupyter from?
- anaconda on the cloud. when I try to reinstall a kernel I get permission denied error
- 22. Questions to staff: If JupyterLab is started through *Anaconda Navigator* directly (not command line and then jupyterlab), is $PATH set to the active conda environment?
- It is. You can open a terminal within jupyterlab and type the command `echo $PATH` to confirm this
- 23. After modifying the script, do you need to run this again: jupyter nbconvert --to script weather_observations.ipynb ?
- No, if you are modifying the script (`.py`) file directly.
- 24 Do I have the right impression that using agrs is handy when calculations are performed on external computer (for example on the computing cluster), but if you are doing local calculations (for example data-analysis with Spyder), it is perhaps more convinient to use classes and functions?
- Command line arguments and classes/functions are not really mutually exclusive. Esssentially command line arguments are good for on interactive use, where your code is a stand-aone script or program that performs a certain function. If you have this functionality within a larger context, then yes, providing it as a class function or other function makes sense. One could argue, that something that takes command line arguments is a self-contained program.
- Personal preference: jupyter or spyder (or Rstudio in R!) are good for prototyping interactively so you can look at the values in the variables. But when the prototyping is over I convert everything to the "final" script and run it non interactively from terminal (or cluster)
- Arguments are great when you have some parts of the code that stays the same (e.g. your analysis code) and some input that changes (e.g. parameters, input files, output files, options for more verbose printing). This is especially useful in clusters, where you usually want to run the same code multiple times without changing the code between runs. Classes are functions are great in the code to organize the program and like mentioned before, they are not mutually exclusive.
- 25. I am still dealing with pandas not found problem! it seems that it is installed but it can not be used.
- You might not be able to see the installed environment. Are you running jupyterlab on your local machine?
-yes it says localhost:8889
- if you start a new notebook and run a cell with `import pandas as pd` does it complain?
- it works!
- good! So most likely the other script did not import pandas...? Just a guess (assuming it is on the same instance of jupyterlab, same environment)
- Are you starting the terminal through Jupyter? We hope that means it's using that same Python environment.
-yes. and trying to put everything in the same path... but I assume path is different from environment.
you need to solve the environment issue by using teh commands "conda activate base" and then the command "conda install pandas". This will resolve the iossue.
- I am trying these again but as I said panda seems to be there.
- 26. When adding arguments to the argparse object, how is the numbering of the arguments set? If you add an "input", then on the next line add an "output", will the first argument written in the command line call always be the "input"? And would it then be opposite if you in your script first defined "output", then below "input"?
- Good quesiton! It's the order you use add_argument. The ones without `-` get taken in the order you run `add_argument`.
- Ok got it
- 27. Question related to the course: In order to get credits, I understand you need to submit at least four Python scripts / notebooks based on an example code. Is it OK to have scripts that are basically generalized versions of the exercises? (for example, with more customizability etc.)
- Yes this is ok. Make them useful for your learning, that's the most important thing.
- 28. Is it possible to test code with argparse in JupyterLab / a notebook?
- I am unsure if it is the best way, I add an "if" so that if args are missing, then I set them up.
- Set them up in a cell above in the notebook?
- You can spoof the argument parsing in the following way:
```python
import argparse
# Create parser
parser = argparse.ArgumentParser()
# One positional and one optional argument
parser.add_argument('name', type=str, metavar="N",
help="The name of the subject")
parser.add_argument('-d', '--date', type=str, default="01/01/2000",
help="Birth date of the subject")
# Give some example arguments
args = parser.parse_args(["2", "-d", "01/12/2012"])
print(args.name + " was born on " + args.date)
```
- Of course, you cannot give arguments to the jupyter notebook itself.
- Thanks
- 29. So what was the point of adding --output optional argument?
- If you check the solution https://aaltoscicomp.github.io/python-for-scicomp/scripts/#exercises-2 output is not an optional argument (I presume we are talking about that). The point is that most likely input and output will be different for all the many input files you need to process.
The lessons linked at the bottom of Reproudcible Research and Modular Code Development describe how a command line interface fits in the bigger picture.
- 30. "python weather_observations.py https://raw.githubusercontent.com/AaltoSciComp/python-for-scicomp/master/resources/data/scripts/weather_cairo.csv --output temperature_cairo.png"
did not work with the example Solution, when I added "- -" (two dashes) front of "output" in the code it did work
example: parser.add_argument("output", type=str, help="Output plot file")
- Oh yeah, that example solution has `option` as a positional argument, not a `--` argument. Seems like the last examlpe line is wrong... I will fix.
- 31. I find it a bit odd that when I start terminal from Jupiterlab launcher, it starts in the base environment, not in the environment created for this course from which I start Jupiterlab. So I still have to run conda activate in the terminal. Maybe the behaviour is intentional, but it is confusing.
- Hm. Good questions... it does seem a bit weird. We'll have to investigate (most of us don't launch jupyter/the terminal this way)
- 32. Is the sys.argv way of passing arguments to the script equivalent to the argparse way? Is there any differences I should be aware of? (I normaly use sys.argv)
- argparse uses sys.argv under the hood. For practical purposes, they are the same and use whichever you like the most.
- 33. How can we use parsers (or other ways) to communicate between different scripts written in different languages? For example, I would like to write my code in C++ for performance reasons and then call it from Python with different initial parameters. Is parsers the way to go or are there better options? Right now I am using pybind, but I am also not calling the function from the terminal at the moment.
- There are different ways: one code calling another as a command line indterface, or use C++ + an interface to call C++ directly from Python (basically, make a Python module out of it: this is called "extending python").
- So, what's better? Depends on the case. extending python is usually more advanced and good when it's very closely connected. If C++ is a separate program it's controllable by other systems too... maybe worth it, maybe not.
- Pybind is good for this, if you have the main python script with pybind inside, you can transform that into an executable script if this is useful for your case. I think parsers are a good way to go if your goal is to run them in remote systems with terminal interface like SLURM clusters (Triton, Turso, CSC, Lumi etc...) especially if your case will be the "trivially parallel" i.e. the same code is run independently for different inputs and you use the python parsing script to pass these many inputs.
## Break until xx:02
## Library ecosystem
https://aaltoscicomp.github.io/python-for-scicomp/libraries/
This is a high-level overview talk before going to "dependenciesU"
- 34. Very general question. If you use more than one machine, do you want to install anaconda on each one, or only on one and use the cloud version on the other?
- Most of us would install "miniconda" on our computers and then make our own custom environments there with what we need. This is the topic of "Dependencies" that comes next.
- There are many different methods and all have their uses.
- 35. Where can we find bad-written (sick) Python codes? I am eager to clean them.
- To learn from them (what not to do) or improve them? For examples of not-so-good ones, there are some from the instructors at the bottom of the Libraries lesson page. Too old to be fixed.
# - from Libraries lesson you mean those in Exercise (where?)??
- You could ask a large language model something like "for learning purposes, generate some badly written but working python code that I can fix according to the pep8 style". I tested it with Llama2 it gave some trivial example... not sure if that's what you are after. :)
- This is what I am looking for. Some bad-written or old codes that can be reweritten to develop my knowledge on clean coding.
- 36. I initially started working with anaconda and got to like it, but then I was introduced into using Python virtual environments in each projects that I would start. Is there a clear difference to consider one of the two approaches? (venv or miniconda)
- Anaconda is based on `conda`, which allows conda environments, which are basically like virtual environments (but conda has more compiled code, so it's more self-contained)
- I would definitely try to start using environments, since it make your code more reproducible and moveable. But you can use conda environments through anaconda and get this!
- conda allows to install all sorts of things that might be needed (system libraries, etc), I think miniconda + environment definition file is the best starting point for a new project. More details at the coderefinery lesson on reproducibilitly https://coderefinery.github.io/reproducible-research/dependencies/ (even better "micromamba")
## Dependency management
https://aaltoscicomp.github.io/python-for-scicomp/dependencies/
- 37. mathplot lib just crashed
- I can't tell anything from the error message below. I'd say do some trial and error to see how you can get it to work.
```
- 'RuntimeError Traceback (most recent call last)''
File ~\anaconda3\Lib\site-packages\IPython\core\formatters.py:340, in BaseFormatter.__call__(self, obj)
338 pass
339 else:
--> 340 return printer(obj)
341 # Finally look for special method names
342 method = get_real_method(obj, self.print_method)
- File ~\anaconda3\Lib\site-packages\IPython\core\pylabtools.py:152, in print_figure(fig, fmt, bbox_inches, base64, **kwargs)
149 from matplotlib.backend_bases import FigureCanvasBase
150 FigureCanvasBase(fig)
--> 152 fig.canvas.print_figure(bytes_io, **kw)
153 data = bytes_io.getvalue()
154 if fmt == 'svg':
File ~\anaconda3\Lib\site-packages\matplotlib\backend_bases.py:2158, in FigureCanvasBase.print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, bbox_inches, pad_inches, bbox_extra_artists, backend, **kwargs)
2155 # we do this instead of `self.figure.draw_without_rendering`
2156 # so that we can inject the orientation
2157 with getattr(renderer, "_draw_disabled", nullcontext)():
-> 2158 self.figure.draw(renderer)
2159 if bbox_inches:
2160 if bbox_inches == "tight":
File ~\anaconda3\Lib\site-packages\matplotlib\artist.py:95, in _finalize_rasterization.<locals>.draw_wrapper(artist, renderer, *args, **kwargs)
93 @wraps(draw)
94 def draw_wrapper(artist, renderer, *args, **kwargs):
---> 95 result = draw(artist, renderer, *args, **kwargs)
96 if renderer._rasterizing:
97 renderer.stop_rasterizing()
File ~\anaconda3\Lib\site-packages\matplotlib\artist.py:72, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
69 if artist.get_agg_filter() is not None:
70 renderer.start_filter()
---> 72 return draw(artist, renderer)
73 finally:
74 if artist.get_agg_filter() is not None:
File ~\anaconda3\Lib\site-packages\matplotlib\figure.py:3154, in Figure.draw(self, renderer)
3151 # ValueError can occur when resizing a window.
3153 self.patch.draw(renderer)
-> 3154 mimage._draw_list_compositing_images(
3155 renderer, self, artists, self.suppressComposite)
3157 for sfig in self.subfigs:
3158 sfig.draw(renderer)
File ~\anaconda3\Lib\site-packages\matplotlib\image.py:132, in _draw_list_compositing_images(renderer, parent, artists, suppress_composite)
130 if not_composite or not has_images:
131 for a in artists:
--> 132 a.draw(renderer)
133 else:
134 # Composite any adjacent images together
135 image_group = []
File ~\anaconda3\Lib\site-packages\matplotlib\artist.py:72, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
69 if artist.get_agg_filter() is not None:
70 renderer.start_filter()
---> 72 return draw(artist, renderer)
73 finally:
74 if artist.get_agg_filter() is not None:
File ~\anaconda3\Lib\site-packages\matplotlib\axes\_base.py:3070, in _AxesBase.draw(self, renderer)
3067 if artists_rasterized:
3068 _draw_rasterized(self.figure, artists_rasterized, renderer)
-> 3070 mimage._draw_list_compositing_images(
3071 renderer, self, artists, self.figure.suppressComposite)
3073 renderer.close_group('axes')
3074 self.stale = False
File ~\anaconda3\Lib\site-packages\matplotlib\image.py:132, in _draw_list_compositing_images(renderer, parent, artists, suppress_composite)
130 if not_composite or not has_images:
131 for a in artists:
--> 132 a.draw(renderer)
133 else:
134 # Composite any adjacent images together
135 image_group = []
File ~\anaconda3\Lib\site-packages\matplotlib\artist.py:72, in allow_rasterization.<locals>.draw_wrapper(artist, renderer)
69 if artist.get_agg_filter() is not None:
70 renderer.start_filter()
---> 72 return draw(artist, renderer)
73 finally:
74 if artist.get_agg_filter() is not None:
File ~\anaconda3\Lib\site-packages\matplotlib\axis.py:1388, in Axis.draw(self, renderer, *args, **kwargs)
1385 renderer.open_group(__name__, gid=self.get_gid())
1387 ticks_to_draw = self._update_ticks()
-> 1388 tlb1, tlb2 = self._get_ticklabel_bboxes(ticks_to_draw, renderer)
1390 for tick in ticks_to_draw:
1391 tick.draw(renderer)
File ~\anaconda3\Lib\site-packages\matplotlib\axis.py:1315, in Axis._get_ticklabel_bboxes(self, ticks, renderer)
1313 if renderer is None:
1314 renderer = self.figure._get_renderer()
-> 1315 return ([tick.label1.get_window_extent(renderer)
1316 for tick in ticks if tick.label1.get_visible()],
1317 [tick.label2.get_window_extent(renderer)
1318 for tick in ticks if tick.label2.get_visible()])
File ~\anaconda3\Lib\site-packages\matplotlib\axis.py:1315, in <listcomp>(.0)
1313 if renderer is None:
1314 renderer = self.figure._get_renderer()
-> 1315 return ([tick.label1.get_window_extent(renderer)
1316 for tick in ticks if tick.label1.get_visible()],
1317 [tick.label2.get_window_extent(renderer)
1318 for tick in ticks if tick.label2.get_visible()]''
```
- Looks like matplotlib tries to open a window. Did you set the backend to "Agg" or did you run `%%matplotlib inline` before?
- 38. In order to pin the versions in those files (environments.yml for example), do you have to do it manually? I mean check the versions of the packages and then type them into the file.
- You can `pip list` or `conda list` to list versions. `conda env export` or `pip freeze` exports in a format that's usable directly
- But usually I'll do it manually to make sure I have just what I need and no more, and know what's in it.
- 39. i used conda update --all and the error still exits
- What was the original problem? Mathplot stopped working and gave the above errors
- My guess it's some bug in code or how it's used. I would probably play around and try to figure out the source of the problem
- `conda update --all` basically tries to recreate the whole environment with newer packages and that can be problematic to solve. I would recommend creating new environment with the packages you need. If you have installed some packages to the `base`-environment that cause it to be broken, you basically have to do a re-install of Anaconda to fix this. Using separate environments is better way because you can remove them without messing up the `base`-environment.
- 40. Where should the environment file "environment.yml" be stored?
- Usually in the folder when you create the environment, but it doesn't matter because it is just a "formula" for telling what you need. You will use it once. (or well.. ideally once :))
- I often use it many times: when environent gets messed up, I delete and re-create.
- When I have created any replication file, I just put it in the project folder. Then the practitioner can directly clone the repo and install all the requirements when starting to touch the project in the terminal.
- 41. How to go back to default anaconda environment after activating like that (create environment with conda and yaml file)?
- `conda activate base` or `source activate base` - re-activates the base. Or you can always stop the shell and restart.
- Okay, so it is only activated within the shell?
- `conda env list` is also useful to check which env you have and the one with the star is the active one.
- 42. I just started and for me it has already installed several, now verifying, maybe new start? Executing now... done now
- Did it work?
- 43. If I create an environment from a .yml file, will it install the packages from the same sources as in the original one?
- Good question! In theory a certain conda channel might stop supporting some old versions which might still be available on another channel, so you would not get exactly the same environment... but I think this is an unlikely scenario. For long term preservation you can use containers, so that the environment is not just the conda environment but also the full operating system. Then the container image can be preserved and (hopefully) reused many years ahead.
- It would install what's defined there and using what's configured. Usually it wolud be the same sources but in some cases it could be different... Versions may be different if not closely specified.
- 44. How to install `mamba` in Ubuntu?
- Usually it would be downloaded separately.
- FYI: [mamba's documentation] https://mamba.readthedocs.io/en/latest/
- You can download miniforge installer and run it. It is similar to miniconda's and it contains mamba: https://github.com/conda-forge/miniforge
- Again, would not recommend running the `conda init` to activate it constantly either on Linux, Windows or Mac.
- You can also install as a package from `conda-forge`-channel.
- 45. If I have an environment is there a command that saves the environment dependencies etc. afterward? To a .yml file.
- You can export the environment: `conda env export`
- `conda env export --from-history` is smarter about only using what's you actually requested.
- 46. When I run pip install -r requirement.txt, I obtain
```:ERROR: Could not install packages due to an OSError:[WinError 5] Access is denied: 'C:\\Users\\User\\an
aconda3\\Lib\\site-packages\\~umpy\\.libs\\libopenbl
as.EL2C6PLE4ZYW3ECEVIV3OXXGRN2NRFM2.gfortran-win_amd
64.dll'
Consider using the `--user` option or check the permissions. ```
```
- What do I do ?
- Pip is trying to install new packages on a folder that is protected. I would first create a new conda environment, activate it, and then run the pip install command.
- 47. If i have an enviroment acivated with some packages installed. How do i check what versions I have in that enviroment ? (Linux) Can I dump my envorment into a file somehow ? Meaning I get a list of all dependencies in that enviroment.
- `conda env list` or `conda env export` or `conda env export --from-history`
- Can I do the same with virtualenv?
- `pip list` or `pip freeze`
- 48. I'm having troubles activating the created conda environment. Running conda activate python310-env doesn't give any error reports but doesn't activate the env. Running conda init gives an error " No action taken. Operation failed." (and some "needs sudo" mentions in front of folder paths)
- You might want to try `source activate NAMEOFENV`
- thanks, that fails also:
- source : The term 'source' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spe
lling of the name, or if a path was included, verify that the path is correct and try again.
At line:1 char:1
- source activate python310-env
```console
~~~~~~
+ CategoryInfo : ObjectNotFound: (source:String) [], CommandNotFoundException
+ FullyQualifiedErrorId : CommandNotFoundException
```
- Could it be that you are on windows? source is a linux command. On windows if conda is in the path it *should* work. Alternatively there is an executable called *activate* in the environment folder (but I might misremember now)
- Ah okay, yes I'm running the Jupyterlab in windows (through anaconda navigator)
- If you just type `conda` does it say anything?
- yes, a standard usage info. "usage: ..."
- if you type `conda env list` does it show the environment you created?
- yes, but the star is in front of the base env so I quess it's the one that's active
- now try typing `activate NAMEOFENVIRONMENT` In general I would rather create the environment outside jupyterlab, then activate the environment, then start jupyterlab. See for example here: https://aaltoscicomp.github.io/python-for-scicomp/installation/ search for the box "Generic instructions with miniconda and an environment file".
- okay, thanks! `activate NAMEOFENVIRONMENT` didn't work either, but I'll try to figure it out.
- 49. When adding a new library to an environment, do we need to manually update the config file (or reflesh the environment?) every time it is added? If so, is there an easy way to automate the addition of libraries in the environment?
- I usually recommend first adding the package into the environment file and then running `conda env update --file environment.yml`. This makes conda aware that you want the other packags as well and will make certain that conda uses the channels you have specified. If the installation fails (there are some already conflicting things), I usually recreate the environment. Conda caches the installed packages so second installation should be faster.
**There are many ways computers can be set up that can cause lots of problems above. For example, a university-managed computer might not let you make environment where Anaconda is installed, since it's not designed for science and making own work.**
## Break until xx:23
- 50. is it possible to change the default user when starting a jupyter notebook session? I'm trying to reinstall a clean version of anaconda vs. many old mistakes, but now it's uninstalled it still takes me back to my old non-functioning account.
- User... what do you mean by user and account?
- when I start a new notebook there's a user ID in the url. This stays the same no matter if I'm logged in, have installed conda or which machine I'm on
- is the URL starting with "localhost..."? -no
- what is it starting with? nd.anaconda.cloud/jupyter/user/...
- I think it would be enough to create a new account at that service (https://anaconda.cloud) so that you can start from scratch
- That's what I tried, but the option to create a new account is not showing up as before.
- I am unsure, I think you need to ask to the support at anaconda.cloud. In theory you should be able to signup with a new email address that was not used before
- Thanks for trying, I'll contact their support then!
- OK. I seemed to have been able to fix it after some googling. I'll write it shortly here in case it helps someone else. So on top of just uninstalling anaconda, you also have to dig through your hard drive and delete all conda and jupyter related files that the uninstall doesn't remove. They include some standard configurations that otherwise keeps sending you back to the same URL's.
## Binder
https://aaltoscicomp.github.io/python-for-scicomp/binder/
:::info
### Binder-1 discussion question:
Why is it possibly not enough to share “just” your code? What problems can you anticipate 2-5 years from now? Write some thoughts below:
- library versions change
- Version of dependencies
- Obsolescence of dependencies
- Versions might change
- We do not know what packages we need to install
- without documentation it can be hard to use the code properly
- It usually difficult to connect different systems that constantly change
- path to access data
- is it understandable to others people?
- analyzing the code is more difficult without docs
- Some dependent software /tool is not available at all
:::
**The following lesson is a demo since it requries git/github accounts/zenodo accounts**
**Demo repository**: https://github.com/tpfau/BinderDemo/
- 51. So, would Binder be suitable for studies when the same project folder has the raw data files used in the study?
- yes, again try not to have too big data, but if it's small (max a few Mb) sure.
- As long as the data is not personal data or private data it should be fine. Binder is also not the best place for heavy calculations.
- 52. Is this method of sharing data good when the datas are around around 100 GB+ ?
- Zenodo has a default limit of 50G, but one can start a conversation with them depending on the case.
- If you are in Finland, CSC can also help you with this type of large data sharing.
- Have a look at the registry of data repositories https://www.re3data.org/ if there is a field specific repository that would allow large storage space for your case. For example the European Genome-phenome archive has a submission limit of 10TB https://ega-archive.org/
- 53. Is it possible to have an environment wihin an environment? Because it takes a long time to load the whole environment. User can maybe load what ever library is necessary.
- Short answer is no. The issues is that the environment has many tiny files. There are ways for solving this by "squashing" all the tiny files into a single one and make the terminal believe that there are still many tiny files. It is maybe a bit too long to explain here but here some links: https://docs.csc.fi/computing/containers/tykky/ I know there was another example but I cannot find it.
## Feedback, day 3
:::info
### News for day 4
- There is one one lesson which uses command line, but plenty to be done only within JupyterLab also.
- The topics are definitely more advanced, so something for everyone!
- We also have an exciting panel discussion where you can "ask us anything."
:::
### Today was (multi-answer):
- too fast: o
- just right: oooooooooooo
- too slow: o
- too easy: o
- right level: ooo
- too advanced: oo
- I would recommend this course to others: oooooooo
- Exercises were good: ooo
- Good interactive session to know about libraries and dependancy management. Python version control.
- I would recommend today to others: ooooooo
- I wouldn't recommend today:o
Comments (will be moved below):
- Some difficulties for Windows (no conda from normal command prompt, etc.)
- Cool, but indeed all this workflow looks easier with Linux, in my point of view xD
- Once I got stuck I couldn't follow along that much anymore. +1
- the topic today was not interactive
- I think, the course is more advanced and needs more time to learn.
- I like the double-act (two persons dialogue) style very much. One person is a tutor and the other a bridging person between participants and the tutor. The latter asks questions from a beginners point of view. It works very well (maybe if the second person picks up questions in user chats and subtly ask the useful questions). Thanks!
### One good thing about today:
- Extremely interesting topic on scripts, arguments and config files (YAML). Looking forward to interesting topic on organizing larger Python projects tomorrow! oo
- I really enjoyed the exercises on topics mentioned above!
- Totally new for me: Binder, mamba, command line conversion, argparse
- Mamba performance boost compared to conda.
- ...
### One thing to be improved for next time:
- Some errors in example code of exercises though.o
- A more detailed explanation about how Zenodo works adn why it is useful.
- My view in Twitch fuzzy -> yes for me too, very unconfortable
- How bad? Is it that we need to zoom in more/adjust our resolution, or do you think it's a Twitch encoding problem?
- Now looks great, but sometimes fuzzy. Not a big deal, because https://aaltoscicomp.github.io/python-for-scicomp provides clear view.
- In my case I reestarted firefox and it became nitid, but after short time it became fuzzy again. As I have to force my sight to read the code. If it's only me maybe it has something to do with too much of ram used by firefox due to jobs in several windows? Now is clear again!!! I didn't change anything...
- In Twitch you can change the resolution behind the "cog" icon. By default the resolution is automatically chosen based on your internet speed, but it can go lower, if the internet is slow.
- I actually can and it works!! Thanks a lot! Usefull for tomorrow.
### Any other comments:
- Thank you for today's lesson. Enjoyed working on exercises. I could learn a lot of useful and new things (e.g.,dependency management, binder, github, etc.).
- Today I really got to feel as a pure beginner haha I had no clue about Binder.
- I first tought I would like to get the session of today earlier (like before starting with numpy...), but now I think it is at the perfect time! Was very good explained - Thanks! (Would like to get more infos about all the stuff today, but obviously would go beyond the scope)
- Thank you for this lesson. There were lots of information to automate thechniques in data science. ...
- I haven't use conda before and was really good to have this overview since I heard so many times about it and had no clue on why is important and how it works.
- I like this course, gives a good overview of "good coding practises" but of course requires weeks of hands on practise to learn o
- Thank you very muchy for today's lessons. I got many insights from your discussions. Thank you for interactive session with more than one course instructors.
- Did we cover "SciPy" and "Web APIs with Python"? Were they skipped?
- SciPy was skipped, Web APIs is tomorrow (we need to update the order...)
# Day 4
## Icebreakers
What's the most surprising thing you have learned so far in this course?:
- Yesterday they way you used Binder blowed my mind haha
- That array indexing [1:3] means only two components o
- Python library ecosystem management (Yesterday) was a good review of MUST and MUST NOT
- Never seen linters before, also the way to keep the enviroments clean and reconstructable with the simple files
- Why people use Mamba and how much faster it was for even simple packages
- Set up Python environment using "Environment file"
- Not necessarily surprising but always wondered and wanted to know how to run code from command line :D .
- The use of config files
- .I'm still learning new topics after my IoT education
- .
- .
... most surprising thing in your career?
- That engineers that can not code are project managers!
- How little most scientists know about data analysis.
- How many times "the wheel" gets reinvented
- How my project was way too simple for any computer scientist, but way to hard for me (initially). There's a really large knowledge gap.
- How useful coding is, I didn't think I would ever do it (much at least)
- When I learned programming language first time, there were not so many people who were interested in programming languages, but now it is popular and I found it is very useful at work!
- How R markdown changed the way I make notes. Jupyter is quite similar of course.
- .
- .
## Parallel Programming
https://aaltoscicomp.github.io/python-for-scicomp/parallel/
Screen is blurry?:
- yes:
- no: ooo
- slightly: o
(I'm not sure I can make it better...)
- 1. (from twitch chat) do you guys have a python method for converting hex codes into binary strings and then calculating z-scores from the numbers of 0s and 1s in the string? I have a statistical analysis I want to perform on such data and doing the conversion by hand is WAY too time consuming. I want to test an RNG by performing a chi square goodness of fit test on its data, but it's all binary data coded as hex strings
- So the actual data are 0s and 1s and you need to do a chisquare test on their frequencies? There are various ways of converting hex to binary, then you convert the binary string to a vector of ones and zeros and then you can run your chi2 frequencies test (if I understood correctly your goal). https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
- Hex to binary data: codecs.decode('61fea2a', 'hex')
```
import numpy as np
from scipy.stats import chisquare
hex_data = "abcd"
b = bin(int(hex_data, 16))
vec = [int(bit) for bit in b[2:]]
freq = np.bincount(vec)
expected_freq = np.array([len(vec)/2, len(vec)/2])
chi2, p_value = chisquare(freq, f_exp=expected_freq)
```
- 2. Peer-Programming aims to increase the quality of code
- yes! There should be more working together.
- 3. What's the difference between parallel programming and parallel algorithms?
- Not sure if there is an official definition, but to me a parallel algorithm would make use of parallel programming (i.e. a parallel algorithm runs parts of the code in parallel).
- 4. Sharing resources
- 5. You're mentioning processors specifically for threading, does Python actually handle cores differently than processors?
- I don't think so, since this is mainly OS handled. +1
- In most SciComp "processors" and "coners" are used the same way.
- 6. Is numpy broadcasting some form of embarrasingly parallel processing?
- *embarrasingly* parallel no: that's specifically a type where you run the same code multiple times and they are completely separate.
- Numpy and other libaries can do other types of parallel inside of them without you being aware (that's still very easy for you!)
- 7. Would a deep/automated learning algorithm be a good example of what needs to use message passing?
- Yes, although "needs" is always a bit tricky, since it's more a "it wouldn't run fast enough on a single thread/process". Essentially anything that gets heavy enough so that it cannot be run on a single machine will need some form of communication (like huge physics simulations, that just need loads and loads of CPUs to do a parallel computation, but you don't have enough on one machine). Or large systems of independent agents that do communicate to achieve something.
- Deep learning codes usually use message passing when one runs data parallel learning (different batches of data are trained on different machines). They also usually use multiprocessing to load data with multiple workers so that GPUs have enough data to work on.
- 8. One at a time or Many at a time? Which one pool is doing?
- If I get your question right: It depends on the number of executions and the number of available threads (CPU?/CORE?). It will always be performing many simultaneously, but each Thread will only handle on of those at a time. Essetially the number you pass to Pool indicates the number of Threads. And if you do more threads than you have CPU/COREs your OS will have to schedule those calculations, so having more Threads than physical compute units easily leads to overheads and slowing down the calculations, as each physical processing unit can only do one calculation at a time.
- Thanks! Then, pool gives one thread to an available CORE. It waits for the computations to get over. Then, the next will be forwarded.
- I think (I'm not entirely sure, but that's my understanding), that Pool essentially creates Threads, and asks the OS to execute those. Then, when it gets to do something, it assigns operations to these threads. Each thread will request resources from the OS to perform it's assigned calculations. So if there are less (or an equal amount) of threads compared to physical compute units, all threads are executed simultaneously (maybe some will have to wait at some point if the OS also has other things to do), but if there are more, the OS will kind of hand out the physical compute units in turn and ask threads to suspend/wait very now and then, to let others work. And Pool will wait for each computation it passed to a thread to complete and then hand a new one to the thread (I think, could also be that it pre-assigns the computations, there are benefits to both approaches. The latter requires less overhead, but can be extremely inefficient when the times required for each computation differ significantly ).
- Thanks!
- Ok... need to correct myself, I don't find how exactlly it does load balancing.... Some sources say pre-assignment some say assignment on completion...
- 9. What Python libraries are recommended for performance benchmarking so that we can compare a non-parallel and a parallel code?
- If you have a specific function you want to benchmark, using [timeit](https://docs.python.org/3/library/timeit.html)-module from Python base library is very good for timing the execution. If you need to benchmark the whole program [scalene](https/github.com/plasma-umass/scalene) is a good profiler that will give you line level in10formation on your code's performance.
:::info
### Exercise 1-2, until xx:35
https://aaltoscicomp.github.io/python-for-scicomp/parallel/#exercises-multiprocessing
- done:
- both multneed more time: o
- not doing it:
-
:::
Discussion of exercise Parallel-2:
- . timing without pooling (`n, n_inside_circle = sample(10**6)`)
```
160 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
timing for pool of 10 (`results = pool.map(sample, [10**5] * 10)`)
```
35.3 ms ± 513 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
timing for pool of 1 (`results = pool.map(sample, [10**5] * 1)`)
```
18 ms ± 473 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
- Could better performance for pool of 1 vs 10 for this example be related to multithreading overhead?
- Most likely yes.
- Wondering why pool of 1 faster than without pooling?
- If you use the exact code from above: The one without pooling works on something an order of magnitude larger (`10**6` vs `10**5`) ...
- :)
- `results = pool.map(sample, [10**6] * 1)`
```
153 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
- Close enough to without pooling.
Continued:
- 10. What does pool.close() do? and why is it important?
- `pool.close()` stops the additional processes. Using the `with ...` syntax avoids this as it will automatically close the pool and thus it is usually easier to use.
- 11. Getting AttributeError: `Can't pickle local object 'inner.<locals>.sample'`
- with Pool, it pickles (converts Python objects to strings) to communicate between the processes. If one tries to pass a variable that somehow can't be pickled, then you might get this error. Try to adjust your code structure some and how variables are defined.
(aside: screen should be more clear now, is it?)
- 12. Can you explain how to install OpenMPI?
- That's a bit beyond this course. I'd look at the OpenMPI installation instructons (or if you use Linux, it's probably in the package manager). We can get to all kinds of optimization if you want...
- What if I want to run an algorithm on 1000's of photos? :D
- Probably you'd do that embarrasingly parallel or multiprocessing: 1000 different inputs can be split up and done separately.
- 13. Do you have some guidelines for deciding/knowing in what way I should paralellize my code
- think of the smallest unit of processing that does not depend on anything else (=embarassingly parallel case). E.g. you need to process N files and get one output per file, and they are independent.
- Pasta takes 8 minutes to cook, even if you have 8 pots and 8 stoves it won't speed up your ~~code~~ cooking :) -> Video "what is parallel computing? an analogy with cooking" https://www.youtube.com/watch?v=N78NQqma-8k&list=PLZLVmS9rf3nMKR2jMglaN4su3ojWtWMVw&index=4
- Easiest/simplest way possible :).
- 14. Both multiprocessing and multiprocess are still running, is it normal?
- This might be due to jupyter locking the GIL. We'll check the example.
- For me the following two cells worked:
```python
import random
def sample(n):
n_inside_circle = 0
for i in range(n):
x = random.random()
y = random.random()
if x**2 + y**2 < 1.0:
n_inside_circle += 1
return n, n_inside_circle
```
```python
%%timeit
import multiprocessing.pool
pool = multiprocessing.pool.Pool()
results = pool.map(sample, [10**5] * 10)
n_sum = sum(x[0] for x in results)
n_inside_circle_sum = sum(x[1] for x in results)
pi = 4.0 * (n_inside_circle_sum / n_sum)
pi
```
- 15. Anybody else having issues with the multiprocessing version of the first exercise taking forever? Even when copying the example code exactly, the computation just never finishes for me. Any ideas what the issue could be?
- We'll check. But there's another good point here: parellel programming has a whole other class of bugs that can happen. It's an art to do this debugging.
- I also got my code lagged with that computation. Pool from multiprocessing wasn't working properly.
- Me too. 16 min and counting
- It probably won't finish. If you want to tell which OS and Python version you're running (e.g. run `import sys; print(sys.version)` we can try to debug the example for future lessons.
- OK, is it so dependent on the setup??
3.9.7 (default, Sep 16 2021, 08:50:36) [Clang 10.0.0 ]
- 16. I cannot run even these codes :/ It gave AttributeError:
``` from multiprocessing import Pool
with Pool() as pool:
pool.map(square, [1, 2, 3, 4, 5, 6])
AttributeError: Can't get attribute 'square' on <module '__main__' (built-in)>
```
- Did you run this cell before running that:
```python
def square(x):
return x*x
```
- You might want to try restarting the kernel as well.
- Sure, i tried all possible solutions. I just wanted to do this example (you did in the lecture) because i got the same Attribute error on the Exercies. That's why i tried to do the basic example.
```
Process SpawnPoolWorker-2:
Traceback (most recent call last):
File “/…/miniconda3/envs/python-for-scicomp/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run()
```
- Do you have any files in the folder where you're running that could affect the imports? E.g. files named `multiprocessing.py` or `run.py` or `square.py`
- No. It is the only file in that folder.
- Which OS are you running? Windows, Linux or Mac? This might be an OS dependent thing.
- Mac. I'll try it on the ML nodes also. Let's see :) Thanks a lot :)
- I found the solution here (*"For some reason Pool does not always work with objects not defined in an imported module. So you have to write your function into a different file and import the module"*): https://stackoverflow.com/questions/41385708/multiprocessing-example-giving-attributeerror
List of useful parallel Python libraries:
- joblib : Easier maps with multiprocessing
- jax : For faster numpy and autograd
- dask : Useful for large datasets
- ray : For running big machine learning setups
- scalene: Good for getting runtime profiling information
:::info
### Break until XX:11
:::
## Packaging
https://aaltoscicomp.github.io/python-for-scicomp/packaging/
- 17. Can you do this packaging in JupyterLab?
- Yes you can. You can open new files and folders in the jupyter. If you create the same structure it should work. For running the packaging commands, you'll need to launch a terminal though. Cell magics might work as well.
- 18. What is the problem with requirement.txt that we learn to create on past days?
- requirements.txt will tell pip what packages should be installed into a environment. Here we're creating a completely new pacakge. If you need to create your own pip package, you'll need to tell pip what dependencies it should provide when someone installs the packages from e.g. requirements.txt. pyproject.toml has its own syntax on where dependencies are listed.
:::info
## Exercise 1 until 11:40
https://aaltoscicomp.github.io/python-for-scicomp/packaging/#exercises-1
For more details about documentation have a look here (also styling for READMEs):
- https://coderefinery.github.io/documentation/
For licensing and more information can be found here:
- https://coderefinery.github.io/social-coding/
- https://coderefinery.github.io/social-coding/software-licensing/
Automated testing: (since we mentioned it but will not cover it)
- https://coderefinery.github.io/testing/
:::
- 19. What kind of information should "README.md" and "LICENSE" have? Could you please give some examples?
- Readme would usually give brief informaiton about what package is and how it's used. For example it appears here:
- https://pypi.org/project/numpy/
- https://github.com/numpy/numpy (same readme above)
- There is also a guide by PyPI about the formatting: https://packaging.python.org/en/latest/guides/making-a-pypi-friendly-readme/
- Most pagkage descriptions you find on pypi.org are README files.
- The license you choose is up to you, but you should use one of the existing and well known license texts
- https://coderefinery.github.io/social-coding/software-licensing/
- also have a look at the links in the Exercise info box.
- Often the contents of the LICENSE-file is provided by the versions control system e.g. [GitHub](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/licensing-a-repository) when you create the repository. Which LICENCE you want to use is up to you, but the bulk text is usually just copied from the licence provider's page.
- GitHub recommendations on what a Readme should have https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-readmes
**Reminder: if you're interested in trying the example yourself remember to use the test PyPI, not the official PyPI**
:::info
### Exercises 1 until xx:23, also continue to exercise 2 if you have time.
https://aaltoscicomp.github.io/python-for-scicomp/web-apis/#exercises-1
The solution is below the text of the exercises.
:::
- 20. Are there any best practices or general guidelines for working with Web APIs in Python?
- coding wise the request library is good
- when it comes to ethics/legal one should consider if they have the rights to download and reuse remote data for research purposes (often researchers have this right)
- 21. For a copy-past, it would be a bit more easier with own IDE, if there would be no command "display". Is it from some library. I used print
- Ah, it's IPython's kernel. Yeah, we should remove those and have it just be printing by being the last item in the cell.
- 22. If you want save larger amounts of data, I guess the proper way of doing that would be to save as JSON to e.g. SQL server or MySQL? Any comments on this kind of integration?
- [DuckDB](https://duckdb.org/) is a good SQL alternative for data science tasks. It allows you to use SQL syntax, but it runs locally without server from Python ands stores data in an efficient format.
- 23.
## Panel discussion
Ask us anything about Python, Scientific Computing, or related things and we can answer
- 24. Would you use and jupyter for training an image analysis algorithm? Or is that a bad idea
- Jupyter is good for interactive workflow (e.g. prototyping), but when you want to scale up it is better to switch to a non-interactive mode (which is also more reproducible since jupyter has always the risk of manually running cells in non-sequential order)
- 25. I am reading and studying what I learned during last a few days. Am I allowed to contact you and ask questions in the case I could not find a determined answer to them?
- If at Aalto University: yes! come to our garage session or ask in SciComp chat: https://scicomp.aalto.fi/help/
- If not, I'd recommend finding similar people at your institution to work on
- Maybe first talk amoung your colleagues and get a internal community going first.
- Thanks!
- 26. This is an amzaing lesson on Python for Scientific Computing but then what is it is contnous part? I hope you will have introduction to machine learning and intoduction to deep learning?
- Hm, we're sort of intending to pepare for those. There are many other courses online that go into that part, so we haven't seen a need to repeat.
- If there should be one in this format, we'd need help to implement it (and determine the need): please get in touch.
- 27. What benefits (and disadvantages) is there to using Python for scientific calculations instead of e.g. Matlab
- Matlab is good for scientific programming, but it is a commercial product and thus running Matlab code is usually limited to people who have Matlab access. But it is a good program for problems that it supports.
- Matlab IDE contains good support and help, BUT for Numpy there are more examples available in Stackoverflow, etc.
- Consider also the "side effect": by learning python for your scientific work, you learn a skill that is very valuable outside academia (Matlab is much less requested than python)
- 28. Parallel programming was a bit fast and I couldnt manage to follow all, partly because it was not straightforward to setup and/or didnt work out in my Notebook. For example, I needed a bit more explanation on different approaches with case studies. Maybe good to have a dedicated workshop next time.
- The main point is that many libraries are already parallelized efficiently, so see if you can use them.
- 29. What is the best way to export plots if you need them as vector graphics?
- Matplotlib should have tools to export as svg
- 30. How does it go with certificate? If we do what was asked for the certificate by next week, would you please give us the certificate quickly afterwards.
- We said something like: "Quick track: results by 15.nov, certificate by end of november."
- Thanks.
- 31. Do you have experience with the polars library for data analysis? If yes, how does it compare to pandas?
- 32. Are you planning to run any python course on AI and ML?
- See #26
- 33. Example situation: During my master's I attended a course on Data Intensive Programming intended to teach us functional programming with Scala Spark. There was a debate between students because some students preferred to continue using PySpark since Python was more widely used. Now my question is, do people use Python because how well documented and widely used is? I know R and plotting with R seems easier to me, but when I collaborate most of the times people prefer Python because R is less documented, even though is quite powerful.
- Yes it is because it is very popular. Python also has wider usage beyond data science, so it is a skill that you can reuse if you change career.
- 34. How to create a web application by Python would be nice as a new course (to transform a code to a practical demo on the web)
- There are many different frameworks and plenty of tutorials and compariasons. I'd recommend finding some of these existing tutorials.
- 35. What packages and libraries you recommend for doing biological data analysis using Python, e.f. genome sequencing data?
- Not an expert, but I think [bioconda](https://bioconda.github.io/) is a big collection of packages related to these subjects.
- Good rule of thumb: if the git (github, gitlab, etc) page of the library has not been updated for few years, maybe find a different one. Python is an evolving language and if a library is not maintained it brings more problems than solutions.
- Thanks for this.
- 36. Is Python a good choice for API(JSON)-SQL integration or are there better alternatives?
- It is a good tool for that, yes. For SQL integration, you can use [SQLAlchemy](https://www.sqlalchemy.org/) in conjunction with Pandas to do SQL queries to databases and get the results as a DataFrame. See for example [read_sql](https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html)-function in Pandas.
- 37. Is a container different from a virtual environment? If so, what is recommended for which situation?
- You could say the container is like the whole operating system packaged, not just the Python parts. They both have their uses.
- Ok, thanks! :)
- Would be good also to know if it is the same workflow to archive (Docker) container as environment for the integration with Binder, GitHub, Zenodo
- 38. Any advice what areas to focus on for me as a junior IoT developer aiming in the data areas? Programming language?
- In general, focus on automation, pipelines, remove the human from the pipeline from data collection down to final results.
- 39. How about Julia?
- I like julia :)
- Luca Ferranti gave a very quick intro to it yesterday (https://hackmd.io/@coderefinery/python2023archive2)
- Yes, and I will join their togetherness in Helsinki in next month.
- 40. Any opinion about boken interactive data visualization libraries? or any other alternatives other than Matplotlib?
- Matplotlib is pretty basic, but the base of many other things so why we teach it. Definitely worth exploring other things. I don't have much knowledge of it.
- I like Bokeh interactivity. Good to think also about learning curve for many functionalities when doing more complicate things
- There is also [Datashader](https://datashader.org/index.html) which is interesting as well.
- [Plotly](https://plotly.com/python/) is a good library as well.
- 41. How can you get one of those conversations with you about your specific project? Asking for a friend
- https://scicomp.aalto.fi/help/ (Aalto University)
- 42. Do you think Python will ever become "outdated"/replaced by a different language? Heard that some people say R will not be used in the near future anymore.
- Possible. There are other languages that are popular in several areas. The main reason I think python will not be replaced any time soon is that it has kind of acquired a somewhat "critical mass", at least in science, and has acknowledged its problems (in terms of speed), and uses interfaces to other languages to address those.
- As an R user (many statistical libraries are basically only available in R), I will make sure I can run R until the end of my days :) (Like)
- as another R user, I've heard that (R becoming outdated) from programmers already many years ago. Scientists beg to differ :)
- We _almost_ made it through the course without Python vs. R...
- 43. How important is learning about programming algorithm? Is learning a new language more important or algorithm?
- Very important to know some basics concepts, since it affects how you use other libraries. Depending on what you do, you may not need to go very deep.
- 44. Aside from python, which programming language do you think is best? ;-)
- all and none :smile:
- I would use many other languages in special situations (C, C++, Julia, Rust, R, Javascript, html, css, Godot, C#, ...), but Python is the one that connects them all.
- “One Ring to rule them all, One Ring to find them, One Ring to bring them all and in the darkness bind them.” :)
- 45. A comment not a question: if you want to learn more Python (or any other language) it is good to have a problem and a deadline... then you should try Advent of Code 2023! https://adventofcode.com/ It is an advent calendar with a programming task to solve every day to open the next "door" of the calendar, until xmas day. If you are from nordics, get in touch since we will run a zulip (like slack) where we discuss these together. https://coderefinery.org/ Zulip chat: https://coderefinery.zulipchat.com/
- 46. What are the future prospects of Julia?
- I think it is growing very rapidly and it already has good products built on top of it. It is a definite contender especially against the "extend Python with C"-paradigm.
- 47. What is your professional opinion about this: From my studies I got to learn Python, R, Scala and C++. Do you think it's worth it to train yourself in your free time with other languages? Or get more expertise with the languages/tools you already know about?
- I would say that if you feel like another language would serve you to do something you cannot do in your current languages learning them might be a good idea. For example, if you want to work in web development, learning e.g. TypeScript might be a good idea. However, if your current skillset works for you it might be better to learn deeper information about the languages you already know.
- 48. I am just starting my PhD in Health Science and will have to do statistics obviously but likely also some kind of machine learning. Should I teach myself R and Python or would Python be sufficient fo both purposes?
- Python is probably sufficient BUT: Having knowledge of R is defintely a good idea, because while there is plenty of analysis software in python there are also often tools only available in R, so if you want to be able to use the latest tools/methods you will likely have to use multiple languages, or integrate the one language in the other.
- Mix and merge. Use a bit of both. Your goal should be to reuse existing tools as much as possible, so if that tool is only in R, then you will need R
- Just join, follow the R/PHARMA then you will get what you need or related with your subject (healthy Science)
- 49. Maybe good to have a workshop for Python with database (SQL etc)! If you know nice other tutorials about it, that would be also apprecaited +1
- If we get enough requests, we can do it. You can request future courses via this form: https://link.webropol.com/s/scipod
- 50. What are good Rust use cases in the context of Scientific Computing?
- At some point some existing libraries will most likely incorporate Rust more, but the main benefits of Rust like increased security is usually not at the forefront of scientific computing. Someone linked [Polars](https://www.pola.rs/) before which is an interesting take on Pandas-like DataFrame working written in Rust. In future things like this might be more prevalent.
- 51. Object Oriented Programming for Python would be nice to learn
- Regarding question #26, why it would be good if you would host a machine learning/deep learning course: You are excellent at explaining and making topics really approachable. :)
- xx. Thank you to all! You planted seeds for the future. They will grow and wish you the best and health every minute they remember November 7-10 2023.
- Thank you!
- Fantastic course. Thank you! +2
## Feedback, day 4 (and all days)
:::info
### News
- All our material will be available for followup
- If we get enough requests, we can do it. You can request future courses via this form: https://link.webropol.com/s/scipod
- Zoom afterparty, link here once stream ends: https://aalto.zoom.us/j/64497040108?pwd=TTVmaE1LR1Vwd3lxLzhHOEIzTS9qZz09
:::
### Today was (multi-answer):
- too fast: o
- just right: ooooo
- too slow:
- too easy:
- right level: o
- too advanced:o
- I would recommend this course to others: ooooooooo
- Exercises were good: oo
- I would recommend today to others: oo
- I wouldn't recommend today:
- all well
### One good thing about today:
- pypackages
- Building your own package doesn't look that hard
- It was fun to test APIs
- ...
### One thing to be improved for next time:
- ...
- ...
### Any other comments:
- Could you share the link to the email list subscription
- CodeRefinery https://coderefinery.org/ (see the subscribe box)
- Aalto Scientific Computing https://scicomp.aalto.fi/training/scicomp-announcements-maillist/
- Please keep the material available "forever". This is very good collection of useful info for getting some knowledge of these subjects. I'm sharing link to the others too in my group.
- Webarchive will save us all https://web.archive.org/web/20230000000000*/https://aaltoscicomp.github.io/python-for-scicomp/ (please remember to donate to webarchive)
- Could there be maybe be a downloadable PDF with all the materials from the course?? In the past I have screenshotted everything into a document, but that is kinda not very efficient. :D
- I think sphinx makes a pdf out of the whole website, I will check
- Materials are also available here: https://github.com/AaltoSciComp/python-for-scicomp/
- I'll add PDF and epub download links [Sphinx](https://www.sphinx-doc.org/) can make them
- Sorry, is it possible to get follow-up help here? I have a problem with packaging in PyPI. Everything works until trying to import the installed package. Package is online and I use pip install on another virtual enviroment, but I cannot import it probably because the naming convention problem of the package. It includes hyphen (-).
- I happen to be looking now, so go ahead. No promises for the future... this will be archived soon.
- Ah. So, `-` isn't valid in an Python identifier (including module name), but is valid in a package name. Usually people have `-` in package name (on PyPI) but `_` in the module names (what you import). Make sure all the Python files have only `_`. By online, is it something I can look at?
- hello there! I modify my package name using only _ underscore, but twine upload -r testpypi dist/* command automatically generate naming with - hyphen.
- I got message in terminal:
Uploading bad_ugly_fantastic-0.1.0-py3-none-any.whl
100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.7/5.7 kB • 00:00 • ?
Uploading bad_ugly_fantastic-0.1.0.tar.gz
100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.9/4.9 kB • 00:00 • ?
View at:
https://test.pypi.org/project/[hidden]/0.1.0/
- Here it seems names is with hyphen.
- yeah, it's translating, that isn't a problem.
- In the package, it's called `calculator`. did you know the PyPI name doesn't have to be the same as the module? So, `_` and `-` shouldn't be mattering at all. How do you try to import it? `import calculator` ?
- In PyPI website: pip install -i https://test.pypi.org/simple/ [removed]==0.1.0
- It looks like it's called `[removed]`, not `[removed]`. Despite the PyPI name, it should be `import calculator`
- yeah?? with import calculator, I got ImportError: cannot import name 'integrate' from 'scipy' (unknown location)
- OK, we saw this problem in the course, too. It tried to import SciPy from Test-PyPI, and it's not there and gives some other random thing. Try `pip install --upgrade scipy` (<- tell it to upgrade).
- Requirement already satisfied: scipy in. After that, I import calculator, and get the same error: ImportError: cannot import name 'integrate' from 'scipy' (unknown location)
- It worked!!
- import calculator as cal, a = cal.add(1, 4), print(a)
- So, what should I do in terms of naming the package? It is very confusing! (why I cannot use "[name]" which I named in toml file or what is named on PyPI website?). What would be the best practice? (maybe best to avoid hyphen etc, but apart from that?)
- Hm. Well, this was a weird case since we were all making something called "calculator". Usually, I would look at PyPI and find an availabl name before starting, and call the Python module that. If it needs, I would use `_` in the module name, and then personally I ues `-` in the PyPI name - try to make them the same otherwise
- OK. I like the tutorial, because I didnt know how to do packaging. I will play around, and try to find the best workflow for this. I think this is repeating process, if I continue developing, so it is important to establish a good practice. Thanks! You can summarise or remove unnecessary notes for this question. Its a bit messy now :-)
- it's OK, we'll clean up when archiving. Enjoy your journey in Python! +1 bye
- My best recommendation is: look at how other projects you use are set up and learn from that. You don't need to do exactly what they do, but it's a good starting point.
- Maybe the folder name of the package should be exactly the same as the name in toml file? Indeed, if I make sure those names are the same, I didnt have a problem at all (even scipy is installed when doing pip install for the package and import is just fine too. Only drawback is that I got an red error for other libraries already installed: "ERROR: pip's dependency resolver does not currently take into account all the packages that are installed." So, best to be careful.
- It's OK if they aren't, but less confusing if they are the same. Most common packages do have the same name, but some don't (PyYAML vs `import yaml`, scikit-learn vs `import sklearn`). But yes, better if they are the same!
- In my IDE: [name] I also realised that if I do pip list, I see many packsges with hyphen names (as well as my package: [name]), I also cannot import them. I tried to use workaround like this:
import importlib
mylib_module = importlib.import_module
- What does `import calculator` say? `[name]` is *not* the name that can be imported. `name = "[name]"` in pyproject.toml is only the PyPI name, not the Python module name (which is in the source).
## Upcoming January 2024
Linux Shell Scripting course https://scicomp.aalto.fi/training/scip/shell-scripting-2024/
(similar format, focus on BASH scripting language)