owned this note
owned this note
Published
Linked with GitHub
# Data analysis workflows in R and Python - OLD
## Day 1
## Icebreaker
## What do you expect to learn from this course?
*Write your answer below, one line per answer. When you are done with editing you can switch to view mode (eye icon on top left of this window).*
- write answer here
- I expect to learn about data analysys workflows :-)
- Perform arithmatics, analysis and visualization with my dataset
- I've used R in past, so looking to get better
- To be better and more efficient at data analysis and visualization
- How to do data analysis without excel
- I want to learn to do data analysis also with excel :)
- How to do post processing and data visualization in one step
- I would like to develop an analysis pipeline and create a shiny for visualization
- I expect to learn best practices of doing data science(e.g. workflow, tools, clean code)
- Statistical studies on long term solar data
- Best practice on data analysis.
- Automating and expediting various data manipulation steps
- What is so special R so that everyone talks about it? :)
- I want to learn data analysis for econometrics
- nice! are there special libraries you use in your field?
- YES, appelpy
- Learn to do consistent and efficient data analysis, grasp new problems fast
- I expect to learn how to handle my data more efficiently. +1
- I work in a multi-disciplinary field, and people from different fields have different concepts of 'data analysis' and 'data workflows'. For example, medical doctors use quite different tools than engineers. At this course, I expect to learn what are the default 'data analysis concepts' among Finnish engineers and data analysts.
- Some good practices, pitfalls and where to look information.
- Get flavors of data analysis.
- Become better with complex programs, be more structured.
- I am looking for smooth way to transit my work from Rstudio to python
### Other questions here
*any question you might have, write here at the bottom (above the break line)*
- Do we really need to know how to use Jupyter? I read it's a requirement and I have no idea
- [name=enrico] Course exercises and materials are saved as jupyter notebook. If you are able to start them and use the "play" button, that might be sufficient for today. Here a good resource that can get you started: https://coderefinery.github.io/jupyter/
- Will you teach R simultanously with Python in each session or will there by separate sessions for each?
- [name=enrico] The idea is that you follow the language that is most familiar to yourself / better suited for your needs.
- Is there any advantage of using R to Python or vice-versa in doing data science?
- [name=enrico] My opinion here, from what I saw, this is mostly dependent on packages/libraries that are already available.
- Could you speculate about the future of scientific computing? Is Python going to remain the main tool? Many people have been advertizing Julia recently.
- [name=enrico] My opinion here: the availability of existing libraries/packages/tutorials is often more important than performance. So my bet is on python :)
- Is there a possibility to get guidance for program developement? I am talking about quite a simple program which would basically just automate a couple sets of Excel functions.
- If you are at Aalto, join one of our Sci Comp garages, we help with these kind of issues. If the tasks requires some more work, we have a new RSE program starting: https://scicomp.aalto.fi/rse/
- Am I able to join even if I am from Uni Helsinki?
- RSE program is only for Aalto users for now, let's chat later in private as there are good people that can help at UH ([name=enrico]).
- Do I get credits for this course and how?
- Yes if you are a student at Aalto, see info here https://scicomp.aalto.fi/training/scip/data-analysis/
- Otherwise we can make a certificate, up to you to see if your university can accept it...
- Where can we find the excercises?
- In the repository you cloned for installation. See at the bottom of https://aaltoscicomp.github.io/data-analysis-workflows-course/installation/#testing-your-installation
- Will we commit results or just keep our work for ourselves? Shall we have our own branches maybe?
- The repository was just for sharing and keep track of future changes. We don't expect people to know `git` (beyond the small commands needed for installation). If you want to suggest changes for future versions of this course, you can open a github issue / pull request from your own fork.
- Thanks! I just noticed that Simo/Richard had more files than me on the python_exercises folder and I was wondering.
- Is there the way to integrate R and Python together for the analysis workflow sufficiently?
- That is possible, but can be complicated. On Wednesday we'll talk about data formats that can be used to transfer data between the languages.
- How is course assessment carried out for Aalto students to get that 1 credit?
- We'll provide the special project for that later on and once you have completed it, we'll mark the credit for you.
## Questions on Chapter 1: Understanding data analysis workflows
*questions related to Chapter 1 here*
- Do you have more information about how to improve in modeling? Faster fitting algorithms in relation to Python or R, for example?
- I'm not sure if we'll get into this, but
- Will be talk about how to make decisions on data? E.g. how to analyse data in a more automatically way? Something that goes beyond basic if, switch statements.
- This in an interesting question. We might check on that later on.
- On installation something went wrong for me. it's been trying to slolve enviroment for ages already.
- % conda env create environment.yml
/Users/n/opt/anaconda3/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Collecting package metadata (repodata.json): done
Solving environment:
- OS? Mac
- Aalto? no, personal laptop
- Anaconda from homebrew? - that might be
the problem, it's not via Aalto
- Sorry I was messing your typing:) yes I would try anaconda official installer. ([name=enrico] ).
It's the official installer, but not from Aalto
- ok, problem solved! just needed more patience:)
- Official installer is not with homebrew https://docs.anaconda.com/anaconda/install/mac-os/ (as far as I know), but glad you managed!
- Where will the lecture recordings be avilable?
- On twitch https://www.twitch.tv/coderefinery for 14 days, then on youtube.
- Twitch has been muted I guess.
- Should be fine now... I guess?
- Yes :)
- One question related to Jupyter: would it be possible to run R there?
- Yes, you can install a R kernel there and it works just fine
- R kernel is installed in the course environment, just start the R exercise notebook and Jupyter will use R to run the notebooks
- Sorry, I dont see the exercises in my Jupyter
- You may need to navigate to the right directory, or run the git clone. Installation istructions should cover this
- I must have missed that part. Can you point me to the right instructions? The link, I mean
- No probs! Here the link: https://aaltoscicomp.github.io/data-analysis-workflows-course/installation/ Are you a linux/mac user?
- Mac, thanks for the link
- How do you get the R symbol besides the python one as Simo has it? I only see Python in my Jupyter
- Did you create the environment from the course environment.yml?
- I thought so. I'll check again
- Did you activate your environment or do you have the jupyter launched from the base environment? R is only in the `dataanalysis`-environment
- DATA folder is empty for me
- Please run the `download_datasets.ipynb`-notebook from the repository folder
- Thank you, it works
I can help with installation If anyone needs it. We can use my *.yml file for specific versions of packages in linux/Mac.
## Exercise 1:
Exercise last until xx:10
*Python exercises here: https://github.com/AaltoSciComp/data-analysis-workflows-course/tree/master/python_exercises
R exercises here: https://github.com/AaltoSciComp/data-analysis-workflows-course/tree/master/r_exercises*
- Cell 8 seem to be plotting test_data classification twice. What would I need to do to plot train_data classification as well?
- How can we see the breakout rooms on Zoom?
- This is a new feature from zoom v5.3.0. It seems that not everyone use this version so Simo is doing the exercises live for everyone. If you need other one-to-one help, ping me somewhere :) ([name=enrico]).
- I downloaded the notebook to my own Mac. I have Notebook, but the datas or file is missing (FileNotFoundError)
- Are you able to open the download notebook and run it? (I guess yes since you get the error). Are you able to see the URLs in that notebook? (e.g. http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data)
- It seems that I don't have data or reading data is not working ... '../data/iris.data'
- Have you run the `download_datasets.ipynb` notebook?
- actually not ... I'll do it -- now it works. Thanks!
- Seems Twitch is down?
- Looks like it.
- It will be back in a minute
- It's back
- Can we say that first cell contains code that is general for datasets and not specific for this one?
- From another participant: I thought so too. It was more for loading python modules.
# Questions on Chapter 1: Understanding data analysis workflows - modularity
*questions related to Chapter 1 here below*
- Why were the accuracy scores different in R and Python? (this was from the exercise 1)
- The models are using different hyperparameters. Both models use CART-algorithm to determine the trees, but the tree depths and other hyperparameters might be different.
- Do you use special software, tools for outlining this analysis workflow? Or you plan it on paper?
- E.g. of tools for DAGs (direct acyclic graphs) pipelines https://snakemake.readthedocs.io/en/stable/
- TF: https://www.tensorflow.org/tensorboard/graphs
- Scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
- When is it a method, when is it a function?
- Method is usually a function of a class, function is just a function. In my talk I used "module" to refer to functions or classes that do some part of the pipeline.[name=simo]
- In scientific computing we often do one thing only in one project - how do I know when to go for a function over just writing a script (ps. Hi Enrico! ([name=enrico] _o/) )
- Funny but also serious answer is at https://xkcd.com/1205/, if you plan to reuse it more than once AND if the time to make it reusable is worth it, then make it into a function. In general, having it as a function pays back even if you don't want to re-use it because it will make your code modular and you can debug a single "box" at a time. ([name=enrico]).
- What does interface mean?
- Interface is something that allows two different parts of the code to discuss. So saying "this function works with string input" means that you have defined a function that has an inteface with all functions that produce a string output. Creation an interface is basically writing out the rules that one function can produce as its output and one function can take as its input.
- How should one organise the code (functions defined in same/different file than script, before/after running the script)?
- excellent question! My answer here : Ideally you want to have code that is reusable into a "package" that you can "install" on each project you are running. Example in python from our previous course: https://aaltoscicomp.github.io/python-for-scicomp/packaging/ More generally, this paper is a must read for organising data/code/projects -> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510
- How to get to a break out room?
- zoom should be v5.3 or >. Not all here have it, so maybe we can try another time.
- what will be data_name we are going to load?
- The dataset is `../data/wdbc.data`. It is mentioned in the `ch2-X-ex2.ipynb` near the problem definition.
- In what situation would you use classes and class methods rather than writting individual functions?
- This happens a lot in Python, when you want to inherit lots of features such as iterating, `__getitem__` (aka. `x['something']`) and do not want to write them again and again. This depends on the framework. Working with iteratively generated data is another place where this happens a lot.
## Exercise 2:
Exercises until xx:40,
**X_exercises/ch1-X-ex2.ipynb.** *(replace X with 'python' or 'r')*,
*Python exercises here: https://github.com/AaltoSciComp/data-analysis-workflows-course/tree/master/python_exercises
R exercises here: https://github.com/AaltoSciComp/data-analysis-workflows-course/tree/master/r_exercises*
-> https://presemo.aalto.fi/dataanalysis
*questions here below:*
- Why this --> Error: '../data/wdbc.data' does not exist in current working directory
- The code assumes you run it from the `X_exercises`-folder. If you do not run it from there please change the relative path. So, if you see `data`-folder there, change it to `data/wdbc.data` instead.
- No way I'll finish this... today
- Yeah, maybe it is a bit big exercise. I'll keep that in mind for the next sessions.
- I am so used to R Studio that this Jupyter is so alien to me
- It is a bit different. At some point I'll convert the notebooks to RMarkdown, but haven't got the time yet.
- Is it possible to switch this excercise to homework mode? And then check the lesson after?
- Let's see what the presemo tells us.
- Why is shuffling important or needed, if at all?
- I don't have any data folder in the 'X_excercises'-folder. What to do?
- Did you run the `download_datasets.ipynb` at the root folder. It downloads the dataset to `data`-folder.
- I am not familiar with Pandas. I got the idea how to create the import function, but I had to cheat for the Pandas code by looking into the solution.
- Good idea is to just copy-paste from the previous cells.
- I could not make the connection to the top part of the jupyternote book file for the Pandas part. Yes, I am on Python.
- You'll need to run the previous cells so that the imports are done. I recommend running the previous pipeline first.
- Is there performance penalty if I use with dataframe.drop or df.rename the agument inplace=True?
- Not really. Column dropping operations are very quick. Of course `inplace=True` makes writing code a bit easier. [name=simo]
- What is train_split in the Problem 3?
- The original data is split into training data and test data.
- In this case you'll want to split the new data in the same way as the iris data was split.
- Very nicely designed set of exercises, thank you Simo & co!
- Thanks! [name=simo]
- Why the accuracy change from problem 2 to problem 3?
- That is part of the exercise. Can you guess why the other model performs better? [name=simo]
- The confusion matrix is also smaller than the previous problem, so.
- What is confusion matrix?
- https://en.wikipedia.org/wiki/Confusion_matrix
- Why is shuffling important? or needed?
- Shuffling is important to make certain that the training and test samples are representative of the whole dataset and that the model does not train based on row indexes etc.
- Do you have some more background information on this kind of data pipeline? The trees etc.
- If I would like to convert my the data fromat from dataframes to arrays (numpy). Is it recommended to do it at the data preparation and loading phase of the pipeline?
- Why is shuffling importand ? Doesn't split_dataset() method take the test and training rows randomly?
- I am not sure about the question regarding linearity between variables X4 and X5 "Can you explain why our plot looks so linear?"? Isn't it just the case that the data just happend to be that way, so there might me some ~linear dependence among the columns?
- Random Forests does increase accuracy when applied for the wdbc data but in the case of Iris it seems there is not difference compared to decision tree in terms of accuracy. Did I do something wrong or that is an intricate property of the data that just happend to be that way such that random forest doesn't perform better than decision tree?
## Feedback for the day
*please write something good about today and something that could be improved. Feel free to add any suggestion to make things better for next days as well as future version of this course*
- general comments? (anything yeah) I mean there should be another bullet about general comments?
- maybe any bullet is good, feel free to write as you please :)
- Personally, I underestimated the installation part. I installed Conda last Friday thinking all was ready. I obviously didn't read properly the instructions. Maybe One thing: could you estimate the time needed for each step of the installation? It makes a difference if it's minutes or hours
- Good: materials were clear and the lecture was easy to follow, the pace was good for me
- To develop: lecturer could mute the mic when not speaking (e.g. during working on exercises)
- Simo explains clearly, Well done!
-
- I understood your talk about the pipeline, modularity, interface. But I have a problem with the content of the data pipeline. My data pipelines I am usually working on look differently, I am not familiar with this stuff like random_forest_classifier, shuffling. I am lost at this point. I see this kind of stuff for the first time in my life.
I am also not familiar with Pandas and I think the Iris problem had another structure.
-
- Excercices were well designed for the timeframe
- Thanks! Very important and usefull information. For me the twitch.tv was not so good environment: totally new for me and it didn't work all the time => I missed a lot of information.
- I think so too, Twitch is not the same as participating via Zoom.
- Exercises are really needed and good but I needed more time than there was reserved for them.
- Good sides: The pace of the lecture was quite fine, I managed to easily follow everything (it could be even a bit faster). The time spent on explaining concepts vs the time spent on doing exercises was nicely divided. I like how you prepared the course material. Bad sides: Not much so far. It took me a bit of time to do exercises since I am not very familiar with pandas commands.
-
- Good: very well justified why to learn pipelines. To improve: Well, nothing much you can do to Twich being not so co-operative. Still good stuff!
-
- Interesting subject matter and I think I understood most what Simo said. The technical difficulties in the beginning were unfortunate but very common in the Covid-era of remote learning so no worries about that. I am such a novice that I was just able to complete Problem 1.
- Good pace, I could follow the talGood pace, I could follow the talGood pace, I could follow the talGood pace, I could follow the talGood pace, I could follow the talGood pace, I could follow the talGood pace, I could follow the talGood pace, I could follow the tal
-
- I think this was very informative but much more challenging than for example that Python course from couple of weeks ago. As I'm not that experienced in coding some minimal working examples would be nice.
- Anaconda feels much better than CoLab, much more responsive!
- Really useful content, well-explained theory on why functions are useful and why to build pipelines instead of separate lines of code. Thank you!
- I had not used Conda or Jupyter earlier and getting started with the excercises took time. Now getting used to them, so probably gets better.
- I loved the Juypter Notebooks used in this lecture. Overall this is an awesome course on using functions, and the pipelines that are built are very easy to follow. Good stuff!
- I personally would try to prepare and work on the notebooks before the lecture, so that I can take my time to figure things out rather than be rushing through it in the allocated 20 or 30 minutes. Also this setup might speed up the lecture and reduce waiting times.
## Day 2
## Download up-to-date lesson materials using these instructions
- Download [this notebook](https://raw.githubusercontent.com/AaltoSciComp/data-analysis-workflows-course/master/download_lessons.ipynb) (right click -> save link as) and save it to the course directory (`data-analysis-workflows-course`)
- Run the notebook
- It will download up-to-date lesson materials
- If you get an error that says it's not being run from the course folder, please give `path_to_course_folder=../relativce/path/to/my/course/folder` as an additional argument `download_lessons('lesson 2', path_to_course_folder='../relativce/path/to/my/course/folder')`
- If for some reason it does not work, you can go [course repository](https://github.com/AaltoSciComp/data-analysis-workflows-course/), go to `X_exercises`, click the notebook you're interested in, click `raw` and download the notebook to you `X_exercises`-folder
## Icebreaker day 2
Which type of data do you work with? How long does it take you to prepare the data? (= make it ready to be analysed)
- Brain images (MRI). It can take about 1 day per subject. Usual Num of subjects is 30... so 30 days without parallelisation. :)
- Chemical data (weight fractions of sample material); once the analysis is complete the data is in quite readable format already, but it needs some tidying. I haven't done it that much yet, so I am unable to give any estimates.
- Solar magnetogram data (preparation takes around 10 days, without parallelisation)
- Solar radio data, preparation takes around 10 min or so
- Neuronal data - patch-clamp or extracellular recordings from individual neurons or small neuronal populations (small means less than 100,000 neurons). For this type of data the first step in data preparation is done by experimentalists who use standardized comercial software. It might take an extra week to additionaly organize the data for specific data modeling tasks.
- Chemical data. I use excel so it is really quick, but it drives me crazy to modify pictures by hand.
- Genetic Data, genes expressions data, time series for genes expression
- Atmospheric transport model output, netCDF files, takes a lot of time to merge and process all the output files.
- Thermal Desorption Spectroscopy and Casting data (the last one takes a lot of work to prepare because we have to divide by cycles)
- Meteorology data from in-situ measurments. For example: Chernobyl Accident Chemical components data(Txt. file); ice crystal images(MAT-files).
- Process data related to manufacturing industry. It has been a month but still stuck wth the data preperation process.
- Long-term ecological data to analyse the effects of climate change in species
- Spatial mobility and environmental data
## Questions
*Write here for any sort of questions. Questions always at the bottom above the last line*
- In downloading the new data, which should be the end directory? data-...? or X_exercises?
- Where i find the poll?
- Here: https://presemo.aalto.fi/dataanalysis
- Thanks!
- I have a couple of questions regarding last time lectures:
- 1) Why is shuffling importand ? Doesn’t split_dataset() method take the test and training rows randomly?
- It does but you don't know if you don't look at the documentation. In this case it was not necessary to shuffle.
- 2) I am not sure about the question regarding linearity between variables X4 and X5 “Can you explain why our plot looks so linear?”? Isn’t it just the case that the data just happend to be that way, so there might me some ~linear dependence among the columns? (ok Simo just answered this(i.e. one variable is perimeter the other is area so they are obviously dependentant in spherical objects))
- 3) Random Forests does increase accuracy when applied for the wdbc data but in the case of Iris it seems there is not difference compared to decision tree in terms of accuracy. Did I do something wrong or that is an intricate property of the data that just happend to be that way such that random forest doesn’t perform better than decision tree?
- Your answer is completely correct. Using a more advanced model to a problem with insufficient data does not necessary improve the accuracy of the model.
-
- Is writing your code in functions considered to be a best practice? Or does it depend on the problem you are trying to solve? Or the length of code?
- Usually people recommend functions. But end goal is clarity, how can you make your code most understable? Usually functions helps, but you'll discover this over time.
- If we want to use arrays instead of data frames (numpy), Should the conversation to arrays be made in the data loading and preparation stage?
- Probably so, but each column in pandas is also a numpy array. So, the difference is not that much
- Is there any link to github where I can download the lessons? Because by downloading the lessons from the given link, I get "Failure at downloading!" and I have downloaded them to the directory data-analysis-workflows-course
- https://github.com/AaltoSciComp/data-analysis-workflows-course
- I ([name=enrico]) just tested on mac/linux with
- `git clone https://github.com/AaltoSciComp/data-analysis-workflows-course`
- `cd data-analysis-workflows-course`
- `conda activate dataanalysis`
- `jupyter-lab`
- *open and run the download notebooks*
- *note that I already created the conda environment*
-
- Why I got a little bit different confusion matrix than in solution? Do I have made a mistake or is there some natural reasons?
- You need to set the random seed before running the classifier. If you don't, the random seed is always different. For example Exercise 2, solutions, Problem 4. Try to run the cell "iris_fitted_forest = ran..." multiple times. You see that the output is always a little bit different. If you add at the beginning of this cell: `random_state = RandomState(seed=42)` then it will always produce the same result. 42 is just a number used as a seed for the random number generator. Some people cheat by picking the best seed that confirms their hypothesis. :) ([name=enrico])
- How do you quickly know that X4 and X5 are perimeter and area?
- it's explained in the documentation
- What is the advantage of having X-names then?
- Reading documentation, I feel X4=texture, X5=perimeter, X6=area...
- Just a naming convention
-
- Why do you choose X4 and X5 for the plot?
- Why we did not do any preprocesses such as pca or normalization. Is it becasue of the data type?
- That is also my ([name=enrico]) guess. The dataset is quite small.
- How to interpret a 3 x 3 sized confusion matrix?
- In general or in this case? For those who are not familiar check the examples at https://en.wikipedia.org/wiki/Confusion_matrix
- what is the random_state?
- See answer above about getting always different results. It sets the state of the random number generator. This is important so that you can replicate *exactly* the same results when you (or your colleague) re-run the analysis. Then you can change it (random seed affects result, and some people *cheat* by finding the best seed to confirm their hypothesis) ([name=enrico]).
- what are iris data and iris dataset?
- A very common simple data set for examples in ML: https://en.wikipedia.org/wiki/Iris_flower_data_set
- Now data is snapshot, where one obseration can have one value for each variable. How to apply this if I have time-series type of data? Is tidy format for time-series data defined?
- I have seen tidy time series. I ([name=enrico]) am not a user myself, but a quick google gave https://robjhyndman.com/hyndsight/tsibbles/
- Thanks: I assume tslearn and equivalent are then way to go. :+1:
- We'll demonstrate a time series in a short while.[name=simo]
- What's the name of the book?
- [R for data science](https://r4ds.had.co.nz/) (It's a free book)
- [Python for Data Analysis](https://wesmckinney.com/pages/book.html) (Good book as well)
- What are your thoughts on using SQL for data wrangling and then Python for analysis?
- It can be useful. For example you are running a web application and want the web application to store the data on an SQL database. SQL can also be useful if you have a huge dataset and you need to filter before getting which data you need. Another advantage is that it can have simultaneous read-write so more process/users can write and read. However, my ([name=enrico]) suggestion is that it is easier to get the data you need out of the database in a format that is more preservable and compatible with more tools (e.g. export SQL query results to csv). In 10 years maybe that SQL database is unusable, while a csv file might still be readable. There are other data types of course e.g. HDF5.
- I'll talk about SQLs when we get to the different data types [name=simo]
- What if you have really nested or complex data? Should you use nested tidy tables or what?
- Part of the point of tidy data is that you don't make things untidy by thinsg like nesting. So, we recommend you find a way to make it tidy. It'll somehow involve re-structuring the table. It will take some thoung but *really* be worth it.
- For example, I work with physics simulators optimization. I have experiments, that contain multiple runs, and each run contains different partitions, setup information etc, but I still need all this information in my analysis. I simply do not see a way of making the data tidy, other than either nesting it or splitting it into many small tables. Splitting the data would make it really hard to manage, so is there really a good general rule for what to do?
- You can make the data tidy by having columns such as `run`, `partition`, `setup_information`, `dataset` etc. Then, you can run e.g. analysis on a group-by-group basis by using groupings (more on this later). Sometimes nesting is needed, if your dataset is a dataframe as well. [name=simo].
- I ([name=enrico]) also agree with you that in some cases it is not so simple to go *tidy*. Or it becomes very expensive (re-managing all the data). I like the general rule that if I start a new project I can try to be "tidy" from start.
- Downloading the data is Problem2, should we have it done already?
- Please run the `download_datasets.ipynb`. If for some reason your version of `download_datasets.ipynb` does not download the dataset, please download the up-to-date version from [here](https://raw.githubusercontent.com/AaltoSciComp/data-analysis-workflows-course/master/download_datasets.ipynb) (right click -> save link as)
- Thanks! It worked. I overwrote the file I already had.
## Lecture notebooks:
Right click -> Save link as -> store into `X_exercises`:
* [ch2-python-lecture.ipynb](https://raw.githubusercontent.com/AaltoSciComp/data-analysis-workflows-course/master/python_exercises/ch2-python-lecture.ipynb)
* [ch2-r-lecture.ipynb](https://raw.githubusercontent.com/AaltoSciComp/data-analysis-workflows-course/master/r_exercises/ch2-r-lecture.ipynb)
## Exercises:
Right click -> Save link as -> store into `X_exercises`:
* [ch2-python-ex1.ipynb](https://raw.githubusercontent.com/AaltoSciComp/data-analysis-workflows-course/master/python_exercises/ch2-python-ex1.ipynb)
* [ch2-r-ex1.ipynb](https://raw.githubusercontent.com/AaltoSciComp/data-analysis-workflows-course/master/r_exercises/ch2-r-ex1.ipynb)
Exercises end at xx:??
## Exercise solutions:
Right click -> Save link as -> store into `X_exercises`:
* [ch2-python-ex1-solution.ipynb](https://raw.githubusercontent.com/AaltoSciComp/data-analysis-workflows-course/master/python_exercises/ch2-python-ex1-solution.ipynb)
* [ch2-r-ex1-solution.ipynb](https://raw.githubusercontent.com/AaltoSciComp/data-analysis-workflows-course/master/r_exercises/ch2-r-ex1-solution.ipynb)
Exercises end at xx:??
## Continuing
- Sort of tidy since its nice divided into colums for variables and rows for observations. But at the same time it is fragmented over multiple excel files.
- Now audio OK! Thanks
- Should be, we are currently in a break
- We are back
- Do I need to install some plug-in to my chrome to be able to see the tabs Python/R similar way as Simo is showing on the stream? Now I only see somekinde of table.
- Hm, you shouldn't, it should be normal javascript. Do you block javascript by any chance?
- Javascript seems to be "Allowed"...
- I have the same problem, I don't see in Safari the nice tabs, just big tables with raw code. Same problem in Chrome, javascrip ist allowed.
- I just figured out: I am using lecture-page in git. Local lecture pages are not rendered with Sphinx, just rst files. May that be the reason that we should open the local lecture file?
- Yeah they are not rendered. The rendering looks better under: https://aaltoscicomp.github.io/data-analysis-workflows-course/
- I redownloaded the lecture 2 file into my local folder and tried to open it in jupyter-lab, but this does now show the correct format. How did you open it locally, please?
- Did you download the lecture notebook or the source code for the lecture page? Check [here](https://hackmd.io/gPreUVcVQuqYKVtqHHp_RQ#Lecture-notebooks) for instructions.
- I redownloaded the whole folder from git. I see the correctly rendered format only on the Aalto-web site. How do you open them locally?
- I noticed, that there were new notebook in the root named "dwonload_lessons.ipynb" It seems to fetch the rendered lecture pages.
- It is mentioned at the top. If it works for you, please use it to download all course materials [name=simo]
- Can we say that the format is not tidy because at least one column contains non trivial information (like in the case 2-1, it requirees extra work to extract meaningful information).
- Python for what course? For the .loc statements
- [Python for Scientific Computing](https://aaltoscicomp.github.io/python-for-scicomp/pandas/), our previous course [name=simo]
- How come there is different code in the lecture notes (python)?
- atp_players['name'] = atp_players['last_name'] + ', ' + atp_players['first_name'] (webpage)
- atp_players['name'] = atp_players.loc[:,'last_name'] + ', ' + atp_players.loc[:,'first_name'] (ipynb file)
- Two different ways of doing the same thing, but maybe Simo can confirm?
- Yup. Just a copy paste error from my part. Both work. The `.loc`-way is a bit more complicated and it can support more complicated queries/modifications etc. First one is more self-evident.
- Is it better to .copy() the data file before use in-order not to overwrite the original datafile?
- The `.copy()`-method creates a copy of the DataFrame. It is usually only needed once you choose a subset of your data and want to do location searches for that dataset. You can always re-run the `read_csv`-part if you need to reload the original DataFrame.
- I got this problem as problem 1 in my notebook? Did you change the notebooks recently?
- Yes. Please download the up-to-date version.
- .You mentioned that there is some python course where .loc can be reviewed among other commands. What is the name of that course?
- Here: https://aaltoscicomp.github.io/python-for-scicomp/pandas/ (we just finished it few weeks ago, there videos available)
- When I click those links I recieve a bunch of texts, I could not download the exercises. Can someone help me out?
- Come chat with me ([name=enrico]) on zoom chat
- Right click -> Save link as -> store into `X_exercises`
-
- .
## Problem 2
We'll check the solution around 13.55.
- I used
```def load_matches(datafile):
matches = pd.read_csv(datafile)
matches['Date'] = pd.to_datetime(matches['Date'])
matches['Season'] = pd.DatetimeIndex(matches['Date']).year
return matches
```
- This solution will give wrong Season for winter matches.
- Good point. Adding `.min` to the end would probably work.
- Right, the error is actually even worse e.g. a match played in July 2020 had the 2019 season tag.
## Problem 3
- Its giving error :keyError: "None of [Index(['HomeGoals', 'AwayGoals'], dtype='object')] are in the [columns]"
- what is the code you are running?
- Python code
- If you're running the exercise materials you first need to create these columns using str.extract (more info on the exercise).
- I meant what is the code you are running? I think the issue is that the function "format_matches" removes HomeGoals and awayGoals from the variable, so if you try to run that cell again, you get an error
- Since match has a Team and an Opponent, isn't the "Side" column useless?
A vs B, Home for A
B vs A, Home for B
(and that's it, no need for "away", unless I want the double of the columns with A and B switched... Or not? :-D )
- I get it, but does that mean that I need two rows for each match? One Home for one team, and one Away for the other team?
- Yes. This happens often when original data is not tidy, because information is encoded into the column names, not rows. This will result in a larger memory footprint, but it will easy up analyzes considerable, because you can use functions that were designed with tidy data in mind.
## Categorical data
- .Categorical mapping to just 0 and 1 should do the trick for binary outocme variables. The algorithms used with this kind of categorical data are usually adapted to this binary format. Would you still say that categorization with pandas has advantages? Also for non binary models there are multinomial algorithms that can deal with categorical data that has more than one category.
- That kind of mapping is basically what the categorical variable does in a case of two variables, so if you already have such a mapping, there's no benefit of using categorical format.
-
- .
## Joining datasets together
- .
- .
-----
## Feedback for this day
*please write something good about today and something that could be improved. Feel free to add any suggestion to make things better for next days as well as future version of this course*
- It was fortunate we didn't finish, no need to rush through the material :)
- The course material and teaching resources are great (lecture content, jupyter exercises, interactivity). However, you might need to addapt it to the available time - either speed up (which will make the course quite challenging for the beginers), drop some of the material, or divide the material into lecture + homework. For the participants, it's good to know whether the course takes only 12h or 12h + time for homeworks.
- .
- I couldn't open the lessons' file, even with all the instructions. It's weird because it worked fine on Monday
- Wanna stay in the zoom after the course? ([name=enrico]) I can try to
Yes, please :+1:
- There are so many different windows to follow at the same time, that sometimes I loose the point. Twitch, HACKMD, Jupiter, exercise, material, additional information. Challenging... any Thanks!
- This was very useful for me. I have been doing some data wrangling, but it is so useful to see how to do it using functions for several datasets. Thanks!!
- I would like a bit more clear dividation between breaks and lectures. Now there has often been still some talking after saying e.g. "Let's have a ten-minute break", and then if you stay to listen the actual break has been about half of the announced length.
- I am not familiar with the Pandas library. I did not know this is a requirment for the course. I am falling behind because I have not the knowledge in Pandas. I use mostly numpy, h5py.
You say there is a break, but then it was probably the exercise.
- I have problems with .rst files. On the web I don't see the tabs, on my local folder .rst open up with Atom, no formatting.
- RST files are rendered into the webpages https://aaltoscicomp.github.io/data-analysis-workflows-course/chapter-2-data/ It is best to look at the render (RST file is like the "code" behind the webpage)
- Sure, but how would I do this locally, please? Try to open them with a browse
- If you stay around the zoom I can quickly show if you want to see RST files locally. In general, they are the same as in the web, so if you have an internet connection working in the machine you use, you can see them there.
- Thank you. I will. On my Safari and Chrome they were not shown correctly online either, from the github repo.
- .I think the problems can be looked at and discussed what is needed to be done during the lecture, but then actually solving them should be homework.Then look at the excercise solutions and discuss them in the next lecture.
- If the python for scientific computing course will be run again, could you also have 1 credit for it?
- This time we didn't, but sure!
- .Film a gitpull video manual! Then share it with us.
- We have the code refinery workshop on youtube https://www.youtube.com/watch?v=r1tF2x5OLNA&t=6154s Next workshop in 2 weeks -> https://coderefinery.github.io/2020-10-20-online/
- .
- .
- It seems that I should be the coder to follow the course :( That's something that I'm not
- .
- Things to follow are all over the place.. Feel distracted when there are too many windows.
- .
- .
-
----
## Homework
- `ch2-X-ex1.ipynb`. Especially problem 3.
---
# Today's session (Monday 12 October) will be skipped (teacher is feeling unwell). See you on Wednesday 14th at 12:00 Helsinki time.
:::info
*(please ask questions at the very bottom, just above this line)*
:::
## Day 3
## Download up-to-date lesson materials using these instructions
- Download [this notebook](https://raw.githubusercontent.com/AaltoSciComp/data-analysis-workflows-course/master/download_lessons.ipynb) (right click -> save link as) and save it to the course directory (`data-analysis-workflows-course`)
- Run the notebook
- It will download up-to-date lesson materials
- If you get an error that says it's not being run from the course folder, please give `path_to_course_folder=../relativce/path/to/my/course/folder` as an additional argument `download_lessons('lesson 2', path_to_course_folder='../relativce/path/to/my/course/folder')`
- If for some reason it does not work, you can use the following links to download the notebooks:
**Lecture notebooks (right click -> save)**:
[ch3-python-lecture.ipynb](https://raw.githubusercontent.com/AaltoSciComp/data-analysis-workflows-course/master/python_exercises/ch3-python-lecture.ipynb)
[ch3-r-lecture.ipynb](https://raw.githubusercontent.com/AaltoSciComp/data-analysis-workflows-course/master/r_exercises/ch3-r-lecture.ipynb)
**Exercise notebooks (right click -> save)**:
[ch3-python-ex1.ipynb](https://raw.githubusercontent.com/AaltoSciComp/data-analysis-workflows-course/master/python_exercises/ch3-python-ex1.ipynb)
[ch3-r-ex1.ipynb](https://raw.githubusercontent.com/AaltoSciComp/data-analysis-workflows-course/master/r_exercises/ch3-r-ex1.ipynb)
**Exercise solution notebooks (right click -> save)**:
[ch3-python-ex1-solution.ipynb](https://raw.githubusercontent.com/AaltoSciComp/data-analysis-workflows-course/master/python_exercises/ch3-python-ex1-solution.ipynb)
[ch3-r-ex1-solution.ipynb](https://raw.githubusercontent.com/AaltoSciComp/data-analysis-workflows-course/master/r_exercises/ch3-r-ex1-solution.ipynb)
## Icebreaker
If you think of your data analysis workflow (data collection, data preparation, modeling, results visualistaion, etc...): what is the biggest bottleneck? Where do you spend most of your time?
- For me data modeling is maybe the biggest blocker because luckily I have standardised the data preparation part. What I find annoying/most time consuming is going always back to the original modeling idea (I planned to run this analysis) and tweak it to get "better" results... when this is done all the time, then there is a potential issue of bias where you just keep on digging to find something significant in your data, increasing the likelihood of making your research non-reproducible
- Reading/writing data in a useful format
- Data preprocessing is pain, but now the challenge is, how large dataframes I can load to memory so that the Triton environment does not end up swapping data (from disk). In addition, the dataframe format is not so easy to define. Furthermore, what type of data models are inputting, range 0..1 , -1..1 or raw as integers.
- For me, data preparation is the biggest time commitment (especially if other people have been involved in collecting/storing the data), followed by results visualisation to the exacting standards of my co-authors
- My work is centered on *mechanistic* computational modeling (a bit different from statistical modeling that you will talk about). So, naturally, modeling is taking most of my time.
- Visualizing the data in the format that looks professional
- Data preparation as it is not always easy to know how to do it
- Getting others to use best practices so that we *can* collaborate
- I have lots of data, like 50.000 - 100.000 2D images. I need to extract positions via fitting. My fitting algorithm is still too slow and if the positions are too close the fitting algorithm bumps around. I spend time on speeding it up, parallelisation. I spend most time on finding a better way to extract the positions. Maybe this is still data preparation, but it includes already a modelling part due to fitting assumptions. Probably further modelling will be the next big issue.
- Data Collection 0, data preperation 50, modelling 30, visulisation 20
- Data preparation is the most time consuming part in our projects which are concerning mostly with genomic and clinical data
- Data preparation and improvement of the model/parameters
# Going over the exercises from last time
- The current exercise folder in jyupter labs does not have exercises for chapter 3 (modelling) and chapter 4. Could you share the update codes we can use as last time?
- Do the download links at the top of this HackMD work?
- Solution notebook links appear to be the same ast the exercises at the moment
- they are being fixed.... and they are fixed. For future reference: open a file on github, click on "raw" button, that's the link pasted here above.
- I used the link at the top to download the ch3 items. Do I have to re-load via raw.githubusercontent.. again?
- no, they are the same links
- Is this syntax pandas-specific?
- the binary format here?
- EDIT: I meant the syntax in the exercises from last time. :-)
- Yes, most of that is pandas specific
- thanks!
- Where does HDF5 belong?
- I was wondering if this is binary too?
- Yes, it's a binary format.
- Can the other data formats not store big 2D matrices? But also my HDF5 data contains motor positions etc. I wonder now if there is a better solution?
- CSVs are probably not good for big 2D stuff, but most other formats should be able to do 2D stuff. One thing about some formats is it's not just the matrix itself, but metadata about it too and storing several matrices together.
- If your data is already in HDF5 you'll most likely want to keep it in that format. Of course if you do analysis on that data and you get results, you might want to use some other format to store you results (as they are most likely in tabular format). [name=Simo]
- I was advised to store the results actually as well in HDF5 as it is faster. Also I could read them in again faster.
- Still curious: Different formats require different speed of reading the data, please?
- Yes, some are faster/slower at different things.
- Sorry, maybe I missed it, but what is the fastest?
- Let's ask Simo after the break +1 :+1:
- I have used xarray a lot when working with N-dimensional data. It offer a lot of the same syntax/functionality as pandas. The dataformat used in xarray is netCDF which also can store metadata.
- netCDF is HDF5 underneath it. It is just a way of writing HDF5 files in a standardized structured way. If your data works well with that, I'd say go with that.
- I have problems with lesson 2 updating exercises ...where I can find the instructions? I think from somewhere last lesson documentation...
- Here for the files: https://github.com/AaltoSciComp/data-analysis-workflows-course/tree/master/python_exercises For the direct download link for lesson 2 solutions: https://raw.githubusercontent.com/AaltoSciComp/data-analysis-workflows-course/master/python_exercises/ch2-python-ex1-solution.ipynb
- For downloading the lessons, you can run the notebook "download_lessons.ipynb" in the main folder of the repository: https://github.com/AaltoSciComp/data-analysis-workflows-course
- if you are familiar with `git`, from the terminal you can go to the folder where you stored the repository and then run "git pull". If you did not do changes, that should work.
-
- I did not understand why compression make things difficult.
- It has to be uncompressed. Some compression methods let you uncompress just a part, but some mean you have to uncompress the *whole* data. Depending on use, the CPU usage when decompressing may slow you too much. It's all about how it's used.
- How should one go about finding a suitable data format? E.g. I'm only interested in formats that allow partial access. I'm currently using hdf5 - how can I find out alternatives to make sure this is the best fit for my project? (I tried googling just now to no avail, maybe I'm not using the correct terminology?)
- Yeah, this is hard and I don't know a single solution for finding the best. I guess this is part of "wisdom", having done it enough and talking with others
- I sometimes get good ideas from papers where people share their code (of similar topic) or github repositories
- Yes! Check what others do, after enough time you learn lots of useful things.
## Chapter 3, modelling
- Sorry, still have problem ... ok.
- Ping me on zoom chat direct message [name=enrico]
- I feel like this grouping somehow belongs to chapter 2.
- I think the difference is that this grouping work does not get saved but only serves the purpose of the specific task.
-
- Is there some good/standard ways to combine django and pandas? I have lots of data in db and use django querysets to model it, but I often miss pandas power to do certain things.
- Basically, convert data from one memory model to another. Pandas has a `read_sql` format that can read from a SQL database and arrange it in a dataframe. maybe raw SQL access isn't the right way with Django, though. You could make a function that reads django and produces rows of data, this passed to another pandas function to create it.
- Can pandas read querysets?
- Probably not directly, I would use a generator to transform it to a iterator of lists (or similar) to feed to the DataFrame constructor.
-
## Exercise problem 1
Links:
[ch3-python-ex1.ipynb](https://raw.githubusercontent.com/AaltoSciComp/data-analysis-workflows-course/master/python_exercises/ch3-python-ex1.ipynb)
[ch3-r-ex1.ipynb](https://raw.githubusercontent.com/AaltoSciComp/data-analysis-workflows-course/master/r_exercises/ch3-r-ex1.ipynb)
:::warning
**Do Problem 1.**
Work until until xx:45
After you have finished tasks 1 to 3, please click yes on [presemo](https://presemo.aalto.fi/dataanalysis/).
:::
Technical issues -> talk to Enrico in direct message
Conceptual issues, write here
## Continuing chapter 3
- I use this curve_fit. I put in a set of positions where I expect my signals to be. But I noticed if I switch off/on some positions, the fit is better and faster. How can make a better decision for the fitting process what to choose? Would you go via a for loop through different combinations and check where chi2 is the lowest? I think I look for something to make a better decision what is the best combination of positions.
- Do you mean that if you selectively remove some of your data, you get a better fit?
- I think so. Background: I have a material with different phases inside. Peak presence vary within the material. And yes, if I try different combinations of the peaks (switch off/on), the fit converges faster.
- If you report all the different combinations of on/off (e.g. by reporting the average X^2) then you are fine, if you report only the on/off combination that gives you the best chi2, then you are overfitting/cherry-picking/p-hacking depending how your fields calls it :) ([name=enrico]). Bootstrapping is somewhat similar where you can do the same computation on a subset of data, but report the average.
- We'll look into bootstrapping next and we'll use nested dataframes to calculate a distribution for a statistic (mean). You might want to use something similar to do multiple fits and then create a distribution for your goodness-of-fit (e.g. sum of squared residuals). Then you can say that the best fit is well defined as it is the one that minimizes that quantity. [name=simo].
- I would not say it is cherry-picking because technically each peak is fitted on its own, but if it is not present, it causes problems. Visually the fits look most of the time very good. I think I need to implement another parameter, like chi2, check all fits for the best chi2 and make a decision based on this. But it will take a lot of more computing time.
- I agree, if you report all combinations you tried, then you are fine. E.g. all the ranges of chi2 you get for 100 combinations.
- If you have 100 combinations, you can run the fit in parallel over 100 independent jobs on a cluster like Triton (Aalto) or CSC. Then you can estimate the average chi2 and confidence intervals (5% - 95%) to get an idea of the best/worst fits.
- Something is already parallised. I limited the fitting currently to one position only. Fitting for one peak over 5000 images take 15min. If I do the fit over the full range, it takes too long and sometimes each fit does not converge and increases the time used for each fitting.
- Very interesting problem btw, if you are from Aalto please come over to our [garage](https://scicomp.aalto.fi/help/garage/) sometimes, we are happy to help+discuss and we have soon new Research Software Engineers to help on these types of heavy computational requests.
- Unfortunately, I am not, but super nice service.
- I am sure you might have RSEs at your uni too, we can put you in contact with them :)
- if you are using CSC services (https://www.csc.fi/ they offer computing services, eg a high performance computing cluster, virtual machines, data storage and more ), they also offer help for these kinds of problems
## Exercises 2
:::warning
**Do problem 2.**
ch3-python-ex1.ipynb / ch3-r-ex1.ipynb (links at the top of the page)
Until: xx:35
After you have finished, click yes in [presemo](https://presemo.aalto.fi/dataanalysis/).
:::
Technical issues -> talk to Enrico in direct message
Conceptual issues, write here:
- I'm still a bit fuzzy on why I might want to write code for plotting as functions (as opposed to script) since there tends to be a lot of manual fiddling. E.g. in the exercise, if I want to print the r value on to the plot, the location for that is different in the two different plots due to the underlying data being differentially distributed
- I ([name=enrico]) agree on one side when it comes to practicality, but the more can be done with a function, the easier (and more reproducible) future plots will be. You can then eventually only do the final touches manually.
## Homework
**Do problem 3.**
## End of Lecture Questions
- Could we get some info regarding course assessment?
- I believe you are interested in getting a study credit for the course. Simo can give more details, but we will also email about it to make sure everyone will know.
- Where could we learn more about it? Do you have some book recommendations? It is the first time I heard about bootstripping.
- Efron's books (he inveted it IIRC). I liked: https://www.amazon.com/Computer-Age-Statistical-Inference-Mathematical-ebook/dp/B01L27MR64 (if you are at Aalto I have a physical copy of the book, we can find a COVID way to share it [name=enrico])
- https://en.wikipedia.org/wiki/Bootstrapping_(statistics)
-
- Simo, Could we get some info regarding how to get the one credit? (before you go)
-
## Feedback for this day
*please write something good about today and something that could be improved. Feel free to add any suggestion to make things better for next days as well as future version of this course*
- Thanks a lot for your effort to teach this course. It was quite interesting and useful. The course material is really nicely organized. Good luck with the rest of the course.
- .
- Really nice content, thank you! A second lecture break would have made it easier to focus towards the end of the lecture.
- Feeling really lucky to take this lecture from you guys. Thanks for your help and effort!
-
- I don't understand Pandas, but I like to learn about the concepts. I will look into bootstrapping. I am happy I had the chance to learn something new.
- for me exercises are quite challenging and so there is not enough time for following the real "beef". Any way, I like the course
1. - What about that one credit? How do we get it? lol
- it will be pass or fail right?
- Yes
### Exercise: record changes
Link: https://coderefinery.github.io/git-intro/02-basics/#exercise-record-changes
You should be able to do all of this, and then try
10 minutes, ends at xx:45
Breakout room status:
- room 1 needs help - rkdarst there.
- room 3: done
- room 5, help with mac
- room 11: need more time
- If I make extra commits, is that OK?
- in general, yes. For this exercise, yes.
- How can I have git automatically commit files?
- ...
- ...
-----
Always write at the very bottom of this document, just above this line.