# RDS course meeting notes
## Tuesday 7th September
### Roll Call
- Greg
- Callum
- James B
### Notes
- Discussed progress in each module and plan for the next two months.
- M2: We aim to have a reviewable version by mid-October but pending confirmation from Jack next week. M2 taught content is missing some elements and hands-on material not developed yet
- M3-M4: M3 is close to finishing and M4 will be worked on this week and on 22 days time if necessary
- We agreed to organise a session to macro-harmonise the content, e.g. agree on what to do with overlapping material, some terminology and optics. This might happen end of September or early October.
## Tuesday 31st August
### Roll Call
- Kasra
- Camila
- Jack
- Callum
### Notes
- No one allocated any more
- Module 1: ready to review
- Module 2: PR until end of September (excluding hands-on)
- Module 3&4: ~4 FTE days to complete
- Module 1: Ready to review
- Module 2: [Draft PR](https://github.com/alan-turing-institute/rds-course/pull/49) starting to add where we've got to to the repo/convert into Jupyter book.
- Started work on data wrangling
- Module 3: Mostly done pending exploring actual data.
- Module 4: ~Half done, still need to work on modelling actual data.
- Started work on variable selection, identified ~10 promising ones.
- Investigating independence of variables, planning to narrow down to 3 variables to start an initial model with.
- Will also feed back into module 3 via plots created.
- Feature engineering?
- https://www.featuretools.com/
- interactions
## Tuesday 24th August
### Roll Call
- Kasra
- Camila
- Christina
- James
- Callum
- Fede
### TODO
- We decided to extend "Weekly RDS meetings" until the summer school ---> meeting time: Tuesdays 10:00-10:30
### Module 1
- [name=Fede] PR ready for review (taught): https://github.com/alan-turing-institute/rds-course/pull/47
- Hands-on: to-be-discussed today
- Simulation of a project scoping meeting
- After module 1, participants should be familiar with the problem/dataset
- Currently, the idea is to have some groups, in each group, we have a helper/instructor who acts as a PI, the participants will ask clarifying questions (including EDI) and try to come up with actionable (DS/SE) steps. The sessions will be moderated such that all groups will end up with comparable "project briefs".
- In module 1, the participants will also setup up github repos, start working with datasets
### Module 2
- [name=James] Prettify and leave in Jupyter Book today. PR review for @ChristinaLast today if possible.
### Module 3
- [name=Camila] 95% ready
### Module 4
- [name=Camila] main focus of this week
## Tuesday 17th August
### Roll Call
- Callum
- Kasra
- Camila
- Christina
- Jack
### Module 1
- We see there's a PR: https://github.com/alan-turing-institute/rds-course/pull/47 - ready to review?
### Module 2
- [name=Jack] mostly finished first half of teaching material (finding and loading datasets)
- Second half, wrangling, missing values, yet to be completed - skteched out syllabus sections last week.
- Plan to copy HackMDs to jupyter book, and prettify
### Module 3
- [name=Callum] 70% of material done (stored in 3 separate PRs)
- [name=Kasra] Do you need a reviewer?
- Not yet, consolidating branches first, there will be a PR to review on thursday
### Module 4
- starting today
- Plan to use logistic regression model on UK in taught part, hands-on part try improving/changing the model or changing the data (e.g. different country)
### Data Analysis / Hands-on Exercise
- [name=Christina] Last week: Equivalent of module 2
- This week: Wrangling, logistic regression model. Plan to make PR with analysis steps.
- Later: Visualisations - referring to taught material (e.g. Section 3.5).
### AOB
- [name=Kasra] Course advertisements and selection questions to write.
## Tuesday 10th August
### Roll Call
- Greg
- Camila
- Fede
- Callum
- James B
- Christina
### Module 1
- Fede and Greg working on Module 1, hopefully by the end of the week there will be a reviewable version.
- No expected code for the hands on session.
### Module 2
- Finished a big section about getting data
- Moving into the data cleaning section
- Module 2 can take some material from Christina's analysis for the hand-on session.
### Module 3
- Hopefully have a reviewable draft about the first 4 sections by the end of the week.
- Section 3.4 not started yet but some material from Data Feminism can be used here if there are no license issues.
- 3.5 is exploration of the data and any output from Christiina on this would be useful.
### Module 4
- Christina will start looking at the data, should use the paper as a guide but not try to reproduce exacly the results. Focus on the UK.
## Tuesday 3rd August
### Roll Call
- Greg
- Camila
- Kasra
- Callum
- Jack
### Module 1
- WIP: hands-on session
- Fede & Greg will finish Module 1, with some additional time from Fede's LwM time.
- Workflow
- Internal reviews when merging module branches into develop
- Inviting REG (or possibly external reviewers) when merging develop into main
- Maybe ask enrichment students to review?
### Module 2
- Developing material on getting data, different types of data, pandas, wrangling data.
- Will look into missing data cases
- Exploratory data analysis: Is it falling in the gap between M2 and M3?
- M3 will visualise relationships between variables as a starting point to then develop models, which is close to EDA.
- If some elements of EDA fit M2, they coould be included and then we can sync/rearrange with M3.
- Variables selection in hands-on exercises:
- We can gradually refined his during the course: Some initial discussion in M1 about. what we are interested in, then looking at. the data. and selecting some variables in M2, then doing EDA and moore refining in M3/M4
- Share what each modules does in research question issue
### Module 3
- 5 sections, we have the draft: Spent last week working on 3.1, 3.2
- We might not have enough time to develop all the content:
- Can we remove material?
- Can we use pre-existing plots?
### Module 4
- Draafted but no content generated
- Complicated and might take time
### Others
- Should we define style rules for the code? --> Use black to automatically format notebooks/scripts.
- When do people take leave and how this will impact development
- Proposed plan to increase bandwidth and get some early results from analysing the dataset:
- Invite REG members that have spare time to reaad the paper, get the data, clean them, visualise them and build an initial model (similar to the paper's) in order to answer the research question.
- Use this early output as a base for developing the last part of M3 taught and M4 taught and also to support development of the hands-on sessions for all modules.
- Post in slack and ask Martin for support
## Tuesday 27th July
### Roll Call
- Greg
- Camila
- Callum
- James
- Kasra
### Notes
### General
- We are still waiting for feedbacks from EAG :warning:
- Risk: consents given by the participants and the scope of the consent
- Verbosity level when preparing the course materials
- Preference: as verbose as possible (some people can asynchronously follow the course)
- How to choose a data set given a research question
- Why you pick a dataset
- Hypothesis driven tests vs collecting data and then scoping research questions
- Article about p-hacking story https://www.vox.com/science-and-health/2018/9/19/17879102/brian-wansink-cornell-food-brand-lab-retractions-jama
### Module 1
- planning
- developing some content
- "project life cycle" section
- Challenges: very wide topic, what to include
- Discussions on the differences between research in academic and industry settings - maybe focus more on research?
- Incremental improvements vs building a new system
- Examples of projects where discussion with academics was needed to set up the project: paleoanalytics, LwM, possibly some projects from DSG, SPENSER
- Hands-on session
- Focus on EDI
- Not sure how to run the scoping part? (if we want to include that in the hands-on session)
- One approach:
- Give a broad research question to the participants
- In groups, they need to further scope the project
- This can be done iteratively
### Module 2
- Divide into several sections:
- Refer to for details: https://hackmd.io/2jO67RhQSGysNetJ-ReJ7Q
- Legality, ethics, API for getting data
- Stay very high level in e.g., wrangling due to time constraints
### Module 3
- Project board: https://github.com/alan-turing-institute/rds-course/projects/3
- Meeting with Andrea, she is part of data visualisation SIG. Notes about the meeting in here: https://hackmd.io/od3NiF2HQZmG79gGVlMi_w
- some discussions on simplicity vs complexity, breadth vs depth when visualising/communicating
### Module 4
- Draft outline: https://github.com/alan-turing-institute/rds-course/issues/30
- Main challenges:
- what to include
- Bayes vs frequentist
- It would be interesting to talk about some of these cultures
- Plan is to develop Module 3 first
- Developing taught and hands-on sessions in parallel, but the hands-on will be finalised after the taught session is finished.
- Some ideas:
- we build a model and discuss results
- this helps to remove barriers so the participants can start interpreting the results
- For part of the hands-on session: ask them to do similar type of analysis but maybe for another region or with more variables
- framework: sklearn, ...
- One approach: one model, add more variables.
## Tuesday 20th July
### Roll Call
- Camila
- Kasra
- Callum
- Jack
### Notes
#### General
- Callum, updates on our meeting with Chris Burr (Thursday):
- We talked about Jupyter Books.
- Callum: we have started JB for the RDS course (also learning how to develop JB is in progress)
- PR for the JB (will be merged to develop)
- We should start the conversation on which features/columns we are going to use?
- In which module?
- Feels a bit strange to start interacting with data in Module 1
- However, during the scoping, some irrelevant columns can be dropped. Given we have around 100 columns in this dataset, maybe we should do this here as well? This will result in a dataset with maybe around 20 or less columns. These columns can be further reduced in the following modules.
- In Module 2, we start looking at the data. Maybe that is the right place?
- In Module 3/4, we can decide on the most important columns
- Quick overview on JB:
- It is easy to integrate a notebook/markdown/...
- Callum will create a README/tutorial on how to further develop the JB
- Can integrate different file types (including markdown and ipynb)
- There are different ways to run the cells: Binder, colab
- Proposed GitHub workflow:
- Module/other development on new branches
- PR module development to `develop` branch with review from within the team.
- PR final module drafts to `main` with external review.
- For now, develop branch is the default branch
#### Module 1
#### Module 2
- Work on getting data; GDPR licensing; different ways to load CSV data
- The next topics are:
- wrangling and bias
- hands-on sessions
- We need to make decisions around:
- format; which columns to use
- Camila: maybe we can use a pre-cleaned data
- How to make features for categorical data?
- How to treat categorical data? This will also affect the visualisation in Module 3
- How many columns we should use?
- Callum: probably not more than 10
- Camila: how much are you going to talk about missing data in Module 2?
- Jack: we are going to talk about it in the context of wrangling, not huge amount
- A paper about data wrangling that can be useful https://arxiv.org/abs/2004.12929.
- SQL/DB
- For the hands-on: we have not decided yet
- For the taught session: some topics on DBs and pandas
- In Module 2, EDA:
- [name=Jack] - I don't think we're planning to do EDA at the moment, at least not extensively. Some wrangling, dealing with missing values etc. in the hands on part, but probably not calculating stats, visualising etc.
#### Module 3
- Camila had a good progress on planning; project [here](https://github.com/alan-turing-institute/rds-course/projects/3).
#### Module 4
- Planning is in progress, deciding on the content
- There are many topics to be covered and time can be an issue
- Explaining the general principles with some comments on the specific methods that we are going to use in practicals
- Some feedback received from REG
- Camila: we can focus on how to work with real data; everybody can work with the Kaggle datasets, but in the practical, we need to focus, e.g., on how to choose a model.
- What are the most important components in modeling?
- How to ensure that the model is good enough? how to compare the models? How to choose the right model for the task?
- Maybe ask the audience to read the first chapter of "Statistical Thinking" book
- What are the EDI aspects in this module?
- e.g., what are the impacts of using the same questionnaire for different groups of people, what happens when we mix different groups of people? (thinking about the differences between western and eastern countries).
- Which package should we use?
- One approach is to first try to answer the question and then find the right package for that (this can be sklearn).
### Actions
## Thursday 8th July
### Roll Call:
- Fede
- Greg
- Callum
- Kasra
- Camila
### Notes:
- Callum says that the paper related to the last research question is not doing a good job
- We shoudnt discuss the quality of the papers in the course, at least not in the open material.
- Greg suggests to use [this paper](https://github.com/alan-turing-institute/rds-course/blob/main/articles/Aldabe_etal_2010.pdf) and [this framework](https://github.com/alan-turing-institute/rds-course/issues/1#issuecomment-872310492) as the central research question of the course
- Greg: all research questions [in the list](https://github.com/alan-turing-institute/rds-course/issues/1#issuecomment-840561125) are more or less in line with regression models. are we ok with this?
- Camila: it's important that we teach the broad school of thoughts in DS, more than specific method etc (if they want to do deep learning there's a course for that)
- Kasra: Should we talk about the datasets doring the taught sessions? Callum: It depends, but we should avoid having no examples.
- There wont be enough time for the students to build a good model, we can help them to get familiarise with the data.
- Could there be a different options that can be explored by the students? we should provide the options.
- Kasra: Scikit learn provides many models, we could ask them to try several of them. We can ask them to benchmark, and discuss the different types of models and its performance.
- Provide a minimal viable model, ask them to improve from that.
- Iterative model development
- This is the summary of the paper from Greg's initial notes:
`This study explores associations between socio-economic status (SES), measured using occupation, and self-reported health, using the same data. It examines the contribution of various material, occupational and psychosocial factors to social inequalities in health in Europe. It uses multilevel logistic regression. Policy connection could be which interventions might result in a reduction in social inequalities in health.
`
### Actions
- Decided to preliminary choose the Question 3.
- Everyone should read the paper [aldabe et al., 2011](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3678208/), and we will discuse it on Tuesday. If there is time, there is also another paper related paper that Chris sent through [Braveman et al., 2011](https://www.annualreviews.org/doi/full/10.1146/annurev-publhealth-031210-101218)
## Tuesday 6th July
This is the project kick-off meeting
### Roll Call:
- Camila
- Callum
- Jack
- Lydia
- Jamesb
- Greg
- Fede
- Iain
- James R
- Christina
- Kasra
### Notes
- Aims
- Develop data science training material for PhD students and researchers
- 4 modules, with 8 hours each (half practical hands-on sessions)
- Delivery in last two weeks of November
- Asynchronous use of the material after the live delivery
- Updates
- EAG decision pending
- Feedback sessions to be organised for October
- Turing Commons collaboration
- Ad, selection form to be written
- Datasets
- European Quality of Life Time Series (2007 and 2011) dataset used for exercise that students will make (hands-on)
- We discussed use of a common dataset throughout vs. toy datasets tailored to each example. The former gives familiarity (which will be useful for later modelling) but the latter gives flexibility and possible clarity of explanation. A possible compromise is for each module to pick datasets based on the requirements of the module but also share their choices and discuss with the team to make sure any opportunities for sharing are exploited. An issue will be created for that.
- Do we need EAG approval for each of those (toy) datasets? -> We do not need it for toy or artificial datasets (assuming toy datasets do not contain real sensitive data). We will keep EAG updated if we need to use a new dataset that might be sensitive.
- Discussion about datasets is here: https://github.com/alan-turing-institute/rds-course/issues/1, including details about how we chose the EQLS dataset.
- Teaching content
- interactive notebooks. potentially jupyter book to bring it together
- ultimate aim is an open self-supported course that students can do asynchronously, this is also one of the deliverables that we agreed with Mishka (the other being the live course delivery)
- Risks
- Unknown research question for the hands on session. We should prioritise definition of this.
- EAG application. Unknown outcome but some alternatives exist.
- Time is short, start setting milestones and having regular progress meetings.
- Over-producing material
### Actions
- Camila & CM set up project boards for module 3 ([here](https://github.com/alan-turing-institute/rds-course/projects/3)) & 4 ([here](https://github.com/alan-turing-institute/rds-course/projects/4))
- Weekly team meetings - Greg to set up
- Submodule meetings to be arranged between participants
- Module 2 might need to send one person in regular meeting due to not having a formal allocation
- Issue where people note their holiday plans to work around them if possible [here](https://github.com/alan-turing-institute/rds-course/issues/12)
- Thursday meeting to be set up
### For next meeting
- Next meeting about research question on Thursday 10am.
- Regular meetings on Tuesdays 10am
- For Thursday meeting (8/7/2021):
- Greg to send material about research questions to Camila/Callum (list of questions, material sent by Chris)
- Camila/Callum to think about what they want to include in the taught and hands-on sessions, i.e. create a brief curriculum
- Camila/Callum to look at the data and the possible questions and start thinking about which questions will:
- allow them to teach what they want
- be practical to answer with the data (check for publications that answer the questions by training models on the data or other examples online)
- Greg to think about what Module 1 will contain and how this can be implemented using the dataset/possible research questions.
- Couple of Module 2 links in this issue (contain taught session list of concepts to teach): https://github.com/alan-turing-institute/rds-course/issues/8
## Tuesday 13th July
### Roll Call:
- Camila
- Callum
- Jamesb
- Greg
- Fede
- Kasra
### Notes
- Discussed the time plan/milestones.
- Aim for delivering first complete draft by 20th of August so that we have a spare week to review, get feedback, iterate.
- We will request feedback from REG whenever there is aa reviewable part of the material that we can propose as a PR.
- We will organise meeting to get feedback from REG and other stakeholders at the last week of August.
- Should we first develop taught or hands-on material? It dependes, up to the team members and likely to involve iteration. For M4 it might make sense to develop taught first as some of the code will gradually be presented there and then resused in the hands-on sessions.
- M2 might not be complete at the same dates but early version can bee presented in August.
- Updated version of the plan here: https://github.com/alan-turing-institute/rds-course/issues/14
- Research question:
- Research question could be: Understand which factors (e.g. SES/other) are associated with self-reported health.
- Stages to answer it:
- Exploratory data analysis and visualisations.
- A predictive model is built to understand if an initial set of variables can predict the outcome. This is a typical but imperfect way to understand if there are associations
- More variables are added and/or a different model is used.
- The model predicts in one country but maybe fails in others so we discuss methods to address that (possibly multilevel modeling).
- A good predictive model does not necessarily answer the question of which variables are associated. We then discuss other steps to improve our answer.
- More discussion is needed to clarify the way this will run from M1 to M4.
- Another research questions could be: Which variables predict health?
- This is easier as we can focus on just building a predictive model with a specific benchmark.
- But it is less representative of a real research question.
- If we do this, we could first aim to build a good model and then highlight that this might not be very useful if we want to understand the relationships/causality.
- M3 sketch had started: https://hackmd.io/3vMmdpcnSaSuo8lDzzUAyg
- Visualisation notes from Camila's book: https://hackmd.io/5xcNlQ2KTn-S1MCzkE19cg