# Module 2: Syllabus Planning
###### tags: `RDS_Module_2`
[toc]
## TODOs:
- Course dataset
- Explore Compas - How clean is it? Do we need to introduce artificial missing values etc?
- EDA: To what extent do we need to do EDA before cleaning?
- Look at each column independently
- Separate datasets used in taught vs. hand-on sessions probably preferrable to challenge attendees in hands-on sessions
- Examples of issues with raw data: e.g. scenarios where there are issue in collection
- Anything else to include in taught elements or exclude?
- How to store cleaned data? e.g. tabular in dataframe?
- Data protection/"personal data", what to look out for?
- making data public/private
## Course Objectives and Outcomes
The main objectives of the proposed course are the following:
• Teach attendees how to use research data science (RDS) methods in an interdisciplinary research environment.
• Move beyond DS principles offered by existing courses, towards the *craft* of doing RDS: A hands-on, practical understanding, focused on collaboration, reproducibility and openness.
• Provide exposure to a diverse set of real-world RDS projects and demonstrate the decision-making process used to choose the right method and tools for each setting and in each project step.
• Embed data ethics, diversity and inclusion awareness into the attendees' approach to all stages of a RDS project, providing multiple examples.
The expected outcomes are the following:
• Attendees will understand fundamental RDS methods and tools and know when/how to apply them to their PhD/postdoctoral research in order to draw data-driven insights or create data-driven tools.
• Attendees will be familiar with the stages of an RDS project, from scoping and data exploration to deployment and will become aware of the challenges of tackling ambiguous real-world problems.
• Attendees will be able to recognise power imbalances, bias and diversity issues in their technical work and in their ways of working and challenge them.
• Combined with our existing RSE courses, attendees will have the necessary tools to apply for junior level RSE/RDS positions.
## Module Plan
Module 2: Handling data and deployment
- Sub-module A (Taught) - with tailored data/examples for the material:
- Data wrangling, cleaning, provenance, testing
- Handling missing data, different types of data (e.g. text, datetimes)
- Data access
- Basic Databases (SQL, accessing from Python)
- APIs
- EDI: Why raw data are not raw
- sampling bias
- collection error
- badly designed capture/missed variables (e.g. missing q in survey, leading q)
- subjectivity/inconsistent labels
- For all the above: Focus on which action is appropriate for each dataset or situation.
- Basic deployment (Docker) and reproducibility.
- version control, requirements.txt, virtual environments...
- Sub-module B (Hands-on project) - with COMPAS/equivalent dataset used throughout course:
- Attendees will need to access a (fake) NHS data set from an Azure database and figure out how to pre-process it to prepare the data for later use. They will need to discuss and decode various complexities (e.g. missing/ambiguous values, bias in data collection, data privacy and sensitivity).
- Attendees will need to work in teams, reviewing each other's code and all contributing to a common code base and will have to use a simple Docker container for reproducibility.
## Similar MOOCs / Refs
- John's Hopkins Data Science Specialization - Course 3 Data Cleaning: https://www.coursera.org/learn/data-cleaning?specialization=jhu-data-science. Includes
- reading data
- summarize/sort
- reshaping/merging
- regex
- dates
- Data Feminism: https://mutabit.com/repos.fossil/datafem/uv/datafem.pdf
- Turing way course: https://docs.google.com/document/d/1vnLZ1LpHCUHyESvnTFHob_ObF4OHnWSw/edit
- IBM - Data Analysis with Python: https://www.coursera.org/learn/data-analysis-with-python?specialization=applied-data-science#syllabus. Includes:
- Importing Data
- Data Wrangling
- EDA
- Model Development
- Databricks - Applied Data Science for Data Analysts: https://www.coursera.org/learn/applied-data-science-for-data-analysts?specialization=data-science-with-databricks-for-data-analysts#syllabus. Includes:
- Feature Engineering and Selection:
- feature eng.
- missing values
- data types
# Syllabus
## Getting Data (https://hackmd.io/7rron7vtTLK_uKx_5A_PrQ)
### Where to find data?
Popular online sources of data. Often domain specific.
E.g.
- ONS
- data.gov.uk
- kaggle
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)
- [World Bank](https://microdata.worldbank.org/index.php/home)
- sklearn, tensorflow datasets, pytorch datasets (ready to go w/ framework)
- GitHub
- [Zenodo](https://zenodo.org/)
- https://datasetsearch.research.google.com/
- [UK Data Service](https://www.ukdataservice.ac.uk/)
or:
- How to find data if it's not available in a common repository?
- Creating/collecting your own dataset.
### Legality
#### Licenses
Ensure that license for your chosen dataset fits use.
- Common license types
- example permissive
- example non permissive
- (GitHub) default (no license)
- Public domain? (TODO: need to find out what's allowed here)
#### GDPR
High level summary. Redirect to specific training.
- What is **personal** data?
- Protected characteristics
- Resonsibilities: collection, storage, right to be removed,
- Roles (data protection officers etc.)
#### Commercial agreements
Abide by any agreement made with partners
### Pandas intro
### Data sources/querying data
- check what's covered in RSE course
#### Databases
- SQL
- Common flavours: PostgreSQL, SQLite, MySQL...
- (NoSQL)
- Elastic, MongoDB...
Where:
- Local vs hosted
Getting data:
- Querying (select, join...)
- An example with a python package (pandas, sqlite3, ...)
TODO: some broad guide of which type of DB to choose (find a blog to link to?)
#### APIs
- https://github.com/public-apis/public-apis
- web api example
#### Scraping data
light touch: point towrads popular packages like [Scrapy](https://scrapy.org/)
- legality
#### Tabular Data
- csv example
- (xls...)
#### Structured Data
- json
- (yaml, xml)
#### Other
- Image, text documents, video, spatial, binary
### Security
#### Controlling access
### What's the Data For?
- Covered by Module 1?
What task/question are we trying to answer with the data?
How to choose a dataset based on the task/question
### Bias
- Types of bias to be aware of in data
### Linking datasets
Mention idea but cover in Data Wrangling
## Types of Data
- Data types
- Non tabular data (image, audio, video, spatial, documents)
- Temporal data
## Wrangling and cleaning data
- Data consistency (e.g. consistent datatypes)
- Missing data
- Regex
- Binning
- Date processing
- Privacy/Anonymisation
- *Should* data be used?
- badly designed capture/missed variables (e.g. missing q in survey, leading q)
- subjectivity/inconsistent labels
- Wrangling text/image data?
## to be sorted:
- Saving/publishing data
- Virtual environments, dependencies etc. (? maybe module 1 or intro to hands-on exercise)
# Chris Burr Ehtics Course Meeting (4/5/2021)
- [Course Proposal](https://thealanturininstitute.sharepoint.com/:w:/s/dp/Ed3YNc6MCBdCnOh4M9tYFN8BXmUxhZ8qGKTopKDIe2sAQg?e=PnNnKM&wdLOR=c77BAAB4D-6E64-0D4D-81BC-2230D6B645E6)
- Collaborate between ethics & RDS courses?
- COMPAS: Maybe problematic (controversy/sensitivity of topic may distract from core content of course). Consider alternatives + synthetic data?
- Syllabus
- 3 independent modules:
- ethics & governance
- responsible research (prob greatest overlap with RDS course)
- public communication
- Resp research & Innovation:
- ~10 lectures (of ~1 hour each) + exercises
- No pre-requisite to know Python. Should know general statistics/data analysis.
- Qualitative assessment (exercises to think through problems, not exercises with fixed solutions)
- explore previous examples (cambridge analytica etc.)
- AI lifecycle
- DESIGN
- project planning
- problem formulation
- data extraction/procurement
- data analysis
- preprocessing
- DEVELOP
- model selection
- model training
- model testing & validation
- model reporting
- DEPLOY
- model implemenetation
- user training
- system use and monitoring
- model updating/deprovisioning
- Planning exercise of exploratory analysis looking at biases in a Jupyter notebook, and others.
- Interest in creating synthetic dataset with desired properties (e.g. bias in missing data...)
- https://catalogofbias.org/
- https://alan-turing-institute.github.io/rrp-selfassessment/introduction.html
- Bias self-assessment (examples of types of bias in healthcare and questions you might ask)
- https://cdeiuk.github.io/bias-mitigation/
- How they created a synthetic dataset:
- https://github.com/CDEIUK/bias-mitigation
- See `notebooks/notebooks/recruiting/preprocessing.ipynb`
- Particularly interested in healthcare, criminal justice, and hiring examples (in that order of preference).
- Have a fairly large budget, including for graphic design.
- No requirement for single dataset used throughout the code - probably want something for exploratory data analysis, and something for reporting.
- See their course as theory, ours as practice, so two should complement each other well.
- Design workshops in June - we could go to observe if we like