# Module 2: Syllabus Planning ###### tags: `RDS_Module_2` [toc] ## TODOs: - Course dataset - Explore Compas - How clean is it? Do we need to introduce artificial missing values etc? - EDA: To what extent do we need to do EDA before cleaning? - Look at each column independently - Separate datasets used in taught vs. hand-on sessions probably preferrable to challenge attendees in hands-on sessions - Examples of issues with raw data: e.g. scenarios where there are issue in collection - Anything else to include in taught elements or exclude? - How to store cleaned data? e.g. tabular in dataframe? - Data protection/"personal data", what to look out for? - making data public/private ## Course Objectives and Outcomes The main objectives of the proposed course are the following: • Teach attendees how to use research data science (RDS) methods in an interdisciplinary research environment. • Move beyond DS principles offered by existing courses, towards the *craft* of doing RDS: A hands-on, practical understanding, focused on collaboration, reproducibility and openness. • Provide exposure to a diverse set of real-world RDS projects and demonstrate the decision-making process used to choose the right method and tools for each setting and in each project step. • Embed data ethics, diversity and inclusion awareness into the attendees' approach to all stages of a RDS project, providing multiple examples. The expected outcomes are the following: • Attendees will understand fundamental RDS methods and tools and know when/how to apply them to their PhD/postdoctoral research in order to draw data-driven insights or create data-driven tools. • Attendees will be familiar with the stages of an RDS project, from scoping and data exploration to deployment and will become aware of the challenges of tackling ambiguous real-world problems. • Attendees will be able to recognise power imbalances, bias and diversity issues in their technical work and in their ways of working and challenge them. • Combined with our existing RSE courses, attendees will have the necessary tools to apply for junior level RSE/RDS positions. ## Module Plan Module 2: Handling data and deployment - Sub-module A (Taught) - with tailored data/examples for the material: - Data wrangling, cleaning, provenance, testing - Handling missing data, different types of data (e.g. text, datetimes) - Data access - Basic Databases (SQL, accessing from Python) - APIs - EDI: Why raw data are not raw - sampling bias - collection error - badly designed capture/missed variables (e.g. missing q in survey, leading q) - subjectivity/inconsistent labels - For all the above: Focus on which action is appropriate for each dataset or situation. - Basic deployment (Docker) and reproducibility. - version control, requirements.txt, virtual environments... - Sub-module B (Hands-on project) - with COMPAS/equivalent dataset used throughout course: - Attendees will need to access a (fake) NHS data set from an Azure database and figure out how to pre-process it to prepare the data for later use. They will need to discuss and decode various complexities (e.g. missing/ambiguous values, bias in data collection, data privacy and sensitivity). - Attendees will need to work in teams, reviewing each other's code and all contributing to a common code base and will have to use a simple Docker container for reproducibility. ## Similar MOOCs / Refs - John's Hopkins Data Science Specialization - Course 3 Data Cleaning: https://www.coursera.org/learn/data-cleaning?specialization=jhu-data-science. Includes - reading data - summarize/sort - reshaping/merging - regex - dates - Data Feminism: https://mutabit.com/repos.fossil/datafem/uv/datafem.pdf - Turing way course: https://docs.google.com/document/d/1vnLZ1LpHCUHyESvnTFHob_ObF4OHnWSw/edit - IBM - Data Analysis with Python: https://www.coursera.org/learn/data-analysis-with-python?specialization=applied-data-science#syllabus. Includes: - Importing Data - Data Wrangling - EDA - Model Development - Databricks - Applied Data Science for Data Analysts: https://www.coursera.org/learn/applied-data-science-for-data-analysts?specialization=data-science-with-databricks-for-data-analysts#syllabus. Includes: - Feature Engineering and Selection: - feature eng. - missing values - data types # Syllabus ## Getting Data (https://hackmd.io/7rron7vtTLK_uKx_5A_PrQ) ### Where to find data? Popular online sources of data. Often domain specific. E.g. - ONS - data.gov.uk - kaggle - [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) - [World Bank](https://microdata.worldbank.org/index.php/home) - sklearn, tensorflow datasets, pytorch datasets (ready to go w/ framework) - GitHub - [Zenodo](https://zenodo.org/) - https://datasetsearch.research.google.com/ - [UK Data Service](https://www.ukdataservice.ac.uk/) or: - How to find data if it's not available in a common repository? - Creating/collecting your own dataset. ### Legality #### Licenses Ensure that license for your chosen dataset fits use. - Common license types - example permissive - example non permissive - (GitHub) default (no license) - Public domain? (TODO: need to find out what's allowed here) #### GDPR High level summary. Redirect to specific training. - What is **personal** data? - Protected characteristics - Resonsibilities: collection, storage, right to be removed, - Roles (data protection officers etc.) #### Commercial agreements Abide by any agreement made with partners ### Pandas intro ### Data sources/querying data - check what's covered in RSE course #### Databases - SQL - Common flavours: PostgreSQL, SQLite, MySQL... - (NoSQL) - Elastic, MongoDB... Where: - Local vs hosted Getting data: - Querying (select, join...) - An example with a python package (pandas, sqlite3, ...) TODO: some broad guide of which type of DB to choose (find a blog to link to?) #### APIs - https://github.com/public-apis/public-apis - web api example #### Scraping data light touch: point towrads popular packages like [Scrapy](https://scrapy.org/) - legality #### Tabular Data - csv example - (xls...) #### Structured Data - json - (yaml, xml) #### Other - Image, text documents, video, spatial, binary ### Security #### Controlling access ### What's the Data For? - Covered by Module 1? What task/question are we trying to answer with the data? How to choose a dataset based on the task/question ### Bias - Types of bias to be aware of in data ### Linking datasets Mention idea but cover in Data Wrangling ## Types of Data - Data types - Non tabular data (image, audio, video, spatial, documents) - Temporal data ## Wrangling and cleaning data - Data consistency (e.g. consistent datatypes) - Missing data - Regex - Binning - Date processing - Privacy/Anonymisation - *Should* data be used? - badly designed capture/missed variables (e.g. missing q in survey, leading q) - subjectivity/inconsistent labels - Wrangling text/image data? ## to be sorted: - Saving/publishing data - Virtual environments, dependencies etc. (? maybe module 1 or intro to hands-on exercise) # Chris Burr Ehtics Course Meeting (4/5/2021) - [Course Proposal](https://thealanturininstitute.sharepoint.com/:w:/s/dp/Ed3YNc6MCBdCnOh4M9tYFN8BXmUxhZ8qGKTopKDIe2sAQg?e=PnNnKM&wdLOR=c77BAAB4D-6E64-0D4D-81BC-2230D6B645E6) - Collaborate between ethics & RDS courses? - COMPAS: Maybe problematic (controversy/sensitivity of topic may distract from core content of course). Consider alternatives + synthetic data? - Syllabus - 3 independent modules: - ethics & governance - responsible research (prob greatest overlap with RDS course) - public communication - Resp research & Innovation: - ~10 lectures (of ~1 hour each) + exercises - No pre-requisite to know Python. Should know general statistics/data analysis. - Qualitative assessment (exercises to think through problems, not exercises with fixed solutions) - explore previous examples (cambridge analytica etc.) - AI lifecycle - DESIGN - project planning - problem formulation - data extraction/procurement - data analysis - preprocessing - DEVELOP - model selection - model training - model testing & validation - model reporting - DEPLOY - model implemenetation - user training - system use and monitoring - model updating/deprovisioning - Planning exercise of exploratory analysis looking at biases in a Jupyter notebook, and others. - Interest in creating synthetic dataset with desired properties (e.g. bias in missing data...) - https://catalogofbias.org/ - https://alan-turing-institute.github.io/rrp-selfassessment/introduction.html - Bias self-assessment (examples of types of bias in healthcare and questions you might ask) - https://cdeiuk.github.io/bias-mitigation/ - How they created a synthetic dataset: - https://github.com/CDEIUK/bias-mitigation - See `notebooks/notebooks/recruiting/preprocessing.ipynb` - Particularly interested in healthcare, criminal justice, and hiring examples (in that order of preference). - Have a fairly large budget, including for graphic design. - No requirement for single dataset used throughout the code - probably want something for exploratory data analysis, and something for reporting. - See their course as theory, ours as practice, so two should complement each other well. - Design workshops in June - we could go to observe if we like