# Module 3 ## 29/07/21 - in office collab (Notes from initial scoping on Feb) Module 3: Data visualisation/exploration and intro to modeling - Taught: - Exploratory data analysis, how to start working with new data. - Basic visualisation - Types of charts - Interactive charts - Network graphs - Guidance for effective data communication and visualisation, including EDI discussion on the importance of always considering visualisations in the context where they are presented and being aware of inclusivity guidelines. - Intro to modeling: - The two cultures (theorise and estimate, compute and test). Discuss pros and cons. - Predictive modeling, focusing on a basic methods (e.g. linear/logistic regression) using a standard Python library. - Hands-on project: - Building a simple dashboard with a Gapminder dataset. Attendees will need to discuss the trends seen in the data and encouraged to build effective visualisations, then decide which features to use to train a basic predictive model. Teams will be free to experiment with different methods of collaboration ## Braindump of things that we feel are important in data visualisation. - Near the start: interactive session with a poor plot. Groups go off an attempt to interpret it and also to pin point the interpretation. Come back and discuss different interpretations, and discuss _why_ it is hard to interpret. - lead on to Rules of the Game / Recipe?. Basics: what is a figure? labels, axes (dimensions), scale, considering colorblindness, etc. - Plots should be understandable without reading the caption - Plots should be understandable without having domain expertise. - Figure as story - Using data visualisation for Data exploration as link between Mod 2 & 3. - Useful tips - Dimentionality reduction when we cant visualise all the information in a single graph - Using data visualisation for comunication - Plotting pitfalls - overplotting - trying to tell multiple stories - - Examples of bad figures - NHS - Other real-world case studies? - when to go interactive? data-based figures vs. infographics, do the same rules hold? know your audience - are there different rules for public vs. academia? ## Tools - seaborn (more advs working with dataframes, perhaps do more with less code) - matplotlib (more low level) - ggplot? Not in python. Could translate some examples as additional material. ## Useful books/references: - [Fundamentals of data visualisation](https://clauswilke.com/dataviz/) - The Truthful Art: Data, Charts, and Maps for Communication (Alberto Cairo) - Storytelling with Data (Cole Nussbaumer Knaflic) - https://ourworldindata.org/ - Grammar of Graphics [Wickham's paper](https://byrneslab.net/classes/biol607/readings/wickham_layered-grammar.pdf) ## Actions 06/07/21 - Read above resources and reiterate syllabus - come back togetherr and attempt to formalise syllabus a little more ## Outline of Module 3: *Preliminary by Camila, 19 July 2021.* ### 1. Wrong, bad and ugly figures. We choose a few figures from these examples, ask the students to comment what is the problem with the figures: - [No axis in graph in a Mexican government Covid briefing.](https://twitter.com/Rodpac/status/1250764503861600256?s=20) - [Too much information in one slide from UK covid briefing.]( https://twitter.com/10DowningStreet/status/1322614557181960195/photo/1) - [Examples of bad data visualisation.]( https://www.jotform.com/blog/bad-data-visualization/) - [Good-and-bad-data-visualization.]( https://www.oldstreetsolutions.com/good-and-bad-data-visualization) - [bad-data-visualization-in-the-time-of-covid-19.](https://medium.com/nightingale/bad-data-visualization-in-the-time-of-covid-19-5a9f8198ce3e) - [statisticshowto](https://www.statisticshowto.com/probability-and-statistics/descriptive-statistics/misleading-graphs/) - [small decisions in data viz](https://www.visualisingdata.com/2016/03/little-visualisation-design/) Theme: Responsible data communication - [thread about the difficulties of representing uncertainty](https://twitter.com/EvanMPeck/status/1235568532840120321) - [referenced from here](https://www.visualisingdata.com/2020/03/communication-themes-from-coronavirus-outbreak/) - [thread about data doubts](https://www.visualisingdata.com/2016/03/the-little-of-visualisation-design-part-4/) ### 2. Rules of the game: Material for sections 2 and 3 mainly based on the [Fundamentals of data visualisation](https://clauswilke.com/dataviz/) book and grammar of Graphics [Wickham's paper](https://byrneslab.net/classes/biol607/readings/wickham_layered-grammar.pdf). For the examples we will use data from the hands-on sessions focusing on a particular country. - What is a figure? Mapping data onto aesthetics. - Coordinate systems and axes. - Color scales. [examples in grey](https://www.visualisingdata.com/2015/01/make-grey-best-friend/) - Statistical transformations(binning or aggregating). - Anotations: labels, legends, titles. ### 3. Directory of visualisations (or an equivalent name): - Distributions. - Proportions. - Trends. - Time series. - Geospatial data. - Uncertainty. ### 4. Story telling with data visualisation. This section can be based on the following resources: - [Telling a story](https://clauswilke.com/dataviz/telling-a-story.html) section of the Fundamentals for data visualisation. - [Numbers don't speak for themselves](https://data-feminism.mitpress.mit.edu/pub/czq9dfs5/release/2) chapter of the Data Feminism book. - musings on [what are graphs](https://www.visualisingdata.com/2019/01/what-do-charts-actually-show/) ### 5. Data visualisation for data exploration - [Visualise patterns of missigness](https://www.geeksforgeeks.org/python-visualize-missing-values-nan-values-using-missingno-library/) (is this discussed in module 2?). - Relationships between numerical variables with scatter plots, joint plots, and pair plots. - Relationships between numerical and categorical variables with box-and-whisker plots and complex conditional plots from [here](https://towardsdatascience.com/how-to-perform-exploratory-data-analysis-with-seaborn-97e3413e841d) and heatmaps from [here](https://towardsdatascience.com/how-to-use-python-seaborn-for-exploratory-data-analysis-1a4850f48f14). - Visualizing high-dimensional datasets: PCA, t-sne + UMAP (?), some ideas in [here](https://www.kaggle.com/parulpandey/part1-visualizing-kannada-mnist-with-pca/notebook?scriptVersionId=29322090).