Notes from Fundamentals of Data Visualisation

# Notes from Fundamentals of Data Visualisation ## [1 Ugly, bad, and wrong figures](https://clauswilke.com/dataviz/introduction.html#ugly-bad-and-wrong-figures) - **ugly**—A figure that has aesthetic problems but otherwise is clear and informative. - **bad**—A figure that has problems related to perception; it may be unclear, confusing, overly complicated, or deceiving. - **wrong**—A figure that has problems related to mathematics; it is objectively incorrect. ## [2 Visualizing data: Mapping data onto aesthetics](https://clauswilke.com/dataviz/aesthetic-mapping.html) - All data visualizations map data values into quantifiable features of the resulting graphic. We refer to these features as **aesthetics**. - Commonly used aesthetics in data visualization: position, shape, size, color, line width, line type. Some of these aesthetics can represent both **continuous and discrete data** (position, size, line width, color) while others can usually only represent discrete data (shape, line type). - When data is numerical we also call it **quantitative** and when it is categorical we call it **qualitative**. Variables holding qualitative data are **factors**, and the different categories are called **levels**. - To map data values onto aesthetics, we need to specify which data values correspond to which specific aesthetics values. This mapping between data values and aesthetics values is created via **scales**. ## [3 Coordinate systems and axes](https://clauswilke.com/dataviz/coordinate-systems-axes.html) - The combination of a set of position scales and their relative geometric arrangement is called a **coordinate system**. ### Cartesian coordinates: - Whenever the two axes are measured in different units, we can stretch or compress one relative to the other and maintain a valid visualization of the data (Figure 3.2). **Which version is preferable may depend on the story we want to convey**. A tall and narrow figure emphasizes change along the y axis and a short and wide figure does the opposite. Ideally, we want to choose an aspect ratio that ensures that any important differences in position are noticeable. - On the other hand, if the x and the y axes are measured in the same units, then the grid spacings for the two axes should be equal, such that the same distance along the x or y axis corresponds to the same number of data units. ### Non linear coordinates - The most commonly used nonlinear scale is the logarithmic scale or log scale for short. Log scales are linear in multiplication, such that a unit step on the scale corresponds to multiplication with a fixed value. - There is no difference between plotting the log-transformed data on a linear scale or plotting the original data on a logarithmic scale (Figure 3.4). The only difference lies in the labeling for the individual axis ticks and for the axis as a whole. In most cases, the labeling for a logarithmic scale is preferable, because it places less mental burden on the reader to interpret the numbers shown as the axis tick labels. There is also less of a risk of confusion about the base of the logarithm. - Because multiplication on a log scale looks like addition on a linear scale, log scales are the natural choice for any data that have been obtained by multiplication or division. In particular, ratios should generally be shown on a log scale. **Figure 3.5 is a great example about using logscales**. - Log scales are frequently used when the data set contains numbers of very different magnitudes. - Reasons to use other scales? (e.g square root scale? - Problems with square-root scales: First, while on a **linear scale one unit step corresponds to addition or subtraction of a constant value and on a log scale it corresponds to multiplication with or division by a constant value**, no such rule exists for a square-root scale. - Despite these problems with square-root scales, they are valid position scales and I do not discount the possibility that they have appropriate applications. For example, just like a log scale is the natural scale for ratios, one could argue that **the square-root scale is the natural scale for data that come in squares** (geographic regions Figure 3.8). ### Coordinate systems with curved axes - Polar coordinates can be useful for data of a **periodic nature**, such that data values at one end of the scale can be logically joined to data values at the other end. - A second setting in which we encounter curved axes is in the context of geospatial data, i.e., maps. Locations on the globe are specified by their longitude and latitude. But **because the earth is a sphere, drawing latitude and longitude as Cartesian axes is misleading and not recommended** (Figure 3.11). Instead, we use various types of non-linear projections that attempt to minimize artifacts and that strike different balances between conserving areas or angles relative to the true shape lines on the globe (Figure 3.11). ## [4 Color scales](https://clauswilke.com/dataviz/color-basics.html) - There are three fundamental use cases for color in data visualizations. The types of colors we use and the way in which we use them are quite different for these three cases. - **Color as a tool to distinguish**: use **qualitative color scale**. Such a scale contains a finite set of specific colors that are chosen to look clearly distinct from each other while also being equivalent to each other. No one color should stand out relative to the others. And, the colors should not create the impression of an order. - **Color to represent data values**: In this case, we use a **sequential color scale**. Such a scale contains a sequence of colors that clearly indicate (i) which values are larger or smaller than which other ones and (ii) how distant two specific values are from each other. The second point implies that the color scale needs to be perceived to vary uniformly across its entire range. In some cases, we need to visualize the deviation of data values in one of two directions relative to a neutral midpoint. We may want to show those with different colors, so that it is immediately obvious whether a value is positive or negative as well as how far in either direction it deviates from zero. The appropriate color scale in this situation is a **diverging color scale**. - **Color as a tool to highlight**: An easy way to achieve this emphasis is to color figure elements you want to highlight in a color or set of colors that vividly stand out against the rest of the figure. This effect can be achieved with **accent color scales**, which are color scales that contain both a set of subdued colors and a matching set of stronger, darker, and/or more saturated color. # [Directory of visualisations](https://clauswilke.com/dataviz/directory-of-visualizations.html) - **Amounts**: bar plots, if there are two or more sets of categories for which we want to show amounts stack the bars or heatmaps. - **Distributions**: Histograms and density plots provide the most intuitive visualizations of a distribution, but both require arbitrary parameter choices and can be misleading. Cumulative densities and quantile-quantile (q-q) plots always represent the data faithfully but can be more difficult to interpret. Boxplots, violins, strip charts, and sina plots are useful when we want to visualize many distributions at once and/or if we are primarily interested in overall shifts among the distributions. - **Proportions**: Proportions can be visualized as pie charts, side-by-side bars, or stacked bars (Chapter 10), and as in the case for amounts. When visualizing multiple sets of proportions or changes in proportions across conditions, pie charts tend to be space-inefficient and often obscure relationships. Grouped bars work well as long as the number of conditions compared is moderate, and stacked bars can work for large numbers of conditions.When proportions are specified according to multiple grouping variables, then mosaic plots, treemaps, or parallel sets are useful visualization approaches. - **x–y relationships**: Scatterplots represent the archetypical visualization when we want to show one quantitative variable relative to another. For large numbers of points, regular scatterplots can become uninformative due to overplotting. In this case, contour lines, 2D bins, or hex bins may provide an alternative (Chapter 18). When we want to visualize more than two quantities, on the other hand, we may choose to plot correlation coefficients in the form of a correlogram instead of the underlying raw data. - **Uncertainty**: Error bars are meant to indicate the range of likely values for some estimate or measurement. They extend horizontally and/or vertically from some reference point representing the estimate or measurement. Reference points can be shown in various ways, such as by dots or by bars. Graded error bars show multiple ranges at the same time, where each range corresponds to a different degree of confidence. They are in effect multiple error bars with different line thicknesses plotted on top of each other. To achieve a more detailed visualization than is possible with error bars or graded error bars, we can visualize the actual confidence or posterior distributions. The rest of the book goes in detail about each one of the categories on the directory of visualisations.