## 8.01 Collection of Data
**Terminology**
* **Population**: All items in the group being studied. E.g., all students in ACT is the population if we are finding the mean of student height in ACT.
* **Sample**: A portion of the population. E.g., Bob is a sample if he's a student in ACT
* **Sample Size**: The size of samples being studied. A sample size is usually a subset (i.e., a part of a larger group) of the population. E.g., within the context of the previous example, the number of students at Dickson College is a sample size.
* **Random Sample**: The fairest way of sampling, as everything or everyone in the population has
an equal chance of being chosen.
* **Unbiased information**: Impartial, fair, and objective. It represents the true state of affairs without any slant or systematic error.
* **Biased Informaton**: Showing an inclination or prejudice for or against something or someone. It introduces a systematic error that consistently pushes a result or understanding in one direction.
## 8.02 Types of Data
**Terminology**
* **Categorical Data**: Variables that represent qualities/characteristics described with words
* **Nominal**: Data that has no order. E.g., eye colour, sex
* **Ordinal**: Data that can be sorted by an order (e.g., movie rating,)
* **Numerical Data**: Variables that represent quantities or measurements represented with numbers.
* **Discrete**: Data that can take particular values (i.e., can be counted)(e.g., shoe size, height of buildings in whole cm)
* **Continuous**: Data that can take all values usually within a particular range (e.g., weight of a person, height of a tree, population growth of Australia in 5 years)
## 8.03 Displaying Categorical Data
**Terminology**
* **Frequency**: the number of times a particular value or event occurs within a dataset or a specific observation
**Frequency Table**
| Type of pet | Frequency |
| :--- | :--- |
| Dog | 12 |
| Cat | 9 |
| Bird | 6 |
**Column vs. Bar Chart**
The only difference between a column chart and a bar chart is that column chart is represented vertically while a bar chart is represented horizontally.
**This is a bar chart:**

**This is a column chart:**

### Practice: Visualising Data on Google Sheets
Create the dataset

Select all the elements in the dataset, then click `Insert` -> `Chart`

Then you get a chart. Edit the panel on the right to see what it can do for you

## 8.04 Stem-and-leaf Plots and Dot Plots
### Stem-and-leaf
Basically we are converting the raw dataset

into a plot like this.

**Creating a plot**
1. When you create the plot, you should first find the smallest number and the largest one.
2. Then see what you can split them into, in this case, by 10s so you can span 120s, 130s, ..., 190s.
3. If a number is missing in one category (for example there are no 130s or 140s), still draw the stem but leave the leaf blank.
**Reading a plot**
When you read off the numbers, simply combine the number on the left (stem) and the numbers on the right (leaf) respectively. Therefore, we have: 123, 154, 157, 159, 159, ...
**Some notes**
1. Note that it's important, unless the question says not to, that you alway sort the plot by ascending order as you go.
2. Once you've done the plot, double-check the count of numbers in case you miss any number.
### Dot plots
Again, we convert the dataset into a plot. Before you start, again, you should find the minimum and maximum number so you can draw the number line.


Again, categorize the numbers, but in this case when you use a dot plot, it's always likely that you have less number to manage.
### Comparing stem-and-leaf plots with dot plots
**Stem-and-leaf plots**
Best for medium-sized quantitative (i.e., in number) datasets where you want to see the individual values and the overall shape. Also works well if you can categorize the data into groups (like 100s, 110s, etc). Works well for numbers with multiple digits.
**Dot plots**
Best for small-sized quantitative datasets, especially when dealing with discrete data or a small range of values.
## 8.05 Histograms
When numerical data has a large range of values, numbers need to be grouped into class intervals for the frequency table. Histograms help us better see the intervalsof the dataset
**Note that 15-<20 is equivalent to 15-19.**
### [Tutorial] Contructing a Histogram
First, create a frequency table with intervals (similar to the stem of a stem-and-leaf, but you need to be more explicity about the interval range)
Let's use **Question 8** in the textbook as an example:
**"The age, in years, of employees of Burger Heaven were recorded as follows."**
18 19 18 17 20 20 24 15 24 19
15 40 21 17 20 22 23 21 24 23
34 19 45 20 15 21 24 27 19 33
34 24 16 18 30 21 26 31 16 25
49 21 21 35 16 22 15 25 44 23.
Frequency table:
| Interval | Frequency |
|:---------|:----------:|
| 15-19 | 16 |
| 20-24 | 20 |
| 25-29 | 4 |
| 30-34 | 5 |
| 35-39 | 1 |
| 40-44 | 2 |
| 45-50 | 2 |
However, the frequency table won't help us do anything in Google sheets. You will have to input the numbers manually, like this

Once done select all the numbers, then `Insert` -> `Chart`.
You might get a histogram chart with some weird number in x-axis, so you need to click on `Customise` on the chart setting panel, and naviagate to `Histogram` and change the bucket size to 5 to create intervals of 5.

## 8.06 Distribution of Numerical Data
#### Terms:
* Symmetrical
A **symmetrical** distribution is one where the left and right sides of the distribution are mirror images of each other around a central point. In a perfectly symmetrical distribution, the **mean**, **median**, and **mode** are all the same.
* Bimodal
A **bimodal** distribution has two distinct peaks, or **modes**. This often suggests that the dataset consists of two different subgroups.

* Skew
**Skewness** refers to the measure of asymmetry in a probability distribution. It indicates whether the data points are skewed to the left or right of the centre.
* Positively Skewed
A distribution is **positively skewed** (or **right-skewed**) when the tail on the right side of the distribution is longer or fatter than the left side. The bulk of the data is on the left. In this case, the **mean is greater than the median**
* Negatively Skewed
A distribution is **negatively skewed** (or **left-skewed**) when the tail on the left side of the distribution is longer or fatter than the right side. The bulk of the data is on the right. In this case, the **mean is less than the median**

* Outlier
An **outlier** is a data point that is significantly different from the other observations in a dataset. Outliers can be caused by measurement variability or experimental error and can affect statistical analyses.
* Centre
The **centre** describes the "middle" or a "typical" value of a dataset. The most common measures of centre are:
**Mean**: The average of all data points.
**Median**: The middle value when the data is sorted.
**Mode**: The most frequently occurring value.
* Spread
The **spread** (also known as **dispersion** or **variability**) describes how scattered or spread out the data points are. Common measures of spread include:
**Range**: The difference between the maximum and minimum values.
#### Exercises:

I would always start with the calculation:
* Center = $\frac{Total\,Frequency+1}{2} = \frac{2+7+11+6+7+5+4+6+1}{2} = 24.5$, located in the Mass of 59g to the right (Why 59g? start from 56->2, 57->2+7=9, 58->2+7+11=20, 59->2+7+11+6=26, so it must be at 59)
Therefore, the histogram has a range of values of 56-64, with a centre value of 24.5 at 59g. The graph is **unimodal** with roughly one significant peak at 58g. Moreover, the graph is **slightly positively skewed**, ignoring the potential outlier, 64g.
## 8.07: Mean
Mean is the average of all data points.
Mean ($\bar{x}$) is always represented by this formula: $\frac{sum\,of\,all\,values}{number\,of\,values} = \frac{\Sigma_x}{n}$
Whenever you see $\Sigma$ in math, it represents "Sum of"
## 8.08: Mode, Median
* **Median** is the **middle value** when the data is sorted (for example, 3.5 in 1, 3, 4, 5; 4 in 3, 3, 4, 5, 5).
* **Mode** is the **most frequently occurring** value (e.g., 5 and 8 in 3, 5, 5, 6, 7, 8, 8).
## 8.09 Range, IQR
The mean, median and mode are measures of centre for a data set. There are three summary statistics that are measures of spread: the range, the interquartile range ($IQR$) and the standard deviation ($std.\,dev$ or $\sigma$).
* The range represents the total spread of scores but it is not a good measure of spread if there are outliers. The interquartile range is not affected by outliers, because it measures the range of half of the data.
#### Range
* Range = highest data value – lowest data value
For example, range of the dataset 21, 17, 10, 5, 9, 15, 23, 5, 7 is 23-5 = 18.
#### IQR
The IQR is given by: $Q_3 - Q_1$, where
* $Q_1$: The median of the lower group of the dataset (separated by $Q_2$)
* $Q_2$: The median of the entire dataset
* $Q_3$: The median of the upper group of the dataset (separated by $Q_2$)

---

$Q_1$: 13 + 13 / 2 = 13
$Q_2$: 15
$Q_3$: 16 + 16 / 2 = 16
$IQR$: 16 - 13 = 3

$Q_1$: 22
$Q_2$: 24.5
$Q_3$: 27
$IQR$: 27 - 22 = 5
---
Let's use this example:

**a.** Sort the data by ascending order set first. We get:
$15, 26, 30, 34, 35, 37, 37, 38, 43, 44, 45, 46, 48, 52, 61$
Median is the value in the middle i.e., $38$
**b.** Mean = Sum of all values/number of values = 591/15 = 39.4
**c.** Q1 = 34, Q2 = 38, Q3 = 46, so IQR = Q3 - Q1 = 12
**d.** Range = Max. - Min. = 61 - 15 = 46
**e.** The median and IQR are more appropriate to use as the data set has outliers.
**f.** Mention middle score of 38, therefore half above and below, mention IQR of 12 seconds, therefore the middle 50% of the group completed the test within 12 seconds of each other.