---
title: Introduction
description:
duration: 300
card_type: cue_card
---
### **Introduction to Data Visualization** (5-10 minutes)
#### Plots Presentation:
<https://docs.google.com/presentation/d/1DkLTjTe6YmGbDHtr4v9Jso553DlCuP3cfSnwvUN1mgE/edit?usp=sharing>
### Summary/Agenda
#### Where is all Data Visualization helpful? Why?
- Exploratory - EDA
- Explanatory - Storytelling
#### What is the Science in Data Visualization?
- Anatomy of a plot/chart
- How to use the right plot/chart for given data?
#### What is the Art in Data Visualization?
- Choose the right scale, labels, tick labels
- Identify and remove clutters in the plot
- Ways to highlight information in the plot
---
title: Libraries for data visualization
description:
duration: 700
card_type: cue_card
---
### **Python libraries for Data Visualization**
#### Importing Matplotlib and Seaborn
We don't need to import the entire library but just its submodule `pyplot`
We'll use the **alias name `plt`**
#### What is `pyplot`?
- `pyplot` is a **sub-module for visualization** in `matplotlib`
- Think of it as **high-level API** which **makes plotting an easy
task**
- Data Scientists **stick to using `pyplot` only unless** they want to
create **something totally new**.
For seaborn, we will be importing the whole seaborn library as alias `sns`
#### What is seaborn?
Seaborn is another visualization library which uses matplotlib in the backend for plotting
#### What is the major difference then between both matplotlib and seaborn?
- Seaborn uses **fascinating themes** and **reduces number of code lines** by doing a lot of work in the backend
- While matplotlib is used to **plot basic plots and add more functionlaity** on top of that
- Seaborn is built on the top of Pandas and Matplotlib
As we proceed through the lecture, we will see the difference between
both the libraries
``` python=
import matplotlib.pyplot as plt
import seaborn as sns
```
Before we dive into learning these libraries, lets answer some general questions
#### Why do we even need to visualize data? When do I even need to visualise?
- It helps us to understand the data in a pictorial format
- It's **extensively used in Exploratory Data Analysis**
- And also to **present peformance results of our models**
Two reasons/scopes
- **Exploratory** - I can't see certain patterns just by crunching numbers (avg, rates, %ages)
- **Explanatory** - I can the numbers crunches and insights ready, but I'd like a visual art for storytelling
#### Data
- Rows: Samples, Data-points, Records
- Columns: Features, Variables
#### How many kinds of data do we have?
At the fundamental level, it\'s just two types:
- Numerical/Continuous
- Categorical
Categorical can be further divided into:
- **Ordinal:** Categorical Data with an order (E.g. low, medium, high)
- **Non-ordinal/nominal:** Categorical Data without any order (example
gender as Male/Female)
---
title: Video Game Data Intro
description:
duration: 500
card_type: cue_card
---
### **Video Games Analysis** (5 minutes)
You are a data scientist at \"**Tencent Games**\".
You need to analyze what kind of games they should start creating to get higher success in the market.
#### **Downloading the Dataset**
Code:
``` python=
!gdown https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/021/299/original/final_vg1_-_final_vg_%281%29.csv?1670840166 -O final_vg.csv
```
> Output:
```
Downloading...
From: https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/021/299/original/final_vg1_-_final_vg_%281%29.csv?1670840166
To: /content/final_vg.csv
100% 2.04M/2.04M [00:00<00:00, 29.4MB/s]
```
#### **Importing Dataset using Pandas**
Code:
``` python=
import pandas as pd
import numpy as np
data = pd.read_csv('final_vg.csv')
data.head()
```
> Output:
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/588/original/Screenshot_2023-09-25_at_10.03.37_PM.png?1695659848" width="600" height="300">
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/589/original/Screenshot_2023-09-25_at_10.05.12_PM.png?1695659890" width="600" height="300">
If you notice,
- Columns like `Platform`, `Genre` are categorical
- While columns like `NA_Sales`, `Global_Sales`, `Rank` are continuous
On noticing further,
- `Platform` is of nominal type, no proper order between the categories
- `Year` is of ordinal type, there\'s a order to the categories
---
title: Intro to matplotlib
description:
duration: 900
card_type: cue_card
---
### **Introduction to Matplotlib** (10-15 minutes)
#### Lets learn to create a basic plot using plt
Now say, we want to draw a curve passing through 3 points:
- (0, 3)
- (1, 5)
- (2, 9)
#### How can we draw a curve using matplotlib ?
By using `plt.plot() function`
Code:
``` python=
x_val = [0, 1, 2]
y_val = [3, 5, 9]
plt.plot(x_val, y_val)
```
> Output:
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/590/original/Screenshot_2023-09-25_at_10.10.30_PM.png?1695660068" width="400" height="300">
#### What can we observe from this plot ? {#what-can-we-observe-from-this-plot-}
- `plt.plot()` automatically decided the scale of the plot
- It also prints the **type of object** `matplotlib.lines.Line2D`
While this command decided a lot of things for you, you can customise each of these by understanding **components of a matplotlib plot**
#### **Anatomy of Matplotlib**
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/591/original/Screenshot_2023-09-25_at_10.12.20_PM.png?1695660180" width="700" height="400">
Woah! There is a lot of information in this image. Let's understand them one at a time.
- Figure: The **overall window** or page that everything is drawn on.
- You can create multiple independent Figures in Jupyter.
- If you run the code in terminal, separate windows will pop-up
- Axes: To the figure you can add multiple **Axes** which represents a
plot
- **Axis**: Simply the `x-axis` and `y-axis`
- **Axes**: - It is the **area** on which the **data is plotted** with
functions such as `plot()`
- **x-label**: Name of x-axis
- **y-label**: Name of y-axis
- **Major ticks**: subdivides the axis into major units. They
appear by default during plotting
- **Minor ticks**: subdivides the major tick units. They are by
default hidden and can be toggled on.
- **Title**: Title of each plot **(Axes)**, giving information about
the same
- **Legend**: describes the elements in the plot, blue and green
curves in this case
- **Suptitle**: The common title of all the plots
These are the major components of a matplotlib plot
#### Now, how to choose the right plot?
Firstly, depends on the what is your question of interest
When the question is clear:
- How many variables are involved?
- Whether the variable(s) are numerical or categorical?
#### How many variables are involved?
- 1 Variable - Univariate Analysis
- 2 Variables - Bivariate Analysis
- 2+ Variables - Multivariate Analysis
PS: Bivariate counts under multivariate, but let\'s keep it sep for ease of communication
What are the possible cases?
#### Univariate
- Numerical
- Categorical
#### Bivariate
- Numerical-Numerical
- Numerical-Categorical
- Categorical-Categorical
#### Multivariate
Let's start with 3 and then we can generalize
- Numerical-Numerical-Categorical
- Categorical-Categorical-Numerical
- Categorical-Categorical-Categorical
- Numerical-Numerical-Numerical
We will work on these one by one
---
title: Univariate Data Visualization of categorical data
description:
duration: 1800
card_type: cue_card
---
### **Univariate Data Visualization - Categorical Data**
#### What kind of questions we may want to ask for a categorical variable?
Questions like:
- What is the Distribution/Frequency of the data across different categories?
- What proportion does a particular category constitutes?
Let's take the categorical column "Genre"
#### How can we find the top-5 genres?
Recall, how could we get this data using pandas?
Code:
``` python=
cat_counts = data['Genre'].value_counts()
cat_counts
```
> Output:
```
Action 3316
Sports 2400
Misc 1739
Role-Playing 1488
Shooter 1310
Adventure 1286
Racing 1249
Platform 886
Simulation 867
Fighting 848
Strategy 681
Puzzle 582
Name: Genre, dtype: int64
```
#### Now what kind of plot can we use to visualize this information?
- We can perhaps plot categories on X-axis and their corresponding frequencies on Y-axis
- Such chart is called a Bar Chart or a Count Plot
- Can also plot horizontally when the #categories are many
### **Bar Chart** (25-30 minutes)
The data is binned here into categories
#### How can we draw a Bar plot ?
Using `plt.bar()`
Code:
``` python=
x_bar=cat_counts.index
y_bar=cat_counts
plt.bar(x_bar,y_bar)
```
> Output:
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/592/original/Screenshot_2023-09-25_at_10.17.52_PM.png?1695660555" width="400" height="300">
The names seem to be overlapping a lot
#### How can we handle overlapping labels?
- Maybe decrease the font size (not preferred though)
- Or maybe increase the figure size
- Or rotate the labels
#### How can we change the plot size?
Code:
``` python=
plt.figure(figsize=(12,8))
plt.bar(x_bar,y_bar)
```
> Output:
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/593/original/Screenshot_2023-09-25_at_10.21.30_PM.png?1695660708" width="500" height="400">
#### And how can we rotate the tick labels, also maybe increase the fontsize of the same?
``` python
plt.figure(figsize=(12,8))
plt.bar(x_bar,y_bar)
plt.xticks(rotation=90, fontsize=12)
```
> Output:
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/594/original/Screenshot_2023-09-25_at_10.22.44_PM.png?1695660782" width="500" height="400">
If you notice, the width of each bar is **1**
#### Can we change the width of these bars?
Code:
``` python=
# same code
plt.figure(figsize=(10,8))
plt.bar(x_bar,y_bar,width=0.2)
plt.xticks(rotation = 90, fontsize=12)
```
> Output:
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/595/original/Screenshot_2023-09-25_at_10.24.04_PM.png?1695660859" width="500" height="400">
#### What about any additional styling to add to the bars ?
- We can **change colour of bars**
- We can add a **title to the axes**
- We can also add x and y labels
``` python=
plt.figure(figsize=(10,8))
plt.bar(x_bar,y_bar,width=0.2,color='orange')
plt.title('Games per Genre',fontsize=15)
plt.xlabel('Genre',fontsize=12)
plt.ylabel('Count',fontsize=12)
plt.xticks(rotation = 90, fontsize=12)
plt.yticks(fontsize=12)
```
> Output:
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/596/original/Screenshot_2023-09-25_at_10.25.34_PM.png?1695660948" width="500" height="400">
If you notice, there\'s some text printed always before the plots.
This contains the data information of the plot
#### How can we remove the text printed before the plot and just display the plot?
Code:
``` python=
plt.figure(figsize=(10,8))
plt.bar(x_bar,y_bar,width=0.2,color='orange')
plt.title('Games per Genre',fontsize=15)
plt.xlabel('Genre',fontsize=12)
plt.ylabel('Count',fontsize=12)
plt.xticks(rotation = 90, fontsize=12)
plt.yticks(fontsize=12)
plt.show()
```
> Output:
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/597/original/Screenshot_2023-09-25_at_10.27.47_PM.png?1695661083" width="500" height="400">
#### How can we draw a bar-chart in Seaborn?
- In Seaborn, the same plot is called as **countplot**.
- Countplot automtically does even the counting of frequencies for you
#### Why not called a barplot?
There is **another function** in Seaborn called **barplot which has some other purpose** - discuss later
Code:
``` python=
sns.countplot(x = 'Genre', data = data, order=data['Genre'].value_counts().index, color='cornflowerblue')
plt.xticks(rotation=90)
```
> Output:
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/598/original/Screenshot_2023-09-25_at_10.29.16_PM.png?1695661171" width="500" height="400">
The top 5 genres are action, sports, misc, role player, and shooter
### **Pie charts** (5-10 minutes)
#### What if instead of actual frequencues, I want see the proportion of the categories with each other?
Say, we want to compare the distrubution/proportion of sales across the different regions?
Which plot can we use for this?
A pie-chart!
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/021/280/original/download.png?1670819443" width="300" height="250">
Code:
``` python=
sales_data = data[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']]
region_sales = sales_data.T.sum(axis='columns')
plt.pie(region_sales,
labels=region_sales.index,
startangle=90,
explode=(0.2,0,0,0))
plt.show()
```
---
title: Quiz-1
description:
duration: 60
card_type: quiz_card
---
# Question
In the state “Haryana”, we want to find the proportion of people who smoke. Which will be the preferred plot?
# Choices
- [x] Pie Chart
- [ ] Bar Chart
- [ ] Count Plot
- [ ] BoxPlot
---
title: univariate data visualisation of Numerical Data
description:
duration: 1800
card_type: cue_card
---
### **Univariate Data Visualisation - Numerical Data**
#### What kind of questions we may have regarding a numerical variable?
1. How is the data distributed? Say distribution of number of games published in a year.
2. Is the data skewed? Are there any outliers? - Extremely high selling games maybe?
3. How much percentage of data is below/above a certain number?
4. Some special numbers - Min, Max, Mean, Median, nth percentile?
Now say, you want to find the distribution of games released every year.
Unlike barplot, **to see the distribution we will need to `bin` the data**.
**How can we understand popularity of video games year by year?**
#### **Histogram**
Code:
``` python=
plt.hist(data['Year'])
plt.show()
```
> Output:
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/599/original/Screenshot_2023-09-25_at_10.36.43_PM.png?1695661619" width="500" height="400">
- The curve is left skewed, with a lot more games being published in 2005-2015
- This shows that games started being highly popular in the last 1-2 decades, maybe could point to increased usage of internet worldwide!
If you notice, histograms are basically frequency charts
We can also vary the number of bins, the **default number of bins is 10**
So if we would need to see this data per decade, we would need 40 years in 4 bins.
Code:
``` python=
plt.hist(data['Year'], bins=4)
plt.show()
```
> Output:
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/600/original/Screenshot_2023-09-25_at_10.38.09_PM.png?1695661717" width="500" height="400">
We can also get the data of each bin, such as range of the boundaries, values, etc.
Code:
``` python=
count, bins, _ = plt.hist(data['Year'])
```
> Output:
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/601/original/Screenshot_2023-09-25_at_10.38.19_PM.png?1695661762" width="500" height="400">
``` python
count
```
> Output:
```
array([ 112., 70., 92., 449., 1274., 2440., 3921., 5262., 2406.,
355.])
```
``` python=
bins
```
> Output:
```
array([1980., 1984., 1988., 1992., 1996., 2000., 2004., 2008., 2012.,
2016., 2020.])
```
#### Now what do these `count` and `bins` mean?
- **bins** provides bin edges
- **counts** provides it corresponding counts
#### What is the length of `count`?
10
#### What shoud be the length of `bins`?
10 + 1 = 11
##### How can we plot histogram in Seaborn?
``` python=
sns.histplot(data['Year'], bins=10)
```
> Output:
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/602/original/Screenshot_2023-09-25_at_10.40.49_PM.png?1695661866" width="600" height="400">
Notice,
- The boundaries are more defined than matplotlib\'s plotting
- The x and y axis are labelled automatically
---
title: Quiz-2
description:
duration: 60
card_type: quiz_card
---
# Question
For analyzing marks by an edtech company, we want to find the range in which the most number of students have scored.
Which one will be the best plot ?
# Choices
- [x] Histogram
- [ ] Line Chart
- [ ] Bar Chart
- [ ] BoxPlot
---
title: Quiz-2 Explanation, Univariate data viz of Num Data (contd.)
description:
duration: 900
card_type: cue_card
---
#### Quiz-2 Explanation
Since we want to find number of studnets in a bin, we have to study the distribution of scores of students, which is a numerical variable. Hence we will use Histogram.
#### **Kernel Density Estimate (KDE) Plot**
- A KDE plot, similar to histrogram, is a method for visualizing the distributions
- But instead of bars, KDE represents data using a **continuous probability density curve**
#### Now, Why do we even need KDE plots?
- Compared to histogram, KDE produces a plot which is **less cluttered** and **more interpretable**
- Think of it as a **smoothened version** of histogram
Let's plot KDE using `seaborn`'s `kdeplot`
Code:
``` python=
sns.kdeplot(data['Year'])
```
> Output:
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/603/original/Screenshot_2023-09-25_at_10.44.26_PM.png?1695662080" width="600" height="400">
#### Can you notice the difference between KDE and histogram?
Y-Axis has **probability density estimation** instead of count
You can read more about this on:
<https://en.wikipedia.org/wiki/Kernel_density_estimation>
<https://www.youtube.com/watch?v=DCgPRaIDYXA>
**`Instructor Note:`**
Just for a brief idea, to find the probability, say probability of `Year` from 1990-2000, we will need to find the `area under the curve` for that interval
#### **Boxplot**
Now say I want to find the typical earnings of a game when it is published.
Or maybe find the aggregates like median, min, max and percentiles of the data.
#### What kind of plot can we use to understand the typical earnings from a game?
Box Plot
#### What exactly is a Box Plot?
- A box plot or **box-and-whisker plot** shows the **distribution of quantitative data**
- It facilitates comparisons between
- attributes
- across levels of a categorical attribute.
The **box**: Shows the **quartiles** of the dataset
The **whiskers**: Show the **rest of the distribution**
#### Box plots show the five-number summary of data:
1. Minimum score,
2. first (lower) quartile
3. Median
4. Third (upper) quartile
5. maximum score
#### **Diagram**
<img src=https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/627/original/iqr3.jpg?1706525143 width="650" height="200">
#### **Minimum Score**
- It is the **lowest value**, excluding outliers
- It is shown at the **end of bottom whisker**
#### **Lower Quartile**
- **25% of values** fall below the lower quartile value
- It is also known as the **first quartile**.
#### **Median**
- Median marks the **mid-point of the data**
- **Half the scores are greater than or equal to this value and half are less**.
- It is sometimes known as the **second quartile**.
#### **Upper Quartile**
- **75% of the values fall below the upper quartile value**
- It is also known as the **third quartile**.
#### **Maximum Score**
- It is the **highest value**, excluding outliers
- It is shown at the **end of upper whisker**.
#### **Whiskers**
- The upper and lower whiskers represent **values outside the middle 50%**
- That is, the **lower 25% of values** and the **upper 25% of values**.
#### **Interquartile Range (or IQR)**
- This is the box plot showing the **middle 50% of scores**
- It is the **range between the 25th and 75th percentile**.
#### Now, Let's plot a box plot to find the average typical earnings for a game
Code:
``` python=
plt.figure(figsize=(15,10))
sns.boxplot(y = data["Global_Sales"])
plt.yticks(fontsize=20)
plt.ylabel('Global Sales (in million dollars)', fontsize=20)
plt.title('Global Sales of video games', fontsize=20)
```
>Output:
<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/605/original/Screenshot_2023-09-25_at_10.48.46_PM.png?1695662360" width="500" height="400">
What can we infer from this?
The 5 point estimates (approx.) here are:
- Minimum, excluding outliers: 0
- Maximum, excluding outliers: 20 million dollars
- 25th Quantile: 6 million
- Median: around 7 million
- 75th Quantile: 12 million
There are quite a few outliers above 20 million dollars, represented by black diamonds
Key Takeaways:
Categorical - Barplot, Pie Chart
Numerical - Histogram, KDE, Boxplot
Can explore more types: Violin plot, bee-swarm plot, etc.
---
title: Quiz-3
description:
duration: 60
card_type: quiz_card
---
# Question
The telecom company "Airtel", wants to find the count of each payment mode opted by the customer. Which will be the preferred plot?
# Choices
- [ ] Boxplot
- [ ] Pie Chart
- [x] Count Plot
- [ ] Line Plot
---
title: Quiz-3 Explanation
description:
duration: 60
card_type: cue_card
---
We are using a single variable, which is categorical in this case. To find the count, we will be using countplot