Data Visualization - 1 (Revamped)

--- title: Introduction description: duration: 300 card_type: cue_card --- ### **Introduction to Data Visualization** (5-10 minutes) #### Plots Presentation: <https://docs.google.com/presentation/d/1DkLTjTe6YmGbDHtr4v9Jso553DlCuP3cfSnwvUN1mgE/edit?usp=sharing> ### Summary/Agenda #### Where is all Data Visualization helpful? Why? - Exploratory - EDA - Explanatory - Storytelling #### What is the Science in Data Visualization? - Anatomy of a plot/chart - How to use the right plot/chart for given data? #### What is the Art in Data Visualization? - Choose the right scale, labels, tick labels - Identify and remove clutters in the plot - Ways to highlight information in the plot --- title: Libraries for data visualization description: duration: 700 card_type: cue_card --- ### **Python libraries for Data Visualization** #### Importing Matplotlib and Seaborn We don't need to import the entire library but just its submodule `pyplot` We'll use the **alias name `plt`** #### What is `pyplot`? - `pyplot` is a **sub-module for visualization** in `matplotlib` - Think of it as **high-level API** which **makes plotting an easy task** - Data Scientists **stick to using `pyplot` only unless** they want to create **something totally new**. For seaborn, we will be importing the whole seaborn library as alias `sns` #### What is seaborn? Seaborn is another visualization library which uses matplotlib in the backend for plotting #### What is the major difference then between both matplotlib and seaborn? - Seaborn uses **fascinating themes** and **reduces number of code lines** by doing a lot of work in the backend - While matplotlib is used to **plot basic plots and add more functionlaity** on top of that - Seaborn is built on the top of Pandas and Matplotlib As we proceed through the lecture, we will see the difference between both the libraries ``` python= import matplotlib.pyplot as plt import seaborn as sns ``` Before we dive into learning these libraries, lets answer some general questions #### Why do we even need to visualize data? When do I even need to visualise? - It helps us to understand the data in a pictorial format - It's **extensively used in Exploratory Data Analysis** - And also to **present peformance results of our models** Two reasons/scopes - **Exploratory** - I can't see certain patterns just by crunching numbers (avg, rates, %ages) - **Explanatory** - I can the numbers crunches and insights ready, but I'd like a visual art for storytelling #### Data - Rows: Samples, Data-points, Records - Columns: Features, Variables #### How many kinds of data do we have? At the fundamental level, it\'s just two types: - Numerical/Continuous - Categorical Categorical can be further divided into: - **Ordinal:** Categorical Data with an order (E.g. low, medium, high) - **Non-ordinal/nominal:** Categorical Data without any order (example gender as Male/Female) --- title: Video Game Data Intro description: duration: 500 card_type: cue_card --- ### **Video Games Analysis** (5 minutes) You are a data scientist at \"**Tencent Games**\". You need to analyze what kind of games they should start creating to get higher success in the market. #### **Downloading the Dataset** Code: ``` python= !gdown https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/021/299/original/final_vg1_-_final_vg_%281%29.csv?1670840166 -O final_vg.csv ``` > Output: ``` Downloading... From: https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/021/299/original/final_vg1_-_final_vg_%281%29.csv?1670840166 To: /content/final_vg.csv 100% 2.04M/2.04M [00:00<00:00, 29.4MB/s] ``` #### **Importing Dataset using Pandas** Code: ``` python= import pandas as pd import numpy as np data = pd.read_csv('final_vg.csv') data.head() ``` > Output: <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/588/original/Screenshot_2023-09-25_at_10.03.37_PM.png?1695659848" width="600" height="300"> <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/589/original/Screenshot_2023-09-25_at_10.05.12_PM.png?1695659890" width="600" height="300"> If you notice, - Columns like `Platform`, `Genre` are categorical - While columns like `NA_Sales`, `Global_Sales`, `Rank` are continuous On noticing further, - `Platform` is of nominal type, no proper order between the categories - `Year` is of ordinal type, there\'s a order to the categories --- title: Intro to matplotlib description: duration: 900 card_type: cue_card --- ### **Introduction to Matplotlib** (10-15 minutes) #### Lets learn to create a basic plot using plt Now say, we want to draw a curve passing through 3 points: - (0, 3) - (1, 5) - (2, 9) #### How can we draw a curve using matplotlib ? By using `plt.plot() function` Code: ``` python= x_val = [0, 1, 2] y_val = [3, 5, 9] plt.plot(x_val, y_val) ``` > Output: <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/590/original/Screenshot_2023-09-25_at_10.10.30_PM.png?1695660068" width="400" height="300"> #### What can we observe from this plot ? {#what-can-we-observe-from-this-plot-} - `plt.plot()` automatically decided the scale of the plot - It also prints the **type of object** `matplotlib.lines.Line2D` While this command decided a lot of things for you, you can customise each of these by understanding **components of a matplotlib plot** #### **Anatomy of Matplotlib** <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/591/original/Screenshot_2023-09-25_at_10.12.20_PM.png?1695660180" width="700" height="400"> Woah! There is a lot of information in this image. Let's understand them one at a time. - Figure: The **overall window** or page that everything is drawn on. - You can create multiple independent Figures in Jupyter. - If you run the code in terminal, separate windows will pop-up - Axes: To the figure you can add multiple **Axes** which represents a plot - **Axis**: Simply the `x-axis` and `y-axis` - **Axes**: - It is the **area** on which the **data is plotted** with functions such as `plot()` - **x-label**: Name of x-axis - **y-label**: Name of y-axis - **Major ticks**: subdivides the axis into major units. They appear by default during plotting - **Minor ticks**: subdivides the major tick units. They are by default hidden and can be toggled on. - **Title**: Title of each plot **(Axes)**, giving information about the same - **Legend**: describes the elements in the plot, blue and green curves in this case - **Suptitle**: The common title of all the plots These are the major components of a matplotlib plot #### Now, how to choose the right plot? Firstly, depends on the what is your question of interest When the question is clear: - How many variables are involved? - Whether the variable(s) are numerical or categorical? #### How many variables are involved? - 1 Variable - Univariate Analysis - 2 Variables - Bivariate Analysis - 2+ Variables - Multivariate Analysis PS: Bivariate counts under multivariate, but let\'s keep it sep for ease of communication What are the possible cases? #### Univariate - Numerical - Categorical #### Bivariate - Numerical-Numerical - Numerical-Categorical - Categorical-Categorical #### Multivariate Let's start with 3 and then we can generalize - Numerical-Numerical-Categorical - Categorical-Categorical-Numerical - Categorical-Categorical-Categorical - Numerical-Numerical-Numerical We will work on these one by one --- title: Univariate Data Visualization of categorical data description: duration: 1800 card_type: cue_card --- ### **Univariate Data Visualization - Categorical Data** #### What kind of questions we may want to ask for a categorical variable? Questions like: - What is the Distribution/Frequency of the data across different categories? - What proportion does a particular category constitutes? Let's take the categorical column "Genre" #### How can we find the top-5 genres? Recall, how could we get this data using pandas? Code: ``` python= cat_counts = data['Genre'].value_counts() cat_counts ``` > Output: ``` Action 3316 Sports 2400 Misc 1739 Role-Playing 1488 Shooter 1310 Adventure 1286 Racing 1249 Platform 886 Simulation 867 Fighting 848 Strategy 681 Puzzle 582 Name: Genre, dtype: int64 ``` #### Now what kind of plot can we use to visualize this information? - We can perhaps plot categories on X-axis and their corresponding frequencies on Y-axis - Such chart is called a Bar Chart or a Count Plot - Can also plot horizontally when the #categories are many ### **Bar Chart** (25-30 minutes) The data is binned here into categories #### How can we draw a Bar plot ? Using `plt.bar()` Code: ``` python= x_bar=cat_counts.index y_bar=cat_counts plt.bar(x_bar,y_bar) ``` > Output: <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/592/original/Screenshot_2023-09-25_at_10.17.52_PM.png?1695660555" width="400" height="300"> The names seem to be overlapping a lot #### How can we handle overlapping labels? - Maybe decrease the font size (not preferred though) - Or maybe increase the figure size - Or rotate the labels #### How can we change the plot size? Code: ``` python= plt.figure(figsize=(12,8)) plt.bar(x_bar,y_bar) ``` > Output: <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/593/original/Screenshot_2023-09-25_at_10.21.30_PM.png?1695660708" width="500" height="400"> #### And how can we rotate the tick labels, also maybe increase the fontsize of the same? ``` python plt.figure(figsize=(12,8)) plt.bar(x_bar,y_bar) plt.xticks(rotation=90, fontsize=12) ``` > Output: <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/594/original/Screenshot_2023-09-25_at_10.22.44_PM.png?1695660782" width="500" height="400"> If you notice, the width of each bar is **1** #### Can we change the width of these bars? Code: ``` python= # same code plt.figure(figsize=(10,8)) plt.bar(x_bar,y_bar,width=0.2) plt.xticks(rotation = 90, fontsize=12) ``` > Output: <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/595/original/Screenshot_2023-09-25_at_10.24.04_PM.png?1695660859" width="500" height="400"> #### What about any additional styling to add to the bars ? - We can **change colour of bars** - We can add a **title to the axes** - We can also add x and y labels ``` python= plt.figure(figsize=(10,8)) plt.bar(x_bar,y_bar,width=0.2,color='orange') plt.title('Games per Genre',fontsize=15) plt.xlabel('Genre',fontsize=12) plt.ylabel('Count',fontsize=12) plt.xticks(rotation = 90, fontsize=12) plt.yticks(fontsize=12) ``` > Output: <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/596/original/Screenshot_2023-09-25_at_10.25.34_PM.png?1695660948" width="500" height="400"> If you notice, there\'s some text printed always before the plots. This contains the data information of the plot #### How can we remove the text printed before the plot and just display the plot? Code: ``` python= plt.figure(figsize=(10,8)) plt.bar(x_bar,y_bar,width=0.2,color='orange') plt.title('Games per Genre',fontsize=15) plt.xlabel('Genre',fontsize=12) plt.ylabel('Count',fontsize=12) plt.xticks(rotation = 90, fontsize=12) plt.yticks(fontsize=12) plt.show() ``` > Output: <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/597/original/Screenshot_2023-09-25_at_10.27.47_PM.png?1695661083" width="500" height="400"> #### How can we draw a bar-chart in Seaborn? - In Seaborn, the same plot is called as **countplot**. - Countplot automtically does even the counting of frequencies for you #### Why not called a barplot? There is **another function** in Seaborn called **barplot which has some other purpose** - discuss later Code: ``` python= sns.countplot(x = 'Genre', data = data, order=data['Genre'].value_counts().index, color='cornflowerblue') plt.xticks(rotation=90) ``` > Output: <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/598/original/Screenshot_2023-09-25_at_10.29.16_PM.png?1695661171" width="500" height="400"> The top 5 genres are action, sports, misc, role player, and shooter ### **Pie charts** (5-10 minutes) #### What if instead of actual frequencues, I want see the proportion of the categories with each other? Say, we want to compare the distrubution/proportion of sales across the different regions? Which plot can we use for this? A pie-chart! <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/021/280/original/download.png?1670819443" width="300" height="250"> Code: ``` python= sales_data = data[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']] region_sales = sales_data.T.sum(axis='columns') plt.pie(region_sales, labels=region_sales.index, startangle=90, explode=(0.2,0,0,0)) plt.show() ``` --- title: Quiz-1 description: duration: 60 card_type: quiz_card --- # Question In the state “Haryana”, we want to find the proportion of people who smoke. Which will be the preferred plot? # Choices - [x] Pie Chart - [ ] Bar Chart - [ ] Count Plot - [ ] BoxPlot --- title: univariate data visualisation of Numerical Data description: duration: 1800 card_type: cue_card --- ### **Univariate Data Visualisation - Numerical Data** #### What kind of questions we may have regarding a numerical variable? 1. How is the data distributed? Say distribution of number of games published in a year. 2. Is the data skewed? Are there any outliers? - Extremely high selling games maybe? 3. How much percentage of data is below/above a certain number? 4. Some special numbers - Min, Max, Mean, Median, nth percentile? Now say, you want to find the distribution of games released every year. Unlike barplot, **to see the distribution we will need to `bin` the data**. **How can we understand popularity of video games year by year?** #### **Histogram** Code: ``` python= plt.hist(data['Year']) plt.show() ``` > Output: <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/599/original/Screenshot_2023-09-25_at_10.36.43_PM.png?1695661619" width="500" height="400"> - The curve is left skewed, with a lot more games being published in 2005-2015 - This shows that games started being highly popular in the last 1-2 decades, maybe could point to increased usage of internet worldwide! If you notice, histograms are basically frequency charts We can also vary the number of bins, the **default number of bins is 10** So if we would need to see this data per decade, we would need 40 years in 4 bins. Code: ``` python= plt.hist(data['Year'], bins=4) plt.show() ``` > Output: <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/600/original/Screenshot_2023-09-25_at_10.38.09_PM.png?1695661717" width="500" height="400"> We can also get the data of each bin, such as range of the boundaries, values, etc. Code: ``` python= count, bins, _ = plt.hist(data['Year']) ``` > Output: <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/601/original/Screenshot_2023-09-25_at_10.38.19_PM.png?1695661762" width="500" height="400"> ``` python count ``` > Output: ``` array([ 112., 70., 92., 449., 1274., 2440., 3921., 5262., 2406., 355.]) ``` ``` python= bins ``` > Output: ``` array([1980., 1984., 1988., 1992., 1996., 2000., 2004., 2008., 2012., 2016., 2020.]) ``` #### Now what do these `count` and `bins` mean? - **bins** provides bin edges - **counts** provides it corresponding counts #### What is the length of `count`? 10 #### What shoud be the length of `bins`? 10 + 1 = 11 ##### How can we plot histogram in Seaborn? ``` python= sns.histplot(data['Year'], bins=10) ``` > Output: <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/602/original/Screenshot_2023-09-25_at_10.40.49_PM.png?1695661866" width="600" height="400"> Notice, - The boundaries are more defined than matplotlib\'s plotting - The x and y axis are labelled automatically --- title: Quiz-2 description: duration: 60 card_type: quiz_card --- # Question For analyzing marks by an edtech company, we want to find the range in which the most number of students have scored. Which one will be the best plot ? # Choices - [x] Histogram - [ ] Line Chart - [ ] Bar Chart - [ ] BoxPlot --- title: Quiz-2 Explanation, Univariate data viz of Num Data (contd.) description: duration: 900 card_type: cue_card --- #### Quiz-2 Explanation Since we want to find number of studnets in a bin, we have to study the distribution of scores of students, which is a numerical variable. Hence we will use Histogram. #### **Kernel Density Estimate (KDE) Plot** - A KDE plot, similar to histrogram, is a method for visualizing the distributions - But instead of bars, KDE represents data using a **continuous probability density curve** #### Now, Why do we even need KDE plots? - Compared to histogram, KDE produces a plot which is **less cluttered** and **more interpretable** - Think of it as a **smoothened version** of histogram Let's plot KDE using `seaborn`'s `kdeplot` Code: ``` python= sns.kdeplot(data['Year']) ``` > Output: <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/603/original/Screenshot_2023-09-25_at_10.44.26_PM.png?1695662080" width="600" height="400"> #### Can you notice the difference between KDE and histogram? Y-Axis has **probability density estimation** instead of count You can read more about this on: <https://en.wikipedia.org/wiki/Kernel_density_estimation> <https://www.youtube.com/watch?v=DCgPRaIDYXA> **`Instructor Note:`** Just for a brief idea, to find the probability, say probability of `Year` from 1990-2000, we will need to find the `area under the curve` for that interval #### **Boxplot** Now say I want to find the typical earnings of a game when it is published. Or maybe find the aggregates like median, min, max and percentiles of the data. #### What kind of plot can we use to understand the typical earnings from a game? Box Plot #### What exactly is a Box Plot? - A box plot or **box-and-whisker plot** shows the **distribution of quantitative data** - It facilitates comparisons between - attributes - across levels of a categorical attribute. The **box**: Shows the **quartiles** of the dataset The **whiskers**: Show the **rest of the distribution** #### Box plots show the five-number summary of data: 1. Minimum score, 2. first (lower) quartile 3. Median 4. Third (upper) quartile 5. maximum score #### **Diagram** <img src=https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/627/original/iqr3.jpg?1706525143 width="650" height="200"> #### **Minimum Score** - It is the **lowest value**, excluding outliers - It is shown at the **end of bottom whisker** #### **Lower Quartile** - **25% of values** fall below the lower quartile value - It is also known as the **first quartile**. #### **Median** - Median marks the **mid-point of the data** - **Half the scores are greater than or equal to this value and half are less**. - It is sometimes known as the **second quartile**. #### **Upper Quartile** - **75% of the values fall below the upper quartile value** - It is also known as the **third quartile**. #### **Maximum Score** - It is the **highest value**, excluding outliers - It is shown at the **end of upper whisker**. #### **Whiskers** - The upper and lower whiskers represent **values outside the middle 50%** - That is, the **lower 25% of values** and the **upper 25% of values**. #### **Interquartile Range (or IQR)** - This is the box plot showing the **middle 50% of scores** - It is the **range between the 25th and 75th percentile**. #### Now, Let's plot a box plot to find the average typical earnings for a game Code: ``` python= plt.figure(figsize=(15,10)) sns.boxplot(y = data["Global_Sales"]) plt.yticks(fontsize=20) plt.ylabel('Global Sales (in million dollars)', fontsize=20) plt.title('Global Sales of video games', fontsize=20) ``` >Output: <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/050/605/original/Screenshot_2023-09-25_at_10.48.46_PM.png?1695662360" width="500" height="400"> What can we infer from this? The 5 point estimates (approx.) here are: - Minimum, excluding outliers: 0 - Maximum, excluding outliers: 20 million dollars - 25th Quantile: 6 million - Median: around 7 million - 75th Quantile: 12 million There are quite a few outliers above 20 million dollars, represented by black diamonds Key Takeaways: Categorical - Barplot, Pie Chart Numerical - Histogram, KDE, Boxplot Can explore more types: Violin plot, bee-swarm plot, etc. --- title: Quiz-3 description: duration: 60 card_type: quiz_card --- # Question The telecom company "Airtel", wants to find the count of each payment mode opted by the customer. Which will be the preferred plot? # Choices - [ ] Boxplot - [ ] Pie Chart - [x] Count Plot - [ ] Line Plot --- title: Quiz-3 Explanation description: duration: 60 card_type: cue_card --- We are using a single variable, which is categorical in this case. To find the count, we will be using countplot

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.