--- title: Virgil - EDA Visualization - S21 Data Viz Matplotlib tags: Virgil, LearnWorld, EDAVisualization --- <a target="_blank" href="https://colab.research.google.com/drive/1gWGBi_5w9n1vBzBY2qucroW2pVkeSTOw"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a> # VISUALIZING WITH MATPLOTLIB AND PANDAS .plot() <img src="https://matplotlib.org/_static/logo2.png" alt="matplotlib" width="50%"/> Matplotlib is a package for data visualization in Python. The library is built on NumPy arrays, and designed to work with the broader SciPy stack. One of Matplotlib’s most important features is its **ability to play well with many operating systems and graphics backends**. Matplotlib supports dozens of backends and output types, which means you can count on it to work regardless of which operating system you are using or which output format you wish. This cross-platform, everything-to-everyone approach has been one of the great strengths of Matplotlib. It has led to a large user base, which in turn has led to an active developer base and Matplotlib’s powerful tools and ubiquity within the scientific Python world. Matplotlib is designed to help users to visualize data as easily as possible, with all the necessary control -- that is, by using relatively high-level commands most of the time, and still have the ability to use the low-level commands when needed. Therefore, everything in matplotlib is organized in a hierarchy. At the top of the hierarchy is the matplotlib "state-machine environment" which is provided by the [`matplotlib.pyplot`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.html#module-matplotlib.pyplot) module. At this level, simple functions are used to add plot elements (lines, images, text, etc.) to the current axes in the current figure. **COMPONENTS OF A FIGURE** <img src='https://i.imgur.com/AaqxaKy.png' width=600> <img src="https://matplotlib.org/_images/anatomy.png" alt="drawing" width="600"> ```python # limit import pandas as pd import seaborn as sns import matplotlib.pyplot as plt ``` `matplotlib.pyplot` is a collection of command style functions that make matplotlib work like **MATLAB**. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc. ```python plt.figure(()) plt.subplot(3,4,11) plt.subplot(222) ``` ## 1. FIGURE AND SUBPLOTS ### Create figure and subplots 👉 To create figure and subplots: ```plt.figure()``` ▸ https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.figure.html ```plt.subplot()``` ▸ https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplot.html There is nothing there because we haven't plot anything. Let's load the data and try plotting something. ```python df = pd.read_csv('https://www.dropbox.com/s/ory0s3z89z3zkt4/restaurant.csv?dl=1') df.head() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>total_bill</th> <th>tip</th> <th>sex</th> <th>smoker</th> <th>day</th> <th>time</th> <th>size</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>16.99</td> <td>1.01</td> <td>Female</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>2</td> </tr> <tr> <th>1</th> <td>10.34</td> <td>1.66</td> <td>Male</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>3</td> </tr> <tr> <th>2</th> <td>21.01</td> <td>3.50</td> <td>Male</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>3</td> </tr> <tr> <th>3</th> <td>23.68</td> <td>3.31</td> <td>Male</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>2</td> </tr> <tr> <th>4</th> <td>24.59</td> <td>3.61</td> <td>Female</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>4</td> </tr> </tbody> </table> </div> ```python # Tự động tạo ra 1 figure và 1 subplot ở trong nó plt.figure() ``` <Figure size 432x288 with 0 Axes> <Figure size 432x288 with 0 Axes> ```python # PLOT NUMBER OF BILLS BY DAY sns.countplot(data=df, x='day') ``` <matplotlib.axes._subplots.AxesSubplot at 0x7efd355dd610> ```python # Thay đôỉ size của figure plt.figure(figsize=(10, 10)) #inches sns.countplot(data=df, x='day') plt.show() ``` ***MULTIPLE SUBPLOTS IN ONE FIGURE*** Addition: Line chart about average tip by day. ```python # Define the figure plt.figure(figsize=(10, 7)) plt.subplot(121) # Layout có 1 dòng và 2 cột sns.countplot(data=df, x='day') plt.subplot(122) sns.countplot(data=df, x='day') plt.show() ``` 🙋🏻‍♂️ **TASK: Create a new graph with 4 subplot that show:** - Number of bills by day - Number of bills by time - Number of bills by gender - Number of bills by table size EXPECTED OUTPUT: <img src="https://i.imgur.com/2ZRxk5b.png"> ```python # YOUR CODE HERE ``` ### Adding titles Now that we have all the graphs needed. Let's give it a name. 👉 To add title: ```plt.suptitle()``` ▸ https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.suptitle.html ```plt.title()``` ▸ https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.title.html ```python plt.figure(figsize=(10, 7)) plt.suptitle('TÊN CẢ BIỂU ĐỒ') # Plot on first subplot plt.subplot(121) sns.countplot(data=df, x='day') plt.title('Biểu đồ A') plt.subplot(122) sns.countplot(data=df, x='day') plt.title('Biểu đồ B') plt.show() ``` ## 2. FORMATTING AXIS & STYLING Let's go back to our simple plot. ```python plt.figure(figsize=(5, 5)) sns.countplot(data=df, x='day') plt.show() ``` ```python plt.ylim(0,60) ``` ```python plt.title() ``` ### plt.xlabel / plt.ylabel You can use the syntax to simply change the name of the axis. ```python plt.figure(figsize=(5, 5)) sns.countplot(data=df, x='day') plt.xlabel("Days of Week") plt.ylabel("Number of bill") plt.show() ``` ### plt.xticks / plt.yticks ```python plt.figure(figsize=(5,5)) sns.histplot(data=df, x='total_bill') plt.show() ``` Working with continuous axis: ```python plt.figure(figsize=(5,5)) sns.histplot(data=df, x='total_bill') plt.xticks(ticks=[0, 5, 10, 15, 20, 25, 30, 35]) #Tell them which tick position do you want to display? plt.show() ``` ```python plt.figure(figsize=(5, 5)) sns.countplot(data=df, x='day', order=['Thur', 'Fri', 'Sat', 'Sun']) plt.show() ``` Working with categorical axis: ```python plt.figure(figsize=(5, 5)) sns.countplot(data=df, x='day', order=['Thur', 'Fri', 'Sat', 'Sun']) # Using plt.xticks or plt.yticks to change the ticks plt.xticks(ticks=[0, 2, 3], labels=['Thursday', 'Saturday', 'Sunday']) plt.show() ``` ### plt.xlim / plt.ylim ```python plt.figure(figsize=(5, 5)) sns.countplot(data=df, x='day', order=['Thur', 'Fri', 'Sat', 'Sun']) # Using plt.ylim to limit or expand the range of the y axis. plt.ylim(0, 100) plt.show() ``` 🙋🏻‍♂️ **TASK: Draw a histogram that shows the distribution of total bill but limit the total bill only to the range from 10 to 40** ```python # YOUR CODE HERE plt.figure(figsize=(7, 5)) sns.histplot(data=df, x='total_bill') # What should I write here? plt.show() ``` **How should I finish the above code?** A. plt.xticks(10, 40) B. plt.yticks(10, 40) C. plt.xlim(10, 40) D. plt.ylim(10, 40) ### plt.twinx() ▸ Shared Axis Example: Create a shared axis plot with: - A countplot of number of bills by day - A lineplot of average amount of tip by day. ```python plot_data = df.groupby('day')['tip'].mean().reset_index() plot_data ``` ```python plt.figure(figsize=(8, 8)) sns.countplot(data=df, x='day') plt.ylabel('Number of bills') plt.twinx() sns.lineplot(data=plot_data, x='day', y='tip') plt.ylabel('Average Tip') plt.show() ``` ### Styling There are many styles that you can try to stylize your report. Please refer to this documentation: https://matplotlib.org/3.1.1/gallery/style_sheets/style_sheets_reference.html ```python # Set up the theme plt.style.use('default') plt.figure(figsize=(5, 5)) sns.countplot(data=df, x='day') plt.show() ``` ## 3. PANDAS BUILT-IN PLOT FUNCTION .plot( ) **When is it useful?** ☀︎ Quick visualization. I mean, QUICK! ☀︎ Line plot with time series data. ☀︎ Stacked bar, stacked area, stacked histogram. **How to do it?** ▸ Simply get the plot table and call ```.plot()``` with ```kind=``` to specify the kind of chart. ``` ‘bar’ or ‘barh’ for bar plots ‘hist’ for histogram ‘box’ for boxplot ‘kde’ or ‘density’ for density plots ‘area’ for area plots ‘scatter’ for scatter plots ‘hexbin’ for hexagonal bin plots ‘pie’ for pie plots ``` When creating the ```plot_data``` table for plotting: ❗️ **The axis dimension must be index of the plot_data table**. Official Documentation ▸ https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html Guide ▸ https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html Guide ▸ https://realpython.com/pandas-plot-python/ ```python df = pd.read_csv('https://www.dropbox.com/s/ory0s3z89z3zkt4/restaurant.csv?dl=1') df.head() ``` ### Example 1: Univariate - Distribution of the column total bill ```python # Histogram df['total_bill'].plot(kind='hist', color='red') ``` ```python # Boxplot df['total_bill'].plot(kind='box') ``` ### Example 2: Bivariate - Average total_bill by day. Plot data can be computed from a groupby method. ```python df.groupby('day').mean()['total_bill'] ``` ```python # Bar df.groupby('day').mean()['total_bill'].plot(kind='bar') ``` ```python # Barh df.groupby('day').sum()['total_bill'].plot(kind='barh'); ``` ```python # Line df.groupby('day').sum()['total_bill'].plot(kind='line'); ``` ### Example 3: Multivariate - Average of total_bill by day and sex Plot data can be computed by a pivot_table method. ```python #Group bar chart df.groupby(['day', 'sex']).mean()['total_bill'] ``` ```python plot_data = df.groupby(['day', 'sex']).mean()['total_bill'].reset_index() plot_data ``` ```python sns.barplot(data=plot_data, x='day', y='total_bill', hue='sex') ``` ```python df.groupby(['day', 'sex']).mean()['total_bill'] ``` ❗️ **If you have 2 dimensions, the plot_data table must be a pivot table!** ```python # Group bar chart pd.pivot_table(df, index='day', columns='sex', values='total_bill', aggfunc='mean').plot(kind='bar', stacked=True) ``` ## SUMMARY ▸ **MATPLOTLIB** ```python # Import matplotlib import matplotlib.pyplot as plt ------ FIGURE AND SUBPLOT ------- # Create a figure with size of 10 inches x 10 inches plt.figure(figsize=(10, 10)) # Create a subplot at the first position in a grid of 1 row, 2 columns plt.subplot(121) # Create a subplot at the fourth position in a grid of 2 row, 3 columns plt.subplot(234) # Name the figure plt.suptitle("Name") # Name the subplot plt.title("Name") ------ FORMATTING AND STYLING ------- # Change axis name plt.xlabel("Name") plt.ylabel("Name") # Change x_axis ticks at the position 0, 2, 3 to new name "A", "B", "C" plt.xticks(ticks=[0, 2, 3], labels=['A', 'B', 'C']) # Change the limit of y axis to 20 to 40 plt.ylim(20, 40) # Create a shared axis chart plt.twinx() # Change plotting theme plt.style.use('theme_name') ------ A SAMPLE CODE BLOCK ------- # Visualize a shared axis chart with # Countplot of number of bills by day # And a lineplot of average tip by day # Step 1: Prepare the data for the average tip by day plot_data = df.groupby('day')['tip'].mean().reset_index() # Step 2: Plot plt.figure(figisize=(10, 10)) sns.countplot(data=df, x='day') plt.twinx() sns.lineplot(data=df, x='day', y='tip') plt.show() ``` ▸ .PLOT( ) ```python # Example: plot a stacked bar from a table name "commerce". commerce.plot(kind='bar', stacked=True, figsize=(10,10), color='blue') ``` ❗️ What you want to be the dimensions of your chart MUST BE the index of the table. ## 4. further reading ▸ ADDING TEXT ```plt.text( )``` ▸ https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.text.html Some important parameters: - `x`: x location - `y`: y location - `s`: message in string - `horizontalalignment` or `ha`: "center", "right", or "left" - `verticalalignment` or `va`: 'center', 'top', 'bottom', 'baseline', 'center_baseline' - Furthermore, any other formatting parameter can be passed in as a dictionary. Please refer in the link. ```python df = pd.read_csv('https://www.dropbox.com/s/ory0s3z89z3zkt4/restaurant.csv?dl=1') df.head() ``` ```python # Centerize the text plt.figure(figsize=(5, 5)) sns.countplot(data=df, x='day') plt.text(x=0, y=80, s="This is a string", ha='center', color='pink', fontsize='large') plt.show() ``` 🙋🏻‍♂️ **TASK: Find a way to put the value of each bar on top of the bar.** ```python plot_data = df.groupby('day').count()[['size']].loc[['Thur', 'Fri', 'Sat', 'Sun']] plot_data ``` ```python plt.figure(figsize=(8, 6)) sns.countplot(data=df, x='day', order = ['Thur', 'Fri', 'Sat', 'Sun']) for i in range(plot_data.shape[0]): plt.text(i, plot_data.values[i][0] + 1, plot_data.values[i][0], ha='center') plt.show() ``` ## 5. further reading ▸ EXPORT We can then use the `figure.Figure.savefig()` in order to save the figure to disk. Note that there are several useful flags we'll show below: * `transparent=True` makes the background of the saved figure transparent if the format supports it. * `dpi=80` controls the resolution (dots per square inch) of the output. * `bbox_inches="tight"` fits the bounds of the figure to our plot. Read more at: https://matplotlib.org/3.3.2/api/_as_gen/matplotlib.pyplot.savefig.html ```python # First you need to assign the figure into a variable fig = plt.figure(figsize=(5, 5)) sns.countplot(data=df, x='day') plt.show() ``` ```python # Save the figure fig.savefig('sales.png', transparent=False, dpi=80, bbox_inches="tight") ```