Data Visualization and Analysis in Python

{%hackmd theme-dark %} # Data Visualization and Analysis in Python In this post we will: - Making a scatterplot - Adding a linear regression line to the chart - Adding labels to the data points in the chart - Carrying out correlation analysis + Adding the Correlation to the scatterplot Most of the work here are based on the blog posts on [https://www.marsja.se](https://www.marsja.se). ## Python Packages: Pandas, Seaborn, Pingouin In this post we will use the Python packages, [Pandas](), [Seaborn](), and [Pingouin](). More specifically, we are going to use Pandas to load a dataset from a a CSV file, Seaborn (and Matplotlib, actually) to carry out data visualization in Python, and Pingouin for the correlation analysis. Check the link above for Pandas dataframe tutorials, learn how to read data from different formats (CSV, Excel, SPSS, JSON, to name a few), carrying out ANOVA, and many more. ### How to install packages using Pip Now, we may want to install these packages, using pip, before we continue. Here's how to install packages using pip: ``` pip install pandas seaborn pingouin ``` It is, of course, possible to install a Python distribution (like [Anaconda](https://www.anaconda.com/)). This will get you Pandas and Seaborn (and dependencies) but you may need to install pingouin. ## Scatter Plots A good way to illustrate a relationship between two variables is through so-called scatterplots. In a scatterplot, each unit of analysis (for example, different measures of personality) is drawn out as a point. A variable indicates the position of the point on the X axis and another variable indicates the position of the Y axis. A scatterplot shows, in the same way as a correlation measure, the bivariata relation between two variables, but can also show whether there are any outliers in the data. They can also make a relatitionship between two variables more understandable to the reader. ## Scatterplots in Python using Seaborn In this Seaborn Scatterplot example, we are going to work with the *mtcars* dataset. It's an R dataset but can be loaded from an URL. If we want to load data from a CSV file, from an URL, we can use Pandas read_csv method. In the last line of code, below, we are printing the first 5 rows of the Pandas dataframe object. ```python import pandas as pd import seaborn as sns data = 'https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv' df = pd.read_csv(data, index_col=0) df.head() ``` ![](https://i.imgur.com/hpizoYr.png) ### Make a scatter plot in Seaborn Here, in the next example code, we are going to create a scattergram using the *scatterplot* method from Seaborn. It's quite simple we put the variables (i.e., columns) that we are interested in plotting in the y- and x-arguments_ ```python sns.scatterplot(x='hp', y='mpg', data=df) ``` ![](https://i.imgur.com/u1kG7xV.png) ## Correlation in Python using Pingouin As can be seen in the figure above, there seems to be a relationship between hp and mpg. We could test this by examining the correlation between the variables. In the first code chunk, below, we import the corr method from pingouin. This function takes a number of parameters. ![](https://i.imgur.com/lJOQElT.png) In this simple example wea re just going to use the x and y parameters and leave the method to be a pearson correlation (type `python help(corr)` to get the documentation for this particular correlation function.) ```python from pingouin import corr ``` Next we are selecting the columns and get the results: ```python corr(df['hp'], df['mpg']) ``` ![](https://i.imgur.com/lnjybUf.png) As can bee see in the image above we get a lot of interesting values to use for our interpretation of this Python data analysis (correlation). ## Conclusion In this post we learned how to work with Pandas (load data into a dataframe), carry out a simple scatter plot using Seaborn, and carrying out a correlation analysis using Pingouin. ###### tags: `Pandas` `Seaborn` `Python` `Seaborn` `Correlation` `scatterplots`