Ch 4 Display of Statistical Data

# Ch 4 Display of Statistical Data [TOC] ###### tags: `Probability and Statistics` ## 4.1 Datatypes The choice of **appropriate** statistical procedure depends on the ***data type***. Data can be `categorical` or `numerical`. If the variables are numerical, we are led to a certain statistical strategy. In contrast, if the variables represent qualitative categorizations, then we follow a different path. There are **univariate**, **bivariate**, and **multivariate** data. * Univariate data are data of only one variable, e.g., the size of a person. * Bivariate data have two parameters, for example, the x/y position in a plane, or the income as a function of age. * Multivariate data have three or more variables, e.g., the position of a particle in space, etc. ![](https://i.imgur.com/iA2AbpC.png) ### A. CATEGORICAL DATA (類別資料) **Categorical data** represents characteristics. Therefore it can represent things like a person's gender, language etc. Categorical data can also take on numerical values (Example: 1 for female and 0 for male). Note that those numbers don’t have mathematical meaning. #### A_1. NOMINAL DATA (名目資料) **Nominal values** represent discrete units and are used to label variables, that have no quantitative value. Just think of them as "**labels**". Note that nominal data that has no order. There are two examples of nominal features below: ![](https://i.imgur.com/CsdgDiT.png) #### A_2. ORDINAL DATA (順序資料) **Ordinal values** represent ***discrete*** and ***ordered*** units. It is therefore nearly the same as nominal data, except that it’s ordering matters. ![](https://i.imgur.com/XWxlqaO.png) ### B. Numerical Data (數值資料) #### DISCRETE DATA (離散資料) We speak of **discrete data** if its values are distinct and separate. In other words: We speak of discrete data if the data can only take on certain values. This type of data can’t be measured but it can be counted. It basically represents information that can be categorized into a classification. An example is the number of heads in 100 coin flips. #### CONTINUOUS DATA **Continuous Data** represents measurements and therefore their values can't be counted but they can be measured. An example would be the height of a person, which you can describe by using intervals on the real number line. #### B_1. Interval Data **Interval values** represent ordered units that have the same difference. Therefore we speak of interval data when we have a variable that contains numeric values that are ordered and where we know the exact differences between the values. An example would be a feature that contains temperature of a given place like you can see below: ![](https://i.imgur.com/rQZ66uy.png) The problem with interval values data is that they don’t have a "true zero". That means in regards to our example, that there is no such thing as no temperature. With interval data, we can add and subtract, but we cannot **multiply**, **divide** or **calculate ratios**. Because there is no true zero, a lot of descriptive and inferential statistics can’t be applied. #### B_2. Ratio Data **Ratio values** are also ordered units that have the same difference. Ratio values are the same as interval values, with the difference that they do have an absolute zero. Good examples are **height**, **weight**, **length** etc. --- ## 4.2 Plotting in Python ### A. Plot sine and cosine function Case 1. ***pyplot*** style ~~~ # Import the required packages, # with their conventional names import matplotlib.pyplot as plt import numpy as np # Generate the data x = np.arange(0, 10, 0.2) y = np.sin(x) # Generate the plot plt.plot(x, y) # Display it on the screen plt.show() ~~~ Case 2. ***matplotlib*** style ~~~ # Generate the plot fig = plt.figure() # Generate the figure ax = fig.add_subplot(111) # Add an axis to that figure ax.plot(x,y) # Add a plot to that axis ~~~ Case 3. ***Matlab-like coding*** style: ~~~ from pylab import * x = arange(0, 10, 0.2) y = sin(x) plot(x, y) show() ~~~ Case 4. Plot both functions of sine and cosine ~~~ # Import the required packages import matplotlib.pyplot as plt import numpy as np # Generate the data x = np.arange(0, 10, 0.2) y = np.sin(x) z = np.cos(x) # Generate the figure and the axes fig, axs = plt.subplots(nrows=2, ncols=1) # On the first axis, plot the sine and label the ordinate axs[0].plot(x,y) axs[0].set_ylabel('Sine') # On the second axis, plot the cosine axs[1].plot(x,z) axs[1].set_xlabel('0<x<10') axs[1].set_ylabel('Cosine') # Display the resulting plot plt.show() ~~~ --- ## 4.3 Displaying Statistical Datasets The easiest way to find and implement one of the many image types that **matplotlib** offers is to browse their gallery (http://matplotlib.org/gallery.html), and copy the corresponding Python code into your program. For statistical data analysis, the Python package **seaborn** (http://www.stanford.edu/~mwaskom/software/seaborn/) builds on matplotlib, and aims to provide a concise, high-level interface for drawing statistical graphics that are both informative and attractive. ### 4.3.1 Univariate Data An example can be found at http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.stripplot.html) #### a) Scatter Plots (散佈圖) ![](https://i.imgur.com/zHMmRxe.png) ~~~ # Import standard packages import numpy as np import matplotlib.pyplot as plt import pandas as pd import scipy.stats as stats import seaborn as sns # Generate the data x = np.random.randn(500) # Plot-command start --------------------- plt.plot(x, '.') # Plot-command end ----------------------- # Show plot plt.show() ~~~ #### b) Histograms (直方圖) ![](https://i.imgur.com/ybvXJlp.png) Histograms provide a first good overview of the distribution of your data. If you divide by the overall number of data points, you get a relative frequency histogram;and if you just connect the top center points of each bin, you obtain a relative frequency polygon. ~~~ plt.hist(x, bins=25) ~~~ #### c) Kernel-Density-Estimation (KDE) Plots Kernel density estimation (KDE) is a **non-parametric method** for estimating the probability density function of a given random variable. Given a sample of independent, identically distributed (i.i.d) observations $(x_1,x_2,\ldots,x_n)$ of a random variable from an unknown source distribution, the **kernel density** estimate, is given by: $$p(x) = \frac{1}{nh} \Sigma_{j=1}^{n}K(\frac{x-x_j}{h})$$ where $K(a)$ is the **kernel function** and $h$ is the **smoothing parameter**, also called the **bandwidth**. Suppose we have the sample points [-2,-1,0,1,2], with a linear kernel given by: $K(a)= 1-\frac{|a|}{h}$ and $h=10$. \begin{array} xx_i &=& [&−2&−1&0&1&2]\\ |0-x_i| &=& [&2&1&0&1&2]\\ |\frac{0-x_i}{h}|&=&[&0.2&0.1&0&0.1&0.2]\\ K(|\frac{0-x_i}{h}|)&=&[&0.8&0.9&1&0.9&0.8]\\ \end{array} Plug the above in the formula for $p(x)$: $$p(0) = \frac{1}{(5)(10)} ( 0.8+0.9+1+0.9+0.8 ) = 0.088$$ ##### c)-1 Kernel Density Estimation Using Python While there are several ways of computing the kernel density estimate in Python, we'll use the popular machine learning library `scikit-learn` for this purpose. ~~~ import numpy as np import matplotlib.pyplot as plt from sklearn.neighbors import KernelDensity from sklearn.model_selection import GridSearchCV ~~~ ##### c)-2 Synthetic Data (生成資料) To demonstrate kernel density estimation, synthetic data is generated from two different types of distributions. One is an asymmetric ***log-normal*** distribution and the other one is a ***Gaussian*** distribution. ~~~ def generate_data(seed=17): # Fix the seed to reproduce the results rand = np.random.RandomState(seed) x = [] dat = rand.lognormal(0, 0.3, 1000) x = np.concatenate((x, dat)) dat = rand.normal(3, 1, 1000) x = np.concatenate((x, dat)) return x ~~~ ~~~ x_train = generate_data()[:, np.newaxis] fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 5)) plt.subplot(121) plt.scatter(np.arange(len(x_train)), x_train, c='red') plt.xlabel('Sample no.') plt.ylabel('Value') plt.title('Scatter plot') plt.subplot(122) plt.hist(x_train, bins=50) plt.title('Histogram') fig.subplots_adjust(wspace=.3) plt.show() ~~~ ##### c)-3 Using Scikit-Learn's KernelDensity To find the shape of the estimated density function, we can generate a set of points equidistant from each other and estimate the kernel density at each point. ~~~ x_test = np.linspace(-1, 7, 2000)[:, np.newaxis] ~~~ Now we will create a `KernelDensity` object and use the `fit()` method to find the score of each sample as shown in the code below. The `KernelDensity()` method uses two default parameters, i.e. `kernel=gaussian` and `bandwidth=1`. ~~~ model = KernelDensity() model.fit(x_train) log_dens = model.score_samples(x_test) ~~~ The shape of the distribution can be viewed by plotting the density score for each point, as given below: ~~~ plt.fill(x_test, np.exp(log_dens), c='cyan') plt.show() ~~~ ##### c)-4 Understanding the Bandwidth Parameter Let's experiment with different values of bandwidth to see how it affects density estimation. ~~~ bandwidths = [0.01, 0.05, 0.1, 0.5, 1, 4] fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(10, 7)) plt_ind = np.arange(6) + 231 for b, ind in zip(bandwidths, plt_ind): kde_model = KernelDensity(kernel='gaussian', bandwidth=b) kde_model.fit(x_train) score = kde_model.score_samples(x_test) plt.subplot(ind) plt.fill(x_test, np.exp(score), c='cyan') plt.title("h="+str(b)) fig.subplots_adjust(hspace=0.5, wspace=.3) plt.show() ~~~ #### d) Cumulative Frequencies A ***cumulative frequency*** curve indicates the number (or percent) of data with less than a given value. Cumulative frequencies are also useful for comparing the distribution of values in two or more different groups of individuals. When you use percentage points, the cumulative frequency presentation has the additional advantage that it is bounded: $0\le cumfreq(x) \le 1$ ###### Examples ~~~ import matplotlib.pyplot as plt from numpy.random import default_rng from scipy import stats rng = default_rng() x = [1, 4, 2, 1, 3, 1] res = stats.cumfreq(x, numbins=4, defaultreallimits=(1.5, 5)) print(res.cumcount) res.extrapoints ~~~ Create a normal distribution with 1000 random values ~~~ samples = stats.norm.rvs(size=1000, random_state=rng) ~~~ Calculate cumulative frequencies ~~~ res = stats.cumfreq(samples, numbins=25) ~~~ Calculate space of values for x ~~~ x = res.lowerlimit + np.linspace(0, res.binsize*res.cumcount.size, res.cumcount.size) ~~~ Plot histogram and cumulative histogram ~~~ fig = plt.figure(figsize=(10, 4)) ax1 = fig.add_subplot(1, 2, 1) ax2 = fig.add_subplot(1, 2, 2) ax1.hist(samples, bins=25) ax1.set_title('Histogram') ax2.bar(x, res.cumcount, width=res.binsize) ax2.set_title('Cumulative histogram') ax2.set_xlim([x.min(), x.max()]) plt.show() ~~~ #### e) Error-Bars ***Error-bars*** are a common way to show mean value and variability when comparing measurement values. Note that it always has to be stated explicitly if the error-bars correspond to the ***standard deviation*** or to the ***standard error*** of the data. Using standard errors has a nice feature: When error bars for the standard errors for two groups overlap, one can be sure the difference between the two means is not statistically significant $(p > 0.05)$. ~~~ index = np.arange(5) y = index**2 errorBar = index/2 # just for demonstration plt.errorbar(index,y, yerr=errorBar, fmt='o', capsize=5, capthick=3) plt.show() ~~~ #### f) Box-Whiskers Plots **Boxplots** are frequently used in scientific publications to indicate values in two or more groups. 1. The bottom and top of the box indicate the ***first quartile*** ($Q_1$) and ***third quartile*** ($Q_3$), respectively, and the line inside the box shows the ***median*** ($Q_2$). 2. Care has to be taken with the **whiskers**, as different conventions exist for them. The most common form is that the lower whisker indicates the lowest value still within $1.5\times$**interquartile-range** (**IQR=$Q_3-Q_1$**) of the lower quartile, and the upper whisker the highest value still within $1.5\times$**IQR** of the upper quartile. 3. **Outliers** (outside the whiskers) are plotted separately. 4. Another convention is to have the whiskers indicate the full data **range**. There are a number of tests to check for outliers. The method suggested by Tukey, for example, is to check for data which lie more than $1.5\times$ **IQR** above or below the **first/third quartile** ![](https://i.imgur.com/n1BtsK0.png) **Box plots are useful as they show the skewness of a data set** ![](https://i.imgur.com/Q29yzxT.png) 1. When the median is in the middle of the box, and the whiskers are about the same on both sides of the box, then the distribution is **symmetric**. 2. When the median is closer to the bottom of the box, and if the whisker is shorter on the lower end of the box, then the distribution is **positively skewed** (***skewed right***). 3. When the median is closer to the top of the box, and if the whisker is shorter on the upper end of the box, then the distribution is **negatively skewed** (***skewed left***). ![](https://i.imgur.com/7CCNgRM.png) ~~~ x = np.random.randn(500) plt.boxplot(x, sym='*') plt.show() ~~~ Boxplots can be combined with KDE-plots to produce the so-called ***violin plots***, where the vertical axis is the same as for the box-plot, but in addition a KDE-plot is shown symmetrically along the horizontal direction. ~~~ import pandas as pd import seaborn as sns # Generate the data nd = stats.norm data = nd.rvs(size=(100)) nd2 = stats.norm(loc = 3, scale = 1.5) data2 = nd2.rvs(size=(100)) df = pd.DataFrame({'Girls':data, 'Boys':data2}) sns.violinplot(data=df) plt.show() ~~~ #### g) Violin plots Draw a single horizontal violinplot: ~~~ import seaborn as sns sns.set_theme(style="whitegrid") tips = sns.load_dataset("tips") ax = sns.violinplot(x=tips["total_bill"]) ~~~ Draw a vertical violinplot grouped by a categorical variable: ~~~ ax = sns.violinplot(x="day", y="total_bill", data=tips) ~~~ Draw a violinplot with nested grouping by two categorical variables: ~~~ ax = sns.violinplot(x="day", y="total_bill", hue="smoker", data=tips, palette="muted") ~~~ Draw split violins to compare the across the hue variable: ~~~ ax = sns.violinplot(x="day", y="total_bill", hue="smoker", data=tips, palette="muted", split=True) ~~~ #### h) Grouped Bar Charts For some applications the plotting abilities of pandas can facilitate the generation of useful graphs, e.g., for **grouped barplots**. ~~~ df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd']) df.plot(kind='bar', grid=False) plt.show() ~~~ #### i) Pie Charts Pie charts can be generated with a number of different options, ~~~ import seaborn as sns import matplotlib.pyplot as plt txtLabels = 'Cats', 'Dogs', 'Frogs', 'Others' fractions = [45, 30, 15, 10] offsets =(0, 0.05, 0, 0) plt.pie(fractions, explode=offsets, labels=txtLabels, autopct='%1.1f%%', shadow=True, startangle=90, colors=sns.color_palette('muted') ) plt.axis('equal') plt.show() ~~~ ### 4.3.2 Bivariate and Multivariate Plots #### a) Bivariate Scatter Plots Simple **scatter plots** are trivial. ~~~ import numpy as np import matplotlib.pyplot as plt # Fixing random state for reproducibility np.random.seed(19680801) N = 50 x = np.random.rand(N) y = np.random.rand(N) colors = np.random.rand(N) area = (30 * np.random.rand(N))**2 # 0 to 15 point radii plt.scatter(x, y, s=area, c=colors, alpha=0.5) plt.show() ~~~ #### b) 3D Plots 3D plots in **matplotlib** are a bit awkward, because separate modules have to beimported, and axes for 3D plots have to be explicitly declared. However, once theaxis is correctly defined, the rest is straightforward. Here are two examples: ~~~ # imports specific to the plots in this example import numpy as np from matplotlib import cm from mpl_toolkits.mplot3d.axes3d import get_test_data # Twice as wide as it is tall. fig = plt.figure(figsize=plt.figaspect(0.5)) #---- First subplot # Note that the declaration "projection='3d'" # is required for 3d plots! ax = fig.add_subplot(1, 2, 1, projection='3d') # Generate the grid X = np.arange(-5, 5, 0.1) Y = np.arange(-5, 5, 0.1) X, Y = np.meshgrid(X, Y) # Generate the surface data R = np.sqrt(X**2 + Y**2) Z = np.sin(R) # Plot the surface surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.GnBu, linewidth=0, antialiased=False) ax.set_zlim3d(-1.01, 1.01) fig.colorbar(surf, shrink=0.5, aspect=10) #---- Second subplot ax = fig.add_subplot(1, 2, 2, projection='3d') X, Y, Z = get_test_data(0.05) ax.plot_wireframe(X, Y, Z, rstride=10, cstride=10) outfile = '3dGraph.png' plt.savefig(outfile, dpi=200) print('Image saved to {0}'.format(outfile)) plt.show() ~~~