As an informaticist, it is quite likely that someone is going to hand you a dataset and ask you to make sense of it. These last two weeks we are going to mimic this scenario. You will download a 100-row dataset having 26 variables which was extracted from the Framingham Heart Study project. The Framingham Heart Study is a multigenerational study, initially about heart disease. You can read more about the study at the Framingham Heart Study Website.
Before running any analyses, you will want to know something about the dataset overall and the individual variables that make up the dataset. How big is the dataset? What is the range of values for each variable? How are the variables distributed? What kind of data is it? You will use Excel to get the answers to these questions.
Mise en Place is a culinary term that refers to getting things ready or gathering what you need to start cooking. This section will describe some activities that will answer the questions posed in the previous paragraph.
Ideally, you will have access to a data dictionary. The data dictionary contains a lot of useful information. For this exercise, the data dictionary includes the variable name and a brief description of the data that is assigned to that variable. There may be a description of the size of the variable and whether it is alpha or numeric or alphanumeric. Here is the data dictionary for this dataset.
Some variables are binary which means that there are only two possible values. In this dataset, sex is one such variable, other variables include the presence or absence of a condition. For these variables, you will want to do a count to see how many instances there are for each possible response.
Excel has a countif function to do that. the syntax is **countif(range, value). See this explanation about countif functions for a more detailed explanation.
Other variables, such as total cholesterol can have a range of values. A scatter plot is a good way to see how the values are clustered and whether there are outliers. See this link on Excel-Easy to see the steps to create a scatter plot if you don't know how.
This graph shows the cholesterol values for 100 study participants regardless of sex.
To get a better idea of differences in values, you may want to sort the data based on a binary variable and create separate graphs to see if there are differences between the sexes.
What do you notice about the differences in values? These graphs provide a great detailed visual, however, it is hard to do a side-by-side comparison.
This graph might be better.
What do you see now? What are the differences between the 1s and 2s?
The scatterplot is one kind of visual for the data, but it's not easily quantifiable. The 5 number summary includes the minimum, maximum, median (don't confuse this value with average), and first and third quartiles for the data. Excel has functions to calculate these values. Min(range), max(range), median(range), quartile.inc(range,quartile). Note: for our purposes quartile will have the value of either 1 or 3.
These five values can be visualized in a Box and Whisker Diagram. The top and bottom "whiskers" represent the min and max values in the data. The box represents the values between the first and third quartiles, while the line in the middle is the median. You can create the diagram without finding these values.
Now let's look at three box and whisker diagrams. The first diagram is everyone in the sample. The second and third diagrams have been broken out by sex.
What can you say about the diagrams? What are the differences and common attributes across the diagrams?
The Box and Whisker diagram is not part of the standard chart offerings in Excel. You will have to add the Analysis Tookpak, a free add-in from Microsoft. The instructions for finding and installing the Analysis Toolpak are here this link on Microsoft Support. Instructions for creating a Box and Whisker diagram can be found on this page
Another way to "slice" the data is to look at how the data are distrubuted. That is, are the data points clustered around the mean (average), or are they spread out, perhaps more to the left or to the right?
Variance measures the data point's distance from the mean.
Standard deviation is an overall measure of how far the data is from the mean. There are three deviations to the right of the mean and three deviations to the left of the mean.
Imagine a bell curve. Thirty-three percent of the data are in the first deviations (left and righ), another 33%, for a total of 66% are within two standard deviations of the mean, and the final 33% are within three standard deviations of the mean. You can visualize this bell curve by graphing the values and their normal distribution. See these instructions
You should have a graph that looks something like this:
It's easy to calculate the mean, standard deviation, and variance with Excel. I've provided the functions for you range of value refers to the set of values that you plan to measure.
Mean: =average(range of values)
Standard Deviation: =stdev.p(range of values)
Variance: =var.p(range of values)
use .p suffix for population and .s for a sample
Up to this point, we have looked at a single variable. We know min and max values, modes, and means. We have measures for variance and standard deviation. We also have a few graphs. But it's only one variable. What if we want to compare variables with more rigor than looking at two graphs side by side–cholesterol by gender, for example? Keep reading.
For those who come from a computer science or database background, variables can be defined as alpha, integer, float, for example. In statistics, there are levels of measurement nominal ordinal, interval, and ratio.
Nominal data are categories or groups. Example: gender, marital status, race
Ordinal data are categories with a defined order. Example: ratings, grade level, faculty rank
Interval data are categories with a proportionate interval between values. Example: credit score, temperature
Ratio data are interval data in which there is an absolute 0 value. Example: height, Kelvin measurement, speed
Think about the variables in your dataset. Can you identify the category for each variable?
The type of statistical test that you use is based on the variables' measure type. Any statistical test will need to have two things upfront
The hypothesis is an assumption about the data, often the data distribution, For example, there are an equal number of male and female patients who have CHF. This is probably the null hypothesis referenced as h0. The alternate hypothesis ha is that there are note equal numbers of male and female patients with CHF.
The level of significance is the value, usually .05, at which something is considered statistically significant. If the result of your statistical test is .05 or less, then the result is considered statistically significant. If the value is greater than .05, then the result is not statistically significant.
Let's assume that we want to compare the level of education between males and females in our dataset. Chi-square is a measure of distribution for nominal (categorical) data
We will set the value of statistical significance to .05 and define the null and alternate hypotheses:
h0: there is an equal distribution of males and females in each education category
ha:there is an unequal distribution of males and females in each education category
See this video to learn how to calculate Chi-square with Excel.
First calculate the actual number of males and females for each level of education. You can use a Pivot table for this step. Or, you can run a custom sort by grade and sex then count the instances. Sum across the rows and columns. Then calculate the percentage for each row total (row total/total count of all rows) You should have something that looks like this.
Second calculate the expected number of males and females for each level of education. For each grade and sex, multiply the row percent by the total number for each sex. So for grade=1 and sex=1 the calculation is .42* 41. When everything is calculated you should have something that looks like this.
Third use the Excel function chisq.test(actual range, perceived range) to caclulate the value of chi-square and compare it to the statistical significance value.
In this example, the calculated chi-square value was .15687 which is greater than .05.
We can interpret this result to state that there is no statistical difference between the actual and perceived number of males and females across the levels of education in this dataset.
A t-test allows you to compare the means of two groups and determine whether the groups are statistically different, that is the difference is not by chance. You've seen a visualization of the means using the box and whisker diagram, but that is not sufficient to compare values. The comparative groups could be the same subjects using pre- and post-test scores, such as before and after an intervention. This comparison is a paired t test>. If comparing separate groups, such as those living in Boise and those living in Columbus, use an independent t-test
Excel has three types of t-tests: a paired t-test (see pre- post-test above), an independent t-test where the variances are equal, and an independent t-test where the variances are unequal. Assuming an independent T-test, first, you must determine the variance. Run an F-Test for Two-Sample Variance. This test can be found in the data analysis tab under data.
You will be presented with a window that looks like this. I annotated this screen so you know what values go in each field.
The output will look something like this. Notice that I bolded the variance line.
As you can see, the variances are different so we will run the independent t-test with unequal variance, not the t-test with equal variance.
T-tests are also located in the data analysis tab where you found the F-test. You will be presented with a window that looks like this. I annotated this screen so you know what values go in each field.
My output looks like this:
You can see that the group means are close, although the variance for group 2 is slightly smaller and there are more observations (data points). The one-tail value (only looking to the left or right of the mean) is not statistically significant since the value is greater than .05. The same is also true for the two-tail value (looking both left and right of the mean). Therefore, we can conclude that the two samples are relatively equivalent.
This lesson has been a very brief introduction to statistics, it is by no means complete. There are many more topics to consider. You've learned how to visualize and provide descriptive statistics on a single variable and how to compare nominal variables as well as compare the means between two variables.