Hypothesis Testing

This guide aims to give a high level overview of the concepts behind hypothesis testing and walk through an example hypothesis test. At the end, I give some tips for how to approach Problem 3 on Homework 7.

Motivation for Hypothesis Testing

I think an understanding of what a hypothesis test is includes an understanding of the scenarios in which hypothesis tests are used.

Suppose you're in a class of 20 students in which all the students have not studied at all for the upcoming multiple choice midterm. Thus, every student guesses randomly on every question. After the exam, you learn that 16 of the students scored higher than 50%. Naturally, you think that this number is pretty high if everyone truly guessed randomly.

In this scenario, you would run a hypothesis test which tests the hypothesis that everyone guessed randomly.

The goal of the hypothesis test is to obtain a measurement that captures how likely it is to see a number as extreme as 16. Our go-to tool for measuring likelihoods is probability, so our hypothesis test will output a probability that the number of students scoring above 50% is as extreme as 16. We call this probability the

p

value of the hypothesis test.

The main takeaway here is that a hypothesis test is really a method of testing how "extreme" the data we observed is.

What is a hypothesis?

Many of you may be familiar with the term hypothesis from the scientific method. In the scientific method, a hypothesis is a proposed explanation for an event that occurred. For example, if your laptop isn't turning on, one hypothesis could be "my laptop battery is dead".

In statistics, the term hypothesis means something slightly different. Since statistics deals with distributions and random processes, a hypothesis in statistics is a proposal of what the underlying distribution/process that generates the observed data looks like. For example, a statistical hypothesis about the process that decides whether the flip of a coin will be heads or tails is "the coin is fair, meaning each flip has an equal chance of being head or tails".

There is a similarity, though. In both scientific and statistical hypotheses, we are attempting to explain an observed event or outcome. The difference is that statistical hypotheses propose a distribution/process to explain the event and scientific hypothesis propose a logical explanation of the event (e.g. because A happened, B happened as result).

Null vs. Alternative Hypotheses

Any time we run a hypothesis test, we are trying to reconcile between 2 opposing views of the process that generates data we observe.

One view of the process is called the null hypothesis. Loosely speaking, the null hypothesis specified our belief (before seeing the observed data) of the data generation process. It also asserts that any difference between the observed data and the expected data is due to random error. In the example above, the null hypothesis is that everyone guessed randomly on each question.

The opposing view is called the alternative hypothesis. Loosely speaking, the alternative hypothesis suggests that a different process generates the data. It asserts that the difference between the observed data and the expected data is not due to chance. In the example above, the alternative hypothesis is that everyone did not guess randomly on the test; some people did in fact study and knew the answers.

Using this language, it's easier to describe what a hypothesis test aims to do. A hypothesis test computes how likely it is to observe the data we observed if the true underlying process generating the data is the one specified in the null hypothesis.

By convention, the null hypothesis is the more "conservative" hypothesis. This means the null hypothesis makes a less strong claim about the process that generates the data.

Anatomy of a Hypothesis Test

There are 7 main parts to a hypothesis test:

Observed data
Null hypothesis
Alternative hypothesis
Test statistic
$p$ value
Significance level
Conclusion of the test

The first one is somewhat obvious but I thought I'd state it explicitly anyways. You can't test a hypothesis if you don't have any data.

The second and third parts were discussed above.

Test statistic

As was mentioned above, a hypothesis test computes a probability that the observed data was generated from the process specified in the null hypothesis.

In order to compute probabilities, we need numbers, but the null and alternative hypothesis are just plain-English statements.

The test statistic is the quantity that we use to translate the plain-English hypotheses into mathematical statements we can use to compute probabilities.

The choice of test statistic has a huge bearing on the outcome of the hypothesis test. You should choose the test statistic appropriately. This is kind of an art and comes with practice.

$p$ value

As was said before, the probability that is computed by a hypothesis test is called the

p

-value of the test.

The

$p$ value represents the probability that the observed data was generated from the process specified in the null hypothesis.

Significance level

Once we have a

p

value, we need to use it to determine whether we are confident that the process specified in the null hypothesis is the true process that generates data.

We know that the

p

value represents the probability that the observed data was generated from the process specified in the null hypothesis. But at what probability can we say we are confident in the null hypothesis?

10 %

50 %

99 %

Conventionally, we say that if the

p

value is

5 %

or greater, then we are confident in the null hypothesis.

This

5 %

is referred to as the significance level of the hypothesis test. It doesn't always have to be

5

%, but it usually is.

Conclusion of the test

This will be covered in the following example.

Example Hypothesis Test: Multiple Choice Test

Let's actually conduct the hypothesis test I've been using as an example.

As a reminder, here's the setup: Suppose you're in a class of

n = 20

students in which all the students have not studied at all for the upcoming multiple choice midterm. Thus, every student guesses randomly on every question. After the exam, you learn that 16 of the students scored above 50%.

You want to conduct a hypothesis test using a

5

% significance level to test the hypothesis that everyone actually guessed randomly.

The null hypothesis: Everyone guessed randomly on every question. The event that 16 students scored above 50% occurred due to random chance.

The alternative hypothesis: Not everyone guessed randomly on every question. Some people knew the answers and were thus able to get higher scores.

The test statistic: Ultimately, we want to calculate the probability of seeing at least 16 out of 20 students scoring above 50% if every student guessed randomly on every question.

The reason why we want to calculate the probability of seeing at least 16 out of 20 and not exactly 16 out of 20 is because we are intuitively wanting to figure how likely it is to observe an event as extreme or even more extreme than the event we observed. For more, see the definition of
$p$ -value in the Data 8 textbook.

Thus, a natural choice for a test statistic is the number of students who scored higher than 50%. Let's call this statistic

Z

The

$p$ value: We can use

Z

to get a concise mathematical statement of the probability we want to calculate:

P (Z \geq 16)

If each student guessed randomly, then the probability that a given student scored above 50% is

\frac{1}{2}

. We can then treat

Z

as a random variable with a binomial distribution where

n = 20

and

p = \frac{1}{2}

Then

P (Z \geq 16) = \sum_{k = 16}^{20} (\binom{20}{k}) {(\frac{1}{2})}^{k} {(\frac{1}{2})}^{20 - k} = \sum_{k = 16}^{20} (\binom{20}{k}) {(\frac{1}{2})}^{20} = 0.006 = .06 %

Significance level: We've already chosen the significance level to be

5 %

. If not specified, you should assume this is the significance level.

Conclusion of the hypothesis test

Since our

p

value is below the significance level, we say that the observed data is not consistent with the null hypothesis. We have enough evidence to say that something other than random chance caused the difference between the observed outcome and the expected outcome.

When the observed data is not consistent with the null hypothesis, we reject the null hypothesis.

If the observed data is consistent with the null hypothesis, we fail to reject the null hypothesis. This means the observed data is evidence that supports the null hypothesis.

Cautions about Hypothesis Testing

We NEVER accept the alternative hypothesis. The reasoning for this is because we only computed the probability of observing the observed data if the null hypothesis is true. We never computed the probability of observing the observed data if the alternative hypothesis is true. In fact, we cannot compute this probability because the alternative hypothesis does not specify a distribution that generates data.

The only purpose of the alternative hypothesis is to define what extreme means. In our example the fact that the alternative hypothesis suggested that people should have gotten higher scores told us that we needed to sum from

k = 16

up to

k = 20

instead of from

k = 16

down to

k = 0

Additionally, we never accept the null hypothesis, we only reject or fail to reject the null hypothesis. The reasoning for this is because our

p

-value is never 100%, so we can never be 100% certain that the data was generated from the process specified by the null hypothesis.

Homework 7 Problem 3

With this knowledge of hypothesis tests, hopefully the homework question makes more sense.

In the homework question, we are given that 9 out of 12 machines operate faster after the modification and we want to test whether the modifications did nothing or whether they made the robots faster.

To start, I'll give you a choice of null and alternative hypotheses:

Null hypothesis: The modification had no effect on the robots' performance.

Alternative hypothesis: The modification made the robots faster.

Hopefully this choice of hypotheses is clear.

The meat of the question is in choosing the test statistic. For this, I encourage you to look at the example hypothesis test I did above and think about why I chose the test statistic I did. See if that reasoning applies to this problem.

Hint: I chose that example for a specific reason :)

One more piece of information you will need to tackle the homework question: if the null hypothesis is true, what is the probability that a robot will be faster after the modification?

Hint: if the modification truly makes no difference, then whether or not the second completion of the task takes less time than the first completion of the task is basically a flip of a coin.

Using these hints, try to come up with a test statistic and

p

-value for the homework question. Again, I chose my example deliberately to help you with the homework question :)