1.1 The language of hypothesis testing

The first task in evaluating the implications of experimental observations is to formulate a hypothesis test before looking at the experimental outcome.

Definition (Hypothesis test)

Null Hypothesis, $H_0$. A statement that the value of a population parameter (such as proportion $p$, mean $\mu$, or standard deviation $\sigma$) is equal to some claimed value.
Alternative Hypothesis, $H_A$ or $H_1$. A statement that the parameter of interest has a value differing from the null hypothesis in a specific way, often $>$, $<$, or $\neq$.
Test Statistic, $TS$. A single value summarizing the outcome of a pre-defined experiment that will be used to make a decision about the null hypothesis. A test statistic can be the estimation of a parameter (such as $\hat{p}$, $\hat{\mu}$, or $\hat{\sigma}$), or a parameter estimate that has been converted to a “score” (such as $z$, $t$, or $\chi^2$).
Rejection Criterion, $RC$. Often a set of values $\mathcal{R}$ such that if the test statistic falls in this region $(TS \in R)$, we decide to reject the null hypothesis. The values that constitute the boundary of the rejection region are called critical values.

Evaluation: If the experimental outcome of the test statistic satisfies the rejection criterion, then we reject the null hypothesis $H_0$. Otherwise we fail to reject $H_0$. Note that in this framing, we can never “prove the alternative hypothesis”.

</aside>

Example 1.1: Framing an investigation in terms of a two-sided hypothesis test.

Example 1.2: Framing an investigation in terms of a one-sided hypothesis test.

Example 1.3: Polling surveys, revisited.

Modeling natural variation

The null hypothesis posits a single value for the parameter of interest. But the alternative hypothesis consists of a range of values, and those values can be extremely close to the null hypothesis. For example, suppose that $H_0$ is that $p = 0.5$ and $H_A$ is $p \neq 0.5$. Then $p = 0.5000001$ is technically a version of the alternative hypothesis, but for all practical purposes this version of the alternative would be indistinguishable from the null. So how do we deal with this?

There are two main reasons we want to develop good probability models

to assess the natural amount of variation we expect to occur in scientific experiments; and
to provide a way to communicate results among individuals who have different standards of rigor for scientific results.

The first purpose translates into using probability models to construct good rejection criteria for hypothesis tests. Without a probability model, we would have no idea how to set up a good test! The second purpose translates into what we will later call $p$-values. On the positive side, $p$-values provide a universal way to communicate scientific results. On the negative side, $p$-values provide a way to hide scientific malpractice and/or exaggerations of scientific findings.

</aside>

To think about the outcomes of experiments, we must propose a mathematical model for uncertainty. There are three primary methods: the first two are purely theoretical, while the other uses the data itself.

Methods for modeling the intrinsic uncertainty of experiments.

Simulation We can propose an algorithm that simulates the experimental set-up, repeat the simulation and then summarize the outcomes with a histogram.
Theoretical Probability Models Through theoretical probability distributions, we calculate the exact probability that different outcomes might occur. Some of the probability models we will use in this class are Discrete Uniform, Hypergeometric, Binomial, Poisson, Uniform, Normal, and t.
Randomization Rather than assume a theoretical model for the experiment, in randomization techniques we find ways to resample existing data to generate intuition about uncertainty. The two techniques we will encounter are random relabeling of categories and bootstrap resampling. </aside>

For part one of this course, we will focus on simulation studies, Hypergeometic models, Binomial models, and Poisson models. In the second and third parts of the course, we will introduce several more probability models for uncertainty.

1.2 Assessing the quality of a hypothesis test

Types of hypothesis testing errors

For the statistical decisions we are studying here, we start with a testable hypothesis, and conclude with a statement that we REJECT $H_0$, the null hypothesis, or FAIL TO REJECT $H_0$. To quantify the quality of a given test, we need to evaluate its probability of leading to an incorrect conclusion, or error. Since there are only two types of conclusions, there are two types of error.

Definition (Type I and Type II Errors)

Type I error (False Positive) The decision maker REJECTS $H_0$ when she should have failed to reject it.
Type II error (False Negative) The decision maker FAILS TO REJECT $H_0$ when she should have rejected it. </aside>

These errors can be summarized by the following helpful chart.