4.2 Bayesian parameter estimation

The general perspective

Suppose a random variable $X$ has a probability density function $f$ that depends on some unknown parameter $\theta$. We write

$$ X \sim f(x \, ; \, \theta). $$

Some experiment is conducted and our job is to provide our best estimate for the value of $\theta$, along with an expression for our uncertainty, based on our pre-existing, or prior, assumptions about $\theta$, and a probability model for the data we observe in during the experiment.

$\mathrm{Prior}(\theta)$: Let $\phi(\theta)$ be a pdf that expresses our prior uncertainty about the parameter $\theta$.
Let $\mathbf{x}$ express the data that we have observed. This is often a vector of $n$ values: $\mathbf{x} = (x_1, x_2, \ldots, x_n)$.

Then, Bayes rule for continuous distributions says the following:

$$ \phi(\theta \, | \, \mathbf{x}) = \frac{f(\mathbf{x} \, ; \, \theta) \phi(\theta)}{f_\phi(\mathbf{x})} $$

where $f_\phi(\mathbf{x}) = \int_\Theta f(x \, ; \, t) \phi(t) \mathrm{d} t$ is the average likelihood of observing $\mathbf{x}$ weighted by the prior distribution $\phi$.

Conjugate priors

For most of the commonly used probability models, there exists a specially chosen prior distribution such that, when you compute the posterior distribution, it exists within the same family of probability distributions. For example, the Beta distribution is conjugate for the Binomial parameter $p$; the Gamma distribution is conjugate for the Poisson parameter $\mu$; and the normal distribution conjugate to itself for the parameter $\mu$ when $\sigma$ is assumed to be known.

Bayesian inference for Binomial experiments

Suppose that $X$ is the outcome of a Binomial experiment with $n$ trials and an unknown YES probability $p$.

If we express our prior uncertainty for $p$ in terms of the $\mathrm{Beta}(a_0,b_0)$ distribution, then the posterior distribution for $p$ is $\mathrm{Beta}\big(a_0 + x, b_0 + (n-x)\big)$.

In particular, the Bayes estimator for $p$ is

$$ \hat{p} = \frac{a_0 + x}{a_0 + b_0 +n}. $$

The standard deviation of the posterior distribution is then

$$ \sigma_{\mathrm{post}} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n + 1}}. $$

When $a_0 = 0$ and $b_0 = 0$, we say that we are using a non-informative prior.

</aside>

Proof. Since the prior for $p$ is a $\mathrm{Beta}\big(a_0, b_0\big)$ distribution, its pdf is

$$ \phi(p) = C_0 \, p^{a_0 - 1}(1-p)^{b_0 - 1} $$

where $C_0$ is a constant that depends only on $a_0$ and $b_0$ (and specifically not on $p$).

Bayes theorem for the posterior distribution then implies

$$ \begin{aligned} \phi(p \, | \, x) &= \frac{\binom{n}{x} p^x (1-p)^{n-x} C_0 p^{a_0 - 1} (1-p)^{b_0-1}}{f_\phi(x)} \\ &= \frac{C_0 \binom{n}{x}}{f_\phi(x)} \, p^{a_0 + x - 1} (1-p)^{b_0 + (n - x)- 1} \\ &= C_\mathrm{post} \, p^{a_0 + x - 1} (1-p)^{b_0 + (n - x)- 1}. \end{aligned} $$

The key thing to notice is that the terms $C_0$, $\binom{n}{x}$, and $f_\phi(x)$ do not depend on $p$, so they can be gathered together into one constant $C_\mathrm{post}$. We may be worried that it would be hard to compute $C_\mathrm{post}$ but recall that $\phi(p \, | \, \mathbf{x})$ is a probability density, so it must integrate to 1. Since we recognize the terms involving $p$ as having the form of the Beta distribution, then we know what value $C_\mathrm{post}$ must have. $\square$

Example using a non-informative prior

Suppose that I have purchased a bag of Candio candies that contains 100 candies. Of these, 23 are strawberry-flavored. Assuming a non-informative prior, provide a 2$\sigma$-posterior credible region for the true proportion of Candios that are strawberry-flavored.