22. Posterior Predictives#

# Import some helper functions (please ignore this!)
from utils import *

Context: For safety-critical applications of ML, it’s important that our model captures two notions of uncertainty. Aleatoric uncertainty captures inherent stochasticity in the system. In contrast, epistemic uncertainty is uncertainty over possible models that could have fit the data. Multiple models can fit the data when we have a lack of data and/or a lack of mechanistic understanding of the system. We realized that fitting models using the MLE only captured aleatoric uncertainty. To additionally capture epistemic, we therefore had to rethink our modeling paradigm. Using Bayes’ rule, we were able to obtain a distribution over model parameters given the data, \(p(\theta | \mathcal{D})\) (the posterior). Using NumPyro, we sampled from the posterior to obtain a diversity of models that fit the data. We interpreted a greater diversity of models indicated higher epistemic ucnertainty.

Challenge: Now that we have a posterior over model parameters, we can capture epistemic uncertainty. But how do we use this diverse set of models to (1) make predictions, and (2) compute the log-likelihood (for evaluation)? To do this, we will derive the posterior predictive, a distribution that translates a distribution over parameters to a distribution over data. This distributions can then be used to make predictions and evaluate the log-likelihood.

Outline:

  • Provide intuition for the posterior predictive

  • Derive the posterior predictive

  • Introduce laws of conditional independence

  • Evaluate the posterior predictive

22.1. Intuition: Model Averaging#

Bayesian Modeling as Ensembling. Recall in the previous chapter, we initially introduced ensembling as a way to capture epistemic uncertainty. In ensembling, we train a collection of models independently and hope that, due to quirks in optimization, we end up with a diverse collection of models. In a sense, doesn’t our Bayesian approach provide us with an ensemble as well? After all, each set of parameters \(\theta\) from the posterior \(p(\theta | \mathcal{D})\) represents a different model. Based on this analogy, we can create a “Bayesian” ensemble as follows:

  1. We draw \(S\) samples from the posterior: \(\theta_s \sim p(\theta | \mathcal{D})\).

  2. Each posterior sample represents a different member of our ensemble: \(p(\mathcal{D} | \theta_s)\).

    For regression, we have \(p_{Y | X}(y | x, \theta_s)\).

Predicting. Using this ensemble, we can predict by averaging the predictions of the ensemble members:

  1. We draw \(\mathcal{D}_s \sim p(\cdot | \theta_s)\) for each \(\theta_s\).

    For regression, we draw \(y_s \sim p_{Y | X}(\cdot | x, \theta_s)\).

  2. We average: \(\frac{1}{S} \sum\limits_{s=1}^S \mathcal{D}_s\).

    For regression, we average \(\frac{1}{S} \sum\limits_{s=1}^S y_s\).

Evaluating Log-Likelihood. Given test data, \(\mathcal{D}^*\), we can use the ensemble to evaluate the model’s log-likelihood:

  1. We evaluate \(p(\mathcal{D}^* | \theta_s)\) for each \(\theta_s\).

    For regression, we evaluate \(p_{Y | X}(y^* | x^*, \theta_s)\) for each \(\theta_s\), where \(x^*, y^*\) is a new data point.

  2. We average and take the log: \(\log \frac{1}{S} \sum\limits_{s=1}^S p(\mathcal{D}^* | \theta_s)\).

    For regression, we average and take the log: \(\log \frac{1}{S} \sum\limits_{s=1}^S p_{Y | X}(y^* | x^*, \theta_s)\).

Formalizing Intuition. As we will show next, this intuition actually holds for the Bayesian paradigm. That is, we can compute

In the regression case, we have:

(22.1)#\[\begin{align} p(y^* | x^*, \mathcal{D}) &= \mathbb{E}_{\theta \sim p(\theta | \mathcal{D})} \left[ p(y^* | x^*, \theta) \right] \\ &\approx \frac{1}{S} \sum\limits_{s=1}^S p(y^* | x^*, \theta_s), \quad \theta_s \sim p(\theta | \mathcal{D}), \end{align}\]

which is exactly the same formula we got from the “ensembling” analogy, except that the members of the ensemble are draws from the posterior.

22.2. Derivation of the Posterior Predictive#

Goal. We want to derive a formula for \(p(\mathcal{D}^* | \mathcal{D})\), which represents the distribution of new data \(\mathcal{D}^*\) given the observed data, \(\mathcal{D}\).

For a regression model, this distribution is:

(22.2)#\[\begin{align} p_{Y^* | X^*, \mathcal{D}}(y^* | x^*, \mathcal{D}) &= p_{Y^* | X^*, \mathcal{D}}(y^* | x^*, x_1, \dots, x_N, y_1, \dots, y_N), \end{align}\]

where \(x^*\) is a new input for which we’d like to make a prediction, \(y^*\).

A Graphical Model for the Training and Test Data. Notice that our posterior predictive includes a new random variable, \(\mathcal{D}*\). Let’s incorporate it into our graphical model. This will help us reason about the conditional dependencies (below), needed in the derivation of the posterior predictive.

As you can see, the original graphical model (for training data) is on the left. We then added a second component on the right for \(M\) test points we have not yet observed. We can similarly create a diagram for regression as follows:

Derivation. Now we have all we need in order to derive a formula for \(p(\mathcal{D}^* | \mathcal{D})\). Our first step is to multiply and divide \(p(\mathcal{D}^* | \mathcal{D})\) by \(p(\mathcal{D})\):

(22.3)#\[\begin{align} p(\mathcal{D}^* | \mathcal{D}) &= \frac{p(\mathcal{D}^* | \mathcal{D}) \cdot p(\mathcal{D})}{p(\mathcal{D})} \end{align}\]

We do this so that we can write the numerator as the joint distribution of \(\mathcal{D}^*\) and \(\mathcal{D}\):

(22.4)#\[\begin{align} &= \frac{p(\mathcal{D}^*, \mathcal{D})}{p(\mathcal{D})} \end{align}\]

Next, we use the law of total probability to re-write the above as a joint distribution over \(\mathcal{D}^*\), \(\mathcal{D}\), and \(\theta\). We do this to introduce \(\theta\) into the equation—since our model’s prior, likelihood, and posterior all depend on \(\theta\), it would be weird if the formula for \(p(\mathcal{D}^* | \mathcal{D})\) didn’t depend on it. This gives us:

(22.5)#\[\begin{align} &= \frac{\int p(\mathcal{D}^*, \mathcal{D}, \theta) \cdot d\theta}{p(\mathcal{D})} \end{align}\]

Now, we can factorize this joint distribution to get one term that’s the posterior, \(p(\theta | \mathcal{D})\), and one term that’s the marginal, \(p(\mathcal{D})\):

(22.6)#\[\begin{align} &= \frac{\int p(\mathcal{D}^* | \theta, \mathcal{D}) \cdot p(\theta | \mathcal{D}) \cdot p(\mathcal{D}) \cdot d\theta}{p(\mathcal{D})} \end{align}\]

Since \(p(\mathcal{D})\) doesn’t depend on \(\theta\), we can take it out of the integral, thereby canceling it with the \(p(\mathcal{D})\) in the denominator:

(22.7)#\[\begin{align} &= \int p(\mathcal{D}^* | \theta, \mathcal{D}) \cdot p(\theta | \mathcal{D}) \cdot d\theta \end{align}\]

Finally, using the laws of conditional independence, we know that \(p(\mathcal{D}^* | \theta, \mathcal{D}) = p(\mathcal{D}^* | \theta)\). This is because, by conditioning on \(\theta\), we remove all paths connecting \(\mathcal{D}\) to \(\mathcal{D}^*\). In other words, \(\theta\) summarizes all information from \(\mathcal{D}\) needed to predict \(\mathcal{D}^*\) (we cover the laws of conditional independence in depth below). This gives us the following equation:

(22.8)#\[\begin{align} &= \int \underbrace{p(\mathcal{D}^* | \theta)}_{\text{likelihood of new data}} \cdot \underbrace{p(\theta | \mathcal{D})}_{\text{posterior}} \cdot d\theta \end{align}\]

As you can see, \(p(\mathcal{D}^* | \mathcal{D})\) is a function of the posterior and the joint data likelihood of the new data. Adding some syntactic sugar, we can write the above equation as:

(22.9)#\[\begin{align} &= \mathbb{E}_{p(\theta | \mathcal{D})} \left[ p(\mathcal{D}^* | \theta) \right] \end{align}\]

This shows that to evaluate \(p(\mathcal{D}^* | \mathcal{D})\), we need to:

  1. Draw posterior samples \(\theta \sim p(\theta | \mathcal{D})\).

  2. Average the likelihood \(p(\mathcal{D}^* | \theta)\) across these samples.

As you can see this matches our intuition exactly!

22.3. Laws of Conditional Independence#

Motivation. Recall in a linear regression model, we often sample the slope and intercept independently under the prior. For example, we may choose to draw each from a normal distribution (with no correlations). However, when we condition on data (i.e. under the posterior), they are no longer independent. Why does this happen? To fit the data, if the slope increases, the intercept has to decrease (and vice versa). Generalizing this insight, when conditioning on a variable, we need to rethink the statistical dependence. This becomes important when deriving distributions like the posterior predictive.

We will present three general cases here that can be applied to more complicated models. In all three cases, we have three random variables: \(A\), \(B\), and \(C\). We will then condition on \(B\) and see what happens to the statistical dependence between \(A\) and \(C\).

Case 1: Intuition. Under the generative process, \(B\) depends on \(A\) and \(C\) depends on \(B\). When conditioning on \(B\), however, \(A\) and \(C\) become statistically independent:

(22.10)#\[\begin{align} p_{A, C | B}(a, c | b) &= p_{A | B}(a | b) \cdot p_{C | B}(c | b) \end{align}\]

Example: Let \(A\) be a latent variable, describing whether a patient has or doesn’t have COVID. \(B\) is the result of a COVID-test; it depends on \(A\), since having the disease means a greater chance of testing positive. Of course, there’s some probability that even the test could be wrong. Finally, \(C\) describes whether the doctor will prescribe the patient COVID medication. \(C\) only depends on \(B\) because the doctor can only act on the results of the test—they have no other way of knowing whether the patient has or doesn’t have the disease. Again, even given a positive test, the doctor might still choose not to prescribe medicine.

Given \(B\), the probability of \(A\) and \(C\) are independent. This is because given a positive test result (\(B = 1\)), we can infer the chance that the patient actually has COVID (\(A = 1\)). But the doctor’s decision to prescribe medication (\(C = 1\)) is only based on the result of the test.

Case 1: Derivation. Before we begin the derivation, notice that we can write a conditional distribution by dividing the joint by the marginal:

(22.11)#\[\begin{align} p_{A, C | B}(a, c | b) \cdot p_B(b) &= p_{A, B, C}(a, b, c) \\ p_{A, C | B}(a, c | b) &= \frac{p_{A, B, C}(a, b, c)}{p_B(b)} \end{align}\]

We start using this fact:

(22.12)#\[\begin{align} p_{A, C | B}(a, c | b) &= \frac{p_{A, B, C}(a, b, c)}{p_B(b)} \\ &= \frac{p_A(a) \cdot p_{B | A}(b | a) \cdot p_{C | B}(c | b)}{p_B(b)} \quad (\text{factorizing the joint using the DGM in Case 1}) \\ &= \frac{p_A(a) \cdot \frac{p_{A | B}(a | b) \cdot p_B(b)}{p_A(a)} \cdot p_{C | B}(c | b)}{p_B(b)} \quad (\text{Bayes' rule}) \\ &= p_{A | B}(a | b) \cdot p_{C | B}(c | b) \quad (\text{cancel out terms}) \end{align}\]

Case 2: Intuition Under the generative process, both \(A\) and \(C\) depend on \(B\). When conditioning on \(B\), however, \(A\) and \(C\) become statistically independent:

(22.13)#\[\begin{align} p_{A, C | B}(a, c | b) &= p_{A | B}(a | b) \cdot p_{C | B}(c | b) \end{align}\]

Example: Let \(B\) be a latent variable, describing whether a patient has or doesn’t have COVID. Let \(A\) be a COVID-test; it depends on \(B\), since having the disease means a greater probability of testing positive. Finally, let \(C\) be the probability the patient infects someone else with COVID. \(C\) depends on \(B\), since the patient can only infect someone else if they actually have COVID. In general, \(A\) and \(C\) are not independent, since knowing \(A\) tells us something about \(C\) (and vice versa); if a patient tests positive, they are more likely to have COVID, and therefore also more likely to infect someone else. Here, information passes from \(A\) to \(C\) through \(B\). However, conditioning on \(B\) (i.e. knowing whether the patient has COVID) means that \(A\) and \(C\) are now statistically independent. This is because information can no longer travel from \(A\) to \(C\) through \(B\). Given \(B\), there’s a fixed probability of testing positive, and that probability is independent of whether the patient will infect someone else.

Case 2: Derivation. We start the same way as we did for Case 1:

(22.14)#\[\begin{align} p_{A, C | B}(a, c | b) &= \frac{p_{A, B, C}(a, b, c)}{p_B(b)} \\ &= \frac{p_B(b) \cdot p_{A | B}(a | b) \cdot p_{C | B}(c | b)}{p_B(b)} \quad (\text{factorizing the joint using the DGM in Case 2}) \\ &= p_{A | B}(a | b) \cdot p_{C | B}(c | b) \quad (\text{cancel out terms}) \end{align}\]

Case 3: Intuition Under the generative process, \(A\) and \(C\) are independent. \(B\) then depends on \(A\) and \(C\). When conditioning on \(B\), however, \(A\) and \(C\) are statistically dependent:

(22.15)#\[\begin{align} p_{A, C | B}(a, c | b) &\neq p_{A | B}(a | b) \cdot p_{C | B}(c | b) \end{align}\]

Example: Let \(B\) describe whether a patient has heart disease. Let \(A\) be lifestyle factors (like diet) that could increase the chance of having heart disease, and let \(C\) be genetic factors that contribute to heart disease. \(B\) depends on both \(A\) and \(C\): the probability of heart disease increases with the presence of both lifestyle and genetic factors. Moreover, here we assume that \(A\) and \(C\) are independent—whether you have a certain lifestyle doesn’t tell us about your genes and vice versa. However, conditioning on \(B\), \(A\) and \(C\) are no longer independent. If we know an individual has heart disease, and we know they don’t have a lifestyle that contributes to the disease, then they are more likely to have genetic factors. Similarly, if we know an individual has heart disease, but they don’t have genetic factors, they are more likely to have lifestyle factors.

Case 3: Derivation. In applying the same tricks to factorize \(p_{A, C | B}(a, c | b)\) as for the previous two cases, we always end up with \(p_{A, C | B}(a, c | b)\), meaning that we cannot further factorize it.

Exercise: Laws of Conditional Independence

Part 1: Look at the graphical model for Bayesian regression that includes both the training and test data. Having conditioned on \(X_1, X_2, X_1^*, X_2^*\),

  • Is \(Y_1\) independent of \(Y_2\)?

  • Is \(Y_1\) independent of \(Y_1^*\)?

  • Is \(Y_1^*\) independent of \(Y_2^*\)?

Justify your reasoning.

Part 2: Look at the graphical model for Bayesian regression that includes both the training and test data. Having conditioned on \(X_1, X_2, X_1^*, X_2^*\), as well as on \(\theta\),

  • Is \(Y_1\) independent of \(Y_2\)?

  • Is \(Y_1\) independent of \(Y_1^*\)?

  • Is \(Y_1^*\) independent of \(Y_2^*\)?

Justify your reasoning.

Part 3: Consider the directed graphical model below.

Using the laws of conditional probability,

  • Factorize \(p_{A, B, D, E | C}(a, b, d, e | c)\).

  • Factorize \(p_{A, C, D, E | B}(a, c, d, e | b)\).

  • Factorize \(p_{A, B, C, D | E}(a, b, c, d | e)\).

Part 4: Consider the directed graphical model below.

Using the laws of conditional probability,

  • Factorize \(p_{B, C, D, E | A}(b, c, d, e | a)\).

  • Factorize \(p_{A, C, D, E | B}(a, c, d, e | b)\).

  • Factorize \(p_{A, B, D, E | C}(a, b, d, e | c)\).

Part 5: Consider the directed graphical model below.

Using the laws of conditional probability,

  • Factorize \(p_{B, C, D, E | A}(b, c, d, e | a)\).

  • Factorize \(p_{A, C, D, E | B}(a, c, d, e | b)\).

  • Factorize \(p_{A, B, C, E | D}(a, b, c, e | d)\).

22.4. Posterior Predictive of Different Models#

Exercise: Derive the Posterior Predictive Distribution

For each of the models below, derive the posterior predictive. You may find it helpful to draw the directed graphical model that captures both the train and test data.

Part 1: Bayesian predictive model.

(22.16)#\[\begin{align} \theta &\sim p_\theta(\cdot) \\ y_n | x_n, \theta &\sim p_{Y | X}(\cdot | x_n, \theta) \end{align}\]

The posterior predictive is: \(p_{Y^* | X^*, X_{1:N}, Y_{1:N}}(y^* | x^*, x_{1:N}, y_{1:N})\), where \(x_{1:N}\) and \(y_{1:N}\) denote the full data.

Part 2: Bayesian Factor Analysis.

(22.17)#\[\begin{align} \theta &\sim p_\theta(\cdot) \\ z_n &\sim p_Z(\cdot) \\ x_n | z_n, \theta &\sim p_{X | Z, \theta}(\cdot | z_n, \theta) \end{align}\]

The posterior predictive is: \(p_{X^* | X_{1:N}}(x^* | x_{1:N})\), where \(x_{1:N}\) denotes the full data.

Part 3: Bayesian predictive model with latent variable.

(22.18)#\[\begin{align} \theta &\sim p_\theta(\cdot) \\ z_n &\sim p_Z(\cdot) \\ y_n | x_n, z_n, \theta &\sim p_{Y | X, Z, \theta}(\cdot | x_n, z_n, \theta) \end{align}\]

The posterior predictive is: \(p_{Y^* | X^*, X_{1:N}, Y_{1:N}}(y^* | x^*, x_{1:N}, y_{1:N})\), where \(x_{1:N}\) and \(y_{1:N}\) denote the full data.

Part 4: Bayesian Concept-Bottlebeck model (CBM). CBMs aim to make it easier to interpret model predictions. They do this by combining two models:

  • CMBs first learning to predict “concepts” \(c_n\) associated, associated with input \(x_n\). In a CBM, a concept is just a discrete attribute associated with the input; for example, if \(x_n\) is an image of wildlife, a concept could be rain, grass, dog, etc. You can think of \(p_{C | X}\) as a classifier.

  • After having predicted the concept \(c_n\) from the input \(x_n\), CBMs attempt to predict the final output \(y_n\) from the concept only. In this way, predictions of \(y_n\) can be analyzed in terms of the concepts, which as easier to understand, instead of with respect to the inputs, which could be high dimensional and difficult to reason about.

A Bayesian CBM has the following generative process:

(22.19)#\[\begin{align} \theta &\sim p_\theta(\cdot) \\ \phi &\sim p_\phi(\cdot) \\ c_n | x_n, \theta &\sim p_{C | X}(\cdot | x_n, \theta) = \mathrm{Cat}(\pi(x_n; \theta)) \\ y_n | c_n, \phi &\sim p_{Y | C}(\cdot | c_n, \phi), \end{align}\]

where \(\pi(\cdot; \theta)\) is a function that maps \(x_n\) to the parameters of a categorical distribution.

The posterior predictive is: \(p_{Y^* | X^*, X_{1:N}, C_{1:N}, Y_{1:N}}(y^* | x^*, x_{1:N}, c_{1:N}, y_{1:N})\), where \(x_{1:N}\), \(c_{1:N}\), and \(y_{1:N}\) denote the full data.