Probabilistic modeling in Python

Bayesian modeling and related

12 min readJan 5, 2023

Compared to deep learning topics and techniques, I found very few tutorials focusing on probabilistic modeling. Strictly speaking, deep learning is also part of probabilistic modeling, depending on how you view this problem. Probabilistic modeling can be a very powerful tool to model complex dependencies, integrate prior knowledge, and account for uncertainty. So I want to have a focused tutorial on that, specifically in Python.

Distributions

So the fundamental idea for probabilistic modeling is, whatever we observe, in terms of data, is just a sample from a distribution. You observe a tail in a “coin toss” game, it is a sample from a Bernoulli distribution. You observe a 4 from a “dice toss” game, it is a sample from a Categorical distribution, and so on and so forth. We call what we observe a random variable, and this random variable follows an underlying distribution.

There are all kinds of well-defined distributions out there, so I guess the first part of the tutorial will be focused on how to choose the right distribution for certain observed data. And then I will briefly talk about how to write code for them.

Starting from the simplest one, the Bernoulli distribution can model the binary outcome like tossing a coin, the random variable can only be 0 and 1, here we call the support of this random variable is {0,1}. Given a distribution, it is very important to know three things:

the parameters that define this distribution
the moments (mean, scale, skewness, kurtosis)
the log probability
a visual of how the distribution looks like

Those can all be found on the Wikipedia page, and there are a few websites I really like that can generate interactive PDF/PMF functions for each distribution:

Distribution Calculator (Gamma distribution as an example)
Distribution Explorer (Gamma distribution as an example)

Then the most important thing to know is under which scenarios you should use certain distributions. I think it is heavily empirical and I shared some of my experience here.

Regarding the actual python code, I generally suggest using either scipy or pytorch module, pytorch module seems to have a more consistent parameter space for the classic definitions. But scipy has its own loc and scale parameter that you may need to be aware of (related to Affine transformation in pytorch). However, scipy has a lot of handy functions like ppf to facilitate quick visualization, whereas pytorch mainly implement sample and log_prob function for their optimization purpose.

We used a normal distribution to have a look at those handy function included in scipy package here. Also, as I mentioned in the above paragraph, sometimes reparameterization is necessary and there are also some generic reparameterization tricks that are specifically designed for better training or for better interpretability.

from scipy.stats import norm
# calculate the probability at certain point
norm.pdf(2,loc=0,scale=1) # 0.05
# calculate the cdf given certain point
norm.cdf(0,loc=0,scale=1) # 0.5
# calculate x at which certain quantile is met to the left of x
norm.ppf(0.5,loc=0,scale=1) # 0
# the complementary of cdf, sf = 1 - cdf
norm.sf(0,loc=0,scale=1) # 0.5
# the complementary of ppf, calculate x at which certain quantile is met to the right of x
norm.isf(0.5,loc=0,scale=1) # 0
# sample from the distribution
norm.rvs(size=100,loc=0,scale=1,random_state=1)

Bayesian inference

The whole idea of bayesian inference is, given observed data and the underlying parameters of our interests, instead of directly estimating the optimal parameters. We have certain beliefs (prior) for the parameters, and we use the observed data to correct our prior beliefs, which results in the posterior distribution. The essence of this transformation is Bayes's theorem. As an example, we want to know if a coin is sided or not, we hold the prior belief that it is fair (p=0.5), but after 100 tosses, 80 times turn to be one side than the other, now our belief changes to it might be sided (p=0.8).

Usually, we will assume the prior follows some sort of distribution, and then we try to derive the posterior. In an idealist situation, we are able to directly derive the posterior distribution, for instance, if the prior is beta distribution and the observed data happen to be binomial, then the posterior will be a beta as well with a very intuitive explanation. But the difficulties are, we barely can derive a closed form of posterior when the model becomes more complicated. So we need methods to help us solve this problem, which we call bayesian computation.

Bayesian computation

Bayesian computation consists of an array of methods that can help us to sample points from virtually all posterior distributions, from which we are able to do point estimation on our interested parameters.

1. Accept-Reject method

Let’s start with an example if we want to draw points from posterior distribution f(x) which has the following functional form:

Example posterior f(x) (image by author)

We call this f(x) target distribution/density, and then we propose a simpler distribution (candidate distribution) from which we can sample points. We use a uniform distribution g(x), and we hope when multiplicating the g(x) with a constant M, Mg(x) can envelop the f(x). In this case, the M is easy to be chosen as the maximum value of f(x).

Now we first independently sample from another uniform distribution Uniform(0,1), we denote the sampled points as u. We also sample from the candidate distribution g(x) and denote the sampled points as y. We test whether the sampled points u comply with the following criteria:

Accept-reject criterion (image by author)

If compliant, we treat the y as a sample from the target distribution, otherwise, we reject the y. This step might be difficult to understand at the first glance. The idea behind that is, if there’s a region say [0.2,0.3] where the density of target density f(x) is super high, then f(y)/M*g(y) within this region will be high as well, probably close to 1. Since u is randomly drawn from a standard uniform distribution (u is just to control whether to reject or accept the y), this indicator is more likely to be true which in turn results in the acceptance of the sampled y. Now let’s write the code:

x = np.linspace(0,1,1000)
f_x = x**1.7*((1-x)/(1+x))**5.3
M = f_x.max()
def f(x):
    return x**1.7*((1-x)/(1+x))**5.3
def g(x):
    return uniform.pdf(x,loc=0,scale=1)

n = 2500
u = uniform.rvs(loc=0,scale=1,size=n,random_state=1)
y = uniform.rvs(loc=0,scale=1,size=n,random_state=2)
f_y = f(y)
g_y = g(y)
acc_or_rej = u <= f_y / (M * g_y)
accepted_y = y[acc_or_rej]
sns.histplot(accepted_y)

The estimated target distribution can be plotted below:

The accept-Reject method can estimate the target distribution f(x) (image by author)

2. Markov Chain Monte Carlo (MCMC)

Before explaining what the MCMC algorithm is, we need to first separately understand the Monte Carlo method and Markov Chain.

Monte Carlo methods represent a broad class of algorithms that rely on repeated random sampling to estimate the desired quantity. The quantity could be the pi value or the integral of an unfamiliar form of function. The most classic example of Monte Carlo is to estimate the value of pi, where we first randomly sample x and y from two standard uniform distributions and ask whether x**2 + y**2 < 1, if the answer is true, then we add the counter by 1. At the end of random sampling, we can compute the ratio of the counter and the total number of iterations (should be pi/4 if using the area formula for square and inscribed circles).

Markov Chain is a random process where the current value only depends on its immediate predecessor. A simple example I like a lot is the forecast, if we assume there are three possible states for the weather in a day (sunny, cloudy, rainy), we have an initial distribution where p(sunny)=0.6, p(cloudy)=0.3 and p(rainy)=0.1, now given a transition kernel K, which is a matrix/look-up table in this example where row-wise are the states in the last day and column-wise are the states in the current day:

We are able to derive the distributions of the state on any day using the initial distribution and transition kernel. As we continue doing that, if the state distributions stay unchanged, it reaches stationary distribution.

Although the discrete scenario for Markov Chain is easier to intuit, we eventually need to generalize it to the continuous case where the state space is infinite, meaning instead of having only three states (sunny, cloudy, rainy), we will have an infinite number of states. So we need to use a distribution to represent the state distribution at each time point. For instance, if we say the initial distribution of states follows Normal(0,3), and the transition probability should also be a distribution where p(x_n+1|x_n) = K(x_n+1|x_n) = Normal(x_n,0.1), and all the above-mentioned properties stay the same as the discrete scenario.

Now that we understand the Monte Carlo and Markov Chain, what is the MCMC algorithm, and how it is helpful for us to sample points from hard-to-solve posterior distributions? The idea is if we create a Markov chain (we need to have initial distribution and transition distribution)and specifically require the final stationary distribution equal to target density f(x), then when we generate a sample along the Markov chain, the sample points will converge to the stationary state. Compared to the Accept-reject method, MCMC is more general and can be useful, especially when dealing with high-dimension modeling (not just one variable but multiple). Now let’s again use a concrete example to illustrate the MCMC algorithm. Particularly, we are using one of the most popular MCMC algorithms called the Metropolis-Hasting (MH) algorithm.

Now we want to sample from the target density f(x):

target density f(x) (image by author)

Assuming we are creating a Markov Chain where the transition kernel q(x,y) is (where x is the current time point and y is the last time point):

transition kernel (image by author)

The Metropolis-Hasting (MH) algorithm starts with initializing the first value of vector x (imagine we are going to sample n=10,000 values of x using MCMC) x[0] . Then in each iteration, we first generate a x_cand by using the transition kernel, and we need to compute an acceptance rate alpha to decide whether the next x will adopt the x_cand or just the last x value. The acceptance rate is represented below:

the acceptance rate in the MH algorithm (image by author)

This formula dictates two rules when traversing the Markov chain, one is to make sure the accepted x_cand should approach the high probability region of target density, which is determined by the f part. However, this exposes the risk that x_cand will get stuck at one high-probability region which is not ideal, so the q part is to favor the large step traverse to make sure x will explore the different regions of the domain. Let’s look at the code:

def f(x):
    import math
    return 2*x**2*(1-x)**8*math.cos(4*math.pi*x)**2
def q(x,y):
    return norm.pdf(x,loc=y,scale=0.1)

n = 10000
x = np.zeros(n)
x[0] = norm.rvs(loc=0,scale=0.1,size=1)[0]
for i in range(n-1):
    while True:
        x_cand = norm.rvs(loc=x[i],scale=0.1,size=1)[0]
        if x_cand >= 0 and x_cand <= 1:
            break
    if x_cand >= 0 and x_cand <= 1:
        rho = (q(x[i],x_cand)/q(x_cand,x[i]))*(f(x_cand)/f(x[i]))
        alpha = min(1,rho)
        u = uniform.rvs(loc=0,scale=1,size=1)[0]
        if u < alpha:
            x[i+1] = x_cand
        else:
            x[i+1] = x[i]
sns.histplot(x)
fig,ax = plt.subplots()
ax.plot(np.arange(10000),x)

The estimated target density f(x):

MCMC to estimate target density f(x) (image by author)

And we can trace the dynamics of the x:

There are other variations of MCMC algorithms asides from the MH algorithm, for example, the random walk Metropolis-Hasting Algorithm gets rid of the q part in the computation of the acceptance ratio. The independent Metropolis-Hasting algorithm writes q part without the condition. Another useful variation of MCMC is the Gibbs sampler, which aims to tackle multi-dimensional parameters, instead of having a suitable multivariate proposal distribution, we can update each parameter using the full conditional form. Other things that should be similar to the vanilla MH algorithm. Moreover, in the original MH algorithm, the scale of the proposal density q , if chosen to be a smaller value (say 0.03 instead of 0.1), it can restrict the step size of the Markov chain generation process (see below).

Change the scale from 0.1 to 0.03 (image by author)

Hierarchical Bayesian (An example to use Gibbs Sampling MCMC)

With all the required tools we go over in the last section, now you should be able to model your data using a bayesian model (find proper distribution for each parameter of your interests), and do point estimation from the posterior distribution of your parameters, no matter it is a closed-form solution or not, thanks to the powerful bayesian computation approaches.

However, in real-world scenarios, it is common to encounter cases where the underlying parameters are associated in some ways. For example, we want to model the survival rate lambda of a certain disease A post-surgery in K hospitals in the city, within each hospital, we have n patients' survival data y for us to infer the parameter. It is easy to represent the model in the following way:

Single-layer Bayesian model (image by author)

But it is normal to sense the dependencies between the K lambda values, as there must be some relationships between the survival rate in the first hospital and the second one, instead of being completely independent. In another word, knowing the lambda_1 may tell you something about the value of lambda_2. So here we need to build another layer of the bayesian model:

The second layer of the Bayesian model (image by author)

This is what we call a Hierarchical Bayesian, which is more common in real cases. Now we are able to formulate the posterior distribution of all the parameters in our model:

The posterior distribution for parameters (image by author)

Where we simplify the prior to be:

simplified prior distribution (image by author)

Since this is a high-dimensional posterior distribution, Gibbs sampling is the best way to sample points from it. To use Gibbs sampling, we first need to derive the full conditional of each parameter, meaning p(lambda|…), p(sigma|…), p(mu|…), and p(tau|…), the way to derive them is to only retain the terms in full posterior that contain the parameters of interest:

Full conditional for lambda (image by author)

The second step above requires some re-writes of the PDF of the normal distribution.

We can derive the same for other full conditional:

full conditional for p(sigma|…) and p(tau|…) (image by author)

Full conditional for p(mu|…) (image by author)

With all the full conditional derived, we are able to finally code it:

nruns = 10000
K = 100 # K = 100, K hospitals
n = 1000 # each hospital has 1000 patients
y = np.zeros(K,n)  # generated observations
lambda_est = np.zeros(K,nruns)
sigma_est = np.zeros(nruns)
mu_est = np.zeros(nruns)
tau_est = np.zeros(nruns)

for i in range(K):
    loc = uniform.rvs(loc=0,scale=10,size=1)[0]
    scale = uniform.rvs(loc=0,scale=0.1,size=1)[0]
    lambda_est[i,0] = norm.rvs(loc=loc,scale=scale,size=1)[0]
sigma_est[0] = uniform.rvs(loc=0,scale=0.1,size=1)[0]
mu_est[0] = norm.rvs(loc=uniform.rvs(loc=0,scale=10,size=1)[0],scale=1,size=1)[0]
tau_est[0] = uniform.rvs(loc=0,scale=0.1,size=1)[0]

for runs in range(1,n_runs-1,1):
    # estimate lambda
    for i in range(1,K-1):
        mean = i/math.sqrt(1/tau_est[runs-1]) + n/sigma_est[runs-1]
        std = (mean^2)*(mu_est[runs-1]/(tau_est[runs-1])+y[i,:].mean()*n/sigma_est[runs-1])
        lambda_est[i,runs] = norm.rvs(loc=mean,scale=std,size=1)[0]
    # estimate sigma
    sigma_sum_term = 0
    for i in range(K):
        for j in range(n):
            sigma_sum_term += (y[i,j]-lambda_est[i,runs])**2
    sigma_est[runs] = invgamma(loc=K*n/2,scale=sigma_sum_term/2)
    # estimate tau
    tau_sum_term = 0
    for i in range(K):
        tau_sum_term += (lambda_est[i,runs]-mu_est[runs-1])**2
    tau_est[runs] = invgamma(loc=K/2,scale=tau_sum_term/2)
    # estimate mu
    mu_est[runs] = norm.rvs(loc=lambda_est[:,runs-1].mean(),scale=math.sqrt(tau_est[runs]/2))

Practical guides

While what you covered so far is the manual implementation of MCMC algorithms, there are off-the-shelf packages that can help us to solve complex hierarchical bayesian models. Thanks to the Hamiltonian Monte Carlo algorithm, and its derivation called No-U-Turn sampler (NUTs), we can now sample almost all kinds of posterior distributions in a more uniform interface, which for example, can be accessed through the PyMC package, I would recommend you to read these series which are super intuitive to get a hang of that.

Alternatively, although the MCMC algorithm can give more accurate inferences, it is time-consuming. So people often use Variational inference (VI) to solve the problem by using a simpler variational distribution to approximate the posterior, so that any optimizer can be used here to facilitate the process. Particularly, using Stochastic Variational Inference (SVI) is super fast for large datasets because it subsamples the observations. I would recommend checking the Pyro introduction tutorials to get a hang of performing quick VI in Python.

Reference

Kleshchevnikov, Vitalii, Artem Shmatko, Emma Dann, Alexander Aivazidis, Hamish W. King, Tong Li, Rasa Elmentaite, et al. 2022. “Cell2location Maps Fine-Grained Cell Types in Spatial Transcriptomics.” Nature Biotechnology 40 (5): 661–71.
Li, Guangyuan, Anukana Bhattacharjee, and Nathan Salomonis. 2023. “Quantifying Tumor Specificity Using Bayesian Probabilistic Modeling for Drug Target Discovery and Prioritization.” bioRxiv. https://www.biorxiv.org/content/10.1101/2023.03.03.530994v1.
Büttner, M., J. Ostner, C. L. Müller, F. J. Theis, and B. Schubert. 2021. “scCODA Is a Bayesian Model for Compositional Single-Cell Data Analysis.” Nature Communications 12 (1): 1–10.