Probabilistic programming allows for automatic Bayesian inference on user-defined probabilistic models. Recent advances in Markov chain Monte Carlo (MCMC) sampling allow inference on increasingly complex models. This class of MCMC, known as Hamiltonian Monte Carlo, requires gradient information which is often not readily available. PyMC3 is a new open source probabilistic programming framework written in Python that uses Theano to compute gradients via automatic differentiation as well as compile probabilistic programs on-the-fly to C for increased speed. Contrary to other probabilistic programming languages, PyMC3 allows model specification directly in Python code. The lack of a domain specific language allows for great flexibility and direct interaction with the model. This paper is a tutorial-style introduction to this software package.

Probabilistic programming (PP) allows for flexible specification and fitting of Bayesian statistical models. PyMC3 is a new, open-source PP framework with an intuitive and readable, yet powerful, syntax that is close to the natural syntax statisticians use to describe models. It features next-generation Markov chain Monte Carlo (MCMC) sampling algorithms such as the No-U-Turn Sampler (NUTS) (

A number of probabilistic programming languages and systems have emerged over the past 2–3 decades. One of the earliest to enjoy widespread usage was the BUGS language (

Probabilistic programming in Python (

Here, we present a primer on the use of PyMC3 for solving general Bayesian statistical inference and prediction problems. We will first describe basic PyMC3 usage, including installation, data creation, model definition, model fitting and posterior analysis. We will then employ two case studies to illustrate how to define and fit more sophisticated models. Finally we will show how PyMC3 can be extended and discuss more advanced features, such as the Generalized Linear Models (GLM) subpackage, custom distributions, custom transformations and alternative storage backends.

Running PyMC3 requires a working Python interpreter (

`pip install git+https://github.com/pymc-devs/pymc3`

PyMC3 depends on several third-party Python packages which will be automatically installed when installing via pip. The four required dependencies are:

`pip install patsy pandas`

The source code for PyMC3 is hosted on GitHub at

To introduce model definition, fitting and posterior analysis, we first consider a simple Bayesian linear regression model with normal priors on the parameters. We are interested in predicting outcomes _{1} and _{2}. _{i} is the coefficient for covariate _{i}, while

We can simulate some data from this model using NumPy’s

```
import numpy as np
import matplotlib.pyplot as plt
```# Initialize random number generator
np.random.seed(123 )
# True parameter values
alpha, sigma = 1 , 1
beta = [1 , 2.5 ]
# Size of dataset
size = 100
# Predictor variable
X1 = np.linspace(0 , 1 , size)
X2 = np.linspace(0 ,.2 , size)
# Simulate outcome variable
Y = alpha + beta[0 ]* X1 + beta[1 ]* X2 + np.random.randn(size)* sigma

Specifying this model in PyMC3 is straightforward because the syntax is similar to the statistical notation. For the most part, each line of Python code corresponds to a line in the model notation above. First, we import the components we will need from PyMC3.

`from pymc3 import Model, Normal, HalfNormal`

The following code implements the model in PyMC:

`basic_model `= Model()
with basic_model:
# Priors for unknown model parameters
alpha = Normal('alpha' , mu= 0 , sd= 10 )
beta = Normal('beta' , mu= 0 , sd= 10 , shape= 2 )
sigma = HalfNormal('sigma' , sd= 1 )
# Expected value of outcome
mu = alpha + beta[0 ]* X1 + beta[1 ]* X2
# Likelihood (sampling distribution) of observations
Y_obs = Normal('Y_obs' , mu= mu, sd= sigma, observed= Y)

The first line,

`basic_model `= Model()

creates a new

with basic_model:

This creates a context manager, with our

The first three statements in the context manager create

`alpha `= Normal('alpha' , mu= 0 , sd= 10 )
beta = Normal('beta' , mu= 0 , sd= 10 , shape= 2 )
sigma = HalfNormal('sigma' , sd= 1 )

These are stochastic because their values are partly determined by its parents in the dependency graph of random variables, which for priors are simple constants, and are partly random, according to the specified probability distribution.

The

The

Detailed notes about distributions, sampling methods and other PyMC3 functions are available via the

`help(Normal)`

^{2}` (tau > 0). | sd : float | Standard deviation of the distribution. Alternative parameterization. | | .. note:: | - :math:`E(X) = \mu` | - :math:`Var(X) = 1 / \tau`

Having defined the priors, the next statement creates the expected value

`mu `= alpha + beta[0 ]* X1 + beta[1 ]* X2

This creates a

PyMC3 random variables and data can be arbitrarily added, subtracted, divided, or multiplied together, as well as indexed (extracting a subset of values) to create new random variables. Many common mathematical functions like

The final line of the model defines

`Y_obs `= Normal('Y_obs' , mu= mu, sd= sigma, observed= Y)

This is a special case of a stochastic variable that we call an

Notice that, unlike the prior distributions, the parameters for the normal distribution of

Having completely specified our model, the next step is to obtain posterior estimates for the unknown variables in the model. Ideally, we could derive the posterior estimates analytically, but for most non-trivial models this is not feasible. We will consider two approaches, whose appropriateness depends on the structure of the model and the goals of the analysis: finding the

The

Below we find the MAP for our original model. The MAP is returned as a parameter

```
from pymc3 import find_MAP
map_estimate
```= find_MAP(model= basic_model)
print(map_estimate)

```
{'alpha': array(1.0136638069892534),
'beta': array([ 1.46791629, 0.29358326]),
'sigma_log': array(0.11928770010017063)}
```

By default,

```
from scipy import optimize
map_estimate
```= find_MAP(model= basic_model, fmin= optimize.fmin_powell)
print(map_estimate)

```
{'alpha': array(1.0175522109423465),
'beta': array([ 1.51426782, 0.03520891]),
'sigma_log': array(0.11815106849951475)}
```

It is important to note that the MAP estimate is not always reasonable, especially if the mode is at an extreme. This can be a subtle issue; with high dimensional posteriors, one can have areas of extremely high density but low total probability because the volume is very small. This will often occur in hierarchical models with the variance parameter for the random effect. If the individual group means are all the same, the posterior will have near infinite density if the scale parameter for the group means is almost zero, even though the probability of such a small scale parameter will be small since the group means must be extremely close together.

Also, most techniques for finding the MAP estimate only find a

Though finding the MAP is a fast and easy way of obtaining parameter estimates of well-behaved models, it is limited because there is no associated estimate of uncertainty produced with the MAP estimates. Instead, a simulation-based approach such as MCMC can be used to obtain a Markov chain of values that, given the satisfaction of certain conditions, are indistinguishable from samples from the posterior distribution.

To conduct MCMC sampling to generate posterior samples in PyMC3, we specify a

PyMC3 implements several standard sampling algorithms, such as adaptive Metropolis-Hastings and adaptive slice sampling, but PyMC3’s most capable step method is the No-U-Turn Sampler. NUTS is especially useful for sampling from models that have many continuous parameters, a situation where older MCMC algorithms work very slowly. It takes advantage of information about where regions of higher probability are, based on the gradient of the log posterior-density. This helps it achieve dramatically faster convergence on large problems than traditional sampling methods achieve. PyMC3 relies on Theano to analytically compute model gradients via automatic differentiation of the posterior density. NUTS also has several self-tuning strategies for adaptively setting the tunable parameters of Hamiltonian Monte Carlo. For random variables that are undifferentiable (namely, discrete variables) NUTS cannot be used, but it may still be used on the differentiable variables in a model that contains undifferentiable variables.

NUTS requires a scaling matrix parameter, which is analogous to the variance parameter for the jump proposal distribution in Metropolis-Hastings, although NUTS uses it somewhat differently. The matrix gives an approximate shape of the posterior distribution, so that NUTS does not make jumps that are too large in some directions and too small in other directions. It is important to set this scaling parameter to a reasonable value to facilitate efficient sampling. This is especially true for models that have many unobserved stochastic random variables or models with highly non-normal posterior distributions. Poor scaling parameters will slow down NUTS significantly, sometimes almost stopping it completely. A reasonable starting point for sampling can also be important for efficient sampling, but not as often.

Fortunately, NUTS can often make good guesses for the scaling parameters. If you pass a point in parameter space (as a dictionary of variable names to parameter values, the same format as returned by

Here, we will use NUTS to sample 2000 draws from the posterior using the MAP as the starting and scaling point. Sampling must also be performed inside the context of the model.

```
from pymc3 import NUTS, sample
```with basic_model:
# obtain starting values via MAP
start = find_MAP(fmin= optimize.fmin_powell)
# instantiate sampler
step = NUTS(scaling= start)
# draw 2000 posterior samples
trace = sample(2000 , step, start= start)

The

`trace[`'alpha' ][- 5 :]

`array([ 0.98134501, 1.04901676, 1.03638451, 0.88261935, 0.95910723])`

```
from pymc3 import traceplot
traceplot(trace)
```

The left column consists of a smoothed histogram (using kernel density estimation) of the marginal posteriors of each stochastic random variable while the right column contains the samples of the Markov chain plotted in sequential order. The

For a tabular summary, the

```
from pymc3 import summary
summary(trace['alpha'])
```

We present a case study of stochastic volatility, time varying stock market volatility, to illustrate PyMC3’s capability for addressing more realistic problems. The distribution of market returns is highly non-normal, which makes sampling the volatilities significantly more difficult. This example has 400+ parameters so using older sampling algorithms like Metropolis-Hastings would be inefficient, generating highly auto-correlated samples with a low effective sample size. Instead, we use NUTS, which is dramatically more efficient.

Asset prices have time-varying volatility (variance of day over day

Here, _{i} are the individual daily log volatilities in the latent log volatility process.

Our data consist of daily returns of the S&P 500 during the 2008 financial crisis.

```
import pandas as pd
returns
```= pd.read_csv('data/SP500.csv' , index_col= 0 , parse_dates= True )

See

As with the linear regression example, implementing the model in PyMC3 mirrors its statistical specification. This model employs several new distributions: the

In PyMC3, variables with positive support like

Although (unlike model specification in PyMC2) we do not typically provide starting points for variables at the model specification stage, it is possible to provide an initial value for any distribution (called a “test value” in Theano) using the

The vector of latent volatilities

```
from pymc3 import Exponential, StudentT, exp, Deterministic
from pymc3.distributions.timeseries import GaussianRandomWalk
```with Model() as sp500_model:
nu = Exponential('nu' , 1 ./ 10 , testval= 5 .)
sigma = Exponential('sigma' , 1 ./ .02 , testval= .1 )
s = GaussianRandomWalk('s' , sigma**- 2 , shape= len(returns))
volatility_process = Deterministic('volatility_process' , exp(- 2 * s))
r = StudentT('r' , nu, lam= 1 / volatility_process, observed= returns['S&P500' ])

Notice that we transform the log volatility process

Also note that we have declared the

Before we draw samples from the posterior, it is prudent to find a decent starting value, by which we mean a point of relatively high probability. For this model, the full

As a sampling strategy, we execute a short initial run to locate a volume of high probability, then start again at the new starting point to obtain a sample that can be used for inference.

```
import scipy
```with sp500_model:
start = find_MAP(vars= [s], fmin= scipy.optimize.fmin_l_bfgs_b)
step = NUTS(scaling= start)
trace = sample(100 , step, progressbar= False )
# Start next run at the last sampled position.
step = NUTS(scaling= trace[- 1 ], gamma= .25 )
trace = sample(2000 , step, start= trace[- 1 ], progressbar= False , njobs= 2 )

Notice that the call to

We can check our samples by looking at the traceplot for

`traceplot(trace, [nu, sigma])`;

Each plotted line represents a single independent chain sampled in parallel.

Finally we plot the distribution of volatility paths by plotting many of our sampled volatility paths on the same graph (

`fig, ax `= plt.subplots(figsize= (15 , 8 ))
returns.plot(ax= ax)
ax.plot(returns.index, 1 / np.exp(trace['s' ,::30 ].T), 'r' , alpha= .03 );
ax.set(title= 'volatility_process' , xlabel= 'time' , ylabel= 'volatility' );
ax.legend(['S&P500' , 'stochastic volatility process' ])

As you can see, the model correctly infers the increase in volatility during the 2008 financial crash.

It is worth emphasizing the complexity of this model due to its high dimensionality and dependency-structure in the random walk distribution. NUTS as implemented in PyMC3, however, correctly infers the posterior distribution with ease.

This case study implements a change-point model for a time series of recorded coal mining disasters in the UK from 1851 to 1962 (

Our objective is to estimate when the change occurred, in the presence of missing data, using multiple step methods to allow us to fit a model that includes both discrete and continuous random variables.

`disaster_data `= np.ma.masked_values([4 , 5 , 4 , 0 , 1 , 4 , 3 , 4 , 0 , 6 , 3 , 3 , 4 , 0 , 2 , 6 ,
3 , 3 , 5 , 4 , 5 , 3 , 1 , 4 , 4 , 1 , 5 , 5 , 3 , 4 , 2 , 5 ,
2 , 2 , 3 , 4 , 2 , 1 , 3 , - 999 , 2 , 1 , 1 , 1 , 1 , 3 , 0 , 0 ,
1 , 0 , 1 , 1 , 0 , 0 , 3 , 1 , 0 , 3 , 2 , 2 , 0 , 1 , 1 , 1 ,
0 , 1 , 0 , 1 , 0 , 0 , 0 , 2 , 1 , 0 , 0 , 0 , 1 , 1 , 0 , 2 ,
3 , 3 , 1 , - 999 , 2 , 1 , 1 , 1 , 1 , 2 , 4 , 2 , 0 , 0 , 1 , 4 ,
0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 1 , 0 , 1 ], value=- 999 )
year = np.arange(1851 , 1962 )
plot(year, disaster_data, 'o' , markersize= 8 );
ylabel("Disaster count" )
xlabel("Year" )

Counts of disasters in the time series is thought to follow a Poisson process, with a relatively large rate parameter in the early part of the time series, and a smaller rate in the later part. The Bayesian approach to such a problem is to treat the change point as an unknown quantity in the model, and assign it a prior distribution, which we update to a posterior using the evidence in the dataset.

In our model,

_{t}: The number of disasters in year

_{t}: The rate parameter of the Poisson distribution of disasters in year

_{l}, _{h}: The lower and upper boundaries of year

```
from pymc3 import DiscreteUniform, Poisson, switch
```with Model() as disaster_model:
switchpoint = DiscreteUniform('switchpoint' , lower= year.min(),
upper= year.max(), testval= 1900 )
# Priors for pre- and post-switch rates number of disasters
early_rate = Exponential('early_rate' , 1 )
late_rate = Exponential('late_rate' , 1 )
# Allocate appropriate Poisson rates to years before and after current
rate = switch(switchpoint >= year, early_rate, late_rate)
disasters = Poisson('disasters' , rate, observed= disaster_data)

This model introduces discrete variables with the Poisson likelihood and a discrete-uniform prior on the change-point

`rate `= switch(switchpoint >= year, early_rate, late_rate)

The conditional statement is realized using the Theano function

Missing values are handled concisely by passing a

Unfortunately, because they are discrete variables and thus have no meaningful gradient, we cannot use NUTS for sampling either

Here, the

```
from pymc3 import Metropolis
```with disaster_model:
step1 = NUTS([early_rate, late_rate])
step2 = Metropolis([switchpoint, disasters.missing_values[0 ]] )
trace = sample(10000 , step= [step1, step2])

In the trace plot (

Due to its reliance on Theano, PyMC3 provides many mathematical functions and operators for transforming random variables into new random variables. However, the library of functions in Theano is not exhaustive, therefore PyMC3 provides functionality for creating arbitrary Theano functions in pure Python, and including these functions in PyMC3 models. This is supported with the

```
import theano.tensor as T
from theano.compile.ops import as_op
```@as_op (itypes= [T.lscalar], otypes= [T.lscalar])
def crazy_modulo3(value):
if value > 0 :
return value % 3
else :
return (- value + 1 ) % 3
with Model() as model_deterministic:
a = Poisson('a' , 1 )
b = crazy_modulo3(a)

Theano requires the types of the inputs and outputs of a function to be declared, which are specified for

The library of statistical distributions in PyMC3, though large, is not exhaustive, but PyMC allows for the creation of user-defined probability distributions. For simple statistical distributions, the

The logarithms of these functions can be specified as the argument to

```
import theano.tensor as T
from pymc3 import DensityDist, Uniform
```with Model() as model:
alpha = Uniform('intercept' , - 100 , 100 )
# Create custom densities
beta = DensityDist('beta' , lambda value: - 1.5 * T.log(1 + value** 2 ), testval= 0 )
eps = DensityDist('eps' , lambda value: - T.log(T.abs_(value)), testval= 1 )
# Create likelihood
like = Normal('y_est' , mu= alpha + beta * X, sd= eps, observed= Y)

For more complex distributions, one can create a subclass of

Implementing the

```
from pymc3.distributions import Continuous
```class Beta(Continuous):
def __init__ (self , mu, * args, ** kwargs):
super(Beta, self ).__init__ (* args, ** kwargs)
self .mu = mu
self .mode = mu
def logp(self , value):
mu = self .mu
return beta_logp(value - mu)
@as_op (itypes= [T.dscalar], otypes= [T.dscalar])
def beta_logp(value):
return - 1.5 * np.log(1 + (value)** 2 )
with Model() as model:
beta = Beta('slope' , mu= 0 , testval= 0 )

The generalized linear model (GLM) is a class of flexible models that is widely used to estimate regression relationships between a single outcome variable and one or multiple predictors. Because these models are so common,

The

# Convert X and Y to a pandas DataFrame
import pandas
df = pandas.DataFrame({'x1' : X1, 'x2' : X2, 'y' : Y})

The model can then be very concisely specified in one line of code.

```
from pymc3.glm import glm
```with Model() as model_glm:
glm('y ~ x1 + x2' , df)

The error distribution, if not specified via the

```
from pymc3.glm.families import Binomial
df_logistic
```= pandas.DataFrame({'x1' : X1, 'x2' : X2, 'y' : Y > 0 })
with Model() as model_glm_logistic:
glm('y ~ x1 + x2' , df_logistic, family= Binomial())

Models specified via

```
from pymc3.backends import SQLite
```with model_glm_logistic:
backend = SQLite('logistic_trace.sqlite' )
trace = sample(5000 , Metropolis(), trace= backend)

A secondary advantage to using an on-disk backend is the portability of model output, as the stored trace can then later (e.g., in another session) be re-loaded using the

```
from pymc3.backends.sqlite import load
```with basic_model:
trace_loaded = load('logistic_trace.sqlite' )

Probabilistic programming is an emerging paradigm in statistical learning, of which Bayesian modeling is an important sub-discipline. The signature characteristics of probabilistic programming–specifying variables as probability distributions and conditioning variables on other variables and on observations–makes it a powerful tool for building models in a variety of settings, and over a range of model complexity. Accompanying the rise of probabilistic programming has been a burst of innovation in fitting methods for Bayesian models that represent notable improvement over existing MCMC methods. Yet, despite this expansion, there are few software packages available that have kept pace with the methodological innovation, and still fewer that allow non-expert users to implement models.

PyMC3 provides a probabilistic programming platform for quantitative researchers to implement statistical models flexibly and succinctly. A large library of statistical distributions and several pre-defined fitting algorithms allows users to focus on the scientific problem at hand, rather than the implementation details of Bayesian modeling. The choice of Python as a development language, rather than a domain-specific language, means that PyMC3 users are able to work interactively to build models, introspect model objects, and debug or profile their work, using a dynamic, high-level programming language that is easy to learn. The modular, object-oriented design of PyMC3 means that adding new fitting algorithms or other features is straightforward. In addition, PyMC3 comes with several features not found in most other packages, most notably Hamiltonian-based samplers as well as automatical transforms of constrained random variables which is only offered by STAN. Unlike STAN, however, PyMC3 supports discrete variables as well as non-gradient based sampling algorithms like Metropolis-Hastings and Slice sampling.

Development of PyMC3 is an ongoing effort and several features are planned for future versions. Most notably, variational inference techniques are often more efficient than MCMC sampling, at the cost of generalizability. More recently, however, black-box variational inference algorithms have been developed, such as automatic differentiation variational inference (ADVI) (

Thomas V. Wiecki is an employee of Quantopian Inc. John Salvatier is an employee of AI Impacts.

The following information was supplied regarding data availability: