Getting Started

Installation

Install package CmdStanPy

CmdStanPy is a pure-Python package which can be installed from PyPI

pip install --upgrade cmdstanpy

or from GitHub

pip install -e git+https://github.com/stan-dev/cmdstanpy#egg=cmdstanpy

To install CmdStanPy with all the optional packages (ujson; json processing, tqdm; progress bar)

pip install --upgrade cmdstanpy[all]

Note for PyStan users: PyStan and CmdStanPy should be installed in separate environments. If you already have PyStan installed, you should take care to install CmdStanPy in its own virtual environment.

User can install optional packages with pip with the CmdStanPy installation

pip install --upgrade cmdstanpy[all]

The optional packages are

  • ujson which provides faster IO
  • tqdm which displays a progress during sampling

To install these manually

pip install ujson
pip install tqdm

Install CmdStan

CmdStanPy requires a local install of CmdStan.

Prerequisites

CmdStanPy requires an installed C++ toolchain.

Fuction install_cmdstan

CmdStanPy provides the function install_cmdstan which downloads CmdStan from GitHub and builds the CmdStan utilities. It can be can be called from within Python or from the command line. By default it installs the latest version of CmdStan into a directory named .cmdstanpy in your $HOME directory:

  • From Python
import cmdstanpy
cmdstanpy.install_cmdstan()
  • From the command line on Linux or MacOSX
install_cmdstan
ls -F ~/.cmdstanpy
  • On Windows
python -m cmdstanpy.install_cmdstan
dir "%HOME%/.cmdstanpy"

The named arguments: -d <directory> and -v <version> can be used to override these defaults:

install_cmdstan -d my_local_cmdstan -v 2.20.0
ls -F my_local_cmdstan
Specifying CmdStan installation location

The default for the CmdStan installation location is a directory named .cmdstanpy in your $HOME directory.

If you have installed CmdStan in a different directory, then you can set the environment variable CMDSTAN to this location and it will be picked up by CmdStanPy:

export CMDSTAN='/path/to/cmdstan-2.20.0'

The CmdStanPy commands cmdstan_path and set_cmdstan_path get and set this environment variable:

from cmdstanpy import cmdstan_path, set_cmdstan_path

oldpath = cmdstan_path()
set_cmdstan_path(os.path.join('path','to','cmdstan'))
newpath = cmdstan_path()
Specifying a custom make tool

To use custom make-tool use set_make_env function.

from cmdstanpy import set_make_env
set_make_env("mingw32-make.exe") # On Windows with mingw32-make

“Hello, World”

Bayesian estimation via Stan’s HMC-NUTS sampler

To exercise the essential functions of CmdStanPy, we will compile the example Stan model bernoulli.stan, which is distributed with CmdStan and then fit the model to example data bernoulli.data.json, also distributed with CmdStan using Stan’s HMC-NUTS sampler in order to estimate the posterior probability of the model parameters conditioned on the data.

Specify a Stan model

The CmdStanModel class specifies the Stan program and its corresponding compiled executable. By default, the Stan program is compiled on instantiation.

import os
from cmdstanpy import cmdstan_path, CmdStanModel

bernoulli_stan = os.path.join(cmdstan_path(), 'examples', 'bernoulli', 'bernoulli.stan')
bernoulli_model = CmdStanModel(stan_file=bernoulli_stan)

The CmdStanModel class provides properties and functions to inspect the model code and filepaths.

bernoulli_model.name
bernoulli_model.stan_file
bernoulli_model.exe_file
bernoulli_model.code()

Run the HMC-NUTS sampler

The CmdStanModel method sample runs the Stan HMC-NUTS sampler on the model and data and returns a CmdStanMCMC object:

bernoulli_data = { "N" : 10, "y" : [0,1,0,0,0,0,0,0,0,1] }
bern_fit = bernoulli_model.sample(data=bernoulli_data, csv_basename='./bern')

By default, the sample command runs 4 sampler chains. The csv_basename argument specifies the path and filename prefix of the sampler output files. If no output file path is specified, the sampler outputs are written to a temporary directory which is deleted when the current Python session is terminated.

Access the sample

The sample command returns a CmdStanMCMC object which provides methods to retrieve the sampler outputs, the arguments used to run Cmdstan, and names of the the per-chain stan-csv output files, and per-chain console messages files.

print(bern_fit)

The resulting sample from the posterior is lazily instantiated the first time that any of the properties sample, metric, or stepsize are accessed. At this point the stan-csv output files are read into memory. For large files this may take several seconds; for the example dataset, this should take less than a second. The sample property of the CmdStanMCMC object is a 3-D numpy.ndarray (i.e., a multi-dimensional array) which contains the set of all draws from all chains arranged as dimensions: (draws, chains, columns).

bern_fit.sample.shape

The get_drawset method returns the draws from all chains as a pandas.DataFrame, one draw per row, one column per model parameter, transformed parameter, generated quantity variable. The params argument is used to restrict the DataFrame columns to just the specified parameter names.

bern_fit.get_drawset(params=['theta'])

Python’s index slicing operations can be used to access the information by chain. For example, to select all draws and all output columns from the first chain, we specify the chain index (2nd index dimension). As arrays indexing starts at 0, the index ‘0’ corresponds to the first chain in the CmdStanMCMC:

chain_1 = bern_fit.sample[:,0,:]
chain_1.shape       # (1000, 8)
chain_1[0]          # sample first draw:
                    # array([-7.99462  ,  0.578072 ,  0.955103 ,  2.       ,  7.       ,
                    # 0.       ,  9.44788  ,  0.0934208])

Summarize or save the results

CmdStan is distributed with a posterior analysis utility stansummary that reads the outputs of all chains and computes summary statistics on the model fit for all parameters. The CmdStanMCMC method summary runs the CmdStan stansummary utility and returns the output as a pandas.DataFrame:

bern_fit.summary()

CmdStan is distributed with a second posterior analysis utility diagnose that reads the outputs of all chains and checks for the following potential problems:

  • Transitions that hit the maximum treedepth
  • Divergent transitions
  • Low E-BFMI values (sampler transitions HMC potential energy)
  • Low effective sample sizes
  • High R-hat values

The CmdStanMCMC method diagnose runs the CmdStan diagnose utility and prints the output to the console.

bern_fit.diagnose()

By default, CmdStanPy will save all CmdStan outputs in a temporary directory which is deleted when the Python session exits. In particular, unless the csv_basename argument to the sample function is overtly specified, all the csv output files will be written into this temporary directory and then when the session exits. The save_csvfiles function moves the CmdStan csv output files to the specified location, renaming them using a specified basename.

bern_fit.save_csvfiles(dir='some/path', basename='descriptive-name')