Getting Started¶
Installation¶
Install package CmdStanPy¶
CmdStanPy is a pure-Python package which can be installed from PyPI
pip install --upgrade cmdstanpy
or from GitHub
pip install -e git+https://github.com/stan-dev/cmdstanpy#egg=cmdstanpy
To install CmdStanPy with all the optional packages (ujson; json processing, tqdm; progress bar)
pip install --upgrade cmdstanpy[all]
Note for PyStan users: PyStan and CmdStanPy should be installed in separate environments. If you already have PyStan installed, you should take care to install CmdStanPy in its own virtual environment.
User can install optional packages with pip with the CmdStanPy installation
pip install --upgrade cmdstanpy[all]
The optional packages are
ujson
which provides faster IOtqdm
which displays a progress during sampling
To install these manually
pip install ujson
pip install tqdm
Install CmdStan¶
CmdStanPy requires a local install of CmdStan.
Prerequisites¶
CmdStanPy requires an installed C++ toolchain consisting of a modern C++ compiler and the GNU-Make utility.
Function install_cmdstan
¶
CmdStanPy provides the function install_cmdstan
which
downloads CmdStan from GitHub and builds the CmdStan utilities.
It can be can be called from within Python or from the command line.
By default it installs the latest version of CmdStan into a directory named
.cmdstanpy
in your $HOME
directory:
- From Python
import cmdstanpy
cmdstanpy.install_cmdstan()
- From the command line on Linux or MacOSX
install_cmdstan
ls -F ~/.cmdstanpy
- On Windows
python -m cmdstanpy.install_cmdstan
dir "%HOME%/.cmdstanpy"
The named arguments: -d <directory> and -v <version> can be used to override these defaults:
install_cmdstan -d my_local_cmdstan -v 2.20.0
ls -F my_local_cmdstan
Specifying CmdStan installation location¶
The default for the CmdStan installation location
is a directory named .cmdstanpy
in your $HOME
directory.
If you have installed CmdStan in a different directory,
then you can set the environment variable CMDSTAN
to this
location and it will be picked up by CmdStanPy:
export CMDSTAN='/path/to/cmdstan-2.20.0'
The CmdStanPy commands cmdstan_path
and set_cmdstan_path
get and set this environment variable:
from cmdstanpy import cmdstan_path, set_cmdstan_path
oldpath = cmdstan_path()
set_cmdstan_path(os.path.join('path','to','cmdstan'))
newpath = cmdstan_path()
Specifying a custom Make tool¶
To use custom make
-tool use set_make_env
function.
from cmdstanpy import set_make_env
set_make_env("mingw32-make.exe") # On Windows with mingw32-make
Installion via the command line¶
If you with to install CmdStan yourself, follow the instructions in the CmdStan User’s Guide.
“Hello, World”¶
Bayesian estimation via Stan’s HMC-NUTS sampler¶
To exercise the essential functions of CmdStanPy we show how to run
Stan’s HMC-NUTS sampler to estimate the posterior probability
of the model parameters conditioned on the data.
Do do this we use the example Stan model bernoulli.stan
and corresponding dataset bernoulli.data.json
which are
distributed with CmdStan.
This is a simple model for binary data: given a set of N observations of i.i.d. binary data y[1] … y[N], it calculates the Bernoulli chance-of-success theta.
data {
int<lower=0> N;
int<lower=0,upper=1> y[N];
}
parameters {
real<lower=0,upper=1> theta;
}
model {
theta ~ beta(1,1); // uniform prior on interval 0,1
y ~ bernoulli(theta);
}
The data file specifies the number of observations and their values.
{
"N" : 10,
"y" : [0,1,0,0,0,0,0,0,0,1]
}
Specify a Stan model¶
The: CmdStanModel class manages the Stan program and its corresponding compiled executable. It provides properties and functions to inspect the model code and filepaths. By default, the Stan program is compiled on instantiation.
import os
from cmdstanpy import cmdstan_path, CmdStanModel
bernoulli_stan = os.path.join(cmdstan_path(), 'examples', 'bernoulli', 'bernoulli.stan')
bernoulli_model = CmdStanModel(stan_file=bernoulli_stan)
bernoulli_model.name
bernoulli_model.stan_file
bernoulli_model.exe_file
bernoulli_model.code()
Run the HMC-NUTS sampler¶
The CmdStanModel method sample
is used to do Bayesian inference
over the model conditioned on data using using Hamiltonian Monte Carlo
(HMC) sampling. It runs Stan’s HMC-NUTS sampler on the model and data and
returns a CmdStanMCMC object.
bernoulli_data = os.path.join(cmdstan_path(), 'examples', 'bernoulli', 'bernoulli.data.json')
bern_fit = bernoulli_model.sample(data=bernoulli_data, output_dir='.')
By default, the sample
command runs 4 sampler chains.
This is a set of per-chain
Stan CSV files
The filenames follow the template ‘<model_name>-<YYYYMMDDHHMM>-<chain_id>’
plus the file suffix ‘.csv’.
There is also a correspondingly named file with suffix ‘.txt’
which contains all messages written to the console.
If the output_dir
argument is omitted, the output files are written
to a temporary directory which is deleted when the current Python session is terminated.
Access the sample¶
The CmdStanMCMC object stores the CmdStan config information and the names of the the per-chain output files. It manages and retrieves the sampler outputs as Python objects.
print(bern_fit)
The resulting set of draws produced by the sampler is lazily instantiated
as a 3-D numpy.ndarray
(i.e., a multi-dimensional array)
over all draws from all chains arranged as draws X chains X columns.
Instantiation happens the first time that any of the information
in the posterior is accesed via properties:
draws
, metric
, or stepsize
are accessed.
At this point the stan-csv output files are read into memory.
For large files this may take several seconds; for the example
dataset, this should take less than a second.
bern_fit.draws().shape
Python’s index slicing operations can be used to access the information by chain. For example, to select all draws and all output columns from the first chain, we specify the chain index (2nd index dimension). As arrays indexing starts at 0, the index ‘0’ corresponds to the first chain in the CmdStanMCMC:
chain_1 = bern_fit.draws()[:,0,:]
chain_1.shape # (1000, 8)
chain_1[0] # first draw:
# array([-7.99462 , 0.578072 , 0.955103 , 2. , 7. ,
# 0. , 9.44788 , 0.0934208])
To work with the draws from all chains for a parameter or quantity of interest
in the model, use the stan_variable
method to obtains
a numpy.ndarray which contains the set of draws in the sample for the named Stan program variable
by flattening the draws by chains into a single column:
bern_fit.stan_variable('theta')
Summarize or save the results¶
CmdStan is distributed with a posterior analysis utility
stansummary
that reads the outputs of all chains and computes summary statistics
for all sampler and model parameters and quantities of interest.
The CmdStanMCMC method summary
runs this utility and returns
summaries of the total joint log-probability density lp__ plus
all model parameters and quantities of interest in a pandas.DataFrame:
bern_fit.summary()
CmdStan is distributed with a second posterior analysis utility
diagnose
which analyzes the per-draw sampler parameters across all chains
looking for potential problems which indicate that the sample
isn’t a representative sample from the posterior.
The diagnose
method runs this utility and prints the output to the console.
bern_fit.diagnose()
The save_csvfiles
function moves the CmdStan csv output files
to a specified directory.
bern_fit.save_csvfiles(dir='some/path')