“Hello, World!”#
Fitting a Stan model using the NUTS-HMC sampler#
In order to verify the installation and also to demonstrate
the CmdStanPy workflow, we use CmdStanPy to fit the
the example Stan model bernoulli.stan
to the dataset bernoulli.data.json
.
The bernoulli.stan
is a Hello, World!
program which illustrates the basic syntax of the Stan language.
It allows the user to verify that CmdStanPy, CmdStan,
the StanC compiler, and the C++ toolchain have all been properly installed.
For substantive example models and guidance on coding statistical models in Stan, see the Stan User’s Guide.
The Stan model#
The model bernoulli.stan
is a trivial model:
given a set of N observations of i.i.d. binary data
y[1] … y[N], it calculates the Bernoulli chance-of-success theta.
data {
int<lower=0> N;
int<lower=0,upper=1> y[N];
}
parameters {
real<lower=0,upper=1> theta;
}
model {
theta ~ beta(1,1); // uniform prior on interval 0,1
y ~ bernoulli(theta);
}
The CmdStanModel
class manages the Stan program and its corresponding compiled executable.
It provides properties and functions to inspect the model code and filepaths.
A CmdStanModel can be instantiated from a Stan file or its corresponding compiled executable file.
# import packages
In [1]: import os
In [2]: from cmdstanpy import CmdStanModel
# specify Stan program file
In [3]: stan_file = os.path.join('examples', 'bernoulli.stan')
# instantiate the model object
In [4]: model = CmdStanModel(stan_file=stan_file)
INFO:cmdstanpy:compiling stan file /home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/checkouts/v1.0.4/docsrc/examples/bernoulli.stan to exe file /home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/checkouts/v1.0.4/docsrc/examples/bernoulli
INFO:cmdstanpy:compiled model executable: /home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/checkouts/v1.0.4/docsrc/examples/bernoulli
# inspect model object
In [5]: print(model)
CmdStanModel: name=bernoulli
stan_file=/home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/checkouts/v1.0.4/docsrc/examples/bernoulli.stan
exe_file=/home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/checkouts/v1.0.4/docsrc/examples/bernoulli
compiler_options=stanc_options={}, cpp_options={}
# inspect compiled model
In [6]: print(model.exe_info())
{'stan_version_major': '2', 'stan_version_minor': '29', 'stan_version_patch': '2', 'STAN_THREADS': 'false', 'STAN_MPI': 'false', 'STAN_OPENCL': 'false', 'STAN_NO_RANGE_CHECKS': 'false', 'STAN_CPP_OPTIMS': 'false'}
Data inputs#
CmdStanPy accepts input data either as a Python dictionary which maps data variable names to values, or as the corresponding JSON file.
The bernoulli model requires two inputs: the number of observations N, and an N-length vector y of binary outcomes. The data file bernoulli.data.json contains the following inputs:
{
"N" : 10,
"y" : [0,1,0,0,0,0,0,0,0,1]
}
Fitting the model#
The sample()
method is used to do Bayesian inference
over the model conditioned on data using using Hamiltonian Monte Carlo
(HMC) sampling. It runs Stan’s HMC-NUTS sampler on the model and data and
returns a CmdStanMCMC
object. The data can be specified
either as a filepath or a Python dictionary; in this example, we use the
example datafile bernoulli.data.json:
By default, the sample method runs 4 sampler chains.
# specify data file
In [7]: data_file = os.path.join('examples', 'bernoulli.data.json')
# fit the model
In [8]: fit = model.sample(data=data_file)
INFO:cmdstanpy:CmdStan start processing
INFO:cmdstanpy:CmdStan done processing.
Underlyingly, the CmdStan outputs are a set of per-chain
Stan CSV files.
The filenames follow the template ‘<model_name>-<YYYYMMDDHHMMSS>-<chain_id>’
plus the file suffix ‘.csv’.
CmdStanPy also captures the per-chain console and error messages.
The output_dir
argument is an optional argument which specifies
the path to the output directory used by CmdStan.
If this argument is omitted, the output files are written
to a temporary directory which is deleted when the current Python session is terminated.
# printing the object reports sampler commands, output files
In [9]: print(fit)
CmdStanMCMC: model=bernoulli chains=4['method=sample', 'algorithm=hmc', 'adapt', 'engaged=1']
csv_files:
/tmp/tmpqlune80m/bernoullihdh8feyi/bernoulli-20220630204016_1.csv
/tmp/tmpqlune80m/bernoullihdh8feyi/bernoulli-20220630204016_2.csv
/tmp/tmpqlune80m/bernoullihdh8feyi/bernoulli-20220630204016_3.csv
/tmp/tmpqlune80m/bernoullihdh8feyi/bernoulli-20220630204016_4.csv
output_files:
/tmp/tmpqlune80m/bernoullihdh8feyi/bernoulli-20220630204016_0-stdout.txt
/tmp/tmpqlune80m/bernoullihdh8feyi/bernoulli-20220630204016_1-stdout.txt
/tmp/tmpqlune80m/bernoullihdh8feyi/bernoulli-20220630204016_2-stdout.txt
/tmp/tmpqlune80m/bernoullihdh8feyi/bernoulli-20220630204016_3-stdout.txt
Accessing the results#
The sample
method returns a CmdStanMCMC
object,
which provides access to the information from the Stan CSV files.
The CSV header and data rows contain the outputs from each iteration of the sampler.
CSV comment blocks are used to report the inference engine configuration and timing information.
The NUTS-HMC adaptive sampler algorithm also outputs the per-chain HMC tuning parameters step_size and metric.
The CmdStanMCMC
object parses the set of Stan CSV files into separate in-memory data structures for
the set of sampler iterations, the metadata, and the step_size and metric and provides accessor methods for each.
The primary object of interest are the draws from all iterations of the sampler, i.e., the CSV data rows.
The CmdStanMCMC
methods allow the user to extract the sample in whatever data format is needed for their analysis.
The sample can be extracted in tabular format, either as
a numpy.ndarray:
draws()
a pandas.DataFrame:
draws_pd()
In [10]: print(fit.draws().shape)
(1000, 4, 8)
In [11]: print(fit.draws(concat_chains=True).shape)
(4000, 8)
In [12]: fit.draws_pd()
Out[12]:
lp__ accept_stat__ stepsize__ ... divergent__ energy__ theta
0 -7.57011 1.000000 1.02846 ... 0.0 8.23025 0.427618
1 -7.19039 0.887861 1.02846 ... 0.0 8.10070 0.146248
2 -7.55690 0.883966 1.02846 ... 0.0 8.83854 0.116982
3 -6.88085 1.000000 1.02846 ... 0.0 7.37643 0.189540
4 -7.19404 0.891222 1.02846 ... 0.0 7.72335 0.145882
... ... ... ... ... ... ... ...
3995 -7.57873 0.794021 1.00988 ... 0.0 7.63636 0.428605
3996 -6.77030 1.000000 1.00988 ... 0.0 7.28446 0.276983
3997 -6.75470 0.997901 1.00988 ... 0.0 6.77972 0.235742
3998 -6.75470 0.554333 1.00988 ... 0.0 8.34367 0.235742
3999 -6.75470 0.502289 1.00988 ... 0.0 9.65013 0.235742
[4000 rows x 8 columns]
The sample can be treated as a collection of named, structured variables. CmdStanPy makes a distinction between the per-iteration model outputs and the per-iteration algorithm outputs: the former are ‘stan_variables’ and the information reported by the sampler are ‘method_variables’. Accessor functions extract these as:
a structured numpy.ndarray:
stan_variable()
which contains the set of all draws in the sample for the named Stan program variable. The draws from all chains are flattened, i.e., the first ndarray dimension is the number of draws X number of chains. The remaining ndarray dimensions correspond to the Stan program variable dimension.an xarray.Dataset:
draws_xr()
a Python dict mapping Stan variable names to numpy.ndarray objects, where the chains are flattened, as above:
stan_variables()
.a Python dict mapping the algorithm outputs to numpy.ndarray objects. Because these outputs are used for within-chain and cross-chain diagnostics, they are not flattened.
stan_variables()
.
In [13]: print(fit.stan_variable('theta'))
[0.427618 0.146248 0.116982 ... 0.235742 0.235742 0.235742]
In [14]: print(fit.draws_xr('theta'))
<xarray.Dataset>
Dimensions: (draw: 1000, chain: 4)
Coordinates:
* chain (chain) int64 1 2 3 4
* draw (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
Data variables:
theta (chain, draw) float64 0.4276 0.1462 0.117 ... 0.2357 0.2357 0.2357
Attributes:
stan_version: 2.29.2
model: bernoulli_model
num_draws_sampling: 1000
In [15]: for k, v in fit.stan_variables().items():
....: print(f'{k}\t{v.shape}')
....:
theta (4000,)
In [16]: for k, v in fit.method_variables().items():
....: print(f'{k}\t{v.shape}')
....:
lp__ (1000, 4)
accept_stat__ (1000, 4)
stepsize__ (1000, 4)
treedepth__ (1000, 4)
n_leapfrog__ (1000, 4)
divergent__ (1000, 4)
energy__ (1000, 4)
In addition to the MCMC sample itself, the CmdStanMCMC object provides access to the the per-chain HMC tuning parameters from the NUTS-HMC adaptive sampler, (if present).
In [17]: print(fit.metric_type)
diag_e
In [18]: print(fit.metric)
[[0.619973]
[0.473161]
[0.51851 ]
[0.556243]]
In [19]: print(fit.step_size)
[1.02846 0.91694 0.949542 1.00988 ]
The CmdStanMCMC object also provides access to metadata about the model and the sampler run.
In [20]: print(fit.metadata.cmdstan_config['model'])
bernoulli_model
In [21]: print(fit.metadata.cmdstan_config['seed'])
91511
In [22]: print(fit.metadata.stan_vars_cols.keys())
dict_keys(['theta'])
In [23]: print(fit.metadata.method_vars_cols.keys())
dict_keys(['lp__', 'accept_stat__', 'stepsize__', 'treedepth__', 'n_leapfrog__', 'divergent__', 'energy__'])
CmdStan utilities: stansummary
, diagnose
#
CmdStan is distributed with a posterior analysis utility
stansummary
that reads the outputs of all chains and computes summary statistics
for all sampler and model parameters and quantities of interest.
The CmdStanMCMC
method summary()
runs this utility and returns
summaries of the total joint log-probability density lp__ plus
all model parameters and quantities of interest in a pandas.DataFrame:
In [24]: fit.summary()
Out[24]:
Mean MCSE StdDev ... N_Eff N_Eff/s R_hat
name ...
lp__ -7.274110 0.018740 0.751004 ... 1605.98 21131.3 1.00055
theta 0.255293 0.003017 0.120481 ... 1594.24 20976.9 1.00143
[2 rows x 9 columns]
CmdStan is distributed with a second posterior analysis utility
diagnose
which analyzes the per-draw sampler parameters across all chains
looking for potential problems which indicate that the sample
isn’t a representative sample from the posterior.
The diagnose()
method runs this utility and prints the output to the console.
In [25]: print(fit.diagnose())
Processing csv files: /tmp/tmpqlune80m/bernoullihdh8feyi/bernoulli-20220630204016_1.csv, /tmp/tmpqlune80m/bernoullihdh8feyi/bernoulli-20220630204016_2.csv, /tmp/tmpqlune80m/bernoullihdh8feyi/bernoulli-20220630204016_3.csv, /tmp/tmpqlune80m/bernoullihdh8feyi/bernoulli-20220630204016_4.csv
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.
Checking sampler transitions for divergences.
No divergent transitions found.
Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.
Effective sample size satisfactory.
Split R-hat values satisfactory all parameters.
Processing complete, no problems detected.
Managing Stan CSV files#
The CmdStanMCMC
object keeps track of all output files produced
by the sampler run.
The save_csvfiles()
function moves the CSV files
to a specified directory.
In [26]: fit.save_csvfiles(dir='some/path')