“Hello, World”¶
Fitting a Stan model using the NUTS-HMC sampler¶
In order to verify the installation and also to demonstrate
the CmdStanPy workflow, we use CmdStanPy to fit the
the example Stan model bernoulli.stan
to the dataset bernoulli.data.json
.
This model and data are included with the CmdStan distribution
in subdirectory examples/bernoulli
.
This example allows the user to verify that CmdStanPy, CmdStan,
the StanC compiler, and the C++ toolchain have all been properly installed.
For substantive example models and
guidance on coding statistical models in Stan, see
the CmdStan User’s Guide.
The Stan model¶
The model bernoulli.stan
is a simple model for binary data:
given a set of N observations of i.i.d. binary data
y[1] … y[N], it calculates the Bernoulli chance-of-success theta.
data {
int<lower=0> N;
int<lower=0,upper=1> y[N];
}
parameters {
real<lower=0,upper=1> theta;
}
model {
theta ~ beta(1,1); // uniform prior on interval 0,1
y ~ bernoulli(theta);
}
The CmdStanModel
class manages the Stan program and its corresponding compiled executable.
It provides properties and functions to inspect the model code and filepaths.
CmdStanPy, manages the environment variable CMDSTAN
which specifies the path to
the local CmdStan installation.
The function cmdstan_path()
returns the value of this environment variable.
# import packages
In [1]: import os
In [2]: from cmdstanpy import cmdstan_path, CmdStanModel
# specify Stan program file
In [3]: stan_file = os.path.join(cmdstan_path(), 'examples', 'bernoulli', 'bernoulli.stan')
# instantiate the model; compiles the Stan program as needed.
In [4]: model = CmdStanModel(stan_file=stan_file)
INFO:cmdstanpy:found newer exe file, not recompiling
INFO:cmdstanpy:compiled model file: /home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.0rc1/bin/cmdstan/examples/bernoulli/bernoulli
# inspect model object
In [5]: print(model)
CmdStanModel: name=bernoulli
stan_file=/home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.0rc1/bin/cmdstan/examples/bernoulli/bernoulli.stan
exe_file=/home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.0rc1/bin/cmdstan/examples/bernoulli/bernoulli
compiler_options=stanc_options={}, cpp_options={}
Data inputs¶
CmdStanPy accepts input data either as a Python dictionary which maps data variable names to values, or as the corresponding JSON file.
The bernoulli model requires two inputs: the number of observations N, and an N-length vector y of binary outcomes. The data file bernoulli.data.json contains the following inputs:
{
"N" : 10,
"y" : [0,1,0,0,0,0,0,0,0,1]
}
Fitting the model¶
The sample()
method is used to do Bayesian inference
over the model conditioned on data using using Hamiltonian Monte Carlo
(HMC) sampling. It runs Stan’s HMC-NUTS sampler on the model and data and
returns a CmdStanMCMC
object. The data can be specified
either as a filepath or a Python dictionary; in this example, we use the
example datafile bernoulli.data.json:
By default, the sample()
method runs 4 sampler chains.
The output_dir
argument is an optional argument which specifies
the path to the output directory used by CmdStan.
If this argument is omitted, the output files are written
to a temporary directory which is deleted when the current Python session is terminated.
# specify data file
In [6]: data_file = os.path.join(cmdstan_path(), 'examples', 'bernoulli', 'bernoulli.data.json')
# fit the model
In [7]: fit = model.sample(data=data_file)
INFO:cmdstanpy:sampling: ['/home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.0rc1/bin/cmdstan/examples/bernoulli/bernoulli', 'id=1', 'random', 'seed=91202', 'data', 'file=/home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.0rc1/bin/cmdstan/examples/bernoulli/bernoulli.data.json', 'output', 'file=/tmp/tmpjrgdbtmq/bernoulli-20211019134821-1-g44875ar.csv', 'method=sample', 'algorithm=hmc', 'adapt', 'engaged=1']
INFO:cmdstanpy:sampling completed
# printing the object reports sampler commands, output files
In [8]: print(fit)
CmdStanMCMC: model=bernoulli chains=4['method=sample', 'algorithm=hmc', 'adapt', 'engaged=1']
csv_files:
/tmp/tmpjrgdbtmq/bernoulli-20211019134821-1-g44875ar.csv
/tmp/tmpjrgdbtmq/bernoulli-20211019134821-2-i0xnamet.csv
/tmp/tmpjrgdbtmq/bernoulli-20211019134821-3-2_vap_cb.csv
/tmp/tmpjrgdbtmq/bernoulli-20211019134821-4-sr1c4vdj.csv
output_files:
/tmp/tmpjrgdbtmq/bernoulli-20211019134821-1-g44875ar-stdout.txt
/tmp/tmpjrgdbtmq/bernoulli-20211019134821-2-i0xnamet-stdout.txt
/tmp/tmpjrgdbtmq/bernoulli-20211019134821-3-2_vap_cb-stdout.txt
/tmp/tmpjrgdbtmq/bernoulli-20211019134821-4-sr1c4vdj-stdout.txt
Accessing the sample¶
The sample()
method outputs are a set of per-chain
Stan CSV files.
The filenames follow the template ‘<model_name>-<YYYYMMDDHHMM>-<chain_id>’
plus the file suffix ‘.csv’.
The CmdStanMCMC
class provides methods to assemble the contents
of these files in memory as well as methods to manage the disk files.
Underlyingly, the draws from all chains are stored as an
a numpy.ndarray with dimensions: draws, chains, columns.
CmdStanPy provides accessor methods which return the sample
either in terms of the CSV file columns or in terms of the
sampler and Stan program variables.
The draws()
and draws_pd()
methods return the sample contents
in columnar format.
The stan_variable()
method to returns a numpy.ndarray object
which contains the set of all draws in the sample for the named Stan program variable.
The draws from all chains are flattened into a single drawset.
The first ndarray dimension is the number of draws X number of chains.
The remaining ndarray dimensions correspond to the Stan program variable dimension.
The stan_variables()
method returns a Python dict over all Stan model variables.
In [9]: fit.draws().shape
Out[9]: (1000, 4, 8)
In [10]: fit.draws(concat_chains=True).shape
Out[10]: (4000, 8)
In [11]: draws_theta = fit.stan_variable(var='theta')
In [12]: draws_theta.shape
Out[12]: (4000,)
CmdStan utilities: stansummary, diagnose¶
CmdStan is distributed with a posterior analysis utility
stansummary
that reads the outputs of all chains and computes summary statistics
for all sampler and model parameters and quantities of interest.
The CmdStanMCMC
method summary()
runs this utility and returns
summaries of the total joint log-probability density lp__ plus
all model parameters and quantities of interest in a pandas.DataFrame:
In [13]: fit.summary()
Out[13]:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
name
lp__ -7.30 0.018 0.70 -8.700 -7.00 -6.70 1600.0 17000.0 1.0
theta 0.25 0.003 0.12 0.083 0.23 0.47 1500.0 16000.0 1.0
CmdStan is distributed with a second posterior analysis utility
diagnose
which analyzes the per-draw sampler parameters across all chains
looking for potential problems which indicate that the sample
isn’t a representative sample from the posterior.
The diagnose()
method runs this utility and prints the output to the console.
In [14]: print(fit.diagnose())
Processing csv files: /tmp/tmpjrgdbtmq/bernoulli-20211019134821-1-g44875ar.csv, /tmp/tmpjrgdbtmq/bernoulli-20211019134821-2-i0xnamet.csv, /tmp/tmpjrgdbtmq/bernoulli-20211019134821-3-2_vap_cb.csv, /tmp/tmpjrgdbtmq/bernoulli-20211019134821-4-sr1c4vdj.csv
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.
Checking sampler transitions for divergences.
No divergent transitions found.
Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.
Effective sample size satisfactory.
Split R-hat values satisfactory all parameters.
Processing complete, no problems detected.
Managing Stan CSV files¶
The CmdStanMCMC
object keeps track of all output files produced
by the sampler run.
The save_csvfiles()
function moves the CSV files
to a specified directory.
In [15]: fit.save_csvfiles(dir='some/path')
Progress bar¶
By default, CmdStanPy displays a progress bar during sampling.
In [16]: fit = model.sample(data=data_file)
To suppress the progress bar, specify argument show_progress=False
.
In [17]: fit = model.sample(data=data_file, show_progress=False)
To see the CmdStan console outputs instead, specify show_console=True
.
In [18]: fit = model.sample(data=data_file, show_console=True)
This will stream all sampler messages to the console. It provides an alternative way of monitoring progress. In conjunction with Stan programs which contain print statments, this provides a way to inspect and debug model behavoir.
Jupyter Lab Notebook requirements¶
In a Jupyter notebook, this package requires the ipywidgets package. For help on installation and configuration, see ipywidgets installation instructions and this tqdm GitHub issue.