MCMC Sampling

The CmdStanModel class method sample invokes Stan’s adaptive HMC-NUTS sampler which uses the Hamiltonian Monte Carlo (HMC) algorithm and its adaptive variant the no-U-turn sampler (NUTS) to produce a set of draws from the posterior distribution of the model parameters conditioned on the data. It returns a CmdStanMCMC object which provides properties to retrieve information about the sample, as well as methods to run CmdStan’s summary and diagnostics tools.

In order to evaluate the fit of the model to the data, it is necessary to run several Monte Carlo chains and compare the set of draws returned by each. By default, the sample command runs 4 sampler chains, i.e., CmdStanPy invokes CmdStan 4 times. CmdStanPy uses Python’s subprocess and multiprocessing libraries to run these chains in separate processes. This processing can be done in parallel, up to the number of processor cores available.

Prerequisites

CmdStanPy displays progress bars during sampling via use of package tqdm. In order for these to display properly, you must have the ipywidgets package installed, and depending on your version of Jupyter or JupyterLab, you must enable it via command:

[1]:
!jupyter nbextension enable --py widgetsnbextension
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: OK

For more information, see the the installation instructions, also this tqdm GitHub issue.

Fitting a model to data

In this example we use the CmdStan example model bernoulli.stan and data file bernoulli.data.json.

We instantiate a CmdStanModel from the Stan program file

[2]:
import os
from cmdstanpy import CmdStanModel, cmdstan_path

bernoulli_dir = os.path.join(cmdstan_path(), 'examples', 'bernoulli')
stan_file = os.path.join(bernoulli_dir, 'bernoulli.stan')
data_file = os.path.join(bernoulli_dir, 'bernoulli.data.json')

# instantiate, compile bernoulli model
model = CmdStanModel(stan_file=stan_file)
INFO:cmdstanpy:compiling stan file /home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.1/bin/cmdstan/examples/bernoulli/bernoulli.stan to exe file /home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.1/bin/cmdstan/examples/bernoulli/bernoulli
INFO:cmdstanpy:compiled model executable: /home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.1/bin/cmdstan/examples/bernoulli/bernoulli

By default, the model is compiled during instantiation. The compiled executable is created in the same directory as the program file. If the directory already contains an executable file with a newer timestamp, the model is not recompiled.

We run the sampler on the data using all default settings: 4 chains, each of which runs 1000 warmup and sampling iterations.

[3]:
# run CmdStan's sample method, returns object `CmdStanMCMC`
fit = model.sample(data=data_file)
INFO:cmdstanpy:CmdStan start processing

INFO:cmdstanpy:CmdStan done processing.

The sample method returns a CmdStanMCMC object, which contains: - metadata - draws - HMC tuning parameters metric, step_size

[4]:
print('sampler diagnostic variables:\n{}'.format(fit.metadata.method_vars_cols.keys()))
print('stan model variables:\n{}'.format(fit.metadata.stan_vars_cols.keys()))
sampler diagnostic variables:
dict_keys(['lp__', 'accept_stat__', 'stepsize__', 'treedepth__', 'n_leapfrog__', 'divergent__', 'energy__'])
stan model variables:
dict_keys(['theta'])
[5]:
fit.summary()
[5]:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
name
lp__ -7.30 0.0180 0.71 -8.700 -7.00 -6.80 1500.0 17000.0 1.0
theta 0.25 0.0029 0.12 0.081 0.24 0.46 1600.0 18000.0 1.0

The sampling data from the fit can be accessed either as a numpy array or a pandas DataFrame:

[6]:
print(fit.draws().shape)
fit.draws_pd().head()
(1000, 4, 8)
[6]:
lp__ accept_stat__ stepsize__ treedepth__ n_leapfrog__ divergent__ energy__ theta
0 -6.76311 1.000000 0.864132 2.0 3.0 0.0 7.70169 0.228715
1 -6.79635 0.986741 0.864132 1.0 3.0 0.0 6.85193 0.290134
2 -8.04407 0.801004 0.864132 2.0 3.0 0.0 8.04431 0.091230
3 -7.08429 0.892570 0.864132 1.0 3.0 0.0 9.51795 0.360480
4 -7.70584 0.827265 0.864132 1.0 1.0 0.0 7.73255 0.442635

Additionally, if xarray is installed, this data can be accessed another way:

[7]:
fit.draws_xr()
[7]:
<xarray.Dataset>
Dimensions:  (draw: 1000, chain: 4)
Coordinates:
  * chain    (chain) int64 1 2 3 4
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
Data variables:
    theta    (chain, draw) float64 0.2287 0.2901 0.09123 ... 0.1542 0.2446
Attributes:
    stan_version:        2.28.2
    model:               bernoulli_model
    num_draws_sampling:  1000

The fit object records the command, the return code, and the paths to the sampler output csv and console files. The string representation of this object displays the CmdStan commands and the location of the output files.

Output filenames are composed of the model name, a timestamp in the form YYYYMMDDhhmm and the chain id, plus the corresponding filetype suffix, either ‘.csv’ for the CmdStan output or ‘.txt’ for the console messages, e.g. bernoulli-201912081451-1.csv. Output files written to the temporary directory contain an additional 8-character random string, e.g. bernoulli-201912081451-1-5nm6as7u.csv.

[8]:
fit
[8]:
CmdStanMCMC: model=bernoulli chains=4['method=sample', 'algorithm=hmc', 'adapt', 'engaged=1']
 csv_files:
        /tmp/tmplzf18i6_/bernoulli-20220214161211_1.csv
        /tmp/tmplzf18i6_/bernoulli-20220214161211_2.csv
        /tmp/tmplzf18i6_/bernoulli-20220214161211_3.csv
        /tmp/tmplzf18i6_/bernoulli-20220214161211_4.csv
 output_files:
        /tmp/tmplzf18i6_/bernoulli-20220214161211_0-stdout.txt
        /tmp/tmplzf18i6_/bernoulli-20220214161211_1-stdout.txt
        /tmp/tmplzf18i6_/bernoulli-20220214161211_2-stdout.txt
        /tmp/tmplzf18i6_/bernoulli-20220214161211_3-stdout.txt

The sampler output files are written to a temporary directory which is deleted upon session exit unless the output_dir argument is specified. The save_csvfiles function moves the CmdStan CSV output files to a specified directory without having to re-run the sampler. The console output files are not saved. These files are treated as ephemeral; if the sample is valid, all relevant information is recorded in the CSV files.

Sampler Progress

Your model make take a long time to fit. The sample method provides two arguments:

  • visual progress bar: show_progress=True

  • stream CmdStan ouput to the console - show_console=True

To illustrate how progress bars work, we will run the bernoulli model. Since the progress bars are only visible while the sampler is running and the bernoulli model takes no time at all to fit, we run this model for 200K iterations, in order to see the progress bars in action.

[9]:
fit = model.sample(data=data_file, iter_warmup=100000, iter_sampling=100000, show_progress=True)

INFO:cmdstanpy:CmdStan start processing

INFO:cmdstanpy:CmdStan done processing.

The Stan language print statement can be use to monitor the Stan program state. In order to see this information as the sampler is running, use the show_console=True argument. This will stream all CmdStan messages to the terminal while the sampler is running.

[10]:
fit = model.sample(data=data_file, chains=2, parallel_chains=1, show_console=True)


INFO:cmdstanpy:Chain [1] start processing
INFO:cmdstanpy:Chain [1] done processing
INFO:cmdstanpy:Chain [2] start processing
INFO:cmdstanpy:Chain [2] done processing
Chain [1] method = sample (Default)
Chain [1] sample
Chain [1] num_samples = 1000 (Default)
Chain [1] num_warmup = 1000 (Default)
Chain [1] save_warmup = 0 (Default)
Chain [1] thin = 1 (Default)
Chain [1] adapt
Chain [1] engaged = 1 (Default)
Chain [1] gamma = 0.050000000000000003 (Default)
Chain [1] delta = 0.80000000000000004 (Default)
Chain [1] kappa = 0.75 (Default)
Chain [1] t0 = 10 (Default)
Chain [1] init_buffer = 75 (Default)
Chain [1] term_buffer = 50 (Default)
Chain [1] window = 25 (Default)
Chain [1] algorithm = hmc (Default)
Chain [1] hmc
Chain [1] engine = nuts (Default)
Chain [1] nuts
Chain [1] max_depth = 10 (Default)
Chain [1] metric = diag_e (Default)
Chain [1] metric_file =  (Default)
Chain [1] stepsize = 1 (Default)
Chain [1] stepsize_jitter = 0 (Default)
Chain [1] num_chains = 1 (Default)
Chain [1] id = 1 (Default)
Chain [1] data
Chain [1] file = /home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.1/bin/cmdstan/examples/bernoulli/bernoulli.data.json
Chain [1] init = 2 (Default)
Chain [1] random
Chain [1] seed = 20561
Chain [1] output
Chain [1] file = /tmp/tmplzf18i6_/bernoulli-20220214161220_1.csv
Chain [1] diagnostic_file =  (Default)
Chain [1] refresh = 100 (Default)
Chain [1] sig_figs = -1 (Default)
Chain [1] profile_file = profile.csv (Default)
Chain [1] num_threads = 1 (Default)
Chain [1]
Chain [1]
Chain [1] Gradient evaluation took 4e-06 seconds
Chain [1] 1000 transitions using 10 leapfrog steps per transition would take 0.04 seconds.
Chain [1] Adjust your expectations accordingly!
Chain [1]
Chain [1]
Chain [1] Iteration:    1 / 2000 [  0%]  (Warmup)
Chain [1] Iteration:  100 / 2000 [  5%]  (Warmup)
Chain [1] Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain [1] Iteration:  300 / 2000 [ 15%]  (Warmup)
Chain [1] Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain [1] Iteration:  500 / 2000 [ 25%]  (Warmup)
Chain [1] Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain [1] Iteration:  700 / 2000 [ 35%]  (Warmup)
Chain [1] Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain [1] Iteration:  900 / 2000 [ 45%]  (Warmup)
Chain [1] Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain [1] Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain [1] Iteration: 1100 / 2000 [ 55%]  (Sampling)
Chain [1] Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain [1] Iteration: 1300 / 2000 [ 65%]  (Sampling)
Chain [1] Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain [1] Iteration: 1500 / 2000 [ 75%]  (Sampling)
Chain [1] Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain [1] Iteration: 1700 / 2000 [ 85%]  (Sampling)
Chain [1] Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain [1] Iteration: 1900 / 2000 [ 95%]  (Sampling)
Chain [1] Iteration: 2000 / 2000 [100%]  (Sampling)
Chain [1]
Chain [1] Elapsed Time: 0.008 seconds (Warm-up)
Chain [1] 0.011 seconds (Sampling)
Chain [1]                0.019 seconds (Total)
Chain [1]
Chain [1]
Chain [2] method = sample (Default)
Chain [2] sample
Chain [2] num_samples = 1000 (Default)
Chain [2] num_warmup = 1000 (Default)
Chain [2] save_warmup = 0 (Default)
Chain [2] thin = 1 (Default)
Chain [2] adapt
Chain [2] engaged = 1 (Default)
Chain [2] gamma = 0.050000000000000003 (Default)
Chain [2] delta = 0.80000000000000004 (Default)
Chain [2] kappa = 0.75 (Default)
Chain [2] t0 = 10 (Default)
Chain [2] init_buffer = 75 (Default)
Chain [2] term_buffer = 50 (Default)
Chain [2] window = 25 (Default)
Chain [2] algorithm = hmc (Default)
Chain [2] hmc
Chain [2] engine = nuts (Default)
Chain [2] nuts
Chain [2] max_depth = 10 (Default)
Chain [2] metric = diag_e (Default)
Chain [2] metric_file =  (Default)
Chain [2] stepsize = 1 (Default)
Chain [2] stepsize_jitter = 0 (Default)
Chain [2] num_chains = 1 (Default)
Chain [2] id = 2
Chain [2] data
Chain [2] file = /home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.1/bin/cmdstan/examples/bernoulli/bernoulli.data.json
Chain [2] init = 2 (Default)
Chain [2] random
Chain [2] seed = 20561
Chain [2] output
Chain [2] file = /tmp/tmplzf18i6_/bernoulli-20220214161220_2.csv
Chain [2] diagnostic_file =  (Default)
Chain [2] refresh = 100 (Default)
Chain [2] sig_figs = -1 (Default)
Chain [2] profile_file = profile.csv (Default)
Chain [2] num_threads = 1 (Default)
Chain [2]
Chain [2]
Chain [2] Gradient evaluation took 3e-06 seconds
Chain [2] 1000 transitions using 10 leapfrog steps per transition would take 0.03 seconds.
Chain [2] Adjust your expectations accordingly!
Chain [2]
Chain [2]
Chain [2] Iteration:    1 / 2000 [  0%]  (Warmup)
Chain [2] Iteration:  100 / 2000 [  5%]  (Warmup)
Chain [2] Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain [2] Iteration:  300 / 2000 [ 15%]  (Warmup)
Chain [2] Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain [2] Iteration:  500 / 2000 [ 25%]  (Warmup)
Chain [2] Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain [2] Iteration:  700 / 2000 [ 35%]  (Warmup)
Chain [2] Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain [2] Iteration:  900 / 2000 [ 45%]  (Warmup)
Chain [2] Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain [2] Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain [2] Iteration: 1100 / 2000 [ 55%]  (Sampling)
Chain [2] Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain [2] Iteration: 1300 / 2000 [ 65%]  (Sampling)
Chain [2] Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain [2] Iteration: 1500 / 2000 [ 75%]  (Sampling)
Chain [2] Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain [2] Iteration: 1700 / 2000 [ 85%]  (Sampling)
Chain [2] Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain [2] Iteration: 1900 / 2000 [ 95%]  (Sampling)
Chain [2] Iteration: 2000 / 2000 [100%]  (Sampling)
Chain [2]
Chain [2] Elapsed Time: 0.006 seconds (Warm-up)
Chain [2] 0.014 seconds (Sampling)
Chain [2] 0.02 seconds (Total)
Chain [2]
Chain [2]

Running a data-generating model

In this example we use the CmdStan example model data_filegen.stan to generate a simulated dataset given fixed data values.

[11]:
model_datagen = CmdStanModel(stan_file='bernoulli_datagen.stan')
datagen_data = {'N':300, 'theta':0.3}
fit_sim = model_datagen.sample(data=datagen_data, fixed_param=True)
fit_sim.summary()
INFO:cmdstanpy:compiling stan file /home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/checkouts/v1.0.1/docsrc/examples/bernoulli_datagen.stan to exe file /home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/checkouts/v1.0.1/docsrc/examples/bernoulli_datagen
INFO:cmdstanpy:compiled model executable: /home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/checkouts/v1.0.1/docsrc/examples/bernoulli_datagen
INFO:cmdstanpy:CmdStan start processing

INFO:cmdstanpy:CmdStan done processing.

[11]:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
name
lp__ 0 NaN 0.0 0 0 0.0 NaN NaN NaN
theta_rep 90 0.24 7.7 77 90 100.0 990.0 120000.0 1.0

Compute, plot histogram of total successes for N Bernoulli trials with chance of success theta:

[12]:
drawset_pd = fit_sim.draws_pd()
drawset_pd.columns

# restrict to columns over new outcomes of N Bernoulli trials
y_sims = drawset_pd.drop(columns=['lp__', 'accept_stat__'])

# plot total number of successes per draw
y_sums = y_sims.sum(axis=1)
y_sums.astype('int32').plot.hist(range(0,datagen_data['N']+1))
Matplotlib is building the font cache; this may take a moment.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [12], in <module>
      7 # plot total number of successes per draw
      8 y_sums = y_sims.sum(axis=1)
----> 9 y_sums.astype('int32').plot.hist(range(0,datagen_data['N']+1))

File ~/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.1/lib/python3.9/site-packages/pandas/plotting/_core.py:1346, in PlotAccessor.hist(self, by, bins, **kwargs)
   1286 def hist(self, by=None, bins=10, **kwargs):
   1287     """
   1288     Draw one histogram of the DataFrame's columns.
   1289
   (...)
   1344         >>> ax = df.plot.hist(column=["age"], by="gender", figsize=(10, 8))
   1345     """
-> 1346     return self(kind="hist", by=by, bins=bins, **kwargs)

File ~/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.1/lib/python3.9/site-packages/pandas/plotting/_core.py:972, in PlotAccessor.__call__(self, *args, **kwargs)
    969             label_name = label_kw or data.columns
    970             data.columns = label_name
--> 972 return plot_backend.plot(data, kind=kind, **kwargs)

File ~/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.1/lib/python3.9/site-packages/pandas/plotting/_matplotlib/__init__.py:70, in plot(data, kind, **kwargs)
     68             ax = plt.gca()
     69         kwargs["ax"] = getattr(ax, "left_ax", ax)
---> 70 plot_obj = PLOT_CLASSES[kind](data, **kwargs)
     71 plot_obj.generate()
     72 plot_obj.draw()

File ~/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.1/lib/python3.9/site-packages/pandas/plotting/_matplotlib/hist.py:49, in HistPlot.__init__(self, data, bins, bottom, **kwargs)
     47 self.bottom = bottom
     48 # Do not call LinePlot.__init__ which may fill nan
---> 49 MPLPlot.__init__(self, data, **kwargs)

File ~/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.1/lib/python3.9/site-packages/pandas/plotting/_matplotlib/core.py:165, in MPLPlot.__init__(self, data, kind, by, subplots, sharex, sharey, use_index, figsize, grid, legend, rot, ax, fig, title, xlim, ylim, xticks, yticks, xlabel, ylabel, sort_columns, fontsize, secondary_y, colormap, table, layout, include_bool, column, **kwds)
    162 # For `hist` plot, need to get grouped original data before `self.data` is
    163 # updated later
    164 if self.by is not None and self._kind == "hist":
--> 165     self._grouped = data.groupby(self.by)
    167 self.kind = kind
    169 self.sort_columns = sort_columns

File ~/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.1/lib/python3.9/site-packages/pandas/core/series.py:1922, in Series.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
   1918 axis = self._get_axis_number(axis)
   1920 # error: Argument "squeeze" to "SeriesGroupBy" has incompatible type
   1921 # "Union[bool, NoDefault]"; expected "bool"
-> 1922 return SeriesGroupBy(
   1923     obj=self,
   1924     keys=by,
   1925     axis=axis,
   1926     level=level,
   1927     as_index=as_index,
   1928     sort=sort,
   1929     group_keys=group_keys,
   1930     squeeze=squeeze,  # type: ignore[arg-type]
   1931     observed=observed,
   1932     dropna=dropna,
   1933 )

File ~/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.1/lib/python3.9/site-packages/pandas/core/groupby/groupby.py:882, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
    879 if grouper is None:
    880     from pandas.core.groupby.grouper import get_grouper
--> 882     grouper, exclusions, obj = get_grouper(
    883         obj,
    884         keys,
    885         axis=axis,
    886         level=level,
    887         sort=sort,
    888         observed=observed,
    889         mutated=self.mutated,
    890         dropna=self.dropna,
    891     )
    893 self.obj = obj
    894 self.axis = obj._get_axis_number(axis)

File ~/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.1/lib/python3.9/site-packages/pandas/core/groupby/grouper.py:893, in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
    888         in_axis = False
    890     # create the Grouping
    891     # allow us to passing the actual Grouping as the gpr
    892     ping = (
--> 893         Grouping(
    894             group_axis,
    895             gpr,
    896             obj=obj,
    897             level=level,
    898             sort=sort,
    899             observed=observed,
    900             in_axis=in_axis,
    901             dropna=dropna,
    902         )
    903         if not isinstance(gpr, Grouping)
    904         else gpr
    905     )
    907     groupings.append(ping)
    909 if len(groupings) == 0 and len(obj):

File ~/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.1/lib/python3.9/site-packages/pandas/core/groupby/grouper.py:545, in Grouping.__init__(self, index, grouper, obj, level, sort, observed, in_axis, dropna)
    542     t = self.name or str(type(self.grouping_vector))
    543     raise ValueError(f"Grouper for '{t}' not 1-dimensional")
--> 545 self.grouping_vector = index.map(self.grouping_vector)
    547 if not (
    548     hasattr(self.grouping_vector, "__len__")
    549     and len(self.grouping_vector) == len(index)
    550 ):
    551     grper = pprint_thing(self.grouping_vector)

File ~/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.1/lib/python3.9/site-packages/pandas/core/indexes/base.py:6076, in Index.map(self, mapper, na_action)
   6056 """
   6057 Map values using an input mapping or function.
   6058
   (...)
   6072     a MultiIndex will be returned.
   6073 """
   6074 from pandas.core.indexes.multi import MultiIndex
-> 6076 new_values = self._map_values(mapper, na_action=na_action)
   6078 # we can return a MultiIndex
   6079 if new_values.size and isinstance(new_values[0], tuple):

File ~/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.1/lib/python3.9/site-packages/pandas/core/base.py:880, in IndexOpsMixin._map_values(self, mapper, na_action)
    877         raise ValueError(msg)
    879 # mapper is a function
--> 880 new_values = map_f(values, mapper)
    882 return new_values

File ~/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.1/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2870, in pandas._libs.lib.map_infer()

TypeError: 'range' object is not callable