Generating new quantities of interest.

The generated quantities block computes quantities of interest based on the data, transformed data, parameters, and transformed parameters. It can be used to:

  • generate simulated data for model testing by forward sampling

  • generate predictions for new data

  • calculate posterior event probabilities, including multiple comparisons, sign tests, etc.

  • calculating posterior expectations

  • transform parameters for reporting

  • apply full Bayesian decision theory

  • calculate log likelihoods, deviances, etc. for model comparison

Example: add posterior predictive checks to bernoulli.stan

In this example we use the CmdStan example model bernoulli.stan and data file bernoulli.data.json as our existing model and data.

We instantiate the model bernoulli, as in the “Hello World” section of the CmdStanPy tutorial notebook.

[1]:
import os
from cmdstanpy import cmdstan_path, CmdStanModel, CmdStanMCMC, CmdStanGQ

bernoulli_dir = os.path.join(cmdstan_path(), 'examples', 'bernoulli')
stan_file = os.path.join(bernoulli_dir, 'bernoulli.stan')
data_file = os.path.join(bernoulli_dir, 'bernoulli.data.json')

# instantiate, compile bernoulli model
model = CmdStanModel(stan_file=stan_file)
print(model.code())
INFO:cmdstanpy:found newer exe file, not recompiling
INFO:cmdstanpy:compiled model file: /home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.0rc1/bin/cmdstan/examples/bernoulli/bernoulli
data {
  int<lower=0> N;
  array[N] int<lower=0,upper=1> y; // or int<lower=0,upper=1> y[N];
}
parameters {
  real<lower=0,upper=1> theta;
}
model {
  theta ~ beta(1,1);  // uniform prior on interval 0,1
  y ~ bernoulli(theta);
}

The input data consists of N - the number of bernoulli trials and y - the list of observed outcomes. Inspection of the data shows that on average, there is a 20% chance of success for any given Bernoulli trial.

[2]:
# examine bernoulli data
import ujson
import statistics
with open(data_file,'r') as fp:
    data_dict = ujson.load(fp)
print(data_dict)
print('mean of y: {}'.format(statistics.mean(data_dict['y'])))
{'N': 10, 'y': [0, 1, 0, 0, 0, 0, 0, 0, 0, 1]}
mean of y: 0.2

As in the “Hello World” tutorial, we produce a sample from the posterior of the model conditioned on the data:

[3]:
# fit the model to the data
fit = model.sample(data=data_file)
INFO:cmdstanpy:sampling: ['/home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.0rc1/bin/cmdstan/examples/bernoulli/bernoulli', 'id=1', 'random', 'seed=40882', 'data', 'file=/home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/conda/v1.0.0rc1/bin/cmdstan/examples/bernoulli/bernoulli.data.json', 'output', 'file=/tmp/tmptbcr50_o/bernoulli-20211019134707-1-o1z_3dg5.csv', 'method=sample', 'algorithm=hmc', 'adapt', 'engaged=1']

INFO:cmdstanpy:sampling completed

The fitted model produces an estimate of theta - the chance of success

[4]:
fit.summary()
[4]:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
name
lp__ -7.30 0.0170 0.71 -8.600 -7.00 -6.80 1800.0 21000.0 1.0
theta 0.25 0.0031 0.12 0.086 0.24 0.47 1500.0 17000.0 1.0

To run a prior predictive check, we add a generated quantities block to the model, in which we generate a new data vector y_rep using the current estimate of theta. The resulting model is in file bernoulli_ppc.stan

[5]:
model_ppc = CmdStanModel(stan_file='bernoulli_ppc.stan')
print(model_ppc.code())
INFO:cmdstanpy:compiling stan program, exe file: /home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/checkouts/v1.0.0rc1/docsrc/examples/bernoulli_ppc
INFO:cmdstanpy:compiler options: stanc_options={}, cpp_options={}
INFO:cmdstanpy:compiled model file: /home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/checkouts/v1.0.0rc1/docsrc/examples/bernoulli_ppc
data {
  int<lower=0> N;
  int<lower=0,upper=1> y[N];
}
parameters {
  real<lower=0,upper=1> theta;
}
model {
  theta ~ beta(1,1);
  y ~ bernoulli(theta);
}
generated quantities {
  int y_rep[N];
  for (n in 1:N)
    y_rep[n] = bernoulli_rng(theta);
}

We run the generate_quantities method on bernoulli_ppc using existing sample fit as input. The generate_quantities method takes the values of theta in the fit sample as the set of draws from the posterior used to generate the corresponsing y_rep quantities of interest.

The arguments to the generate_quantities method are: + data - the data used to fit the model + mcmc_sample - either a CmdStanMCMC object or a list of stan-csv files

[6]:
new_quantities = model_ppc.generate_quantities(data=data_file, mcmc_sample=fit)
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:finish chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:finish chain 3
INFO:cmdstanpy:start chain 4
INFO:cmdstanpy:finish chain 4

The generate_quantities method returns a CmdStanGQ object which contains the values for all variables in the generated quantitites block of the program bernoulli_ppc.stan. Unlike the output from the sample method, it doesn’t contain any information on the joint log probability density, sampler state, or parameters or transformed parameter values.

In this example, each draw consists of the N-length array of replicate of the bernoulli model’s input variable y, which is an N-length array of Bernoulli outcomes.

[7]:
print(new_quantities.draws().shape, new_quantities.column_names)
for i in range(3):
    print (new_quantities.draws()[i,:])
(1000, 4, 10) ('y_rep[1]', 'y_rep[2]', 'y_rep[3]', 'y_rep[4]', 'y_rep[5]', 'y_rep[6]', 'y_rep[7]', 'y_rep[8]', 'y_rep[9]', 'y_rep[10]')
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
[[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 1. 0. 1.]
 [1. 0. 1. 1. 0. 0. 0. 1. 1. 1.]]
[[1. 1. 0. 1. 0. 0. 0. 0. 1. 0.]
 [0. 1. 0. 1. 0. 0. 0. 0. 0. 0.]
 [1. 1. 0. 1. 0. 0. 0. 0. 1. 0.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

We can also use draws_pd(inc_sample=True) to get a pandas DataFrame which combines the input drawset with the generated quantities.

[8]:
sample_plus = new_quantities.draws_pd(inc_sample=True)
print(type(sample_plus),sample_plus.shape)
names = list(sample_plus.columns.values[7:18])
sample_plus.iloc[0:3, :]
<class 'pandas.core.frame.DataFrame'> (4000, 18)
[8]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 -7.47305 0.734708 0.965126 2.0 3.0 0.0 8.69192 0.122640 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 -8.58632 0.729771 0.965126 2.0 3.0 0.0 10.84510 0.071361 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2 -6.74850 0.966129 0.965126 1.0 3.0 0.0 9.27414 0.253873 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0

For models as simple as the bernoulli models here, it would be trivial to re-run the sampler and generate a new sample which contains both the estimate of the parameters theta as well as y_rep values. For models which are difficult to fit, i.e., when producing a sample is computationally expensive, the generate_quantities method is preferred.