Using Variational Estimates to Initialize the NUTS-HMC Sampler

In this example we show how to use the parameter estimates return by Stan’s variational inference algorithm as the initial parameter values for Stan’s NUTS-HMC sampler, using a the earnings-logearn_height model and data from the posteriordb package.

The experiments reported in the paper Pathfinder: Parallel quasi-Newton variational inference by Zhang et al. show that mean-field ADVI provides a better estimate of the posterior, as measured by the 1-Wasserstein distance to the reference posterior, than 75 iterations of the warmup Phase I algorithm used by the NUTS-HMC sampler, furthermore, ADVI is more computationally efficient, requiring fewer evaluations of the log density and gradient functions. Therefore, using the estimates from ADVI to initialize the parameter values for the NUTS-HMC sampler will allow the sampler to do a better job of adapting the stepsize and metric during warmup, resulting in better performance and estimation.

Model and data

For conveince, we have copied the posteriordb model and data to this directory, in files logearn_height.stan and earnings.json.

[1]:
import os
from cmdstanpy import CmdStanModel

stan_file = 'logearn_height.stan'
data_file = 'earnings.json'

# instantiate, compile bernoulli model
model = CmdStanModel(stan_file=stan_file)
INFO:cmdstanpy:compiling stan program, exe file: /home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/checkouts/v1.0.0rc1/docsrc/examples/logearn_height
INFO:cmdstanpy:compiler options: stanc_options={}, cpp_options={}
INFO:cmdstanpy:compiled model file: /home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/checkouts/v1.0.0rc1/docsrc/examples/logearn_height

The earnings dataset is a set of 1192 observations of annual earnings in USD, height in inches, and indicator for sex==male.

[2]:
import json
with open(data_file, 'r') as fd:
    data_dict = json.load(fd)
print(data_dict.keys())
print(data_dict['N'])
for i in range(5):
    print(data_dict['earn'][i], data_dict['height'][i])

dict_keys(['N', 'earn', 'height', 'male'])
1192
50000 74
60000 66
30000 64
50000 63
51000 63

The “logearn_height” model regresses the log earnings on height.

[3]:
print(model.code())
data {
  int<lower=0> N;
  vector[N] earn;
  vector[N] height;
}
transformed data {           // log transformation
  vector[N] log_earn;
  log_earn = log(earn);
}
parameters {
  vector[2] beta;
  real<lower=0> sigma;
}
model {
  log_earn ~ normal(beta[1] + beta[2] * height, sigma);
}

Run Stan’s variational inference algorithm, obtain fitted estimates

The CmdStanModel method variational runs CmdStan’s ADVI algorithm. Conditioning the model on the data results in a posterior geometry which is difficult to navigate. Because the ADVI algorithm is unstable and may fail to converge, we run it with argument require_converged set to False. We also specify a seed, to avoid instabilities as well as for reproducibility.

[4]:
vb_fit = model.variational(data=data_file, require_converged=False, seed=123)
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:finish chain 1
WARNING:cmdstanpy:The algorithm may not have converged.
Proceeding because require_converged is set to False

The ADVI algorithm provides estimates of all model parameters as well as the step size scaling factor eta.

The variational method returns a CmdStanVB object, with methods eta and stan_variables, which return the step size scaling factor and estimates of all model parameters as a Python dictionary respectively.

[5]:
print(vb_fit.eta, vb_fit.stan_variables())
10.0 {'beta': array([4.18195  , 0.0841979]), 'sigma': 3.87097}

Posteriordb provides reference posteriors for all models. For the logearn_height model, conditioned on the dataset earnings.json, the posterior variables are:

  • beta[1]: 5.782

  • beta[2]: 0.059

  • sigma: 0.894

By default, the sampler algorithm randomly initializes all model parameters in the range uniform[-2, 2]. The ADVI estimates will provide a better starting point, especially w/r/t to parameter beta[1], than the defaults. In addition, we can use the step size scaling factor to scale the initial step size, which allows us to skip the first phase of warmup (default 75 iterations).

[6]:
vb_vars = vb_fit.stan_variables()
vb_stepsize = 1.0 / vb_fit.eta
mcmc_vb_inits_fit = model.sample(
    data=data_file, inits=vb_vars, step_size=vb_stepsize,
    adapt_init_phase=0, seed=123
)
INFO:cmdstanpy:sampling: ['/home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/checkouts/v1.0.0rc1/docsrc/examples/logearn_height', 'id=1', 'random', 'seed=123', 'data', 'file=earnings.json', 'init=/tmp/tmpxedjwyz2/dlwazbgc.json', 'output', 'file=/tmp/tmpxedjwyz2/logearn_height-20211019134745-1-4lxw741n.csv', 'method=sample', 'algorithm=hmc', 'stepsize=0.1', 'adapt', 'engaged=1', 'init_buffer=0']

INFO:cmdstanpy:sampling completed

[7]:
mcmc_vb_inits_fit.summary()
[7]:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
name
lp__ -460.000 0.03200 1.2000 -460.000 -460.000 -460.00 1400.0 130.0 1.0
beta[1] 5.800 0.01200 0.4400 5.100 5.800 6.50 1293.0 124.0 1.0
beta[2] 0.059 0.00018 0.0065 0.048 0.059 0.07 1289.0 124.0 1.0
sigma 0.900 0.00000 0.0000 0.900 0.900 0.90 1653.5 158.8 1.0

The sampler results match the reference posterior, (taking into account MCSE).

  • beta[1]: 5.782

  • beta[2]: 0.059

  • sigma: 0.894

To see how this is useful, we run the sampler with default initializations, step size, and warmup.

[8]:
mcmc_random_inits_fit = model.sample(data=data_file, seed=123)
INFO:cmdstanpy:sampling: ['/home/docs/checkouts/readthedocs.org/user_builds/cmdstanpy/checkouts/v1.0.0rc1/docsrc/examples/logearn_height', 'id=1', 'random', 'seed=123', 'data', 'file=earnings.json', 'output', 'file=/tmp/tmpxedjwyz2/logearn_height-20211019134756-1-eylq9uzj.csv', 'method=sample', 'algorithm=hmc', 'adapt', 'engaged=1']

INFO:cmdstanpy:sampling completed

[9]:
mcmc_random_inits_fit.summary()
[9]:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
name
lp__ -460.000 0.03900 1.2000 -460.000 -460.000 -460.00 970.0 93.0 1.0
beta[1] 5.800 0.01400 0.4500 5.000 5.800 6.50 1001.0 95.0 1.0
beta[2] 0.059 0.00021 0.0066 0.048 0.058 0.07 1008.0 96.0 1.0
sigma 0.900 0.00000 0.0000 0.900 0.900 0.90 1558.4 148.2 1.0

Using the variational estimates to skip warmup phase I shows improved N_Eff/s (number of effective sampler per second) values for all parameters. This is a simple model run on a small dataset. For complex models where the initial parameter values are far from the default initializations, this procedure may allow for faster and better adaptation during warmup.