API Reference¶
Classes¶
CmdStanModel¶
-
class
cmdstanpy.
CmdStanModel
(model_name: str = None, stan_file: str = None, exe_file: str = None, compile: bool = True, stanc_options: Dict = None, cpp_options: Dict = None, logger: logging.Logger = None)[source]¶ Stan model.
- Stores pathnames to Stan program, compiled executable, and collection of compiler options.
- Provides functions to compile the model and perform inference on the model given data.
- By default, compiles model on instantiation - override with argument
compile=False
- By default, property
name
corresponds to basename of the Stan program or exe file - override with argumentmodel_name=<name>
.
-
compile
(force: bool = False, stanc_options: Dict = None, cpp_options: Dict = None, override_options: bool = False) → None[source]¶ Compile the given Stan program file. Translates the Stan code to C++, then calls the C++ compiler.
By default, this function compares the timestamps on the source and executable files; if the executable is newer than the source file, it will not recompile the file, unless argument
force
isTrue
.Parameters: - force – When
True
, always compile, even if the executable file is newer than the source file. Used for Stan models which have#include
directives in order to force recompilation when changes are made to the included files. - compiler_options – Options for stanc and C++ compilers.
- override_options – When
True
, override existing option. WhenFalse
, add/replace existing options. Default isFalse
.
- force – When
-
cpp_options
¶ Options to c++ compilers.
-
exe_file
¶ Full path to Stan exe file.
-
generate_quantities
(data: Union[Dict, str] = None, mcmc_sample: Union[cmdstanpy.stanfit.CmdStanMCMC, List[str]] = None, seed: int = None, gq_output_dir: str = None) → cmdstanpy.stanfit.CmdStanGQ[source]¶ Run CmdStan’s generate_quantities method which runs the generated quantities block of a model given an existing sample.
This function takes a CmdStanMCMC object and the dataset used to generate that sample and calls to the CmdStan
generate_quantities
method to generate additional quantities of interest.The
CmdStanGQ
object records the command, the return code, and the paths to the generate method output csv and console files. The output files are written either to a specified output directory or to a temporary directory which is deleted upon session exit.Output files are either written to a temporary directory or to the specified output directory. Output filenames correspond to the template ‘<model_name>-<YYYYMMDDHHMM>-<chain_id>’ plus the file suffix which is either ‘.csv’ for the CmdStan output or ‘.txt’ for the console messages, e.g. ‘bernoulli-201912081451-1.csv’. Output files written to the temporary directory contain an additional 8-character random string, e.g. ‘bernoulli-201912081451-1-5nm6as7u.csv’.
Parameters: - data – Values for all data variables in the model, specified either as a dictionary with entries matching the data variables, or as the path of a data file in JSON or Rdump format.
- mcmc_sample – Can be either a
CmdStanMCMC
object returned by thesample
method or a list of stan-csv files generated by fitting the model to the data using any Stan interface. - seed – The seed for random number generator. Must be an integer
between 0 and 2^32 - 1. If unspecified,
numpy.random.RandomState()
is used to generate a seed which will be used for all chains. NOTE: Specifying the seed will guarantee the same result for multiple invocations of this method with the same inputs. However this will not reproduce results from the sample method given the same inputs because the RNG will be in a different state. - gq_output_dir – Name of the directory in which the CmdStan output files are saved. If unspecified, files will be written to a temporary directory which is deleted upon session exit.
Returns: CmdStanGQ object
-
name
¶ Model name used in output filename templates. Default is basename of Stan program or exe file, unless specified in call to constructor via argument model_name.
-
optimize
(data: Union[Dict, str] = None, seed: int = None, inits: Union[Dict, float, str] = None, output_dir: str = None, algorithm: str = None, init_alpha: float = None, iter: int = None) → cmdstanpy.stanfit.CmdStanMLE[source]¶ Run the specified CmdStan optimize algorithm to produce a penalized maximum likelihood estimate of the model parameters.
This function validates the specified configuration, composes a call to the CmdStan
optimize
method and spawns one subprocess to run the optimizer and waits for it to run to completion. Unspecified arguments are not included in the call to CmdStan, i.e., those arguments will have CmdStan default values.The
CmdStanMLE
object records the command, the return code, and the paths to the optimize method output csv and console files. The output files are written either to a specified output directory or to a temporary directory which is deleted upon session exit.Output files are either written to a temporary directory or to the specified output directory. Ouput filenames correspond to the template ‘<model_name>-<YYYYMMDDHHMM>-<chain_id>’ plus the file suffix which is either ‘.csv’ for the CmdStan output or ‘.txt’ for the console messages, e.g. ‘bernoulli-201912081451-1.csv’. Output files written to the temporary directory contain an additional 8-character random string, e.g. ‘bernoulli-201912081451-1-5nm6as7u.csv’.
Parameters: - data – Values for all data variables in the model, specified either as a dictionary with entries matching the data variables, or as the path of a data file in JSON or Rdump format.
- seed – The seed for random number generator. Must be an integer
between 0 and 2^32 - 1. If unspecified,
numpy.random.RandomState()
is used to generate a seed. - inits –
Specifies how the sampler initializes parameter values. Initialization is either uniform random on a range centered on 0, exactly 0, or a dictionary or file of initial values for some or all parameters in the model. The default initialization behavior will initialize all parameter values on range [-2, 2] on the unconstrained support. If the expected parameter values are too far from this range, this option may improve estimation. The following value types are allowed:
- Single number, n > 0 - initialization range is [-n, n].
- 0 - all parameters are initialized to 0.
- dictionary - pairs parameter name : initial value.
- string - pathname to a JSON or Rdump data file.
- output_dir – Name of the directory to which CmdStan output files are written. If unspecified, output files will be written to a temporary directory which is deleted upon session exit.
- algorithm – Algorithm to use. One of: ‘BFGS’, ‘LBFGS’, ‘Newton’
- init_alpha – Line search step size for first iteration
- iter – Total number of iterations
Returns: CmdStanMLE object
-
sample
(data: Union[Dict, str] = None, chains: Optional[int] = None, parallel_chains: Optional[int] = None, threads_per_chain: Optional[int] = None, seed: Union[int, List[int]] = None, chain_ids: Union[int, List[int]] = None, inits: Union[Dict, float, str, List[str]] = None, iter_warmup: int = None, iter_sampling: int = None, save_warmup: bool = False, thin: int = None, max_treedepth: float = None, metric: Union[str, List[str]] = None, step_size: Union[float, List[float]] = None, adapt_engaged: bool = True, adapt_delta: float = None, adapt_init_phase: int = None, adapt_metric_window: int = None, adapt_step_size: int = None, fixed_param: bool = False, output_dir: str = None, save_diagnostics: bool = False, show_progress: Union[bool, str] = False, validate_csv: bool = True) → cmdstanpy.stanfit.CmdStanMCMC[source]¶ Run or more chains of the NUTS sampler to produce a set of draws from the posterior distribution of a model conditioned on some data.
This function validates the specified configuration, composes a call to the CmdStan
sample
method and spawns one subprocess per chain to run the sampler and waits for all chains to run to completion. Unspecified arguments are not included in the call to CmdStan, i.e., those arguments will have CmdStan default values.For each chain, the
CmdStanMCMC
object records the command, the return code, the sampler output file paths, and the corresponding console outputs, if any. The output files are written either to a specified output directory or to a temporary directory which is deleted upon session exit.Output files are either written to a temporary directory or to the specified output directory. Ouput filenames correspond to the template ‘<model_name>-<YYYYMMDDHHMM>-<chain_id>’ plus the file suffix which is either ‘.csv’ for the CmdStan output or ‘.txt’ for the console messages, e.g. ‘bernoulli-201912081451-1.csv’. Output files written to the temporary directory contain an additional 8-character random string, e.g. ‘bernoulli-201912081451-1-5nm6as7u.csv’.
Parameters: - data – Values for all data variables in the model, specified either as a dictionary with entries matching the data variables, or as the path of a data file in JSON or Rdump format.
- chains – Number of sampler chains, must be a positive integer.
- parallel_chains – Number of processes to run in parallel. Must be
a positive integer. Defaults to
multiprocessing.cpu_count()
. - threads_per_chain – The number of threads to use in parallelized
sections within an MCMC chain (e.g., when using the Stan functions
reduce_sum()
ormap_rect()
). This will only have an effect if the model was compiled with threading support. The total number of threads used will beparallel_chains * threads_per_chain
. - seed – The seed for random number generator. Must be an integer
between 0 and 2^32 - 1. If unspecified,
numpy.random.RandomState()
is used to generate a seed which will be used for all chains. When the same seed is used across all chains, the chain-id is used to advance the RNG to avoid dependent samples. - chain_ids – The offset for the random number generator, either an integer or a list of unique per-chain offsets. If unspecified, chain ids are numbered sequentially starting from 1.
- inits –
Specifies how the sampler initializes parameter values. Initialization is either uniform random on a range centered on 0, exactly 0, or a dictionary or file of initial values for some or all parameters in the model. The default initialization behavior will initialize all parameter values on range [-2, 2] on the unconstrained support. If the expected parameter values are too far from this range, this option may improve adaptation. The following value types are allowed:
- Single number n > 0 - initialization range is [-n, n].
- 0 - all parameters are initialized to 0.
- dictionary - pairs parameter name : initial value.
- string - pathname to a JSON or Rdump data file.
- list of strings - per-chain pathname to data file.
- iter_warmup – Number of warmup iterations for each chain.
- iter_sampling – Number of draws from the posterior for each chain.
- save_warmup – When
True
, sampler saves warmup draws as part of the Stan csv output file. - thin – Period between saved samples.
- max_treedepth – Maximum depth of trees evaluated by NUTS sampler per iteration.
- metric –
Specification of the mass matrix, either as a vector consisting of the diagonal elements of the covariance matrix (‘diag’ or ‘diag_e’) or the full covariance matrix (‘dense’ or ‘dense_e’).
If the value of the metric argument is a string other than ‘diag’, ‘diag_e’, ‘dense’, or ‘dense_e’, it must be a valid filepath to a JSON or Rdump file which contains an entry ‘inv_metric’ whose value is either the diagonal vector or the full covariance matrix.
If the value of the metric argument is a list of paths, its length must match the number of chains and all paths must be unique.
- step_size – Initial stepsize for HMC sampler. The value is either a single number or a list of numbers which will be used as the global or per-chain initial step size, respectively. The length of the list of step sizes must match the number of chains.
- adapt_engaged – When True, adapt stepsize and metric.
- adapt_delta – Adaptation target Metropolis acceptance rate. The default value is 0.8. Increasing this value, which must be strictly less than 1, causes adaptation to use smaller step sizes which improves the effective sample size, but may increase the time per iteration.
- adapt_init_phase – Iterations for initial phase of adaptation during which step size is adjusted so that the chain converges towards the typical set.
- adapt_metric_window – The second phase of adaptation tunes the metric and stepsize in a series of intervals. This parameter specifies the number of iterations used for the first tuning interval; window size increases for each subsequent interval.
- adapt_step_size – Number of iterations given over to adjusting the step size given the tuned metric during the final phase of adaptation.
- fixed_param – When
True
, call CmdStan with argumentalgorithm=fixed_param
which runs the sampler without updating the Markov Chain, thus the values of all parameters and transformed parameters are constant across all draws and only those values in the generated quantities block that are produced by RNG functions may change. This provides a way to use Stan programs to generate simulated data via the generated quantities block. This option must be used when the parameters block is empty. Default value isFalse
. - output_dir – Name of the directory to which CmdStan output files are written. If unspecified, output files will be written to a temporary directory which is deleted upon session exit.
- save_diagnostics – Whether or not to save diagnostics. If True, csv output files are written to an output file with filename template ‘<model_name>-<YYYYMMDDHHMM>-diagnostic-<chain_id>’, e.g. ‘bernoulli-201912081451-diagnostic-1.csv’.
- show_progress – Use tqdm progress bar to show sampling progress. If show_progress==’notebook’ use tqdm_notebook (needs nodejs for jupyter).
- validate_csv – If
False
, skip scan of sample csv output file. When sample is large or disk i/o is slow, will speed up processing. Default isTrue
- sample csv files are scanned for completeness and consistency.
Returns: CmdStanMCMC object
-
stan_file
¶ Full path to Stan program file.
-
stanc_options
¶ Options to stanc compilers.
-
variational
(data: Union[Dict, str] = None, seed: int = None, inits: float = None, output_dir: str = None, save_diagnostics: bool = False, algorithm: str = None, iter: int = None, grad_samples: int = None, elbo_samples: int = None, eta: numbers.Real = None, adapt_engaged: bool = True, adapt_iter: int = None, tol_rel_obj: numbers.Real = None, eval_elbo: int = None, output_samples: int = None, require_converged: bool = True) → cmdstanpy.stanfit.CmdStanVB[source]¶ Run CmdStan’s variational inference algorithm to approximate the posterior distribution of the model conditioned on the data.
This function validates the specified configuration, composes a call to the CmdStan
variational
method and spawns one subprocess to run the optimizer and waits for it to run to completion. Unspecified arguments are not included in the call to CmdStan, i.e., those arguments will have CmdStan default values.The
CmdStanVB
object records the command, the return code, and the paths to the variational method output csv and console files. The output files are written either to a specified output directory or to a temporary directory which is deleted upon session exit.Output files are either written to a temporary directory or to the specified output directory. Output filenames correspond to the template ‘<model_name>-<YYYYMMDDHHMM>-<chain_id>’ plus the file suffix which is either ‘.csv’ for the CmdStan output or ‘.txt’ for the console messages, e.g. ‘bernoulli-201912081451-1.csv’. Output files written to the temporary directory contain an additional 8-character random string, e.g. ‘bernoulli-201912081451-1-5nm6as7u.csv’.
Parameters: - data – Values for all data variables in the model, specified either as a dictionary with entries matching the data variables, or as the path of a data file in JSON or Rdump format.
- seed – The seed for random number generator. Must be an integer
between 0 and 2^32 - 1. If unspecified,
numpy.random.RandomState()
is used to generate a seed which will be used for all chains. - inits – Specifies how the sampler initializes parameter values. Initialization is uniform random on a range centered on 0 with default range of 2. Specifying a single number n > 0 changes the initialization range to [-n, n].
- output_dir – Name of the directory to which CmdStan output files are written. If unspecified, output files will be written to a temporary directory which is deleted upon session exit.
- save_diagnostics – Whether or not to save diagnostics. If True, csv output files are written to an output file with filename template ‘<model_name>-<YYYYMMDDHHMM>-diagnostic-<chain_id>’, e.g. ‘bernoulli-201912081451-diagnostic-1.csv’.
- algorithm – Algorithm to use. One of: ‘meanfield’, ‘fullrank’.
- iter – Maximum number of ADVI iterations.
- grad_samples – Number of MC draws for computing the gradient.
- elbo_samples – Number of MC draws for estimate of ELBO.
- eta – Stepsize scaling parameter.
- adapt_engaged – Whether eta adaptation is engaged.
- adapt_iter – Number of iterations for eta adaptation.
- tol_rel_obj – Relative tolerance parameter for convergence.
- eval_elbo – Number of iterations between ELBO evaluations.
- output_samples – Number of approximate posterior output draws to save.
- require_converged – Whether or not to raise an error if stan reports that “The algorithm may not have converged”.
Returns: CmdStanVB object
CmdStanMCMC¶
-
class
cmdstanpy.
CmdStanMCMC
(runset: cmdstanpy.stanfit.RunSet, validate_csv: bool = True, logger: logging.Logger = None)[source]¶ Container for outputs from CmdStan sampler run.
-
chain_ids
¶ Chain ids.
-
chains
¶ Number of chains.
-
column_names
¶ all sampler and model parameters and quantities of interest
Type: Names of all per-draw outputs
-
diagnose
() → str[source]¶ Run cmdstan/bin/diagnose over all output csv files. Returns output of diagnose (stdout/stderr).
The diagnose utility reads the outputs of all chains and checks for the following potential problems:
- Transitions that hit the maximum treedepth
- Divergent transitions
- Low E-BFMI values (sampler transitions HMC potential energy)
- Low effective sample sizes
- High R-hat values
-
draws
(inc_warmup: bool = False) → numpy.ndarray[source]¶ A 3-D numpy ndarray which contains all draws, from both warmup and sampling iterations, arranged as (draws, chains, columns) and stored column major, so that the values for each parameter are contiguous in memory, likewise all draws from a chain are contiguous.
Parameters: inc_warmup – When True
and the warmup draws are present in the output, i.e., the sampler was run withsave_warmup=True
, then the warmup draws are included. Default value isFalse
.
-
draws_as_dataframe
(params: List[str] = None, inc_warmup: bool = False) → pandas.core.frame.DataFrame[source]¶ Returns the assembled draws as a pandas DataFrame consisting of one column per parameter and one row per draw.
Parameters: - params – list of model parameter names.
- inc_warmup – When
True
and the warmup draws are present in the output, i.e., the sampler was run withsave_warmup=True
, then the warmup draws are included. Default value isFalse
.
-
metric
¶ Metric used by sampler for each chain. When sampler algorithm ‘fixed_param’ is specified, metric is None.
-
metric_type
¶ Metric type used for adaptation, either ‘diag_e’ or ‘dense_e’. When sampler algorithm ‘fixed_param’ is specified, metric_type is None.
-
num_draws
¶ Number of draws per chain.
-
sample
¶ Deprecated - use method “draws()” instead.
-
sampler_diagnostics
() → Dict[source]¶ Returns the sampler diagnostics as a map from column name to draws X chains X 1 ndarray.
-
save_csvfiles
(dir: str = None) → None[source]¶ Move output csvfiles to specified directory. If files were written to the temporary session directory, clean filename. E.g., save ‘bernoulli-201912081451-1-5nm6as7u.csv’ as ‘bernoulli-201912081451-1.csv’.
Parameters: dir – directory path
-
stan_variable
(name: str) → numpy.ndarray[source]¶ Return a new ndarray which contains the set of post-warmup draws for the named Stan program variable. Flattens the chains. Underlyingly draws are in chain order, i.e., for a sample consisting of N chains of M draws each, the first M array elements are from chain 1, the next M are from chain 2, and the last M elements are from chain N.
- If the variable is a scalar variable, this returns a 1-d array, length(draws X chains).
- If the variable is a vector, this is a 2-d array, shape ( draws X chains, len(vector))
- If the variable is a matrix, this is a 3-d array, shape ( draws X chains, matrix nrows, matrix ncols ).
- If the variable is an array with N dimensions, this is an N+1-d array, shape ( draws X chains, size(dim 1), … size(dim N)).
Parameters: name – variable name
-
stan_variable_dims
¶ Dict mapping Stan program variable names to variable dimensions. Scalar types have int value ‘1’. Structured types have list of dims, e.g., program variable
vector[10] foo
has entry('foo', [10])
.
-
stan_variables
() → Dict[source]¶ Return a dictionary of all Stan program variables. Creates copies of the data in the draws matrix.
-
stepsize
¶ Stepsize used by sampler for each chain. When sampler algorithm ‘fixed_param’ is specified, stepsize is None.
-
summary
(percentiles: List[int] = None) → pandas.core.frame.DataFrame[source]¶ Run cmdstan/bin/stansummary over all output csv files. Echo stansummary stdout/stderr to console. Assemble csv tempfile contents into pandasDataFrame.
Parameters: percentiles – Ordered non-empty list of percentiles to report. Must be integers from (1, 99), inclusive.
-
validate_csv_files
() → None[source]¶ Checks that csv output files for all chains are consistent. Populates attributes for draws, column_names, num_params, metric_type. Raises exception when inconsistencies detected.
-
warmup
¶ Deprecated - use “draws(inc_warmup=True)”
-
CmdStanMLE¶
-
class
cmdstanpy.
CmdStanMLE
(runset: cmdstanpy.stanfit.RunSet)[source]¶ Container for outputs from CmdStan optimization.
-
column_names
¶ Names of estimated quantities, includes joint log probability, and all parameters, transformed parameters, and generated quantitites.
-
optimized_params_dict
¶ Returns optimized params as Dict.
-
optimized_params_np
¶ Returns optimized params as numpy array.
-
optimized_params_pd
¶ Returns optimized params as pandas DataFrame.
-
CmdStanGQ¶
-
class
cmdstanpy.
CmdStanGQ
(runset: cmdstanpy.stanfit.RunSet, mcmc_sample: pandas.core.frame.DataFrame)[source]¶ Container for outputs from CmdStan generate_quantities run.
-
chains
¶ Number of chains.
-
column_names
¶ Names of generated quantities of interest.
-
generated_quantities
¶ A 2-D numpy ndarray which contains generated quantities draws for all chains where the columns correspond to the generated quantities block variables and the rows correspond to the draws from all chains, where first M draws are the first M draws of chain 1 and the last M draws are the last M draws of chain N, i.e., flattened chain, draw ordering.
-
generated_quantities_pd
¶ Returns the generated quantities as a pandas DataFrame consisting of one column per quantity of interest and one row per draw.
-
sample_plus_quantities
¶ Returns the column-wise concatenation of the input drawset with generated quantities drawset. If there are duplicate columns in both the input and the generated quantities, the input column is dropped in favor of the recomputed values in the generate quantities drawset.
-
CmdStanVB¶
-
class
cmdstanpy.
CmdStanVB
(runset: cmdstanpy.stanfit.RunSet)[source]¶ Container for outputs from CmdStan variational run.
-
column_names
¶ Names of information items returned by sampler for each draw. Includes approximation information and names of model parameters and computed quantities.
-
columns
¶ Total number of information items returned by sampler. Includes approximation information and names of model parameters and computed quantities.
-
save_csvfiles
(dir: str = None) → None[source]¶ Move output csvfiles to specified directory. If files were written to the temporary session directory, clean filename. E.g., save ‘bernoulli-201912081451-1-5nm6as7u.csv’ as ‘bernoulli-201912081451-1.csv’.
Parameters: dir – directory path
-
variational_params_dict
¶ Returns inferred parameter means as Dict.
-
variational_params_np
¶ Returns inferred parameter means as numpy array.
-
variational_params_pd
¶ Returns inferred parameter means as pandas DataFrame.
-
variational_sample
¶ Returns the set of approximate posterior output draws.
-
RunSet¶
-
class
cmdstanpy.stanfit.
RunSet
(args: cmdstanpy.cmdstan_args.CmdStanArgs, chains: int = 4, chain_ids: List[int] = None, logger: logging.Logger = None)[source]¶ Record of CmdStan run for a specified configuration and number of chains.
-
chain_ids
¶ Chain ids.
-
chains
¶ Number of chains.
-
cmds
¶ Per-chain call to CmdStan.
-
csv_files
¶ List of paths to CmdStan output files.
-
diagnostic_files
¶ List of paths to CmdStan diagnostic output files.
-
method
¶ Returns the CmdStan method used to generate this fit.
-
model
¶ Stan model name.
-
save_csvfiles
(dir: str = None) → None[source]¶ Moves csvfiles to specified directory.
Parameters: dir – directory path
-
stderr_files
¶ List of paths to CmdStan stderr transcripts.
-
stdout_files
¶ List of paths to CmdStan stdout transcripts.
-