Planning Simulation Code in R

Planning simulation code

When I’m presented with a problem that I think I can solve by simulation, I tend to experience strong, adrenaline-laden urges to jump right into coding, and leave all the details to sort themselves out on the fly. Unfortunately, this is not a good idea (at least for me). The time saved by an adequate simulation plan definitely justifies at least identifying the simulation objectives, inputs, and outputs explicitly up-front. Since I write primarily in R, all but my harriest simulation projects happen there. I’ve received relatively little advice about how to structure simulations. This is my own protocol, so I’ll take full blame for all its anomales. It’s worked for me for a while.

Two minutes of planning
In a simulation intended to investigate the role age-structure plays in epidemic processes for a particular pathogen, I might want to look at maximum epidemic size under variable age structures, transmission rates, infectious periods, and disease-induced mortality rates. In this case, my overall inputs would be the conditions I want to vary, plus any other structural constraints, like an observed social contact structure, or population growth rate, etc., that I want fixed throughout all simulations. The output would be maximum epidemic size. I want to write my functions so that they can take my input values as arguments, and return my desired outputs.

An overarching strategy
My sim projects tend to (sometime inadvertently) follow an approximate Bayesian computational approach. In the simplest case, I simulate a dataset under some process — for me, it’s usually a disease invasion — with varying conditions (i.e., parameters). Next, I describe that particular simulated dataset with some predetermined summary statistics,  and finally I compare the simulated statistics to the same statistic calculated using field data. I throw out parameters that produce simulations with statistics that don’t match the observed data sufficiently well.

Implementation
I break this procedure down into as many bite-sized chunks as possible. This usually gives me at least the following chunks (each embodied in its own function).

Chunk
Function Utility
Inputs
Outputs
Priors.funDetermines parameter values that will govern a given simulationParameters for all prior distributionsRealized parameter values for a given simulation
DataSim.funSimulates one dataset (for me, usually longitudinal or timeseries) under a given set of parameters.Parameter values (user specified or drawn from prior)Either the entire simulated dataset, or a predetermined summary statistic
Batch.funSimulates replicate datasets under the same set of parameter values. (i.e., replicates DataSim.fun)Number of replications, parameter values to be passed to DataSim.funList or vector of length = number of replicates, containing summary measures for all replicate simulations
FullSim.funLinks Priors.fun, DataSim.fun, and Batch.fun into a single packageParameters describing the priors (to be passed to Priors.fun), Number of replications per set of priors (for Batch.fun)Matrix containing prior values and corresponding summary statistics

My preference is to write the DataSim function first.  This is typically the hardest bit. It contains the nuts and bolts of the biological process you’re exploring.  I find that I often identify overlooked aspects of the process as I write this function, so writing it first helps me clarify exactly how complex of a problem I’m tackling.  I often write smaller functions for critical and complicated steps in the data simulation (for me, these tend to be Disease Transmission, Survival, Recrudescence, etc.), so that I can deal with them in isolation. The Priors function can is probably the next-most challenging, since it involves all the philosophical and scientific growing pains associated with identifying reasonable priors. Batching over a set of parameters may not be necessary; I sometimes use it to quantify how variable my summary measures are under identifical parameters.  I often split FullSim.fun up into a few batches to pass to clusters, though this will depend on simulation size.

In general, I like to write all these functions in a file that contains the functions and nothing else (i.e., the “Source” file), and put everything else into another file (i.e., the “Run” file). More specs and suggestions for those in the next post. Below is pseudo-code for a source file (set up in R).

PriorsFun <- function(hyperpriors){
  priors.out <- fun(hyperpriors)
    # fun(hyperpriors) will likely use random number generators like runif, rnorm, etc.
    # the hyperpriors are the arguments that the random number generators need
  return(priors.out = priors.out)
}

DataSimFun <- function(priors.out){
  Data <-  
    #  might be complicated and include a loop through time
  summary.stat <- fun(Data)
  return(Data = Data, summary.stat = summary.stat)
}

BatchFun <- function(priors.out, n.reps){
  summary.stat.list <- vector("list", n.reps)
  # storage object for recording each sim's output
  for(i in 1:n.reps){
    summary.stat.list[[i]] <- DataSimFun(priors.out)$summary.stat
    # this could also be done with "replicate" but I like for-loops in examples,
    # since they're easier for me to understand
  }
  return(summary.stat.list = summary.stat.list)
}

FullSimFun <- function(hyperpriors, n.reps, n.sims){
  param.list <- sim.out.list <- vector("list", n.sims)
  # storage objects for recording what parameters generated what summary measures
  for(i in 1:n.sims){
    param.list[[i]] <- PriorsFun(hyperpriors)
    sim.out.list[[i]] <- BatchFun(param.list[[i]], n.reps)
  }
  return(sim.out.list = sim.out.list, param.list = param.list)
}

 

Leave a Reply

Your email address will not be published. Required fields are marked *