**Planning simulation code**

When I’m presented with a problem that I think I can solve by simulation, I tend to experience strong, adrenaline-laden urges to jump right into coding, and leave all the details to sort themselves out on the fly. Unfortunately, this is not a good idea (at least for me). **The time saved by an adequate simulation plan definitely justifies at least identifying the simulation objectives, inputs, and outputs explicitly up-front**. Since I write primarily in R, all but my harriest simulation projects happen there. I’ve received relatively little advice about how to structure simulations. This is my own protocol, so I’ll take full blame for all its anomales. It’s worked for me for a while.

__Two minutes of planning__

In a simulation intended to investigate the role age-structure plays in epidemic processes for a particular pathogen, I might want to look at maximum epidemic size under variable age structures, transmission rates, infectious periods, and disease-induced mortality rates. In this case, my overall inputs would be the conditions I want to vary, plus any other structural constraints, like an observed social contact structure, or population growth rate, etc., that I want fixed throughout all simulations. The output would be maximum epidemic size. I want to write my functions so that they can take my input values as arguments, and return my desired outputs.

__An overarching strategy__

My sim projects tend to (sometime inadvertently) follow an approximate Bayesian computational approach. In the simplest case, I simulate a dataset under some process — for me, it’s usually a disease invasion — with varying conditions (i.e., parameters). Next, I describe that particular simulated dataset with some predetermined summary statistics, and finally I compare the simulated statistics to the same statistic calculated using field data. I throw out parameters that produce simulations with statistics that don’t match the observed data sufficiently well.

__Implementation__

I break this procedure down into as many bite-sized chunks as possible. This usually gives me at least the following chunks (each embodied in its own function).

Chunk | Function Utility | Inputs | Outputs |
---|---|---|---|

Priors.fun | Determines parameter values that will govern a given simulation | Parameters for all prior distributions | Realized parameter values for a given simulation |

DataSim.fun | Simulates one dataset (for me, usually longitudinal or timeseries) under a given set of parameters. | Parameter values (user specified or drawn from prior) | Either the entire simulated dataset, or a predetermined summary statistic |

Batch.fun | Simulates replicate datasets under the same set of parameter values. (i.e., replicates DataSim.fun) | Number of replications, parameter values to be passed to DataSim.fun | List or vector of length = number of replicates, containing summary measures for all replicate simulations |

FullSim.fun | Links Priors.fun, DataSim.fun, and Batch.fun into a single package | Parameters describing the priors (to be passed to Priors.fun), Number of replications per set of priors (for Batch.fun) | Matrix containing prior values and corresponding summary statistics |

My preference is to write the DataSim function first. This is typically the hardest bit. It contains the nuts and bolts of the biological process you’re exploring. I find that I often identify overlooked aspects of the process as I write this function, so writing it first helps me clarify exactly how complex of a problem I’m tackling. I often write smaller functions for critical and complicated steps in the data simulation (for me, these tend to be Disease Transmission, Survival, Recrudescence, etc.), so that I can deal with them in isolation. The Priors function can is probably the next-most challenging, since it involves all the philosophical and scientific growing pains associated with identifying reasonable priors. Batching over a set of parameters may not be necessary; I sometimes use it to quantify how variable my summary measures are under identifical parameters. I often split FullSim.fun up into a few batches to pass to clusters, though this will depend on simulation size.

In general, I like to write all these functions in a file that contains the functions and nothing else (i.e., the “Source” file), and put everything else into another file (i.e., the “Run” file). More specs and suggestions for those in the next post. Below is pseudo-code for a source file (set up in R).

PriorsFun <- function(hyperpriors){ priors.out <- fun(hyperpriors) # fun(hyperpriors) will likely use random number generators like runif, rnorm, etc. # the hyperpriors are the arguments that the random number generators need return(priors.out = priors.out) } DataSimFun <- function(priors.out){ Data <- # might be complicated and include a loop through time summary.stat <- fun(Data) return(Data = Data, summary.stat = summary.stat) } BatchFun <- function(priors.out, n.reps){ summary.stat.list <- vector("list", n.reps) # storage object for recording each sim's output for(i in 1:n.reps){ summary.stat.list[[i]] <- DataSimFun(priors.out)$summary.stat # this could also be done with "replicate" but I like for-loops in examples, # since they're easier for me to understand } return(summary.stat.list = summary.stat.list) } FullSimFun <- function(hyperpriors, n.reps, n.sims){ param.list <- sim.out.list <- vector("list", n.sims) # storage objects for recording what parameters generated what summary measures for(i in 1:n.sims){ param.list[[i]] <- PriorsFun(hyperpriors) sim.out.list[[i]] <- BatchFun(param.list[[i]], n.reps) } return(sim.out.list = sim.out.list, param.list = param.list) }