Estimates the mean and standard deviation of the serial interval distribution from outbreak data using the Expectation-Maximization (EM) algorithm developed by Vink et al. (2014). The serial interval is defined as the time between symptom onset in a primary case and symptom onset in a secondary case infected by that primary case.
Arguments
- dat
numeric vector; index case-to-case (ICC) intervals in days. These are calculated as the time difference between symptom onset in each case and symptom onset in the index case (case with earliest onset). Must contain at least 2 values. Values should be non-negative in most epidemiological contexts, though negative values are allowed for normal distribution
- n
integer; number of EM algorithm iterations to perform. More iterations generally improve convergence but increase computation time. Defaults to 50, which is typically sufficient for convergence
- dist
character; the assumed parametric family for the serial interval distribution. Must be either:
"normal"(default): Allows negative serial intervals, uses 7 mixture components"gamma": Restricts to positive serial intervals, uses 4 mixture components
- init
numeric vector of length 2; initial values for the mean and standard deviation to start the EM algorithm. If
NULL(default), uses the sample mean and sample standard deviation of the input data. Providing good initial values can improve convergence, especially for challenging datasets- tol
numeric; convergence tolerance for the EM algorithm. The algorithm stops early if the relative change in both mean and standard deviation between iterations is less than this value. Set to 0 to disable early stopping and always run all
niterations. Defaults to 1e-6- n_starts
integer; number of random restarts for the EM algorithm. The algorithm can converge to local optima, especially when initial values are far from the true parameters. Using multiple restarts with different starting points helps find the global optimum. The first restart uses the provided
initvalues (or data-derived values ifinit = NULL), and subsequent restarts use random starting points sampled from the data range. The result with the highest log-likelihood is returned. Defaults to 1 (no additional restarts)
Value
A named list containing:
mean: Estimated mean of the serial interval distribution (days)sd: Estimated standard deviation of the serial interval distribution (days)wts: Numeric vector of estimated component weights representing the probability that cases belong to each transmission route. Length depends on distribution choice (7 for normal, 4 for gamma)converged: Logical indicating whether the algorithm converged before reaching the maximum number of iterationsiterations: Integer indicating the number of iterations performedloglik: Log-likelihood of the fitted model (used for model comparison)n_restarts: Number of restarts performed
Details
The Vink method addresses the challenge that individual transmission pairs are typically unknown in outbreak investigations. Instead, it uses index case-to-case (ICC) intervals - the time differences between the case with earliest symptom onset (index case) and all other cases - to infer the underlying serial interval distribution through a mixture modeling approach.
Methodological Approach:
The method models ICC intervals as arising from a mixture of four transmission routes:
Co-Primary (CP): Cases infected simultaneously from the same source
Primary-Secondary (PS): Direct transmission from index case
Primary-Tertiary (PT): Second-generation transmission
Primary-Quaternary (PQ): Third-generation transmission
The EM algorithm iteratively:
E-step: Calculates the probability that each ICC interval belongs to each transmission route component
M-step: Updates the serial interval parameters (mean, standard deviation) and component weights based on these probabilities
Distribution Choice:
Normal distribution: Allows negative serial intervals (useful for modeling co-primary infections) and uses 7 components (positive and negative pairs for PS, PT, PQ routes plus CP)
Gamma distribution: Restricts to positive values only, uses 4 components (CP, PS, PT, PQ without negative pairs). Recommended when negative serial intervals are epidemiologically implausible
Key Assumptions:
The case with earliest symptom onset is the index case
Transmission occurs through at most 4 generations
Serial intervals follow the specified parametric distribution
Cases represent a single, homogeneously-mixing outbreak
Input Data Preparation:
To prepare ICC intervals from outbreak data:
Identify the case with the earliest symptom onset date (index case)
Calculate the time difference (in days) between each case's onset date and the index case onset date
The resulting values are the ICC intervals for input to this function
Convergence and Diagnostics:
The EM algorithm typically converges within 20-50 iterations. Users should:
Examine the fitted distribution using
plot_si_fitConsider alternative distribution choices if fit is poor
Try different initial values if results seem unreasonable
Ensure adequate sample size (generally >20 cases recommended)
References
Vink MA, Bootsma MCJ, Wallinga J (2014). Serial intervals of respiratory infectious diseases: A systematic review and analysis. American Journal of Epidemiology, 180(9), 865-875. doi:10.1093/aje/kwu209
See also
plot_si_fit for diagnostic visualization,
integrate_component for the underlying likelihood calculations
Examples
# Example 1:Basic usage with simulated data
set.seed(123)
simulated_icc <- c(
rep(1, 20), # Short intervals (co-primary cases)
rep(2, 25), # Medium intervals (primary-secondary)
rep(3, 15), # Longer intervals (higher generation)
rep(4, 8)
)
result <- si_estim(simulated_icc)
# \donttest{
# Example 2: Larger simulated outbreak, specifying distribution
large_icc <- c(
rep(1, 38), # Short intervals (co-primary cases)
rep(2, 39), #
rep(3, 30), # Medium intervals (primary-secondary)
rep(4, 17), #
rep(5, 7), # Longer intervals (higher generation)
rep(6, 4),
rep(7, 2)
)
result_normal <- si_estim(large_icc, dist = "normal")
result_gamma <- si_estim(large_icc, dist = "gamma")
# Example 3: Using custom initial values
result_custom <- si_estim(large_icc, dist = "normal", init = c(3.0, 1.5))
# Example 4: Specify iterations
result_iter <- si_estim(large_icc, n=100)
# Example 5: Check convergence status
result_conv <- si_estim(large_icc)
if (result_conv$converged) {
message("Converged in ", result_conv$iterations, " iterations")
}
#> Converged in 32 iterations
# Example 6: Use multiple restarts to avoid local optima
# Useful when initial values might be far from true parameters
result_restarts <- si_estim(large_icc, n_starts = 5)
message("Best result from ", result_restarts$n_restarts, " restarts")
#> Best result from 5 restarts
# }
