Skip to contents

Estimates the mean and standard deviation of the serial interval distribution from outbreak data using the Expectation-Maximization (EM) algorithm developed by Vink et al. (2014). The serial interval is defined as the time between symptom onset in a primary case and symptom onset in a secondary case infected by that primary case.

Usage

si_estim(dat, n = 50, dist = "normal", init = NULL, tol = 1e-06, n_starts = 1)

Arguments

dat

numeric vector; index case-to-case (ICC) intervals in days. These are calculated as the time difference between symptom onset in each case and symptom onset in the index case (case with earliest onset). Must contain at least 2 values. Values should be non-negative in most epidemiological contexts, though negative values are allowed for normal distribution

n

integer; number of EM algorithm iterations to perform. More iterations generally improve convergence but increase computation time. Defaults to 50, which is typically sufficient for convergence

dist

character; the assumed parametric family for the serial interval distribution. Must be either:

  • "normal" (default): Allows negative serial intervals, uses 7 mixture components

  • "gamma": Restricts to positive serial intervals, uses 4 mixture components

init

numeric vector of length 2; initial values for the mean and standard deviation to start the EM algorithm. If NULL (default), uses the sample mean and sample standard deviation of the input data. Providing good initial values can improve convergence, especially for challenging datasets

tol

numeric; convergence tolerance for the EM algorithm. The algorithm stops early if the relative change in both mean and standard deviation between iterations is less than this value. Set to 0 to disable early stopping and always run all n iterations. Defaults to 1e-6

n_starts

integer; number of random restarts for the EM algorithm. The algorithm can converge to local optima, especially when initial values are far from the true parameters. Using multiple restarts with different starting points helps find the global optimum. The first restart uses the provided init values (or data-derived values if init = NULL), and subsequent restarts use random starting points sampled from the data range. The result with the highest log-likelihood is returned. Defaults to 1 (no additional restarts)

Value

A named list containing:

  • mean: Estimated mean of the serial interval distribution (days)

  • sd: Estimated standard deviation of the serial interval distribution (days)

  • wts: Numeric vector of estimated component weights representing the probability that cases belong to each transmission route. Length depends on distribution choice (7 for normal, 4 for gamma)

  • converged: Logical indicating whether the algorithm converged before reaching the maximum number of iterations

  • iterations: Integer indicating the number of iterations performed

  • loglik: Log-likelihood of the fitted model (used for model comparison)

  • n_restarts: Number of restarts performed

Details

The Vink method addresses the challenge that individual transmission pairs are typically unknown in outbreak investigations. Instead, it uses index case-to-case (ICC) intervals - the time differences between the case with earliest symptom onset (index case) and all other cases - to infer the underlying serial interval distribution through a mixture modeling approach.

Methodological Approach:

The method models ICC intervals as arising from a mixture of four transmission routes:

  • Co-Primary (CP): Cases infected simultaneously from the same source

  • Primary-Secondary (PS): Direct transmission from index case

  • Primary-Tertiary (PT): Second-generation transmission

  • Primary-Quaternary (PQ): Third-generation transmission

The EM algorithm iteratively:

  1. E-step: Calculates the probability that each ICC interval belongs to each transmission route component

  2. M-step: Updates the serial interval parameters (mean, standard deviation) and component weights based on these probabilities

Distribution Choice:

  • Normal distribution: Allows negative serial intervals (useful for modeling co-primary infections) and uses 7 components (positive and negative pairs for PS, PT, PQ routes plus CP)

  • Gamma distribution: Restricts to positive values only, uses 4 components (CP, PS, PT, PQ without negative pairs). Recommended when negative serial intervals are epidemiologically implausible

Key Assumptions:

  • The case with earliest symptom onset is the index case

  • Transmission occurs through at most 4 generations

  • Serial intervals follow the specified parametric distribution

  • Cases represent a single, homogeneously-mixing outbreak

Input Data Preparation:

To prepare ICC intervals from outbreak data:

  1. Identify the case with the earliest symptom onset date (index case)

  2. Calculate the time difference (in days) between each case's onset date and the index case onset date

  3. The resulting values are the ICC intervals for input to this function

Convergence and Diagnostics:

The EM algorithm typically converges within 20-50 iterations. Users should:

  • Examine the fitted distribution using plot_si_fit

  • Consider alternative distribution choices if fit is poor

  • Try different initial values if results seem unreasonable

  • Ensure adequate sample size (generally >20 cases recommended)

References

Vink MA, Bootsma MCJ, Wallinga J (2014). Serial intervals of respiratory infectious diseases: A systematic review and analysis. American Journal of Epidemiology, 180(9), 865-875. doi:10.1093/aje/kwu209

See also

plot_si_fit for diagnostic visualization, integrate_component for the underlying likelihood calculations

Examples

# Example 1:Basic usage with simulated data
set.seed(123)
simulated_icc <- c(
  rep(1, 20),   # Short intervals (co-primary cases)
  rep(2, 25),   # Medium intervals (primary-secondary)
  rep(3, 15),   # Longer intervals (higher generation)
  rep(4, 8)
)

result <- si_estim(simulated_icc)

# \donttest{
# Example 2: Larger simulated outbreak, specifying distribution
large_icc <- c(
  rep(1, 38),   # Short intervals (co-primary cases)
  rep(2, 39),   #
  rep(3, 30),   # Medium intervals (primary-secondary)
  rep(4, 17),   #
  rep(5, 7),    # Longer intervals (higher generation)
  rep(6, 4),
  rep(7, 2)
)

result_normal <- si_estim(large_icc, dist = "normal")
result_gamma <- si_estim(large_icc, dist = "gamma")

# Example 3: Using custom initial values
result_custom <- si_estim(large_icc, dist = "normal", init = c(3.0, 1.5))

# Example 4: Specify iterations
result_iter <- si_estim(large_icc, n=100)

# Example 5: Check convergence status
result_conv <- si_estim(large_icc)
if (result_conv$converged) {
  message("Converged in ", result_conv$iterations, " iterations")
}
#> Converged in 32 iterations

# Example 6: Use multiple restarts to avoid local optima
# Useful when initial values might be far from true parameters
result_restarts <- si_estim(large_icc, n_starts = 5)
message("Best result from ", result_restarts$n_restarts, " restarts")
#> Best result from 5 restarts

# }