Predicts an outcome variable Y using an ensemble of machine learning algorithms combined with multiple sample splitting. This function implements the estimation strategy developed by Fava (2025), which improves statistical power by averaging predictions across M repetitions of K-fold cross-fitting.
Unlike ensemble_hte which estimates heterogeneous treatment effects,
this function performs standard prediction of Y given X without any treatment
variable or causal structure.
The function supports two interfaces:
Formula interface: Specify
formulaanddataMatrix interface: Specify
YandXdirectly
In the matrix interface, Y and X can be provided as column
name(s) instead of raw data. When Y is a single string and/or X
is a character vector, the corresponding columns are extracted from data.
Usage
ensemble_pred(
formula = NULL,
data = NULL,
Y = NULL,
X = NULL,
train_idx = NULL,
M = 2,
K = 3,
algorithms = c("lm", "grf"),
ensemble_folds = 5,
task_type = NULL,
scale_covariates = TRUE,
tune = FALSE,
tune_params = list(time = 30, cv_folds = 3, stagnation_iters = 250,
stagnation_threshold = 0.01, measure = NULL),
learner_params = NULL,
ensemble_strategy = c("cv", "average"),
individual_id = NULL,
n_cores = 1
)Arguments
- formula
A formula specifying the outcome and covariates (e.g.,
Y ~ X1 + X2orY ~ .). Use~ . - Zto exclude variables. Required ifYandXare not provided.- data
A data.frame or data.table containing the variables. Required for the formula interface or when
Y/Xare passed as column name(s).- Y
Numeric vector of outcomes, or a single string naming a column in
data. Use this withXas an alternative to the formula interface. Can contain NA values for observations wheretrain_idx = FALSE.- X
Matrix, data.frame, or character vector of column names in
data. Use this withYas an alternative to the formula interface. When a character vector is supplied (e.g.,X = c("age", "income")orX = microcredit_covariates), the corresponding columns are extracted fromdata.- train_idx
Optional logical or integer vector indicating which observations to use for training. If
NULL(default), all observations are used. If provided:Logical vector:
TRUEfor training observationsInteger vector: indices of training observations
Y must not have NA values for training observations. See Training on a Subset section for details on how this affects cross-fitting and predictions.
- M
Integer. Number of sample splitting repetitions (default: 2). Higher values improve stability but increase computation time.
- K
Integer. Number of cross-fitting folds within each repetition (default: 3). Each observation appears in exactly one test fold per repetition.
- algorithms
Character vector of ML algorithms to include in the ensemble. Default is
c("lm", "grf"). Algorithms can come from two sources:grf package: Use
"grf"for generalized random forest (viagrf::regression_forestorgrf::probability_forest)mlr3 learners: Any algorithm available in mlr3 or its extensions. Specify just the algorithm name without the task prefix (e.g., use
"ranger"not"regr.ranger"). The function will automatically add the appropriate prefix based on the task type. Common examples include:"lm": Linear regression"ranger": Random forest"glmnet": Elastic net regularization"xgboost": Gradient boosting"nnet": Neural network"kknn": K-nearest neighbors"svm": Support vector machine
To see all available learners, run
mlr3::mlr_learners$keys(). Additional learners may require installingmlr3learnersormlr3extralearnerspackages.
- ensemble_folds
Integer. Number of folds for cross-validated ensemble weight estimation (default: 5).
- task_type
Character. Type of prediction task:
"regr"for continuous outcomes or"classif"for binary outcomes. IfNULL(default), automatically detected from the outcome.- scale_covariates
Logical. Whether to standardize non-binary numeric covariates to mean 0 and standard deviation 1 before ML training (default:
TRUE). Binary variables (0/1) are not scaled. The original data is preserved in the returned object.- tune
Logical. Whether to perform hyperparameter tuning for ML algorithms (default:
FALSE). WhenTRUE, uses random search with early stopping.- tune_params
List of tuning parameters. Supports two modes:
- Simple mode (default)
A list of scalar parameters that configure the built-in auto-tuner:
time: Maximum tuning time in seconds (default: 30)cv_folds: Number of CV folds for tuning (default: 3)stagnation_iters: Stop if no improvement for this many iterations (default: 250)stagnation_threshold: Minimum improvement threshold (default: 0.01)measure: Performance measure string (default: R² for regression, AUC for classification)
- Advanced mode
Pass mlr3tuning objects directly for full control:
tuner: ATunerobject (e.g.,mlr3tuning::tnr("grid_search"))terminator: ATerminatorobject (e.g.,mlr3tuning::trm("evals", n_evals = 50))resampling: AResamplingobject (e.g.,mlr3::rsmp("holdout"))search_space: AParamSetor tuning space objectmeasure: AMeasureobject or string
- learner_params
Optional named list of algorithm-specific parameters for mlr3 learners. Each element name should match an algorithm in
algorithms, and the value should be a list of parameter-value pairs. This only affects mlr3-based algorithms; it is ignored for"grf"(which uses its own internal defaults).Example:
learner_params = list( ranger = list(num.trees = 1000, min.node.size = 5), glmnet = list(alpha = 0), # ridge regression xgboost = list(nrounds = 200, max_depth = 6) )Parameters are applied after algorithm-specific defaults and override them when there is a conflict. To see which parameters are available for a given learner, run
mlr3::lrn("regr.<algorithm>")$param_set.- ensemble_strategy
Character. Strategy for combining algorithm predictions:
"cv"(default): Cross-validated OLS regression of Y on algorithm predictions. Learns optimal weights for each algorithm."average": Simple average of all algorithm predictions. No weight learning; all algorithms contribute equally. Works with a single algorithm.
See Ensemble Strategy section for more details.
- individual_id
Required when the dataset is a panel (e.g., individuals observed over multiple time periods). Specifies the column that identifies individuals so that (1) all observations for the same individual are placed in the same cross-fitting fold, and (2) cluster-robust standard errors are used in all downstream analyses.
Example: for a panel of students observed across semesters, set
individual_id = student_id.Can be an unquoted column name, a quoted string (
"student_id"), or a vector of identifiers.- n_cores
Integer. Number of cores for parallel processing of repetitions. Default is 1 (sequential). Set to higher values to parallelize the M repetitions. Uses the
futureframework, so users can also set up their own parallel backend viafuture::plan()before calling this function.
Value
An object of class ensemble_pred_fit containing:
- predictions
data.table of predictions with M columns (one per repetition)
- call
The matched function call
- formula
The formula used (or constructed from Y/X)
- data
The original data (or constructed data.table from Y/X)
- Y
Vector of outcomes
- X
data.table of covariates (unscaled)
- train_idx
Logical vector indicating training observations
- splits
List of fold assignments for each repetition
- n
Number of observations
- n_train
Number of training observations
- M, K
Number of repetitions and folds
- algorithms
Algorithms used in ensemble
- ensemble_folds
Number of ensemble CV folds
- ensemble_strategy
Strategy used for combining predictions ("cv" or "average")
- task_type
Task type (regr or classif)
- scale_covariates
Whether covariates were scaled
- tune, tune_params
Tuning settings
- individual_id
Vector of individual identifiers (if panel data)
- n_cores
Number of cores used for parallel processing
Cross-Fitting Procedure
The estimation proceeds as follows:
The data is randomly split into \(K\) folds. This random splitting is repeated \(M\) times (each time with a fresh random partition).
For each repetition \(m = 1, \ldots, M\) and each fold \(k = 1, \ldots, K\):
Each ML algorithm in
algorithmsis trained on the \(K - 1\) folds that exclude fold \(k\).Out-of-sample predictions of Y are generated for all observations in fold \(k\).
The per-algorithm predictions are combined into a single ensemble prediction using the
ensemble_strategy(cross-validated OLS or simple average).This produces one complete vector of out-of-sample predictions per repetition (each observation appears in exactly one test fold per repetition).
The resulting \(M\) prediction vectors are stored and used by downstream
analysis functions (blp_pred, gavs,
gates, clan), which compute their estimands
separately for each repetition and then average the estimates and standard
errors across the \(M\) repetitions.
Training on a Subset
The train_idx parameter allows training models on a subset of observations
while generating predictions for all observations. This is useful when the outcome
Y is only observed for some units (e.g., only treated units in an experiment) but
you want predictions for everyone.
When train_idx is provided:
Y can have NA values for observations where
train_idx = FALSECross-fitting splits ALL observations into K folds, stratifying by
train_idxFor each fold k, models are trained on training observations NOT in fold k
Predictions are generated for ALL observations in fold k (both training and non-training)
Each observation gets exactly one prediction per repetition (from the model where they were in the test fold)
Ensemble weights are estimated using only training observations
Summary statistics are computed only on training observations
Ensemble Strategy
The ensemble combines predictions from multiple ML algorithms. The
ensemble_strategy parameter controls how predictions are combined:
"cv"(default): Uses cross-validated OLS regression of Y on the predicted values from each algorithm to learn optimal combination weights."average": Uses simple averaging across all algorithm predictions with equal weights. This is useful when you want a simpler ensemble or when using only a single algorithm.
References
Fava, B. (2025). Training and Testing with Multiple Splits: A Central Limit Theorem for Split-Sample Estimators. arXiv preprint arXiv:2511.04957.
See also
gavs for Group Averages analysis,
blp_pred for Best Linear Predictor analysis,
clan for Classification Analysis,
ensemble_hte for heterogeneous treatment effect estimation
Examples
# \donttest{
# --- Predict bank profits using the microcredit data ---
# Bank profits are only observed for borrowers (loan_size > 0 & treat == 1).
# Use train_idx to train on borrowers, predict for the full sample.
# (Athey, Fava, Karlan, Osman & Zinman, 2025)
data(microcredit)
# Subset of covariates for speed (full set: microcredit_covariates)
covars <- c("age", "gender", "education", "hhinc_yrly_base",
"css_creditscorefinal")
# Include hhinc_yrly_end and treat for downstream gates() analysis
dat <- microcredit[, c("bank_profits_pp", "hhinc_yrly_end", "treat",
"prop_score", covars)]
f <- as.formula(paste("bank_profits_pp ~", paste(covars, collapse = " + ")))
fit <- ensemble_pred(
f, data = dat,
train_idx = microcredit$loan_size > 0 & microcredit$treat == 1,
algorithms = c("lm", "grf"), M = 3, K = 3
)
print(fit)
#> Ensemble Prediction Fit
#> =======================
#>
#> Call:
#> ensemble_pred(formula = f, data = dat, train_idx = microcredit$loan_size >
#> 0 & microcredit$treat == 1, M = 3, K = 3, algorithms = c("lm",
#> "grf"))
#>
#> Data:
#> Observations: 1113
#> Training obs: 650 (subset)
#> Outcome: bank_profits_pp
#> Covariates: 5
#>
#> Model specification:
#> Algorithms: lm, grf
#> Task type: regression (continuous outcome)
#>
#> Split-sample parameters:
#> Repetitions (M): 3
#> Folds (K): 3
#> Ensemble strategy: cross-validated OLS
#> Ensemble folds: 5
#> Covariate scaling: enabled
#> Hyperparameter tuning: disabled
summary(fit)
#> Ensemble Prediction Summary
#> ===========================
#>
#> Call:
#> ensemble_pred(formula = f, data = dat, train_idx = microcredit$loan_size >
#> 0 & microcredit$treat == 1, M = 3, K = 3, algorithms = c("lm",
#> "grf"))
#>
#> Outcome: bank_profits_pp
#> Observations: 1113
#> Training obs: 650
#> Repetitions: 3
#>
#> Prediction Accuracy (averaged across 3 repetitions):
#> R-squared: 0.19
#> RMSE: 15.39
#> MAE: 11.08
#> Correlation: 0.44
#>
#> Best Linear Predictor (BLP):
#> intercept: -0.00 (SE: 0.60, p: 0.995)
#> slope: 0.98 (SE: 0.07, p: 0.000) ***
#> -> Intercept close to 0 and slope close to 1 indicate good calibration
#> Note: ML model trained on 650 of 1113 observations; GAVS evaluated on 650 observations. Use subset = "all" to use all observations.
#>
#> Group Averages (GAVS) with 3 groups:
#> Group Estimate Std.Error Pr(>|t|)
#> --------------------------------------------
#> 1 -9.24 1.32 0.000 ***
#> 2 -0.62 0.94 0.511
#> 3 7.65 0.99 0.000 ***
#>
#> Top - Bottom: 16.90 (SE: 1.65, p: 0.000) ***
#>
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Can we predict bank profits? (blp_pred and gavs use training obs by default)
blp_pred(fit)
#> BLP Results (Best Linear Predictor - Prediction)
#> =================================================
#>
#> Fit type: Prediction (ensemble_pred)
#> Outcome analyzed: bank_profits_pp
#> Observations used: 650
#> Repetitions: 3
#>
#> Coefficients:
#> intercept: Regression intercept (0 = well-calibrated)
#> beta: Prediction loading (1 = well-calibrated)
#>
#> Term Estimate Std.Error t value Pr(>|t|)
#> ----------------------------------------------------
#> intercept -0.00 0.60 -0.01 0.995
#> beta 0.98 0.07 13.01 0.000 ***
#>
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
gavs(fit, n_groups = 3)
#> Note: ML model trained on 650 of 1113 observations; GAVS evaluated on 650 observations. Use subset = "all" to use all observations.
#> GAVS Results (Group Averages)
#> =============================
#>
#> Outcome analyzed: bank_profits_pp
#> Number of groups: 3
#> Repetitions: 3
#>
#> Data usage:
#> ML trained on: 650 of 1113 obs
#> Analysis evaluated on: 650 of 1113 obs
#> Groups (3) formed on: all observations (1113 obs)
#>
#> Group Average Outcomes (groups by predicted Y):
#>
#> Group Estimate Std.Error t value Pr(>|t|)
#> ----------------------------------------------------
#> 1 -9.24 1.32 -6.99 0.000 ***
#> 2 -0.62 0.94 -0.66 0.511
#> 3 7.65 0.99 7.73 0.000 ***
#>
#> Heterogeneity Tests:
#> ----------------------------------------------------
#> Test Estimate Std.Error t value Pr(>|t|)
#> ----------------------------------------------------
#> Top-Bottom 16.90 1.65 10.23 0.000 ***
#> Top-All 7.66 0.80 9.57 0.000 ***
#>
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Do predicted profits correlate with treatment effects on household income?
gates(fit, outcome = "hhinc_yrly_end", treatment = "treat",
prop_score = dat$prop_score, subset = "all", n_groups = 3)
#> Note: ML model trained on 650 of 1113 observations; GATES evaluated on 1113 observations. Use subset = "train" to restrict to training observations.
#> GATES Results
#> =============
#>
#> Fit type: Prediction (ensemble_pred)
#> Outcome analyzed: hhinc_yrly_end
#> (Groups based on: bank_profits_pp predicted Y)
#> Number of groups: 3
#> Repetitions: 3
#>
#> Data usage:
#> ML trained on: 650 of 1113 obs
#> Analysis evaluated on: 1113 of 1113 obs
#> Groups (3) formed on: all observations (1113 obs)
#>
#> Group Average Treatment Effects:
#>
#> Group Estimate Std.Error t value Pr(>|t|)
#> ----------------------------------------------------
#> 1 -418.75 2148.27 -0.19 0.845
#> 2 883.33 1568.04 0.56 0.573
#> 3 3990.92 4364.06 0.91 0.360
#>
#> Heterogeneity Tests:
#> ----------------------------------------------------
#> Test Estimate Std.Error t value Pr(>|t|)
#> ----------------------------------------------------
#> Top-Bottom 4409.67 4863.93 0.91 0.365
#> Top-All 2506.92 3041.13 0.82 0.410
#>
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# }
if (FALSE) { # \dontrun{
# --- Additional interface examples ---
# Matrix interface
fit <- ensemble_pred(
Y = microcredit$bank_profits_pp,
X = microcredit[, covars],
train_idx = microcredit$loan_size > 0 & microcredit$treat == 1,
algorithms = c("lm", "ranger"), M = 5, K = 5
)
# Using all covariates from Athey et al. (2025)
dat_full <- microcredit[, c("bank_profits_pp", microcredit_covariates)]
fit_full <- ensemble_pred(
bank_profits_pp ~ ., data = dat_full,
train_idx = microcredit$loan_size > 0 & microcredit$treat == 1,
algorithms = c("lm", "grf"), M = 5, K = 5
)
# Column-name interface (equivalent, no need to subset data)
fit_cov <- ensemble_pred(
Y = "bank_profits_pp",
X = microcredit_covariates,
data = microcredit,
train_idx = microcredit$loan_size > 0 & microcredit$treat == 1,
algorithms = c("lm", "grf"), M = 5, K = 5
)
# With parallel processing (4 cores)
fit <- ensemble_pred(
bank_profits_pp ~ ., data = dat,
train_idx = microcredit$loan_size > 0 & microcredit$treat == 1,
M = 10, K = 5, n_cores = 4
)
# With algorithm-specific learner parameters (mlr3 algorithms only)
fit <- ensemble_pred(
bank_profits_pp ~ ., data = dat,
train_idx = microcredit$loan_size > 0 & microcredit$treat == 1,
algorithms = c("ranger", "glmnet", "lm"),
learner_params = list(
ranger = list(num.trees = 1000, min.node.size = 5),
glmnet = list(alpha = 0) # ridge regression
)
)
# Panel data: individuals observed across multiple time periods
panel_data <- data.frame(
id = rep(1:100, each = 3),
time = rep(1:3, 100),
Y = rnorm(300),
X1 = rnorm(300),
X2 = rnorm(300)
)
fit_panel <- ensemble_pred(
formula = Y ~ X1 + X2,
data = panel_data,
individual_id = id,
algorithms = c("lm", "grf"),
M = 5, K = 3
)
gavs(fit_panel, n_groups = 3)
blp_pred(fit_panel)
} # }