dml.Rd
Estimates a target parameter of interest, such as an average treatment effect (ATE), using Debiased Machine #Learning (DML).
The function dml.gate
is a convenience function that adds groups to a dml
object after the model is fit.
dml(
y,
d,
x,
model = c("plm", "npm"),
target = "ate",
groups = NULL,
cf.folds = 5,
cf.reps = 1,
ps.trim = 0.01,
reg = "ranger",
yreg = reg,
dreg = reg,
dirty.tuning = TRUE,
save.models = FALSE,
y.class = FALSE,
d.class = FALSE,
verbose = TRUE,
warnings = FALSE
)
dml_gate(dml.fit, groups, ...)
numeric
vector with the outcome.
numeric
vector with the treatment. If the treatment is binary, it needs to be encoded as as: zero = absence of treatment, one = presence of treatment.
numeric
vector or matrix
with covariates. We suggest constructing x
using model.matrix
.
specifies the model. Current available options are plm
for a partially linear model, and npm
for a fully non-parametric model.
specifies the target causal quantity of interest. Current available option is ate
(ATE - average treatment effect). Note that for the partially linear model with a continuous treatment the ATE also equals the average causal derivative (ACD). For the nonparametric model, the ATE is only available for binary treatments. Other options (eg., ACD for the nonparametric model, ATT) will be available soon.
a factor
or numeric
vector indicating group membership. Groups must be a deterministic function of x
.
number of cross-fitting folds. Default is 2
.
number of cross-fitting repetitions. Default is 1
.
trims propensity scores lower than ps.trim
and greater than 1-ps.trim
, in order to obtain more stable estimates. This is only relevant for the case of a binary treatment.
details of the machine learning method to be used for estimating the nuisance parameters (e.g, regression functions of the treatment and the outcome). Currently, this should be specified using the same arguments as caret
's train
function. The default is random forest using ranger
. The default method is fast and usually works well for many applications.
same as reg
, but specifies arguments for the outcome regression alone. Default is the same value of reg
.
same as reg
, but specifies arguments for the treatment regression alone. Default is the same value of reg
.
should the tuning of the machine learning method happen within each cross-fit fold ("clean"), or using all the data ("dirty")? Default is dirty tuning (dirty.tuning = T
). As long as the number of choices for the tuning parameters is not too big, dirty tuning is faster and should not affect the asymptotic guarantees of DML.
should the fitted models of each iterated be saved? Default is FALSE
. Note that setting this to true could end up using a lot of memory.
when y
is binary, should the outcome regression be treated as a classification problem? Default is FALSE
. Note that for DML we need the class probabilities, and regression gives us that. If you change to classification, you need to make sure the method outputs class probabilities.
when d
is binary, should the outcome regression be treated as a classification problem? Default is FALSE
. Note that for DML we need the class probabilities, and regression gives us that. If you change to classification, you need to make sure the method outputs class probabilities.
if TRUE
(default) prints steps of the fitting procedure.
should caret
's warnings be printed? Default is FALSE
. Note caret
has many inconsistent and unnecessary warnings.
An object of class dml
with the results of the DML procedure. The object is a list
containing:
data
A list
with the data used.
call
The original call used to fit the model.
info
A list
with general information and arguments of the DML fitting procedure.
fits
A list
with the the predictions of each repetition.
results
A list
with the results (influence functions and estimates) for each repetition.
coefs
A list
with the estimates and standard errors for each repetition.
Chernozhukov, V., Cinelli, C., Newey, W., Sharma A., and Syrgkanis, V. (2021). "Long Story Short: Omitted Variable Bias in Causal Machine Learning."
# loads package
library(dml.sensemakr)
## loads data
data("pension")
# set the outcome
y <- pension$net_tfa # net total financial assets
# set the treatment
d <- pension$e401 # 401K eligibility
# set the covariates (a matrix)
x <- model.matrix(~ -1 + age + inc + educ+ fsize + marr + twoearn + pira + hown, data = pension)
## compute income quartiles for group ATE.
g1 <- cut(x[,"inc"], quantile(x[,"inc"], c(0, 0.25,.5,.75,1), na.rm = TRUE),
labels = c("q1", "q2", "q3", "q4"), include.lowest = T)
# run DML (nonparametric model)
## 2 folds (change as needed)
## 1 repetition (change as needed)
dml.401k <- dml(y, d, x, model = "npm", groups = g1, cf.folds = 2, cf.reps = 1)
#> Debiased Machine Learning
#>
#> Model: Nonparametric
#> Target: ate
#> Cross-Fitting: 2 folds, 1 reps
#> ML Method: outcome (ranger), treatment (ranger)
#> Tuning: dirty
#>
#>
#> ====================================
#> Tuning parameters using all the data
#> ====================================
#>
#> - Tuning Model for D.
#> -- Best Tune:
#> mtry min.node.size splitrule
#> 1 2 5 variance
#>
#> - Tuning Model for Y (non-parametric).
#> -- Best Tune:
#> mtry min.node.size splitrule
#> 1 3 5 variance
#>
#>
#> ======================================
#> Repeating 2-fold cross-fitting 1 times
#> ======================================
#>
#> -- Rep 1 -- Folds: 1 2
#>
# summary of results with median method (default)
summary(dml.401k, combine.method = "median")
#>
#> Debiased Machine Learning
#>
#> Model: Nonparametric
#> Cross-Fitting: 2 folds, 1 reps
#> ML Method: outcome (ranger, R2 = 26.059%), treatment (ranger, R2 = 10.467%)
#> Tuning: dirty
#>
#> Average Treatment Effect:
#>
#> Estimate Std. Error t value P(>|t|)
#> ate.all 7474 1239 6.03 1.63e-09 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Group Average Treatment Effect:
#>
#> Estimate Std. Error t value P(>|t|)
#> gate.q1 3862.5 835.7 4.622 3.8e-06 ***
#> gate.q2 2651.4 1359.9 1.950 0.051207 .
#> gate.q3 6772.3 1887.6 3.588 0.000334 ***
#> gate.q4 16608.8 4291.6 3.870 0.000109 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Note: DML estimates combined using the median method.
#>
#> Verbal interpretation of DML procedure:
#>
#> -- Average treatment effects were estimated using DML with 2-fold cross-fitting. In order to reduce the variance that stems from sample splitting, we repeated the procedure 1 times. Estimates are combined using the median as the final estimate, incorporating variation across experiments into the standard error as described in Chernozhukov et al. (2018). The outcome regression uses Random Forest from the R package ranger; the treatment regression uses Random Forest from the R package ranger.
# coef median method (default)
coef(dml.401k, combine.method = "median")
#> ate.all gate.q1 gate.q2 gate.q3 gate.q4
#> 7474.047 3862.492 2651.363 6772.252 16608.821
# se median method (default)
se(dml.401k, combine.method = "median")
#> ate.all gate.q1 gate.q2 gate.q3 gate.q4
#> 1239.3764 835.6746 1359.8547 1887.6231 4291.5565
# confint median method
confint(dml.401k, combine.method = "median")
#> 2.5 % 97.5 %
#> ate.all 5044.91384 9903.180
#> gate.q1 2224.60031 5500.384
#> gate.q2 -13.90369 5316.629
#> gate.q3 3072.57856 10471.925
#> gate.q4 8197.52458 25020.117
## Sensitivity Analysis
### Robustness Values
robustness_value(dml.401k, alpha = 0.05)
#> ate.all gate.q1 gate.q2 gate.q3 gate.q4
#> 4.157224e-02 6.083497e-02 6.751708e-09 3.543867e-02 4.887655e-02
### Confidence Bounds
confidence_bounds(dml.401k, cf.y = 0.03, cf.d = 0.04, level = 0.95)
#> Error in dml_bounds(model, cf.y = cf.y, cf.d = cf.d, rho2 = rho2): argument "cf.y" is missing, with no default
### Contour Plots
ovb_contour_plot(dml.401k, cf.y = 0.03, cf.d = 0.04,
bound.label = "Max Match (3x years)")
#> Error in ovb_contour_plot.dml(dml.401k, cf.y = 0.03, cf.d = 0.04, bound.label = "Max Match (3x years)"): unused arguments (cf.y = 0.03, cf.d = 0.04)