Debiased Machine Learning — dml • dml.sensemakr

Estimates a target parameter of interest, such as an average treatment effect (ATE), using Debiased Machine #Learning (DML).

The function dml.gate is a convenience function that adds groups to a dml object after the model is fit.

dml(
  y,
  d,
  x,
  model = c("plm", "npm"),
  target = "ate",
  groups = NULL,
  cf.folds = 5,
  cf.reps = 1,
  ps.trim = 0.01,
  reg = "ranger",
  yreg = reg,
  dreg = reg,
  dirty.tuning = TRUE,
  save.models = FALSE,
  y.class = FALSE,
  d.class = FALSE,
  verbose = TRUE,
  warnings = FALSE
)

dml_gate(dml.fit, groups, ...)

Arguments

y: numeric vector with the outcome.
d: numeric vector with the treatment. If the treatment is binary, it needs to be encoded as as: zero = absence of treatment, one = presence of treatment.
x: numeric vector or matrix with covariates. We suggest constructing x using model.matrix.
model: specifies the model. Current available options are plm for a partially linear model, and npm for a fully non-parametric model.
target: specifies the target causal quantity of interest. Current available option is ate (ATE - average treatment effect). Note that for the partially linear model with a continuous treatment the ATE also equals the average causal derivative (ACD). For the nonparametric model, the ATE is only available for binary treatments. Other options (eg., ACD for the nonparametric model, ATT) will be available soon.
groups: a factor or numeric vector indicating group membership. Groups must be a deterministic function of x.
cf.folds: number of cross-fitting folds. Default is 2.
cf.reps: number of cross-fitting repetitions. Default is 1.
ps.trim: trims propensity scores lower than ps.trim and greater than 1-ps.trim, in order to obtain more stable estimates. This is only relevant for the case of a binary treatment.
reg: details of the machine learning method to be used for estimating the nuisance parameters (e.g, regression functions of the treatment and the outcome). Currently, this should be specified using the same arguments as caret's train function. The default is random forest using ranger. The default method is fast and usually works well for many applications.
yreg: same as reg, but specifies arguments for the outcome regression alone. Default is the same value of reg.
dreg: same as reg, but specifies arguments for the treatment regression alone. Default is the same value of reg.
dirty.tuning: should the tuning of the machine learning method happen within each cross-fit fold ("clean"), or using all the data ("dirty")? Default is dirty tuning (dirty.tuning = T). As long as the number of choices for the tuning parameters is not too big, dirty tuning is faster and should not affect the asymptotic guarantees of DML.
save.models: should the fitted models of each iterated be saved? Default is FALSE. Note that setting this to true could end up using a lot of memory.
y.class: when y is binary, should the outcome regression be treated as a classification problem? Default is FALSE. Note that for DML we need the class probabilities, and regression gives us that. If you change to classification, you need to make sure the method outputs class probabilities.
d.class: when d is binary, should the outcome regression be treated as a classification problem? Default is FALSE. Note that for DML we need the class probabilities, and regression gives us that. If you change to classification, you need to make sure the method outputs class probabilities.
verbose: if TRUE (default) prints steps of the fitting procedure.
warning: should caret's warnings be printed? Default is FALSE. Note caret has many inconsistent and unnecessary warnings.

Value

An object of class dml with the results of the DML procedure. The object is a list containing:

data: A list with the data used.
call: The original call used to fit the model.
info: A list with general information and arguments of the DML fitting procedure.
fits: A list with the the predictions of each repetition.
results: A list with the results (influence functions and estimates) for each repetition.
coefs: A list with the estimates and standard errors for each repetition.

References

Chernozhukov, V., Cinelli, C., Newey, W., Sharma A., and Syrgkanis, V. (2021). "Long Story Short: Omitted Variable Bias in Causal Machine Learning."

Examples

# loads package
library(dml.sensemakr)

## loads data
data("pension")

# set the outcome
y <- pension$net_tfa  # net total financial assets

# set the treatment
d <- pension$e401    # 401K eligibility

# set the covariates (a matrix)
x <- model.matrix(~ -1 + age + inc  + educ+ fsize + marr + twoearn + pira + hown, data = pension)

## compute income quartiles for group ATE.
g1 <- cut(x[,"inc"], quantile(x[,"inc"], c(0, 0.25,.5,.75,1), na.rm = TRUE),
         labels = c("q1", "q2", "q3", "q4"), include.lowest = T)
# run DML (nonparametric model)
## 2 folds (change as needed)
## 1 repetition (change as needed)
dml.401k <- dml(y, d, x, model = "npm", groups = g1, cf.folds = 2, cf.reps = 1)
#> Debiased Machine Learning
#> 
#>  Model: Nonparametric 
#>  Target: ate 
#>  Cross-Fitting: 2 folds, 1 reps 
#>  ML Method: outcome (ranger), treatment (ranger)
#>  Tuning: dirty 
#> 
#> 
#> ====================================
#> Tuning parameters using all the data
#> ====================================
#> 
#> - Tuning Model for D.
#> -- Best Tune:
#>   mtry min.node.size splitrule
#> 1    2             5  variance
#> 
#> - Tuning Model for Y (non-parametric).
#> -- Best Tune:
#>   mtry min.node.size splitrule
#> 1    3             5  variance
#> 
#> 
#> ======================================
#> Repeating 2-fold cross-fitting 1 times
#> ======================================
#> 
#> -- Rep 1 -- Folds: 1  2  
#> 

# summary of results with median method (default)
summary(dml.401k, combine.method = "median")
#> 
#> Debiased Machine Learning
#> 
#>  Model: Nonparametric 
#>  Cross-Fitting: 2 folds, 1 reps 
#>  ML Method: outcome (ranger, R2 = 26.059%), treatment (ranger, R2 = 10.467%)
#>  Tuning: dirty 
#> 
#> Average Treatment Effect: 
#> 
#>         Estimate Std. Error t value  P(>|t|)    
#> ate.all     7474       1239    6.03 1.63e-09 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Group Average Treatment Effect: 
#> 
#>         Estimate Std. Error t value  P(>|t|)    
#> gate.q1   3862.5      835.7   4.622  3.8e-06 ***
#> gate.q2   2651.4     1359.9   1.950 0.051207 .  
#> gate.q3   6772.3     1887.6   3.588 0.000334 ***
#> gate.q4  16608.8     4291.6   3.870 0.000109 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Note: DML estimates combined using the median method.
#> 
#> Verbal interpretation of DML procedure:
#> 
#> -- Average treatment effects were estimated using DML with 2-fold cross-fitting. In order to reduce the variance that stems from sample splitting, we repeated the procedure 1 times. Estimates are combined using the median as the final estimate, incorporating variation across experiments into the standard error as described in Chernozhukov et al. (2018). The outcome regression uses Random Forest from the R package ranger; the treatment regression uses Random Forest from the R package ranger.

# coef median method (default)
coef(dml.401k, combine.method = "median")
#>   ate.all   gate.q1   gate.q2   gate.q3   gate.q4 
#>  7474.047  3862.492  2651.363  6772.252 16608.821 

# se median method (default)
se(dml.401k, combine.method = "median")
#>   ate.all   gate.q1   gate.q2   gate.q3   gate.q4 
#> 1239.3764  835.6746 1359.8547 1887.6231 4291.5565 

# confint median method
confint(dml.401k, combine.method = "median")
#>              2.5 %    97.5 %
#> ate.all 5044.91384  9903.180
#> gate.q1 2224.60031  5500.384
#> gate.q2  -13.90369  5316.629
#> gate.q3 3072.57856 10471.925
#> gate.q4 8197.52458 25020.117

## Sensitivity Analysis

### Robustness Values
robustness_value(dml.401k, alpha = 0.05)
#>      ate.all      gate.q1      gate.q2      gate.q3      gate.q4 
#> 4.157224e-02 6.083497e-02 6.751708e-09 3.543867e-02 4.887655e-02 

### Confidence Bounds
confidence_bounds(dml.401k, cf.y = 0.03, cf.d = 0.04, level = 0.95)
#> Error in dml_bounds(model, cf.y = cf.y, cf.d = cf.d, rho2 = rho2): argument "cf.y" is missing, with no default

### Contour Plots
ovb_contour_plot(dml.401k, cf.y = 0.03, cf.d = 0.04,
                bound.label = "Max Match (3x years)")
#> Error in ovb_contour_plot.dml(dml.401k, cf.y = 0.03, cf.d = 0.04,     bound.label = "Max Match (3x years)"): unused arguments (cf.y = 0.03, cf.d = 0.04)