Problems of causal inference after selecting controls

By Gabriel Vasconcelos

Inference after model selection

In many cases, when we want to estimate some causal relationship between two variables we have to solve the problem of selecting the right control variables. If we fail, our results will be very fragile and the estimator potentially biased because we left some important control variables out. This problem is known as omitted variables bias. It happens because some variables correlated with our variable of interest were left out and went to the errors term, making the errors correlated with the variable of interest.

Suppose we want to test the causal effect of education on wages. Many potential controls come to my mind such as parents education, experience, age, etc. There are several methods to select relevant variables, the most naive one is to run a regression with all variables and then select those which are significant. Then we run the regression again only with these variables.

No mater how we select the variables, we may always fall on a type II error situation, when some variable is significant but we wrongly drop it from our model. Additionally, some variables may have a very small, but significant effect which makes it more difficult for variable selection methods to catch them.

The traditional single selection approach

Traditionally, this is how the procedure is done. Suppose our response variable is y_i, and we are interested on the effects of d_i on y_i using the controls x_i. First we estimate the following model:

y_i = \alpha_0 d_i + \beta' x_i + \varepsilon_i ,

Then we select the significant variables from x_i and re-estimate the equation using d_i and x_i^s, which is the group of the selected variables. This is not good and there is a high probability that your inference will be wrong!!

The double selection approach

Beloni and Chernozhukov (2012) showed that the most adequate way to do inference after model selection is by a procedure called double selection. Consider the following equations:

y_i = \alpha_0 d_i + \beta' x_i + \varepsilon_i,
d_i = \gamma'x_i + \epsilon_i,

we can substitute the second equation on the first and have the following:

y_i = \alpha_0(\gamma'+\beta')x_i + (\alpha_0\epsilon_i +\varepsilon_i),
d_i = \gamma'x_i + \epsilon_i,

Note that the y_i now does not depend on d_i. The double selection procedure is as follows:

  • Estimate the equation of x_i on y_i and select the relevant variables,
  • Estimate the equation of x_i on d_i and select the relevant variables,
  • Estimate the equation of x_i and d_i on y_i using the elements of x_i that were selected on any of the two steps.

This procedure gives a new chance to the variables that were dropped on the fist step and are related to d_i to be included in the final model.


The R code presented here is for a simulation on a simplified version of the data generation process from Beloni Chernozhukov (2012). You are going to need to install the mvtnorm package to generate data from multivariate normal distributions. The code below declares the parameters for the simulation.

## Declare parameters ##
p=50 # = Number of potential controls = #
N=100 # = Number of observations = #
alpha0=0.5 # = Parameter of interest = #
b=1/((1:p)^2) # = Control parameters (some are big and some small) = #

# = Controls covariance matrix = #
for(k in 1:p){
  for(j in 1:p){
M=5000 # = Number of simulations = #
set.seed(123) # = Seed for replication = #
store=matrix(NA,M,2) # = Matrix to store the parameters in each simulation = #
colnames(store)=c("Single Selection", "Double Selection")

We are going to perform 5000 simulations. This may take a couple minutes. The objective is to compare the single selection and the double selection in terms of bias and variance. The code below performs the simulation loop.

# = Simulation Loop = #
for(i in 1:M){
  X=rmvnorm(N,rep(0,p),sigma) # = Generate controls = #

  ed=rnorm(N) # = Errors from di equation = #
  ey=rnorm(N) # = Errors from yi equation = #

  cd=1/sqrt(var(X%*%b)) # = This will keep the R2 close to 0.5 = #
  d=X%*%(as.vector(cd)*b)+ed # = generate di = #

  cy=1/sqrt(var(cbind(d,X)%*%c(alpha0,b))) # = R2 control = #
  y=cbind(d,X)%*%(c(alpha0,as.vector(cy)*b))+ey # = Generate yi = #

  # = Single Selection Approach = #
  model0=lm(y~d+X) # = Run first step = #
  selected=which(summary(model0)$coefficients[-c(1:2),4]<0.05) # = Variable selection using significance = #

  # = Just a control for the cases the model selects 0 variables = #

  store[i,1]=coef(model1)[2] # = save results = #

  # = Double Selection = #
  modeld0=lm(y~X) # = Regression of xi on yi = #
  selected0=which(summary(modeld0)$coefficients[-1,4]<0.05) # = Variable selection = #
  modeld1=lm(d~X) # = Regression of xi on di = #
  selected1=which(summary(modeld1)$coefficients[-1,4]<0.05) # = Variable  selection = #

  selected=sort(union(selected0,selected1)) # = Selected variables = # 

  # = Just a control for the cases the model selects 0 variables = #

  # = save results = #

colMeans(store) # = Average alpha0 (real value is 0.5) = #
## Single Selection Double Selection
##        0.5895864        0.5075527
apply(store,2,sd) # = alpha0 standard deviation = #
## Single Selection Double Selection
##        0.1609284        0.1117405
# = Plot densities = #
legend("topleft",legend=c("Double","Single"),col=c(1,2),lty=1,seg.len = 0.8,cex=0.8,bty="n")

plot of chunk unnamed-chunk-8

The results show the damage that can be caused if we ignore the existence of the variable selection to do inference. The single selection procedure is biased and the distribution is bimodal. There is a high probability that the effects of d_i on y_i will be overestimated. The omitted variable bias is behind the problem. When the fist step of the single selection fails to recover important variables they are left to the error term. The double selection makes traditional inference suitable again. Note that the single selection fails even if we use sophisticated variable selection techniques such as LASSO.


Belloni, A., V. Chernozhukov, and C. Hansen. “Inference on treatment effects after selection amongst high-dimensional controls.”

This entry was posted in R and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s