The package hdm for double selection inference with a simple example

By Gabriel Vasconcelos

In a late post I discussed the Double Selection (DS), a procedure for inference after selecting controls. I showed an example of the consequences of ignoring the variable selection step discussed in an article by Belloni, Chernozhukov and Hansen.

Some of the authors of the mentioned article created the hdm package, which implements the double selection using the Rigorous LASSO (RLASSO) to select the controls. The RLASSO uses the theory they developed (instead of cross-validation or information criterion) to select the regularization parameter, normally referred as \lambda.

Application

I am going to show an application based on the package’s vignettes, which is based in an article from Barro and Lee (1994). The hypothesis we want to test is if less developed countries, with lower GDP per capita, grow faster than developed countries. In other words, there is a catch up effect. The model equation is as follows:

\displaystyle y_i=\alpha_0d_i+\sum_{i=1}^p\beta_jx_{i,j}+\varepsilon_i

where y_i is the GDP growth rate over a specific decade in country i, d_i is the log of the GDP at the beginning of the decade, x_{i,j} are controls that may affect the GDP. We want to know the effects of d_i on y_i, which is measured by \alpha_0. If our catch up hypothesis is true, \alpha_0 must be negative and hopefully significant.

The dataset is available in the package. It has 62 variables and 90 observations. Each observation is a country, but the same country may have more than one observation if analysed in two different decades. The large number of variables will require some variable selection, and I will show what happens if we use a single LASSO selection and the Double Selection. The hdm package does all the DS steps in a single line of code, we do not need to estimate the two selection models and the Post-OLS individually. I will also run a naive OLS will all variables just for illustration.

library(hdm)
data("GrowthData") # = use ?GrowthData for more information = #
dataset=GrowthData[,-2] # = The second column is just a vector of ones = #

# = Naive OLS with all variables = #
# = I will select only the summary line that contains the initial log GDP = #
OLS = summary(lm(Outcome ~., data = dataset))$coefficients[1, ]

# = Single step selection LASSO and Post-OLS = #
# = I will select only the summary line that contains the initial log GDP = #
lasso = rlasso(Outcome~., data = dataset, post = FALSE) # = Run the Rigorous LASSO = #
selected = which(coef(lasso)[-c(1:2)] !=0) # = Select relevant variables = #
formula = paste(c("Outcome ~ gdpsh465", names(selected)), collapse = "+")
SS = summary(lm(formula, data = dataset))$coefficients[1, ]

# = Double Selection = #
DS=rlassoEffects(Outcome~. , I=~gdpsh465, data=dataset)
DS=summary(DS)$coefficients[1,]
(results=rbind(OLS,SS,DS))
##        Estimate Std. Error    t value    Pr(>|t|)
## OLS  0.24716089 0.78450163  0.3150547 0.755056170
## SS   0.31168793 0.09832465  3.1699876 0.002169693
## DS  -0.04432403 0.01531925 -2.8933558 0.003811493

The OLS estimate is positive, however the standard error is very big because we have only 90 observations for more than 60 variables. The Single Selection estimate is also positive and, in this case, significant. However, the Double Selection showed a negative and significant coefficient. If we used only the single selection we would find that the catch up effect is wrong. We can’t say that the DS is correct for sure, but it is backed up by a strong theory and lots of simulations that show that the SS is problematic. It is very, very unlikely that the SS results are more accurate than the DS. It is very surprising how much the results can change from one case to the other. You should at least be skeptic when you see this type of modelling and the selection of controls is not clear.

The hdm package has several other implementations in this framework such as instrumental variables and logit models and there are also more examples in the package vignette.

References

Belloni, A., V. Chernozhukov, and C. Hansen. “Inference on treatment effects after selection amongst high-dimensional controls.” https://arxiv.org/abs/1201.0224

Barro, Robert J., and Jong-Wha Lee. “Sources of economic growth.” Carnegie-Rochester conference series on public policy. Vol. 40. North-Holland, 1994. http://www.sciencedirect.com/science/article/pii/0167223194900027

Advertisements
This entry was posted in R and tagged , , , , , . Bookmark the permalink.

3 Responses to The package hdm for double selection inference with a simple example

  1. Pingback: The package hdm for double selection inference with a simple example – Cloud Data Architect

  2. Pingback: The package hdm for double selection inference with a simple example - biva

  3. Pingback: The package hdm for double selection inference with a simple example – Mubashir Qasim

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s