Double Debiased Machine Learning (part 2)

https://ift.tt/SzAox9r How to remove regularization bias using post-double selection Image by Author In causal inference, we are usuall...

https://ift.tt/SzAox9r

How to remove regularization bias using post-double selection

In causal inference, we are usually interested in the effect of a treatment variable on a specific outcome. In randomized control trials or AB tests, conditioning the analysis on a set of other variables — the control variables or covariates — can increase the power of the analysis, by reducing imbalances that have emerged despite randomization. However, conditioning is even more important in observational studies, where, absent randomization, it might be essential to recover causal effects.

Often we do not have a strong opinion on which variables to condition the analysis on, or with which functional form, and we might be tempted to let the data decide, either through variable selection or through flexible machine learning methods. In the previous part of this blog post, we have seen how this procedure can distort inference, i.e. produce wrong confidence intervals for the causal effect of interest. This bias is generally called regularization bias or pre-test bias.

In this blog post, we are going to explore a solution to the variable selection problem, post-double selection, and we are going to introduce a general approach to deal with many control variables and/or non-linear functional forms, double-debiased machine learning.

Recap

To better understand the source of the bias, in the first part of this post, we have explored the example of a firm that is interested in testing the effectiveness of an ad campaign. The firm has information on its current ad spending and on the level of sales. The problem arises because the firm is uncertain on whether it should condition its analysis on the level of past sales.

The following Directed Acyclic Graph (DAG) summarizes the data generating process.

I import the data generating process dgp_pretest() from src.dgp and some plotting functions and libraries from src.utils.

from src.utils import *
from src.dgp import dgp_pretest

df = dgp_pretest().generate_data()
df.head()

We have data on 1000 different markets, for which we observe current sales, the amount spent in advertisement and past sales.

We want to understand whether ads spending is effective in increasing sales. One possibility is to regress the latter on the former, using the following regression model, also called the short model.

Should we also include past sales in the regression? Then, the regression model would be the following, also called the long model.

One naive approach would be to let the data decide: we could run the second regression and, if the estimated effect of past sales, β̂, is statistically significant, we are good with the long model, otherwise we run the short model. This procedure is called pre-testing.

The problem with this procedure is that it introduces a bias that is called regularization or pre-test bias. Pre-testing ensures that this bias is small enough not to distort the estimated coefficient. However, it does not ensure that it is small enough not to distort the confidence intervals around the estimated coefficient.

Is there a solution? Yes!

Post-Double Selection

The solution is called post-double selection. The method was first introduced by Belloni, Chernozhukov, Hansen (2014) and later expanded in a variety of papers having Victor Chernozhukov as a common denominator.

The authors assume the following data generating process:

Data generating process, image by Author

In our example, Y corresponds to sales, D corresponds to ads, X corresponds to past_sales and the effect of interest is α. In our example, X is 1-dimensional for simplicity, but generally, we are interested in cases where X is high-dimensional, potentially even having more dimensions than the number of observations. In that case, variable selection is essential in linear regression since we cannot have more features than variables (the OLS coefficients are not uniquely determined).

Post-double selection consists of the following procedure.

Reduced Form selection: lasso Y on X. Select the statistically significant variables in the set S₁ ⊆ X
First Stage selection: regress D on X. Select the statistically significant variables in the set S₂ ⊆ X
Regress Y on D and the union of the selected variables in the first two steps, S₁ ∪ S₂

The authors show that this procedure gives confidence intervals for the estimated coefficient of interest α̂ that have the correct coverage, i.e. the correct probability of type 1 error.

Note (1): this procedure is always less parsimonious, in terms of variable selection than pre-testing. In fact, we still select all the variables we would have selected with pre-testing but, in the first stage, we might select additional variables.

Note (2): the terms first stage and reduced form come from the instrumental variables literature in econometrics. Indeed, the first application of post-double selection was to select instrumental variables in Belloni, Chen, Chernozhukov, Hansen (2012).

Note (3): the name post-double selection comes from the fact that now we are not performing variable selection once but twice.

Intuition

The idea behind post-double selection is: bound the omitted variable bias. In case you are not familiar with it, I wrote a separate blog post on the omitted variable bias.

In our setting, we can express the omitted variable bias as

As we can see, the omitted variable bias comes from the product of two quantities related to the omitted variable X:

Its partial correlation with the outcome Y, β
Its partial correlation with the treatment D, δ

With pre-testing, we ensure that the partial correlation between X and the outcome Y, β, is small. In fact, we omit X when we shouldn’t (i.e. we commit a type 2 error) rarely. What do small and rarely mean?

When we are selecting a variable because of its significance, we ensure that its dimension is smaller than c/√n for some number c, where n is the sample size. Therefore, with pre-testing, we ensure that, no matter what the value of δ is, the dimension of the bias is smaller than c/√n. which means that it converges to zero for sufficiently large n. This is why the pre-testing estimator is still consistent, i.e. converges to the true value for a sufficiently large sample size n.

However, in order for our confidence intervals to have the right coverage, this is not enough. In practice, we need the bias to converge to zero faster than 1/√n. Why?

To get an intuition for this result, we need to turn to the Central Limit Theorem. The CLT tells us that for a large sample size n the distribution of the sample average of a random variable X converges to a normal distribution with mean μ and standard deviation σ/√n, where μ and σ are the mean and standard deviation of X. To do inference, we usually apply the Central Limit Theorem to our estimator to get its asymptotic distribution, which in turn allows us to build confidence intervals (using the mean and the standard deviation). Therefore, if the bias is not sensibly smaller than the standard deviation of the estimator, the confidence intervals are going to be wrong. Therefore, we need the bias to converge to zero faster than the standard deviation, i.e. faster than 1/√n.

In our setting, the omitted variable bias is βδ and we want it to converge to zero faster than 1/√n. Post-double selection guarantees that

Reduced form selection: any “missing” variable j has |βⱼ| ≤ c/√n
First stage selection: any “missing” variable j has |δⱼ| ≤ c/√n

As a consequence, as long as the number of omitted variables is finite, the omitted variable bias is going to converge to zero at a rate 1/n, which is faster than 1/√n. Problem solved!

Application

Let’s now go back to our example and test the post-double selection procedure. In practice, we want to do the following:

First Stage selection: regress ads on past_sales. Check if past_sales is statistically significant
Reduced Form selection: regress sales on past_sales. Check if past_sales is statistically significant
Regress sales on ads and include past_sales only if it was significant in either one of the two previous regressions

I update the pre_test function from the first part of the post to compute also the post-double selection estimator.

We can now plot the distributions (over simulations) of the estimated coefficients.

png — Distribution of *α̂, image by Author*

As we can see, the post-double selection procedure always correctly selects the long regression and therefore the estimator has the correct distribution.

Double-checks

In the last post, we ran some simulations in order to investigate when pre-testing bias emerges. We saw that pre-testing is a problem for

Small sample sizes n
Intermediate values of β
When the value of β depends on the sample size

Let’s check that post-double selection removes regularization bias in all the previous cases.

First, let’s simulate the distribution of the post-double selection estimator α̂-postdouble for different sample sizes.

Ns = [100,300,1000,3000]
alphas = {f'N = {n:.0f}':  pre_test(N=n) for n in Ns}

We can now plot and compare the distributions of pre-testing estimator and the post-double estimator.

png — Distribution of *α̂, image by Author*

For small samples, the distribution of the pre-testing estimator is not normal but rather bimodal. Instead, the post-double estimator is gaussian also in small sample sizes.

Now we repeat the same exercise, but for different values of β, the coefficient of past_sales on sales.

betas = 0.3 * np.array([0.1,0.3,1,3])
alphas = {rf'$\beta$ = {b:.2f}': pre_test(b=b) for b in betas}
compare_alphas(alphas, true_alpha=1)

png — Distribution of *α̂, image by Author*

Again, the post-double selection estimator has a Gaussian distribution irrespectively of the value of β, while the pre-testing estimator suffers from regularization bias.

For the last simulation, we change both the coefficient and the sample size at the same time.

betas = 0.3 * 30 / np.sqrt(Ns)
alphas = {rf'N = {n:.0f}, $\beta$ = {b:.2f}':  pre_test(b=b, N=n) for n,b in zip(Ns,betas)}
compare_alphas(alphas, true_alpha=1)

png — Distribution of *α̂, image by Author*

Also in this last case, the post-double selection has the correct Gaussian distribution across simulations.

Double Debiased Machine Learning

So far, we only have analyzed a linear, univariate example. What happens if the dimension of X increases and we do not know the functional form through which X affects Y and D? In these cases, we can use machine learning algorithms to uncover these high-dimensional non-linear relationships.

Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins (2018) investigate this setting. In particular, the authors consider the following partially linear model.

where Y is the outcome variable, D is the treatment to interest and X is a potentially high-dimensional set of control variables. The difference with respect to the previous setting is that now we leave the relationships between X and Y and D unspecified, through the functions g() and m().

Naive approach

A naive approach to the estimation of α using machine learning methods would be, for example, to construct a sophisticated machine learning estimator for learning the regression function αD + g(X).

Split the sample in two: main sample and auxiliary sample (why? see note below)
Use the auxiliary sample to estimate ĝ(X)
Use the main sample to compute the orthogonalized component of Y on X:

4. Use the main sample to estimate the residualized OLS estimator from regressing û on D

This estimator is going to have two problems:

Slow rate of convergence, i.e. slower than √n
It will be biased because we are employing high dimensional regularized estimators (e.g. we are doing variable selection)

Note (1): so far we have not talked about it, but the variable selection procedure also introduces another type of bias: overfitting bias. This bias emerges because of the fact that the sample used to select the variables is the same that is used to estimate the coefficient of interest. This bias is easily accounted for with sample splitting: using different sub-samples for the selection and the estimation procedures.

Note (2): why can we use the residuals from step 3 to estimate α in step 4? Because of the Frisch-Waugh-Lovell theorem. If you are not familiar with it, I have written a blog post on the Frisch-Waugh-Lovell theorem here.

Double Orthogonalization

Double-debiased machine learning solves the problem by repeating the orthogonalization procedure twice. The idea is the same behind post-double selection: reduce the regularization bias by performing variable selection twice. The estimator is still valid because of the Frisch-Waugh-Lovell theorem.

In practice, double-debiased machine learning consists of the following steps.

Split the sample in two: the main sample and the auxiliary sample
Use the auxiliary sample to estimate ĝ(X) from

3. Use the auxiliary sample to estimate m̂ from

4. Use the main sample to compute the orthogonalized component of D on X as

5. Use the main sample to estimate the double-residualized OLS estimator as

The estimator is root-N consistent! This means that not only the estimator converges to the true value as the sample size increases (i.e. it’s consistent), but also its standard deviation does (i.e. it’s root-N consistent).

However, the estimator still has a lower rate of convergence because of sample splitting. The problem is solved by inverting the split sample, re-estimating the coefficient, and averaging the two estimates. Note that this procedure is valid since the two estimates are independent of the sample splitting procedure.

A Cautionary Tale

Before we conclude, I have to mention a recent research paper by Hünermund, Louw, and Caspi (2022), in which the authors show that double-debiased machine learning can easily backfire, if applied blindly.

The problem is related to bad control variables. If you have never heard of this term, I have written an introductory blog post on good and bad control variables here. In short, conditioning the analysis on additional features is not always good for causal inference. Depending on the setting, there might exist variables that we want to leave out of our analysis since their inclusion can bias the coefficient of interest, preventing a causal interpretation. The simplest example is variables that are common outcomes, of both the treatment D and outcome variable Y.

The double-debiased machine learning model implicitly assumes that the control variables X are (weakly) common causes to both the outcome Y and the treatment D. If this is the case, and no further mediated/indirect relationship exists between X and Y, there is no problem. However, if, for example, some variable among the controls X is a common effect instead of a common cause, its inclusion will bias the coefficient of interest. Moreover, this variable is likely to be highly correlated either with the outcome Y or with the treatment D. In the latter case, this implies that post-double selection might include it in cases in which simple selection would have not. Therefore, in presence of bad control variables, double-debiased machine learning might be even worse than simple pre-testing.

In short, as for any method, it is crucial to have a clear understanding of the method’s assumptions and to always check for potential violations.

Conclusion

In this post, we have seen how to use post-double selection and, more generally, double debiased machine learning to get rid of an important source of bias: regularization bias.

This contribution by Victor Chernozhukov and co-authors has been undoubtedly one of the most relevant advances in causal inferences in the last decade. It is now widely employed in the industry and included in the most used causal inference packages, such as EconML (Microsoft) and causalml (Uber).

If you (understandably) feel the need for more material on double-debiased machine learning, but you do not feel like reading academic papers (also very understandable), here is a good compromise.

In this video lecture, Victor Chernozhukov himself presents the idea. The video lecture is relatively heavy on math and statistics, but you cannot get a more qualified and direct source than this!

References

[1] A. Belloni, D. Chen, V. Chernozhukov, C. Hansen, Sparse Models and Methods for Optimal Instruments With an Application to Eminent Domain (2012), Econometrica.

[2] A. Belloni, V. Chernozhukov, C. Hansen, Inference on treatment effects after selection among high-dimensional controls (2014), The Review of Economic Studies.

[3] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, J. Robins, Double/debiased machine learning for treatment and structural parameters (2018), The Econometrics Journal.

[4] P. Hünermund, B. Louw, I. Caspi, Double Machine Learning and Automated Confounder Selection — A Cautionary Tale (2022), working paper.

Code

You can find the original Jupyter Notebook here:

Blog-Posts/pds.ipynb at main · matteocourthoud/Blog-Posts

Thank you for reading!

I really appreciate it! 🤗 If you liked the post and would like to see more, consider following me. I post once a week on topics related to causal inference and data analysis. I try to keep my posts simple but precise, always providing code, examples, and simulations.

Also, a small disclaimer: I write to learn so mistakes are the norm, even though I try my best. Please, when you spot them, let me know. I also appreciate suggestions on new topics!

Double Debiased Machine Learning (part 2) was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Towards Data Science - Medium https://ift.tt/zMxUNZX
via RiYo Analytics

Page Nav

Ads Place

Double Debiased Machine Learning (part 2)

https://ift.tt/SzAox9r How to remove regularization bias using post-double selection Image by Author In causal inference, we are usuall...

How to remove regularization bias using post-double selection

Recap

Post-Double Selection

Intuition

Application

Double-checks

Double Debiased Machine Learning

Naive approach

Double Orthogonalization

A Cautionary Tale

Conclusion

References

Related Articles

Code

Thank you for reading!

Related Posts

No comments

Connect WIth Us

Top of the month

China vs USA: Who is Losing the AI Race?

SUTRA-R0: India’s Leap into Advanced AI Reasoning

Three Great Documentaries to Stream

Microsoft Unveils first AI chip: Maia 100 Chip and Cobalt CPU

Latest Posts

Cloud Labels

Search This Blog

Report Abuse

Contributors

Happy To Help You

Popular Tag

Latest Articles

Teradata and its Architecture for Data Engineers

Base LLM vs Instruction-Tuned LLM

Your Go-to Guide on Machine Learning Operations (MLOps)

Contextual Retrieval for Multimodal RAG on Slide Decks with LlamaIndex

Popular Posts

Spider-Man: No Way Home Torrents May Contain Crypto Malware, Cybersecurity Firm Warns

Onecoin Victims Petition Bulgaria for Seizure of Assets and Compensation

3air Leverages Blockchain Technology to Deliver Extensive Broadband Connectivity in Africa

AI Applications for Border Transportation