# Deconfounding explained

## Credits

The original material for this demonstration was written in R by Jeroen
de Mast. His original code was ported to Python by Daniel Kapitan.

## Setting the scene

Suppose that we want to test whether $X$ has a causal effect on $Y$:

$$X \longrightarrow Y$$

And also we have 1000 $(X, Y)$ tuples as our data and that we want to
build a regressions model.

In [1]:
import altair as alt
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf


# setting up our experiment
np.random.sample(1973)
N = 1000
C = np.random.normal(loc=0.0, scale=1.0, size=N)
error_x = np.random.normal(loc=0.0, scale=1.0, size=N)
error_y = np.random.normal(loc=0.0, scale=0.01, size=N)
X = 10 + 5*C + error_x
Y = 1 + 0.5*C + error_y
df = pd.DataFrame({'X': X, 'Y': Y, 'C': C})

In [2]:
confounded = smf.ols("Y ~ X", data=df).fit()
confounded.summary()

So our first (confounded) model yields a result that $Y = 0.03 + 0.1X$.
Note there can be small differences each time you re-run this notebook.
But most importantly the fitted model has a high $R^2 = 0.95$ and high
significance $p = 0.0$!

However, if you look closely at the Python code, you see that the real
model has a confounder $C$:

$$C \longrightarrow X$$ $$C \longrightarrow Y$$

In other words, X and Y are both causally affected by C. As a
consequence, X and Y are correlated, but they do not causally affect
each other. So, the regression analysis above is actually wrong, and the
correlation between X and Y is called *spurious*. C is called a
confounder.

Now here is the great deconfounding trick: suppose that we include both
X and C in the regression analysis and fit the following modelL

$$ Y = \beta_0 + \beta_1 X + \beta_2 C + ϵ$$

In [3]:
deconfounded = smf.ols("Y ~ X + C", data=df).fit()
deconfounded.summary()

Note that, by including $C$ as an independent variable in the regression
analysis, suddenly X has stopped being significant (p=0.36)!

This holds in general: if the true causal relationships are as given in
the second diagram, then including the confounder C in the regression
analysis gives the direct effect of X onto Y (if any such direct effect
exists), and the part of the correlation that is induced by the
confounder C is now entirely attributed to C and not to X. This approach
is called “deconfounding”.