import altair as alt
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
# setting up our experiment
1973)
np.random.sample(= 1000
N = np.random.normal(loc=0.0, scale=1.0, size=N)
C = np.random.normal(loc=0.0, scale=1.0, size=N)
error_x = np.random.normal(loc=0.0, scale=0.01, size=N)
error_y = 10 + 5*C + error_x
X = 1 + 0.5*C + error_y
Y = pd.DataFrame({'X': X, 'Y': Y, 'C': C}) df
Deconfounding explained
A demonstration howcorrelations ‘magically’ disappear if confounders are added to your model.
Credits
The original material for this demonstration was written in R by Jeroen de Mast. His original code was ported to Python by Daniel Kapitan.
Setting the scene
Suppose that we want to test whether \(X\) has a causal effect on \(Y\):
\[X \longrightarrow Y\]
And also we have 1000 \((X, Y)\) tuples as our data and that we want to build a regressions model.
= smf.ols("Y ~ X", data=df).fit()
confounded confounded.summary()
Dep. Variable: | Y | R-squared: | 0.960 |
Model: | OLS | Adj. R-squared: | 0.960 |
Method: | Least Squares | F-statistic: | 2.385e+04 |
Date: | Thu, 21 Dec 2023 | Prob (F-statistic): | 0.00 |
Time: | 23:47:40 | Log-Likelihood: | 854.10 |
No. Observations: | 1000 | AIC: | -1704. |
Df Residuals: | 998 | BIC: | -1694. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | 0.0446 | 0.007 | 6.389 | 0.000 | 0.031 | 0.058 |
X | 0.0958 | 0.001 | 154.420 | 0.000 | 0.095 | 0.097 |
Omnibus: | 1.331 | Durbin-Watson: | 1.855 |
Prob(Omnibus): | 0.514 | Jarque-Bera (JB): | 1.406 |
Skew: | 0.080 | Prob(JB): | 0.495 |
Kurtosis: | 2.911 | Cond. No. | 24.2 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
So our first (confounded) model yields a result that \(Y = 0.03 + 0.1X\). Note there can be small differences each time you re-run this notebook. But most importantly the fitted model has a high \(R^2 = 0.95\) and high significance \(p = 0.0\)!
However, if you look closely at the Python code, you see that the real model has a confounder \(C\):
\[C \longrightarrow X\] \[C \longrightarrow Y\]
In other words, X and Y are both causally affected by C. As a consequence, X and Y are correlated, but they do not causally affect each other. So, the regression analysis above is actually wrong, and the correlation between X and Y is called spurious. C is called a confounder.
Now here is the great deconfounding trick: suppose that we include both X and C in the regression analysis and fit the following modelL
\[ Y = \beta_0 + \beta_1 X + \beta_2 C + ϵ\]
= smf.ols("Y ~ X + C", data=df).fit()
deconfounded deconfounded.summary()
Dep. Variable: | Y | R-squared: | 1.000 |
Model: | OLS | Adj. R-squared: | 1.000 |
Method: | Least Squares | F-statistic: | 1.261e+06 |
Date: | Thu, 21 Dec 2023 | Prob (F-statistic): | 0.00 |
Time: | 23:47:40 | Log-Likelihood: | 3164.9 |
No. Observations: | 1000 | AIC: | -6324. |
Df Residuals: | 997 | BIC: | -6309. |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | 1.0055 | 0.003 | 323.164 | 0.000 | 0.999 | 1.012 |
X | -0.0005 | 0.000 | -1.671 | 0.095 | -0.001 | 9.05e-05 |
C | 0.5023 | 0.002 | 316.780 | 0.000 | 0.499 | 0.505 |
Omnibus: | 2.983 | Durbin-Watson: | 2.042 |
Prob(Omnibus): | 0.225 | Jarque-Bera (JB): | 3.073 |
Skew: | -0.063 | Prob(JB): | 0.215 |
Kurtosis: | 3.240 | Cond. No. | 122. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Note that, by including \(C\) as an independent variable in the regression analysis, suddenly X has stopped being significant (p=0.36)!
This holds in general: if the true causal relationships are as given in the second diagram, then including the confounder C in the regression analysis gives the direct effect of X onto Y (if any such direct effect exists), and the part of the correlation that is induced by the confounder C is now entirely attributed to C and not to X. This approach is called “deconfounding”.