import altair as alt
import pandas as pd
import ppscore as pps
from pycaret.classification import *
import shap
from sklearn.metrics import average_precision_score
from sklearn.model_selection import train_test_split
from ydata_profiling import ProfileReport
# customize Altair
def y_axis():
return {
"config": {
"axisX": {"grid": False},
"axisY": {
"domain": False,
"gridDash": [2, 4],
"tickSize": 0,
"titleAlign": "right",
"titleAngle": 0,
"titleX": -5,
"titleY": -10,
},"view": {
"stroke": "transparent",
# To keep the same height and width as the default theme:
"continuousHeight": 300,
"continuousWidth": 400,
},
}
}
"y_axis", y_axis)
alt.themes.register("y_axis"); alt.themes.enable(
Predicting diabetes with pycaret
Objectives
- Example end-to-end supervised learning workflow with Pima Indians
- Focus on conceptual understanding of machine learning
- Demonstrate use of Predictive Power Score (PPS)
- Demonstrate capabilities of low-code tools
- Demonstrate use of
average_precision_score
(link) - Demonstrate SHAP values
Attribution
Dataset
- Pima Indians paper (original paper)
- Kaggle datacard (link)
Python libraries
- Altair (docs)
- ydata-profiling (docs)
- Predictive Power Score (PPS, GitHub, blog)
- PyCaret: open-source, low-code machine learning library in Python that automates machine learning workflows (link)
Read and explore the data
%%time
= pd.read_csv("diabetes.csv").astype({"Outcome": bool})
df = train_test_split(df, test_size=0.3)
train, test = ProfileReport(train, minimal=True, title="Pima Indians Profiling Report")
profile "pima-indians-profiling-report-minimal.html") profile.to_file(
CPU times: user 5.54 s, sys: 184 ms, total: 5.72 s
Wall time: 1.7 s
profile.to_notebook_iframe()
Investigate features with largest predictive power
We use the Predictive Power Score to evaluate which features have the highest predictive power with respect to Outcome
.
= (
predictors "Outcome")
pps.predictors(train, round(3)
.-1]
.iloc[:, :
)= (
base
alt.Chart(predictors)
.encode(=alt.Y("x:N").sort("-x"),
y="ppscore",
x=["x", "ppscore"],
tooltip
)
)+ base.mark_text(align="center", dy=-5) base.mark_bar()
Investigate colinearity
pps.matrix(train)
x | y | ppscore | case | is_valid_score | metric | baseline_score | model_score | model | |
---|---|---|---|---|---|---|---|---|---|
0 | Pregnancies | Pregnancies | 1.000000 | predict_itself | True | None | 0.000000 | 1.000000 | None |
1 | Pregnancies | Glucose | 0.000000 | regression | True | mean absolute error | 23.651769 | 24.024994 | DecisionTreeRegressor() |
2 | Pregnancies | BloodPressure | 0.000000 | regression | True | mean absolute error | 12.748603 | 12.961633 | DecisionTreeRegressor() |
3 | Pregnancies | SkinThickness | 0.000000 | regression | True | mean absolute error | 13.467412 | 13.630828 | DecisionTreeRegressor() |
4 | Pregnancies | Insulin | 0.000000 | regression | True | mean absolute error | 79.240223 | 85.015086 | DecisionTreeRegressor() |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
76 | Outcome | Insulin | 0.000000 | regression | True | mean absolute error | 79.240223 | 84.802415 | DecisionTreeRegressor() |
77 | Outcome | BMI | 0.035133 | regression | True | mean absolute error | 5.672812 | 5.473511 | DecisionTreeRegressor() |
78 | Outcome | DiabetesPedigreeFunction | 0.000000 | regression | True | mean absolute error | 0.226384 | 0.234531 | DecisionTreeRegressor() |
79 | Outcome | Age | 0.000000 | regression | True | mean absolute error | 8.888268 | 8.997094 | DecisionTreeRegressor() |
80 | Outcome | Outcome | 1.000000 | predict_itself | True | None | 0.000000 | 1.000000 | None |
81 rows × 9 columns
= (
pps_matrix
pps.matrix("ppscore > 0")["x"].tolist()],
train.loc[:, predictors.query(
)"x", "y", "ppscore"]]
.loc[:, [round(3)
.
)
(
alt.Chart(pps_matrix)
.mark_rect()
.encode(="x:O",
x="y:O",
y="ppscore:Q",
color=["x", "y", "ppscore"])
tooltip=500, height=500) ).properties(width
Build models
= setup(data = train,
cls = 'Outcome',
target = 'mean',
numeric_imputation = False,
feature_selection =False,
pca=True,
remove_multicollinearity= False,
remove_outliers = True,
normalize
)'apc', 'APC', average_precision_score, target = 'pred_proba'); add_metric(
Description | Value | |
---|---|---|
0 | Session id | 5752 |
1 | Target | Outcome |
2 | Target type | Binary |
3 | Original data shape | (537, 9) |
4 | Transformed data shape | (537, 9) |
5 | Transformed train set shape | (375, 9) |
6 | Transformed test set shape | (162, 9) |
7 | Numeric features | 8 |
8 | Preprocess | True |
9 | Imputation type | simple |
10 | Numeric imputation | mean |
11 | Categorical imputation | mode |
12 | Remove multicollinearity | True |
13 | Multicollinearity threshold | 0.900000 |
14 | Normalize | True |
15 | Normalize method | zscore |
16 | Fold Generator | StratifiedKFold |
17 | Fold Number | 10 |
18 | CPU Jobs | -1 |
19 | Use GPU | False |
20 | Log Experiment | False |
21 | Experiment Name | clf-default-name |
22 | USI | 10e4 |
%%time
= compare_models(include=["et", "lightgbm", "rf", "dt"], sort="APC") best_model
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | APC | TT (Sec) | |
---|---|---|---|---|---|---|---|---|---|---|
et | Extra Trees Classifier | 0.7569 | 0.8039 | 0.5090 | 0.6832 | 0.5719 | 0.4106 | 0.4261 | 0.6795 | 0.1740 |
rf | Random Forest Classifier | 0.7412 | 0.7893 | 0.5013 | 0.6420 | 0.5538 | 0.3784 | 0.3896 | 0.6661 | 0.0330 |
lightgbm | Light Gradient Boosting Machine | 0.7252 | 0.7776 | 0.5090 | 0.5987 | 0.5452 | 0.3519 | 0.3574 | 0.6378 | 0.1310 |
dt | Decision Tree Classifier | 0.6691 | 0.6275 | 0.5019 | 0.5002 | 0.4904 | 0.2503 | 0.2559 | 0.4269 | 0.0050 |
CPU times: user 537 ms, sys: 135 ms, total: 672 ms
Wall time: 3.96 s
Evaluation
= (
predictions =test.iloc[:, :-1])
predict_model(best_model, data
) predictions.head()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | prediction_label | prediction_score | |
---|---|---|---|---|---|---|---|---|---|---|
752 | 3 | 108 | 62 | 24 | 0 | 26.000000 | 0.223 | 25 | 0 | 0.76 |
295 | 6 | 151 | 62 | 31 | 120 | 35.500000 | 0.692 | 28 | 0 | 0.57 |
532 | 1 | 86 | 66 | 52 | 65 | 41.299999 | 0.917 | 29 | 0 | 0.88 |
426 | 0 | 94 | 0 | 0 | 0 | 0.000000 | 0.256 | 25 | 0 | 0.89 |
68 | 1 | 95 | 66 | 13 | 38 | 19.600000 | 0.334 | 25 | 0 | 0.97 |
evaluate_model(best_model)
SHAP
interpret_model(best_model)
="reason", observation=1) interpret_model(best_model, plot
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
="reason") interpret_model(best_model, plot
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.