Predicting house prices with pycaret

Author

Daniel Kapitan

Published

October 21, 2023

Objectives

  • Example end-to-end supervised learning workflow with Ames Housing dataset
  • Focus on conceptual understanding of machine learning
  • Demonstrate use of Predictive Power Score (PPS)
  • Demonstrate capabilities of low-code tools

Attribution

Dataset

  • Ames Housing dataset paper (original paper)
  • Kaggle competition advanced regression techniques (link)

Python libraries

  • Altair (docs)
  • ydata-profiling (docs)
  • Predictive Power Score (PPS, GitHub, blog)
  • PyCaret: open-source, low-code machine learning library in Python that automates machine learning workflows (link)
import altair as alt
import pandas as pd
import ppscore as pps
from pycaret.regression import *
from ydata_profiling import ProfileReport


# customize Altair
def y_axis():
    return {
        "config": {
            "axisX": {"grid": False},
            "axisY": {
                "domain": False,
                "gridDash": [2, 4],
                "tickSize": 0,
                "titleAlign": "right",
                "titleAngle": 0,
                "titleX": -5,
                "titleY": -10,
            },
            "view": {
                "stroke": "transparent",
                # To keep the same height and width as the default theme:
                "continuousHeight": 300,
                "continuousWidth": 400,
            },
        }
    }


alt.themes.register("y_axis", y_axis)
alt.themes.enable("y_axis")


def get_descriptions():
    "Parse descriptions of columns of Ames Housing dataset"
    with open("data_description.txt") as reader:
        descriptions = {}
        for line in reader.readlines():
            if ":" in line and "2nd level" not in line:
                descriptions[line.split(": ")[0].strip()] = line.split(": ")[1].strip()
    return pd.Series(descriptions).rename("descriptions")


descriptions = get_descriptions()

Read and explore the data

%%time
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
profile = ProfileReport(train, minimal=True, title="Ames Housing Profiling Report")
profile.to_file("ames-housing-profiling-report-minimal.html")
CPU times: user 45 s, sys: 1.78 s, total: 46.7 s
Wall time: 16.4 s
profile.to_notebook_iframe()

Investigate features with largest predictive power

We use the Predictive Power Score to evaluate which features have the highest predictive power with respect to SalePrice.

predictors = (
    pps.predictors(train, "SalePrice")
    .round(3)
    .iloc[:, :-1]
    .merge(descriptions, how="left", left_on="x", right_index=True)
)
base = (
    alt.Chart(predictors)
    .encode(
        x=alt.Y("x:N").sort("-y"),
        y="ppscore",
        tooltip=["x", "ppscore", "descriptions"],
    )
    .transform_filter("datum.ppscore > 0")
)
base.mark_bar() + base.mark_text(align="center", dy=-5)

Investigate colinearity

pps_matrix = (
    pps.matrix(
        train.loc[:, predictors.query("ppscore > 0")["x"].tolist()],
    )
    .loc[:, ["x", "y", "ppscore"]]
    .round(3)
)
(
    alt.Chart(pps_matrix)
    .mark_rect()
    .encode(
        x="x:O",
        y="y:O",
        color="ppscore:Q",
        tooltip=["x", "y", "ppscore"])
)

Build models

We select the 30 features that have the highest predictive power score

selected_predictors = (
    predictors.sort_values("ppscore", ascending=False).head(30)["x"].to_list()
)
reg = setup(data = train.loc[:, selected_predictors + ["SalePrice"]], 
             target = 'SalePrice',
             numeric_imputation = 'mean',
             categorical_features =  list(train.loc[:, selected_predictors].select_dtypes("object").columns), 
             feature_selection = False,
             pca=False,
             remove_multicollinearity=True,
             remove_outliers = False,
             normalize = True,
             )
  Description Value
0 Session id 8378
1 Target SalePrice
2 Target type Regression
3 Original data shape (1460, 31)
4 Transformed data shape (1460, 116)
5 Transformed train set shape (1021, 116)
6 Transformed test set shape (439, 116)
7 Ordinal features 1
8 Numeric features 16
9 Categorical features 14
10 Rows with missing values 94.7%
11 Preprocess True
12 Imputation type simple
13 Numeric imputation mean
14 Categorical imputation mode
15 Maximum one-hot encoding 25
16 Encoding method None
17 Remove multicollinearity True
18 Multicollinearity threshold 0.900000
19 Normalize True
20 Normalize method zscore
21 Fold Generator KFold
22 Fold Number 10
23 CPU Jobs -1
24 Use GPU False
25 Log Experiment False
26 Experiment Name reg-default-name
27 USI 81f6
%%time
selected_models = [model for model in models().index if model not in ["lar", "lr", "ransac"]]
best_model = compare_models(sort='RMSLE', include=selected_models)
  Model MAE MSE RMSE R2 RMSLE MAPE TT (Sec)
lightgbm Light Gradient Boosting Machine 18267.8967 969345616.0929 30245.0381 0.8400 0.1474 0.1051 0.3780
gbr Gradient Boosting Regressor 18349.4461 1064907228.1139 31464.4286 0.8221 0.1497 0.1059 0.0810
rf Random Forest Regressor 18834.6022 1052157810.7295 31669.8884 0.8263 0.1530 0.1091 0.1370
par Passive Aggressive Regressor 18695.3332 1145943934.1128 32527.7429 0.8093 0.1535 0.1061 0.0560
en Elastic Net 19941.1185 1212771199.2709 33679.9238 0.8018 0.1536 0.1131 0.0370
et Extra Trees Regressor 19749.8604 1158574471.9795 33510.6172 0.8073 0.1591 0.1138 0.1370
huber Huber Regressor 18580.7407 1172797296.1965 32571.8573 0.8024 0.1602 0.1069 0.0420
br Bayesian Ridge 20557.3468 1251454965.3245 34036.3809 0.7934 0.1715 0.1191 0.0380
ard Automatic Relevance Determination 20446.5401 1229331466.4696 33711.3986 0.7969 0.1747 0.1193 0.2740
omp Orthogonal Matching Pursuit 21882.7966 1294135379.9217 34955.8947 0.7847 0.1849 0.1296 0.0340
ada AdaBoost Regressor 24866.3282 1379609584.9159 36498.7175 0.7707 0.2036 0.1621 0.0580
knn K Neighbors Regressor 26571.2016 1730405638.7521 40931.3774 0.7200 0.2050 0.1518 0.0360
dt Decision Tree Regressor 27747.5148 2157234242.4490 45330.0191 0.6512 0.2169 0.1564 0.0350
llar Lasso Least Angle Regression 21458.2025 1320695830.3446 35006.4301 0.7809 0.2187 0.1268 0.0380
lasso Lasso Regression 21455.6793 1320742178.3808 35006.1951 0.7809 0.2189 0.1268 0.2100
ridge Ridge Regression 21439.2241 1318937040.3720 34981.0548 0.7812 0.2196 0.1266 0.0360
svm Support Vector Regression 55543.3805 6417749387.4994 79739.3850 -0.0524 0.3979 0.3195 0.0450
dummy Dummy Regressor 57352.4774 6133919031.8184 78021.7431 -0.0086 0.4061 0.3635 0.0340
tr TheilSen Regressor 29178.3219 2564758742.0908 49572.3895 0.5667 0.4258 0.1978 4.0290
kr Kernel Ridge 182040.0692 34133507087.4154 184731.2672 -4.7500 1.7994 1.1623 0.0380
mlp MLP Regressor 166456.5847 32851392125.7179 181040.0031 -4.4796 2.7703 0.9182 0.2890
CPU times: user 4.09 s, sys: 446 ms, total: 4.53 s
Wall time: 1min 3s

Evaluation

  • With a standard, AutoML-like workflow, we achive RMSLE of 0.13 - 0.14 (over different runs), which is already in the top 25% of the 4,200 submissions on the leaderboard
  • We can now make predictions on the test set
predictions = (
    predict_model(best_model, data=test)
    .rename(columns={"prediction_label": "SalePrice"})
    .loc[:, ["Id", "SalePrice"]]
)
predictions.head()
Id SalePrice
0 1461 126951.931078
1 1462 142402.002648
2 1463 185086.014955
3 1464 191718.590497
4 1465 186412.972060

Pipeline

plot_model(best_model, 'pipeline')

plot_model(best_model, 'feature')

plot_model(best_model, 'residuals')