Predicting house prices with pycaret

Author

Daniel Kapitan

Published

October 21, 2023

Objectives

  • Example end-to-end supervised learning workflow with Ames Housing dataset
  • Focus on conceptual understanding of machine learning
  • Demonstrate use of Predictive Power Score (PPS)
  • Demonstrate capabilities of low-code tools

Attribution

Dataset

  • Ames Housing dataset paper (original paper)
  • Kaggle competition advanced regression techniques (link)

Python libraries

  • Altair (docs)
  • ydata-profiling (docs)
  • Predictive Power Score (PPS, GitHub, blog)
  • PyCaret: open-source, low-code machine learning library in Python that automates machine learning workflows (link)
import altair as alt
import pandas as pd
import ppscore as pps
from pycaret.regression import *
from ydata_profiling import ProfileReport


# customize Altair
def y_axis():
    return {
        "config": {
            "axisX": {"grid": False},
            "axisY": {
                "domain": False,
                "gridDash": [2, 4],
                "tickSize": 0,
                "titleAlign": "right",
                "titleAngle": 0,
                "titleX": -5,
                "titleY": -10,
            },
            "view": {
                "stroke": "transparent",
                # To keep the same height and width as the default theme:
                "continuousHeight": 300,
                "continuousWidth": 400,
            },
        }
    }


alt.themes.register("y_axis", y_axis)
alt.themes.enable("y_axis")


def get_descriptions():
    "Parse descriptions of columns of Ames Housing dataset"
    with open("data_description.txt") as reader:
        descriptions = {}
        for line in reader.readlines():
            if ":" in line and "2nd level" not in line:
                descriptions[line.split(": ")[0].strip()] = line.split(": ")[1].strip()
    return pd.Series(descriptions).rename("descriptions")


descriptions = get_descriptions()

Read and explore the data

%%time
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
profile = ProfileReport(train, minimal=True, title="Ames Housing Profiling Report")
profile.to_file("ames-housing-profiling-report-minimal.html")
CPU times: user 45 s, sys: 1.78 s, total: 46.7 s
Wall time: 16.4 s
profile.to_notebook_iframe()