import altair as alt
import pandas as pd
import ppscore as pps
from pycaret.regression import *
from ydata_profiling import ProfileReport
# customize Altair
def y_axis():
return {
"config": {
"axisX": {"grid": False},
"axisY": {
"domain": False,
"gridDash": [2, 4],
"tickSize": 0,
"titleAlign": "right",
"titleAngle": 0,
"titleX": -5,
"titleY": -10,
},"view": {
"stroke": "transparent",
# To keep the same height and width as the default theme:
"continuousHeight": 300,
"continuousWidth": 400,
},
}
}
"y_axis", y_axis)
alt.themes.register("y_axis")
alt.themes.enable(
def get_descriptions():
"Parse descriptions of columns of Ames Housing dataset"
with open("data_description.txt") as reader:
= {}
descriptions for line in reader.readlines():
if ":" in line and "2nd level" not in line:
": ")[0].strip()] = line.split(": ")[1].strip()
descriptions[line.split(return pd.Series(descriptions).rename("descriptions")
= get_descriptions() descriptions
Predicting house prices with pycaret
Objectives
- Example end-to-end supervised learning workflow with Ames Housing dataset
- Focus on conceptual understanding of machine learning
- Demonstrate use of Predictive Power Score (PPS)
- Demonstrate capabilities of low-code tools
Attribution
Dataset
- Ames Housing dataset paper (original paper)
- Kaggle competition advanced regression techniques (link)
Python libraries
- Altair (docs)
- ydata-profiling (docs)
- Predictive Power Score (PPS, GitHub, blog)
- PyCaret: open-source, low-code machine learning library in Python that automates machine learning workflows (link)
Read and explore the data
%%time
= pd.read_csv("train.csv")
train = pd.read_csv("test.csv")
test = ProfileReport(train, minimal=True, title="Ames Housing Profiling Report")
profile "ames-housing-profiling-report-minimal.html") profile.to_file(
CPU times: user 45 s, sys: 1.78 s, total: 46.7 s
Wall time: 16.4 s
profile.to_notebook_iframe()