Exploratory data analysis with Altair

This notebook collects explorations of Altair’s most interesting features. Originally published on Kaggle.

Author
Published

January 1, 2023

import altair as alt
import numpy as np 
import pandas as pd


ames_data = "https://github.com/eaisi/discover-projects/blob/main/ames-housing/AmesHousing.csv?raw=true"
train = pd.read_csv(ames_data).rename(columns=lambda s: s.replace(" ",""))

train
Order PID MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 526301100 20 RL 141.0 31770 Pave NaN IR1 Lvl ... 0 NaN NaN NaN 0 5 2010 WD Normal 215000
1 2 526350040 20 RH 80.0 11622 Pave NaN Reg Lvl ... 0 NaN MnPrv NaN 0 6 2010 WD Normal 105000
2 3 526351010 20 RL 81.0 14267 Pave NaN IR1 Lvl ... 0 NaN NaN Gar2 12500 6 2010 WD Normal 172000
3 4 526353030 20 RL 93.0 11160 Pave NaN Reg Lvl ... 0 NaN NaN NaN 0 4 2010 WD Normal 244000
4 5 527105010 60 RL 74.0 13830 Pave NaN IR1 Lvl ... 0 NaN MnPrv NaN 0 3 2010 WD Normal 189900
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2925 2926 923275080 80 RL 37.0 7937 Pave NaN IR1 Lvl ... 0 NaN GdPrv NaN 0 3 2006 WD Normal 142500
2926 2927 923276100 20 RL NaN 8885 Pave NaN IR1 Low ... 0 NaN MnPrv NaN 0 6 2006 WD Normal 131000
2927 2928 923400125 85 RL 62.0 10441 Pave NaN Reg Lvl ... 0 NaN MnPrv Shed 700 7 2006 WD Normal 132000
2928 2929 924100070 20 RL 77.0 10010 Pave NaN Reg Lvl ... 0 NaN NaN NaN 0 4 2006 WD Normal 170000
2929 2930 924151050 60 RL 74.0 9627 Pave NaN Reg Lvl ... 0 NaN NaN NaN 0 11 2006 WD Normal 188000

2930 rows × 82 columns

Bar Chart with Highlighted Bar

Basic bar chart with a bars highlighted based on the percentage of missing values.

missing = train.isnull().sum()*100/train.isnull().sum().sum()
missing = missing[missing > 0].reset_index()
missing.columns = ['Column', 'Count missing']
missing.head()
Column Count missing
0 LotFrontage 3.111309
1 Alley 17.347133
2 MasVnrType 11.270557
3 MasVnrArea 0.146041
4 BsmtQual 0.507969
alt.Chart(missing).mark_bar().encode(
    x=alt.X('Column', sort='-y'),
    y='Count missing',
    color=alt.condition(
        alt.datum['Count missing'] >10,  # If count missing is > 10%, returns True,
        alt.value('orange'),             # which sets the bar orange.
        alt.value('steelblue')           # And if it's not true it sets the bar steelblue.
    ),
    tooltip=['Count missing']
).properties(
    width=500,
    height=300
).configure_axis(
    grid=False
)

Boxplot

Creation of a basic boxplot using .mark_boxplot() method

alt.Chart(train).mark_boxplot().encode(
    x='OverallQual:O',
    y='SalePrice:Q',
    color='OverallQual:N'
).properties(
    width=500,
    height=300
)