Getting started with the composable data stack
September 8, 2024
This lecture is based on the following open access materials:
Source code: https://github.com/anthology-of-data-science/lecture-composable-data-stack
Daniel Kapitan, Modern, open and downward-scaleable data engineering.
This work is licensed under CC BY-SA 4.0
The requirement for specialization in data management systems has evolved faster than our software development practices. After decades of organic growth, this situation has created a siloed landscape composed of hundreds of products developed and maintained as monoliths, with limited reuse between systems. This fragmentation has resulted in developers often reinventing the wheel, increased maintenance costs, and slowed down innovation. It has also affected the end users, who are often required to learn the idiosyncrasies of dozens of incompatible SQL and non-SQL API dialects, and settle for systems with incomplete functionality and inconsistent semantics.
Pedreira, Pedro, et al. The composable data management system manifesto. Proceedings of the VLDB Endowment 16.10 (2023): 2679-2685.
E. F. Codd. A relational model of data for large shared data banks. Commun. ACM 13, 6 (June 1970), 377–387.
Source: Jadwiga Wilkens on Medium. The Best BigQuery SQL Cheat Sheet for Beginners.
Armbrust, Michael, et al. Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. Proceedings of CIDR. Vol. 8. 2021.
NOW: Application-Centric | FUTURE: Data-Centric |
---|---|
Exorbitant, often prohibitive, cost of change. Reasonable cost of change. | Data is tied up in applications because applications own data. Data is an open resource that outlives any given application. |
Every new project comes with a big data conversion project. | Every new project taps into existing data stores. |
Data exists in wide variety of heterogeneous formats, structures, meaning, and terminology. | Data is globally integrated sharing a common meaning, being exported from a common source into any needed format. |
Data integration consumes 35%-65% of IT budget. | Data integration will be nearly free. |
Hard or impossible to integrate external data with internal data. | Internal and external data readily integrated. |
Source: Basil Borque on Stackoverflow
Source: Apache Arrow overview.
Source: Apache Arrow overview.
Source: Apache Arrow: Introducing ADBC.
Source: Apache Arrow: Introducing ADBC.
Source: Apache Iceberg specification.
Source: Jordan Tigani, Big Data Is Dead.
Source: The Data Quarry::blog Embedded databases (1): The harmony of DuckDB, KùzuDB and LanceDB.
Source: Andreessen Horowitz, Emerging Architectures for Modern Data Infrastructure.
Source: Hamilton.
Source: Dagster Introducing Software-Defined Assets.
just
Wickham, H. (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1–29. https://doi.org/10.18637/jss.v040.i01
concept | pandas | polars | ibis | PySpark | dplyr | SQL |
---|---|---|---|---|---|---|
split | groupby() | group_by() | group_by() | groupBy() | group_by() | GROUP BY |
combine | join (), merge() | join() | left_join, inner_join() etc. | join() | left_join, inner_join() etc. | LEFT JOIN, JOIN etc. |
filtering (row based) | loc[], query() | filter() | filter() | filter() | filter() | WHERE |
select (column based) | loc[], iloc[], | select() | select() | select() | select() | SELECT |
mutate | assign() | with_columns() | mutuate() | withColumn() | mutate() | ADD |
ordering | sort_values() | sort() | order_by() | orderBy() | arrange() | ORDER BY |
tumble_after(
broke(
fell_down(
fetch(went_up(jack_jill, "hill"), "water"),
jack),
"crown"),
"jill"
)
(jack_jill
.went_up("hill")
.fetch("water")
.fell_down("jack")
.broke("crown")
.tumble_after("jill")
Source: Tom’s (Augspurger) Blog. Method Chaining.
Backend | Catalog | Database |
---|---|---|
bigquery | project | database |
clickhouse | database | |
datafusion | catalog | schema |
druid | dataSourceType | dataSource |
duckdb | database | schema |
flink | catalog | database |
impala | database | |
mssql | database | schema |
mysql | database | |
postgres | database | schema |
pyspark | database | |
snowflake | database | |
trino | catalog | schema |
Source: Ibis documentation.
operation | ibis | polars | duckdb |
---|---|---|---|
Flatten Array into multiple rows |
ArrayValue.unnest() |
DataFrame.explode() |
UNNEST |
Unnest Struct into multiple columns |
Table.unpack(*columns) |
DataFrame.unnest() |
UNNEST |
Ibis also has methods that operate directly on a column of structs:
component | Ibis analytics demo | My preference |
---|---|---|
Workflow orchestration | Dagster | Hamilton |
Persistent storage | parquet, native DuckDB files | Apache Iceberg |
Dashboarding app | Streamlit | Shiny for Python, Quarto |
Visualization | plotly | vega-altair |