Part II: Designing, building and operating data & AI platforms
This lecture is based on the following open access materials:
Source code: https://github.com/anthology-of-data-science/lecture-engineering-data-ai-platforms
This work is licensed under CC BY-SA 4.0

Source: Kleppmann & Riccomini (2026)

Source: Reiss & Housley (2022)

Designing data science & AI platforms
Source: the Composable Codex
Source: MAD landscape
Source: Pedreira et al. (2023)
The requirement for specialization in data management systems has evolved faster than our software development practices. After decades of organic growth, this situation has created a siloed landscape composed of hundreds of products developed and maintained as monoliths, with limited reuse between systems. This fragmentation has resulted in developers often reinventing the wheel, increased maintenance costs, and slowed down innovation. It has also affected the end users, who are often required to learn the idiosyncrasies of dozens of incompatible SQL and non-SQL API dialects, and settle for systems with incomplete functionality and inconsistent semantics.
Source: E.F. Codd (1970)

| System | Catalog | Database |
|---|---|---|
| postgresql, mysql, mssql, duckdb | database | schema |
| datafusion, trino | catalog | schema |
| druid | dataSourceType | dataSource |
| bigquery | project | database |
| flink | catalog | database |
| clickhouse | database | |
| clickhouse, impala, mysql, pyspark, snowflake | database |
Source: Kleppmann & Riccomini (2026), chapter 1
| Property | Operational systems (OLTP) | Analytic systems (OLAP) |
|---|---|---|
| Main read pattern | Point queries (fetch individual records by key) | Aggregate over large number of records |
| Main write pattern | Create, update, and delete individual records | Bulk import (ETL) or event stream |
| Human user example | End user of web/mobile application | Internal analyst, for decision support |
| Type of queries | Fixed, predefined by application | Arbitrary, ad-hoc exploration by analysts |
| Query volume | Lost of small queries | Few queries, each is complex |
| Data represents | Latest state of data (current point in time) | History of events that happened over time |
| Dataset size | Gigabytes to terabytes | Terabytes to petabytes |
Source: Armbrust et al. (2021)
Source: Kleppmann & Riccomini (2026), chapter 4
| Component | Function | Open source software |
|---|---|---|
| Open query engine | Parse SQL queries, optimize them into execution plans, and execute them against the data | Apache DataFusion Apache Spark DuckDB |
| Open catalog format | Defines which tables are contained in the database | Apache Iceberg Unity catalog Lance |
| Open table format | Support row inserts and deletions | Apache Iceberg Lance Delta Lake Apache Hudi |
| Open storage formats | Determines how the rows of a table are encoded as bytes in a file | Apache Parquet Lance Apache Orc |
| Open memory formats | Determines how the rows of a table are encoded as bytes in memory | Apache Arrow |
Source: Apache Arrow documentation
Source: the Composable Codex
Source: the Composable Codex
Source: Apache Arrow project
Source: Andreesen Horowitz
Source: Andreesen Horowitz
Source: Jordan Tigani
Source: the Data Quarry blog
Source: the Composable Codex
Source: DuckLake
Source: Cloudzero

Patterns for building data & AI systems
Source: Buschmann et al. (2013)
Source: Dagster
Source: Kleppmann & Riccomini (2026), chapter 5
Source: Hamilton
Source: Wickham (2011)
| concept | pandas | polars | ibis | PySpark | dplyr | SQL |
|---|---|---|---|---|---|---|
| split | groupby() | group_by() | group_by() | groupBy() | group_by() | GROUP BY |
| combine | join (), merge() | join() | left_join, inner_join() etc. | join() | left_join, inner_join() etc. | LEFT JOIN, JOIN etc. |
| filtering (row based) | loc[], query() | filter() | filter() | filter() | filter() | WHERE |
| select (column based) | loc[], iloc[], | select() | select() | select() | select() | SELECT |
| mutate | assign() | with_columns() | mutuate() | withColumn() | mutate() | ADD |
| ordering | sort_values() | sort() | order_by() | orderBy() | arrange() | ORDER BY |
Source: Tom's (Augspurger) Blog (2016)
Source: Maxima Beauchemin (2018)
Source: Dagster
Source: Dagster
MLOps
Source: Google Cloud docs
Source: Google Cloud docs
Source: Google Cloud docs
Source: Google Cloud docs
Source: MLflow
Source: MLflow Tracking
Source: MLflow Model Registry
Source: MLflow
Source: Scully et al. (2015)
