Engineering data science & AI platforms

Part II: Designing, building and operating data & AI platforms

Daniel Kapitan

e. daniel@kapitan.net

Talking to AI by Yutong Liu & Kingston School of Art / CC-BY 4.0

Attribution & copyright notice

This lecture is based on the following open access materials:

Voltron Data, The Composable Codex
Documentation of the following Python libraries:
DuckDB, polars, Ibis, dagster, hamilton, Shiny for Python, marimo

Source code: https://github.com/anthology-of-data-science/lecture-engineering-data-ai-platforms

This work is licensed under CC BY-SA 4.0

How to manage
the data science lifecycle in real life?

The best book on data engineering

Source: Kleppmann & Riccomini (2026)

Chapter 1: Tradeoffs in Data Systems Architecture
Analytical versus Operational Systems - Cloud versus Self-Hosting - Distributed versus Single-Node Systems - Data Systems, Law, and Society
Chapter 2: Defining Nonfunctional Requirements
Case Study: Social Network Home Timelines- Describing Performance - Reliability and Fault Tolerance - Scalability - Maintainability
Chapter 3: Data Models and Query Languages
Relational Model versus Document Model - Graph-Like Data Models - Event Sourcing and CQRS - DataFrames, Matrices, and Arrays
Chapter 4: Storage and Retrieval
Storage and Indexing for OLTP - Data Storage for Analytics - Multidimensional and Full-Text Indexes
(…)
Chapter 11: Batch Processing
Batch Processing in Distributed Systems - Batch Processing Models - Batch Use Cases

The common definition of data engineering

Source: Reiss & Housley (2022)

Designing data science & AI platforms

Our dream: a data system that just works

Source: the Composable Codex

The problem

Source: MAD landscape

The Composable Data Management System Manifesto

Source: Pedreira et al. (2023)

The requirement for specialization in data management systems has evolved faster than our software development practices. After decades of organic growth, this situation has created a siloed landscape composed of hundreds of products developed and maintained as monoliths, with limited reuse between systems. This fragmentation has resulted in developers often reinventing the wheel, increased maintenance costs, and slowed down innovation. It has also affected the end users, who are often required to learn the idiosyncrasies of dozens of incompatible SQL and non-SQL API dialects, and settle for systems with incomplete functionality and inconsistent semantics.

It all started with the relational database management systems (RDBMS)

Source: E.F. Codd (1970)

What makes a database?

System	Catalog	Database
postgresql, mysql, mssql, duckdb	database	schema
datafusion, trino	catalog	schema
druid	dataSourceType	dataSource
bigquery	project	database
flink	catalog	database
clickhouse		database
clickhouse, impala, mysql, pyspark, snowflake		database

Comparing operational and analytical data systems

Source: Kleppmann & Riccomini (2026), chapter 1

Property	Operational systems (OLTP)	Analytic systems (OLAP)
Main read pattern	Point queries (fetch individual records by key)	Aggregate over large number of records
Main write pattern	Create, update, and delete individual records	Bulk import (ETL) or event stream
Human user example	End user of web/mobile application	Internal analyst, for decision support
Type of queries	Fixed, predefined by application	Arbitrary, ad-hoc exploration by analysts
Query volume	Lost of small queries	Few queries, each is complex
Data represents	Latest state of data (current point in time)	History of events that happened over time
Dataset size	Gigabytes to terabytes	Terabytes to petabytes

Evolution of analytical system architectures

Source: Armbrust et al. (2021)

Unbundling data warehouses

Source: Kleppmann & Riccomini (2026), chapter 4

Component	Function	Open source software
Open query engine	Parse SQL queries, optimize them into execution plans, and execute them against the data	Apache DataFusion Apache Spark DuckDB
Open catalog format	Defines which tables are contained in the database	Apache Iceberg Unity catalog Lance
Open table format	Support row inserts and deletions	Apache Iceberg Lance Delta Lake Apache Hudi
Open storage formats	Determines how the rows of a table are encoded as bytes in a file	Apache Parquet Lance Apache Orc
Open memory formats	Determines how the rows of a table are encoded as bytes in memory	Apache Arrow

From row-based to column-based systems

Source: Apache Arrow documentation

The Composable Data Stack takes the unbundling even further

Source: the Composable Codex

The Composable Data Stack takes the unbundling even further

Source: the Composable Codex

End-to-end columnar data with Apache Arrow ADBC and Flight SQL

Source: Apache Arrow project

From components to a whole platform architecture

Source: Andreesen Horowitz

From components to a whole platform architecture

Source: Andreesen Horowitz

Big Data Is Dead

Source: Jordan Tigani

The rise of embedded databases

Source: the Data Quarry blog

Type of data	Open source embedable database
Multi-modal	LanceDB CozoDB
Relational	DuckDB
Key-value documents	RocksDB
Vector	Chroma
Labeled-property graph with Cypher query engine	LadybudDB
Tripe-store with SPARQL query engine	qlever

So has our dream come true?

Source: the Composable Codex

We are getting very close …

Source: DuckLake

Diagonal scaling of a DuckLake lakehouse

Source: Cloudzero

Vertical scaling of single node query engines that can process up to 100 TB, covering 99% of use cases
Horizontal scaling of blob storage in (sovereign!) data centres

Patterns for building data & AI systems

The most common patterns for building data & AI systems

Source: Buschmann et al. (2013)

Pipes and filters pattern for building data processing flows
Layers pattern for achieving interoperability
the hub-and-spoke event broker topology pattern for federated data systems

ETL, ELT, DAGs, pipelines, dataflows: it’s all the same

Source: Dagster

Business process models are also directed acyclic graphs

Source: Kleppmann & Riccomini (2026), chapter 5

Batch data processing has a strong functional flavour

Source: Hamilton

The split-apply-combine strategy for data analysis

Source: Wickham (2011)

Overview data transformations in different libraries

concept	pandas	polars	ibis	PySpark	dplyr	SQL
split	groupby()	group_by()	group_by()	groupBy()	group_by()	GROUP BY
combine	join (), merge()	join()	left_join, inner_join() etc.	join()	left_join, inner_join() etc.	LEFT JOIN, JOIN etc.
filtering (row based)	loc[], query()	filter()	filter()	filter()	filter()	WHERE
select (column based)	loc[], iloc[],	select()	select()	select()	select()	SELECT
mutate	assign()	with_columns()	mutuate()	withColumn()	mutate()	ADD
ordering	sort_values()	sort()	order_by()	orderBy()	arrange()	ORDER BY

Method chaining makes functional code more readable

Source: Tom's (Augspurger) Blog (2016)

tumble_after(
    broke(
        fell_down(
            fetch(went_up(jack_jill, "hill"), "water"),
            jack),
        "crown"),
    "jill"
)

(jack_jill
  .went_up("hill")
  .fetch("water")
  .fell_down("jack")
  .broke("crown")
  .tumble_after("jill")
)

Functional programming

Imperative vs. declarative programming

Idempotency

Immutability: table partitions as immutable objects

Source: Maxima Beauchemin (2018)

Introducting Software-Defined Assets

Source: Dagster

Software-Defined Assets bring it all together

Source: Dagster

Declarative Nature: declare the end state of an asset, orchestrator takes care of the execution. Shifts the focus from task execution to asset production.
Observability and Scheduling: enhanced observability into your data assets and allow for advanced scheduling. Easier to understand the state of your assets and when they should be updated.
Environment Agnosticism: environment-agnostic, same asset definitions can be used across different environments, such as development and production, without changes to the asset code.
Data Lineage: clear data lineage, easier to understand data flows and debug issues.
Integration with External Tools: the orchestrator can be integrated with assets generated by other tools such as dbt.
Rich Metadata and Grouping: assets have rich metadata, which is useful for organizing and searching assets.
Partitioning and Backfills: SDAs support time partitioning and backfills out of the box, which is useful for managing historical data and ensuring data consistency.

Same workflow for machine learning pipelines

MLOps

Stages of machine learning CI/CD automation pipeline

Source: Google Cloud docs

MLOps level 0

Source: Google Cloud docs

MLOps level 1

Source: Google Cloud docs

MLOps level 2

Source: Google Cloud docs

The pipeline consists of the following stages:

Development and experimentation: You iteratively try out new ML algorithms and new modeling where the experiment steps are orchestrated. The output of this stage is the source code of the ML pipeline steps that are then pushed to a source repository.
Pipeline continuous integration: You build source code and run various tests. The outputs of this stage are pipeline components (packages, executables, and artifacts) to be deployed in a later stage.
Pipeline continuous delivery: You deploy the artifacts produced by the CI stage to the target environment. The output of this stage is a deployed pipeline with the new implementation of the model.
Automated triggering: The pipeline is automatically executed in production based on a schedule or in response to a trigger. The output of this stage is a trained model that is pushed to the model registry.
Model continuous delivery: You serve the trained model as a prediction service for the predictions. The output of this stage is a deployed model prediction service.
Monitoring: You collect statistics on the model performance based on live data. The output of this stage is a trigger to execute the pipeline or to execute a new experiment cycle.

The most complete open source MLOps library

Source: MLflow

MLflow Tracking & Experiments

Source: MLflow Tracking

MLflow Model Registry

Source: MLflow Model Registry

MLflow Model Deployment

Source: MLflow

Hidden technical debt in ML systems

Source: Scully et al. (2015)

Engineering data science & AI platforms

Attribution & copyright notice

How to managethe data science lifecycle in real life?

The best book on data engineering

The common definition of data engineering

Our dream: a data system that just works

The problem

The Composable Data Management System Manifesto

It all started with the relational database management systems (RDBMS)

What makes a database?

Comparing operational and analytical data systems

Evolution of analytical system architectures

Unbundling data warehouses

From row-based to column-based systems

The Composable Data Stack takes the unbundling even further

The Composable Data Stack takes the unbundling even further

End-to-end columnar data with Apache Arrow ADBC and Flight SQL

From components to a whole platform architecture

From components to a whole platform architecture

Big Data Is Dead

The rise of embedded databases

So has our dream come true?

We are getting very close …

Diagonal scaling of a DuckLake lakehouse

The most common patterns for building data & AI systems

ETL, ELT, DAGs, pipelines, dataflows: it’s all the same

Business process models are also directed acyclic graphs

Batch data processing has a strong functional flavour

The split-apply-combine strategy for data analysis

Overview data transformations in different libraries

Method chaining makes functional code more readable

Functional programming

Imperative vs. declarative programming

Idempotency

Immutability: table partitions as immutable objects

Introducting Software-Defined Assets

Software-Defined Assets bring it all together

Same workflow for machine learning pipelines

Stages of machine learning CI/CD automation pipeline

MLOps level 0

MLOps level 1

MLOps level 2

The most complete open source MLOps library

MLflow Tracking & Experiments

MLflow Model Registry

MLflow Model Deployment

Hidden technical debt in ML systems

How to manage
the data science lifecycle in real life?