Engineering data science & AI platforms

Part II: Designing, building and operating data & AI platforms

How to manage
the data science lifecycle in real life?




The best book on data engineering

Source: Kleppmann & Riccomini (2026)


  • Chapter 1: Tradeoffs in Data Systems Architecture
    Analytical versus Operational Systems - Cloud versus Self-Hosting - Distributed versus Single-Node Systems - Data Systems, Law, and Society
  • Chapter 2: Defining Nonfunctional Requirements
    Case Study: Social Network Home Timelines- Describing Performance - Reliability and Fault Tolerance - Scalability - Maintainability
  • Chapter 3: Data Models and Query Languages
    Relational Model versus Document Model - Graph-Like Data Models - Event Sourcing and CQRS - DataFrames, Matrices, and Arrays
  • Chapter 4: Storage and Retrieval
    Storage and Indexing for OLTP - Data Storage for Analytics - Multidimensional and Full-Text Indexes
  • (…)
  • Chapter 11: Batch Processing
    Batch Processing in Distributed Systems - Batch Processing Models - Batch Use Cases

The common definition of data engineering

Source: Reiss & Housley (2022)



Designing data science & AI platforms

Our dream: a data system that just works

Source: the Composable Codex



The problem

Source: MAD landscape


The Composable Data Management System Manifesto

Source: Pedreira et al. (2023)



The requirement for specialization in data management systems has evolved faster than our software development practices. After decades of organic growth, this situation has created a siloed landscape composed of hundreds of products developed and maintained as monoliths, with limited reuse between systems. This fragmentation has resulted in developers often reinventing the wheel, increased maintenance costs, and slowed down innovation. It has also affected the end users, who are often required to learn the idiosyncrasies of dozens of incompatible SQL and non-SQL API dialects, and settle for systems with incomplete functionality and inconsistent semantics.

It all started with the relational database management systems (RDBMS)

Source: E.F. Codd (1970)


What makes a database?


System Catalog Database
postgresql, mysql, mssql, duckdb database schema
datafusion, trino catalog schema
druid dataSourceType dataSource
bigquery project database
flink catalog database
clickhouse database
clickhouse, impala, mysql, pyspark, snowflake database

Comparing operational and analytical data systems

Source: Kleppmann & Riccomini (2026), chapter 1



Property Operational systems (OLTP) Analytic systems (OLAP)
Main read pattern Point queries (fetch individual records by key) Aggregate over large number of records
Main write pattern Create, update, and delete individual records Bulk import (ETL) or event stream
Human user example End user of web/mobile application Internal analyst, for decision support
Type of queries Fixed, predefined by application Arbitrary, ad-hoc exploration by analysts
Query volume Lost of small queries Few queries, each is complex
Data represents Latest state of data (current point in time) History of events that happened over time
Dataset size Gigabytes to terabytes Terabytes to petabytes

Evolution of analytical system architectures

Source: Armbrust et al. (2021)


Unbundling data warehouses

Source: Kleppmann & Riccomini (2026), chapter 4



Component Function Open source software
Open query engine Parse SQL queries, optimize them into execution plans, and execute them against the data Apache DataFusion
Apache Spark
DuckDB
Open catalog format Defines which tables are contained in the database Apache Iceberg
Unity catalog
Lance
Open table format Support row inserts and deletions Apache Iceberg
Lance
Delta Lake
Apache Hudi
Open storage formats Determines how the rows of a table are encoded as bytes in a file Apache Parquet
Lance
Apache Orc
Open memory formats Determines how the rows of a table are encoded as bytes in memory Apache Arrow

From row-based to column-based systems

Source: Apache Arrow documentation


The Composable Data Stack takes the unbundling even further

Source: the Composable Codex


The Composable Data Stack takes the unbundling even further

Source: the Composable Codex


End-to-end columnar data with Apache Arrow ADBC and Flight SQL

Source: Apache Arrow project


From components to a whole platform architecture

Source: Andreesen Horowitz


From components to a whole platform architecture

Source: Andreesen Horowitz


Big Data Is Dead

Source: Jordan Tigani


The rise of embedded databases

Source: the Data Quarry blog


Type of data Open source embedable database
Multi-modal
Relational
Key-value documents
Vector
Labeled-property graph with Cypher query engine
Tripe-store with SPARQL query engine

So has our dream come true?

Source: the Composable Codex


We are getting very close …

Source: DuckLake



Diagonal scaling of a DuckLake lakehouse

Source: Cloudzero



  • Vertical scaling of single node query engines that can process up to 100 TB, covering 99% of use cases
  • Horizontal scaling of blob storage in (sovereign!) data centres

Patterns for building data & AI systems

The most common patterns for building data & AI systems

Source: Buschmann et al. (2013)



  • Pipes and filters pattern for building data processing flows
  • Layers pattern for achieving interoperability
  • the hub-and-spoke event broker topology pattern for federated data systems

ETL, ELT, DAGs, pipelines, dataflows: it’s all the same

Source: Dagster


Business process models are also directed acyclic graphs

Source: Kleppmann & Riccomini (2026), chapter 5



Batch data processing has a strong functional flavour

Source: Hamilton


The split-apply-combine strategy for data analysis

Source: Wickham (2011)



Overview data transformations in different libraries


concept pandas polars ibis PySpark dplyr SQL
split groupby() group_by() group_by() groupBy() group_by() GROUP BY
combine join (), merge() join() left_join, inner_join() etc. join() left_join, inner_join() etc. LEFT JOIN, JOIN etc.
filtering (row based) loc[], query() filter() filter() filter() filter() WHERE
select (column based) loc[], iloc[], select() select() select() select() SELECT
mutate assign() with_columns() mutuate() withColumn() mutate() ADD
ordering sort_values() sort() order_by() orderBy() arrange() ORDER BY

Method chaining makes functional code more readable

Source: Tom's (Augspurger) Blog (2016)



tumble_after(
    broke(
        fell_down(
            fetch(went_up(jack_jill, "hill"), "water"),
            jack),
        "crown"),
    "jill"
)
(jack_jill
  .went_up("hill")
  .fetch("water")
  .fell_down("jack")
  .broke("crown")
  .tumble_after("jill")
)

Functional programming


Imperative vs. declarative programming


Idempotency


Immutability: table partitions as immutable objects

Source: Maxima Beauchemin (2018)


Introducting Software-Defined Assets

Source: Dagster


Software-Defined Assets bring it all together

Source: Dagster


  • Declarative Nature: declare the end state of an asset, orchestrator takes care of the execution. Shifts the focus from task execution to asset production.
  • Observability and Scheduling: enhanced observability into your data assets and allow for advanced scheduling. Easier to understand the state of your assets and when they should be updated.
  • Environment Agnosticism: environment-agnostic, same asset definitions can be used across different environments, such as development and production, without changes to the asset code.
  • Data Lineage: clear data lineage, easier to understand data flows and debug issues.
  • Integration with External Tools: the orchestrator can be integrated with assets generated by other tools such as dbt.
  • Rich Metadata and Grouping: assets have rich metadata, which is useful for organizing and searching assets.
  • Partitioning and Backfills: SDAs support time partitioning and backfills out of the box, which is useful for managing historical data and ensuring data consistency.

Same workflow for machine learning pipelines


MLOps

Stages of machine learning CI/CD automation pipeline

Source: Google Cloud docs


MLOps level 0

Source: Google Cloud docs


MLOps level 1

Source: Google Cloud docs


MLOps level 2

Source: Google Cloud docs


The most complete open source MLOps library

Source: MLflow


MLflow Tracking & Experiments

Source: MLflow Tracking


MLflow Model Registry

Source: MLflow Model Registry


MLflow Model Deployment

Source: MLflow


Hidden technical debt in ML systems

Source: Scully et al. (2015)