Engineering for data science & AI

A two-day Hitchhikers’ guide to the stuff
that data professionals spend most of their time on

Daniel Kapitan

e. daniel@kapitan.net

Talking to AI by Yutong Liu & Kingston School of Art / CC-BY 4.0

Attribution & copyright notice

This lecture is based on many open access materials for which references are given on each slide. We highlight the following resources, which provide the backbone of the main narrative:

L. Fioridi (2009), Philosophical Conceptions of Information
Voltron Data, The Composable Codex
Documentation of the following Python libraries:
DuckDB, polars, Ibis, dagster, hamilton, Shiny for Python, marimo
Better Images of AI, for non-stereotypical and thought provoking images

Source code: https://github.com/anthology-of-data-science/lecture-engineering-data-ai-platforms

This work is licensed under CC BY-SA 4.0

The scope of real-world data science

Source: Fairness and machine learning (2023)

The scope of real-world data science

Source: Fairness and machine learning (2023)

Etymology

datum: something given (fr. donné)
informare: formation of the mind
modulus: unit of measurement
gnosis: knowledge

What are data?

Source: Fairness and machine learning (2023)

Data begins with counting
From relational databases to document databases and knowledge databases
From data to information to knowledge

But what is “learning” and what are “models”?

Source: Fairness and machine learning (2023)

Models as tools to obtain factual knowledge
Different types of models: machine learning model, generative AI model, causal model, deductive model
How does data science and “AI” change how we work with models?

The Ladder of Causality

Source: Judea Pearl (2018)

Remember:

Machine learning models are not the only models
Machine learning models are only based on correlations
Machine learning can not reason (although LLMs look like they can)
Read more on analytical problem with different types of models

Using data and models in real-world applications

Source: Fairness and machine learning (2023)

Platforms for working with data and models
Governance of data & models in organizations
Ethical, legal and societal impact of using data & models
Human-computer interactions

Thinking, Fast and Slow

Source: Daniel Kahneman (2011)

Engineering for data science & AI

Part I

Data as measurement:
the basics of working with data
The relational database
Bringing semantics back
into data engineering

Part II

Designing
data science & AI platforms
Patterns for building
data & AI systems
MLOps

Data as measurement

Data starts with counting

Source: H. Bruderer (2024)

Bone of Lembobo

Oldest tally stick, 40,000 years old
29 marks: lunar cycle, menstrual cycle?
… but piece is missing, perhaps it continued counting

Bone of Ishango (picture left)

20,000 years old
168 parallel notches in all, engraved on three sides, arranged in group
Meaning remains unclear

Data starts with counting

Source: K. Housten (2023)

Mesopotamia, around 3300 BC

One of the world’s earliest civilizations: animal husbandry, cultivation of crops
Symbolic accounting system in cuneiform
Use of sexagesimal numbers with number 60 as root: twelve finger joints of one hand and five fingers of the other

Stevens’ levels of measurement

Source: Wikipedia

scale	measure property	math operations	advanced operations	central tendency
nominal	classification, membership	=, ≠	grouping	mode
ordinal	comparison, level	>, <	sorting	median
interval	difference, affinity	+, -	yardstick	mean, deviation
ratio	magnitude, amount	x, /	ratio	geometric mean, variation

Types of data

Source: Fioridi (2009)

Type	Description
Primary data	The principal data stored e.g. in a database, for example a simple array of numbers that measures the battery of a car. The red light of the low battery indicator flashing is assumed to be an instance of primary data conveying primary information.
Secondary data	The converse of primary data, constituted by their absence (one could call them anti-data). You usually suspect your car battery is flat when the engine fails to make any of the usual noise.
Metadata	These are indications about the nature of some other (usually primary) data. They describe properties such as location, format, updating, availability, usage restrictions, and so forth.

Types of data

Source: Fioridi (2009)

Type	Description
Operational data	These are data regarding the operations of the whole data system and the systems performance. Correspondingly, operational information is information about the dynamics of an information system. Suppose the car has a yellow light that, when flashing, indicates that the car checking system is malfunctioning. The fact that the light is on may indicate that the low battery indicator is not working properly, thus undermining the hypothesis that the battery is flat.
Derivative data	These are data that can be extracted from some data whenever the latter are used as indirect sources in search of patterns, clues or inferential evidence about other things than those directly addressed by the data themselves, e.g. for comparative and quantitative analyses. From someones credit card bill, concerning e.g. the purchase of petrol in a certain petrol station, one may derive the information of her whereabouts at a given time. Diﬃcult to define this category precisely.

Operational data

These are data regarding the operations of the whole data system and the systems performance. Correspondingly, operational information is information about the dynamics of an information system. Suppose the car has a yellow light that, when flashing, indicates that the car checking system is malfunctioning. The fact that the light is on may indicate that the low battery indicator is not working properly, thus undermining the hypothesis that the battery is flat.

Derivative data

These are data that can be extracted from some data whenever the latter are used as indirect sources in search of patterns, clues or inferential evidence about other things than those directly addressed by the data themselves, e.g. for comparative and quantitative analyses. From someones credit card bill, concerning e.g. the purchase of petrol in a certain petrol station, one may derive the information of her whereabouts at a given time. Diﬃcult to define this category precisely.

Tidy data

Source: R for Data Science (2nd edition, 2023)

The relational database

It all started with the relational database management system (RDBMS)

Source: E.F. Codd (1970)

Databases are often at the heart of an information system

Relational algebra as the mathematical foundation

Source: Geeks for geeks (2026)

Consider this description of our domain of interest

Our company is divided into departments. Each department has a unique name, a unique department number and a department manager. The date this manager started, is registered. Furthermore, a department can be located on multiple locations.
All employees work for a specific department. Almost everyone has one supervisor, a supervisor may have several employees he is supervisor of. Every employee has a unique number, a name, a birth date, a gender, and a home address.
An employee works in one or more projects. Each project is controlled by one department, has a unique name and number and is settled on a location. Project members do not necessarily have to work fulltime for a project, but usually a fixed number of hours per week.
An employee may have several persons that are economically dependent on him: children and/or a spouse/husband. These can be identified by name, have a birth date and a gender. The sort of relationship is denoted as well.

Chen notation

Solution

Four layers of data modeling

Source: Geonovum (2024)

Semantic:
fact-based modeling, ontology modeling
Conceptual:
Entity-relationship modeling, UML modeling
Logical:
dimensional modeling, data vault modeling
Physical: DuckDB, Polars, SQL Server etc.

Relational models in datawarehouses: the dimensional model (aka star schema)

Source: dbt documentation (2023)

Relational models in datawarehouses: the data vault

Source: Rahma Hassan (2023)

The relational model vs. the document model

{
  "user_id":     251,
  "first_name":  "Barack",
  "last_name":   "Obama",
  "headline":    "Former President of the United States of America",
  "region_id":   "us:91",
  "photo_url":   "/p/7/000/253/05b/308dd6e.jpg",
  "positions": [
    {"job_title": "President", "organization": "United States of America"},
    {"job_title": "US Senator (D-IL)", "organization": "United States Senate"}
  ],
  "education": [
    {"school_name": "Harvard University",  "start": 1988, "end": 1991},
    {"school_name": "Columbia University", "start": 1981, "end": 1983}
  ],
  "contact_info": {
    "website": "https://barackobama.com",
    "twitter": "https://twitter.com/barackobama"
  }
}

But semantics often get lost in data models
Example: different conceptualizations of sex and gender

Source: Zhang, Cornet & Benis (2024)

Bringing semantics back into data engineering

From data to information and knowledge

Source: Fioridi (2009)

General definition of information

Source: Fioridi (2009)

Information = Data + Meaning

multiple data points
data is well-formed (syntax)
data is meaningful in a certain context (semantics)

What is the meaning of this statement?

“Daniel has a high blood pressure”

The semantics of blood pressure

Source: Edelman et al. (2024)

From data to information

Source: Fioridi (2009)

Level	Dimension	Theoretical anchor	Example
Data	Signal & Measurement	Data is not “raw” but produced by instruments designed under specific physical theories.	The voltage fluctuations from the sensor translated into a numerical output (e.g., 145/95).
Information (Statistical)	Entropy & Surprise	Shannon Information Theory: The statistical novelty or reduction of uncertainty within a signal, regardless of meaning.	A reading of 145 mmHg is “surprising” (high information content) if the patient’s historical baseline is 120 mmHg.
Information (Semantic)	Context & Relations	Ontologies & Knowledge Graphs: Data structured via schemas to provide meaning (units, subjects, and temporal states).	Linking the “145” to “mmHg,” “Patient X,” and “Resting State” within a standardized medical ontology.

… and from information to knowledge

Source: Fioridi (2009)

Level	Dimension	Theoretical anchor	Example
Knowledge (Propositional)	Justified True Belief (JTB)	Classical Epistemology: Knowledge-that. A claim that is believed, is true, and has rigorous justification.	The clinician’s belief that “Patient X has hypertension,” justified by multiple readings and clinical guidelines.
Knowledge (Procedural)	Non-Propositional Skill	Ryle’s “Knowledge-How”: The tacit ability to perform a task that cannot be fully captured in formal data structures.	The nurse’s skill in correctly placing the cuff and identifying the specific Korotkoff sounds amidst noise.
Knowledge (Acquaintance)	Personal experience	familiarity with a person, place, or thing	The nurse is familiar with the hospital

The return of the knowledge graph

Source: Barrasa & Webber (2023)

Semantic triples

Source: Wikipedia

A semantic triple, or RDF triple or simply triple, is the atomic data entity in the Resource Description Framework (RDF) data model
A triple is a sequence of three entities that codifies a statement about semantic data in the form of subject–predicate–object expressions

Stelselgegevens basisregistraties

Source: stelselcatalogus.nl

KIK-V ontology for the Dutch care sector

Source: Zorginstituut KIK-V

The Ontology Pipeline

Source: Jessica Talisman (2025)

Ontologies enhance vocabularies by describing and contextualizing relationships between concepts and with the introduction of logical reasoning. By assigning classes, properties, relations and attributes, ontologies establish rule bases that define how concepts behave in the wild, ensuring a level of coherence within complex information systems. Machines love ontologies because of their high-fidelity disambiguation and description, which bring clarity to machine understanding for tasks such as information retrieval, entity management, concept discovery, and RAG implementations for AI systems.

Are graphs the future?

I certainly think so

Read this comprehensive introduction by Hogan et al. (2021)
Labeled property graph (cypher/GQL): try the neo4j sandbox
RDF Triple Store (SPARQL): try the data.world SPARQL tutorial

Caution

The RDF/SPARQL stack is mostly used by the ontology community
Labeled Property Graphs as more commonly used by companies
Graph Query Language GQL is the new ISO standard, similar to SQL and derived from cypher (neo4j)
There is also GraphQL, which is an API format and something completely different from GQL

SPARQL vs. Cypher/GQL comparison

Source: Arthur Keen (2018)

Choose your data structure and engine wisely

Source: Rishabh Agarwal (2025)

The elephant is your friend

Source: Pigsty

One physical engine: PostgreSQL is the most widely used open source relational database management system
Supports different data structures: not only relational but also document model (JSONB) and labeled property graph (Apache AGE)
… plus many use-case specific extensions such as time-series database, spatial database etc.

Online safari

Designing data science & AI platforms

How to manage
the data science lifecycle in real-world applications?

The best book on data engineering

Source: Kleppmann & Riccomini (2026)

Chapter 1: Tradeoffs in Data Systems Architecture
Analytical versus Operational Systems - Cloud versus Self-Hosting - Distributed versus Single-Node Systems - Data Systems, Law, and Society
Chapter 2: Defining Nonfunctional Requirements
Case Study: Social Network Home Timelines- Describing Performance - Reliability and Fault Tolerance - Scalability - Maintainability
Chapter 3: Data Models and Query Languages
Relational Model versus Document Model - Graph-Like Data Models - Event Sourcing and CQRS - DataFrames, Matrices, and Arrays
Chapter 4: Storage and Retrieval
Storage and Indexing for OLTP - Data Storage for Analytics - Multidimensional and Full-Text Indexes
(…)
Chapter 11: Batch Processing
Batch Processing in Distributed Systems - Batch Processing Models - Batch Use Cases

The common definition of data engineering

Source: Reiss & Housley (2022)

Our dream: a data system that just works

Source: the Composable Codex

The problem

Source: MAD landscape

The Composable Data Management System Manifesto

Source: Pedreira et al. (2023)

The requirement for specialization in data management systems has evolved faster than our software development practices. After decades of organic growth, this situation has created a siloed landscape composed of hundreds of products developed and maintained as monoliths, with limited reuse between systems. This fragmentation has resulted in developers often reinventing the wheel, increased maintenance costs, and slowed down innovation. It has also affected the end users, who are often required to learn the idiosyncrasies of dozens of incompatible SQL and non-SQL API dialects, and settle for systems with incomplete functionality and inconsistent semantics.

It all started with the relational database management system (RDBMS)

Source: E.F. Codd (1970)

What makes a database?

System	Catalog	Database
postgresql, mysql, mssql, duckdb	database	schema
datafusion, trino	catalog	schema
druid	dataSourceType	dataSource
bigquery	project	database
flink	catalog	database
clickhouse		database
clickhouse, impala, mysql, pyspark, snowflake		database

Comparing operational and analytical data systems

Source: Kleppmann & Riccomini (2026), chapter 1

Property	Operational systems (OLTP)	Analytic systems (OLAP)
Main read pattern	Point queries (fetch individual records by key)	Aggregate over large number of records
Main write pattern	Create, update, and delete individual records	Bulk import (ETL) or event stream
Human user example	End user of web/mobile application	Internal analyst, for decision support
Type of queries	Fixed, predefined by application	Arbitrary, ad-hoc exploration by analysts
Query volume	Lost of small queries	Few queries, each is complex
Data represents	Latest state of data (current point in time)	History of events that happened over time
Dataset size	Gigabytes to terabytes	Terabytes to petabytes

Evolution of analytical system architectures

Source: Armbrust et al. (2021)

Opening up the data warehouses

Source: Kleppmann & Riccomini (2026), chapter 4

Component	Function	Open source software
Open query engine	Parse SQL queries, optimize them into execution plans, and execute them against the data	Apache DataFusion Apache Spark DuckDB
Open catalog format	Defines which tables are contained in the database	Apache Iceberg Unity catalog Lance
Open table format	Support row inserts and deletions	Apache Iceberg Lance Delta Lake Apache Hudi
Open storage formats	Determines how the rows of a table are encoded as bytes in a file	Apache Parquet Lance Apache Orc
Open memory formats	Determines how the rows of a table are encoded as bytes in memory	Apache Arrow

From row-based to column-based systems

Source: Apache Arrow documentation

The Composable Data Stack takes the unbundling even further

Source: the Composable Codex

The Composable Data Stack takes the unbundling even further

Source: the Composable Codex

End-to-end columnar data with Apache Arrow ADBC and Flight SQL

JDBC/ODBC: row-based database connectivity protocols

Source: Apache Arrow project

Arrow Database Connectivity (ADBC) is column-based

Source: Apache Arrow project

Iceberg: an open table format and catalog

Source: Apache Iceberg specification

Iceberg catalog

Source: Dremio

Iceberg metadata file

Source: Dremio

Big Data Is Dead

Source: Jordan Tigani

The rise of embedded databases

Source: the Data Quarry blog

Type of data	Open source embedable database
Multi-modal	LanceDB CozoDB
Relational	DuckDB
Key-value documents	RocksDB
Vector	Chroma
Labeled-property graph with Cypher query engine	LadybudDB
Tripe-store with SPARQL query engine	qlever

So has our dream come true?

Source: the Composable Codex

We are getting very close …

Source: DuckLake

Diagonal scaling of a DuckLake lakehouse

Source: Cloudzero

Vertical scaling of single node query engines that can process up to 100 TB, covering 99% of use cases
Horizontal scaling of blob storage in (sovereign!) data centres

From components to a whole platform architecture

Source: Andreesen Horowitz

From components to a whole platform architecture

Source: Andreesen Horowitz

Introducing the Single-Repo Data Platform (SRDP)

Source: SRDP Hub

Patterns for building data & AI systems

The most common patterns for building data & AI systems

Source: Buschmann et al. (2013)

Pipes and filters pattern for batch data processing flows
Layers pattern for achieving interoperability
the hub-and-spoke event broker topology pattern for federated data systems

ETL, ELT, DAGs, pipelines, dataflows: it’s all the same

Source: Dagster

Business process models are also directed acyclic graphs

Source: Kleppmann & Riccomini (2026), chapter 5

Batch data processing has a strong functional flavour

Source: Hamilton

Why functional data engineering?
The problem of slowly changing dimensions

Source: Wikipedia

Why functional data engineering?
Immutability, snapshots and partitions

Source: Maxima Beauchemin (2018)

Idempotency

Warning

def current_temperature(location: str) -> int:
    return MyWeatherService().get_current_temperature(location)
def non_idempotent_function(location: str, destination: str) -> None:
    with open(destination, ‘w’) as f:
        f.write(current_temperature(location))

Tip

def get_temperature(timestamp: str, location: str) -> int:
    return MyWeatherService().get_temperature(location, timestamp)
def idempotent_function(timestamp: str, location: str, destination: str) -> None:
    with open(destination, ‘w’) as f:
        f.write(get_temperature(timestamp, location))

Software-Defined Assets bring it all together

Source: Dagster

Declarative Nature: declare the end state of an asset, orchestrator takes care of the execution. Shifts the focus from task execution to asset production.
Observability and Scheduling: enhanced observability into your data assets and allow for advanced scheduling. Easier to understand the state of your assets and when they should be updated.
Environment Agnosticism: environment-agnostic, same asset definitions can be used across different environments, such as development and production, without changes to the asset code.
Data Lineage: clear data lineage, easier to understand data flows and debug issues.
Integration with External Tools: the orchestrator can be integrated with assets generated by other tools such as dbt.
Rich Metadata and Grouping: assets have rich metadata, which is useful for organizing and searching assets.
Partitioning and Backfills: SDAs support time partitioning and backfills out of the box, which is useful for managing historical data and ensuring data consistency.

Same approach for machine learning pipelines

Source: Dagster

OpenLineage as the standard for metadata collection and data lineage

Source: OpenLineage docs

Marquez is the open source reference implementation of OpenLineage

Source: Marquez project

MLOps

Hidden technical debt in ML systems

Source: Scully et al. (2015)

Stages of machine learning CI/CD automation pipeline

Source: Google Cloud docs

MLOps level 0

Source: Google Cloud docs

MLOps level 1

Source: Google Cloud docs

MLOps level 2

Source: Google Cloud docs

The pipeline consists of the following stages:

Development and experimentation: You iteratively try out new ML algorithms and new modeling where the experiment steps are orchestrated. The output of this stage is the source code of the ML pipeline steps that are then pushed to a source repository.
Pipeline continuous integration: You build source code and run various tests. The outputs of this stage are pipeline components (packages, executables, and artifacts) to be deployed in a later stage.
Pipeline continuous delivery: You deploy the artifacts produced by the CI stage to the target environment. The output of this stage is a deployed pipeline with the new implementation of the model.
Automated triggering: The pipeline is automatically executed in production based on a schedule or in response to a trigger. The output of this stage is a trained model that is pushed to the model registry.
Model continuous delivery: You serve the trained model as a prediction service for the predictions. The output of this stage is a deployed model prediction service.
Monitoring: You collect statistics on the model performance based on live data. The output of this stage is a trigger to execute the pipeline or to execute a new experiment cycle.

ONNX as the standard for ai interoperability

Source: ONNX explained

The most complete open source MLOps library

Source: MLflow

MLflow Tracking & Experiments

Source: MLflow Tracking

MLflow Model Registry

Source: MLflow Model Registry

MLflow Model Deployment

Source: MLflow

How about testing?

Source: the Practical Test Pyramid

Write tests with different granularity
The more high-level you get the fewer tests you should have

How about testing?

Source: Breck et al. (2017)

The test strategy depends on the end product

	Type of test
UI	Relevant when developing a dashboard or app (informatieproduct) End-user testing of interactive visualisations can be very time-consuming and hence costly!
Contract	Relevant when developing REST APIs Differentiate between consumer vs producer testrow 1 col 2
Integration	Often starting point of testing e.g. interacting with data stores Consider writing simple checks on datasets when developing ETL pipelines (row counts, number of unique IDs, min-max dates etc.)
Unit	You shouldn’t need to write unit test if you use standard ML libraries Relevant for data processing (e.g. your own utility functions for data cleaning and data validation (row counts!))

Engineering for data science & AI

Attribution & copyright notice

The scope of real-world data science

The scope of real-world data science

What are data?

But what is “learning” and what are “models”?

The Ladder of Causality

Remember:

Using data and models in real-world applications

Thinking, Fast and Slow

Engineering for data science & AI

Data starts with counting

Bone of Lembobo

Bone of Ishango (picture left)

Data starts with counting

Mesopotamia, around 3300 BC

Stevens’ levels of measurement

Types of data

Types of data

Tidy data

It all started with the relational database management system (RDBMS)

Databases are often at the heart of an information system

Databases are often at the heart of an information system

Relational algebra as the mathematical foundation

Consider this description of our domain of interest

Chen notation

Solution

Four layers of data modeling

Relational models in datawarehouses: the dimensional model (aka star schema)

Relational models in datawarehouses: the data vault

The relational model vs. the document model

The relational model vs. the document model

But semantics often get lost in data modelsExample: different conceptualizations of sex and gender

From data to information and knowledge

General definition of information

Information = Data + Meaning

What is the meaning of this statement?

The semantics of blood pressure

From data to information

… and from information to knowledge

The return of the knowledge graph

Semantic triples

Stelselgegevens basisregistraties

KIK-V ontology for the Dutch care sector

The Ontology Pipeline

Are graphs the future?

I certainly think so

SPARQL vs. Cypher/GQL comparison

Choose your data structure and engine wisely

The elephant is your friend

Online safari

How to managethe data science lifecycle in real-world applications?

The best book on data engineering

The common definition of data engineering

Our dream: a data system that just works

The problem

The Composable Data Management System Manifesto

It all started with the relational database management system (RDBMS)

What makes a database?

Comparing operational and analytical data systems

Evolution of analytical system architectures

Opening up the data warehouses

From row-based to column-based systems

The Composable Data Stack takes the unbundling even further

The Composable Data Stack takes the unbundling even further

End-to-end columnar data with Apache Arrow ADBC and Flight SQL

JDBC/ODBC: row-based database connectivity protocols

Arrow Database Connectivity (ADBC) is column-based

Iceberg: an open table format and catalog

Iceberg catalog

Iceberg metadata file

Big Data Is Dead

The rise of embedded databases

So has our dream come true?

We are getting very close …

Diagonal scaling of a DuckLake lakehouse

From components to a whole platform architecture

From components to a whole platform architecture

Introducing the Single-Repo Data Platform (SRDP)

The most common patterns for building data & AI systems

But semantics often get lost in data models
Example: different conceptualizations of sex and gender

How to manage
the data science lifecycle in real-world applications?

Why functional data engineering?
The problem of slowly changing dimensions

Why functional data engineering?
Immutability, snapshots and partitions