Engineering for data science & AI

A two-day Hitchhikers’ guide to the stuff
that data professionals spend most of their time on

The scope of real-world data science

Source: Fairness and machine learning (2023)


The scope of real-world data science

Source: Fairness and machine learning (2023)



Etymology

  • datum: something given (fr. donné)
  • informare: formation of the mind
  • modulus: unit of measurement
  • gnosis: knowledge

What are data?

Source: Fairness and machine learning (2023)



  • Data begins with counting
  • From relational databases to document databases and knowledge databases
  • From data to information to knowledge

But what is “learning” and what are “models”?

Source: Fairness and machine learning (2023)



  • Models as tools to obtain factual knowledge
  • Different types of models: machine learning model, generative AI model, causal model, deductive model
  • How does data science and “AI” change how we work with models?

The Ladder of Causality

Source: Judea Pearl (2018)



Remember:

Using data and models in real-world applications

Source: Fairness and machine learning (2023)



  • Platforms for working with data and models
  • Governance of data & models in organizations
  • Ethical, legal and societal impact of using data & models
  • Human-computer interactions

Thinking, Fast and Slow

Source: Daniel Kahneman (2011)


Engineering for data science & AI



Part I

  • Data as measurement:
    the basics of working with data
  • The relational database
  • Bringing semantics back
    into data engineering

Part II

  • Designing
    data science & AI platforms
  • Patterns for building
    data & AI systems
  • MLOps

Data as measurement

Data starts with counting

Source: H. Bruderer (2024)



Bone of Lembobo

  • Oldest tally stick, 40,000 years old
  • 29 marks: lunar cycle, menstrual cycle?
  • … but piece is missing, perhaps it continued counting


Bone of Ishango (picture left)

  • 20,000 years old
  • 168 parallel notches in all, engraved on three sides, arranged in group
  • Meaning remains unclear

Data starts with counting

Source: K. Housten (2023)



Mesopotamia, around 3300 BC

  • One of the world’s earliest civilizations: animal husbandry, cultivation of crops
  • Symbolic accounting system in cuneiform
  • Use of sexagesimal numbers with number 60 as root: twelve finger joints of one hand and five fingers of the other

Stevens’ levels of measurement

Source: Wikipedia



scale measure property math operations advanced operations central tendency
nominal classification, membership =, ≠ grouping mode
ordinal comparison, level >, < sorting median
interval difference, affinity +, - yardstick mean, deviation
ratio magnitude, amount x, / ratio geometric mean, variation

Types of data

Source: Fioridi (2009)



Type Description
Primary data The principal data stored e.g. in a database, for example a simple array of numbers that measures the battery of a car. The red light of the low battery indicator flashing is assumed to be an instance of primary data conveying primary information.
Secondary data The converse of primary data, constituted by their absence (one could call them anti-data). You usually suspect your car battery is flat when the engine fails to make any of the usual noise.
Metadata These are indications about the nature of some other (usually primary) data. They describe properties such as location, format, updating, availability, usage restrictions, and so forth.

Types of data

Source: Fioridi (2009)



Type Description
Operational data These are data regarding the operations of the whole data system and the systems performance. Correspondingly, operational information is information about the dynamics of an information system. Suppose the car has a yellow light that, when flashing, indicates that the car checking system is malfunctioning. The fact that the light is on may indicate that the low battery indicator is not working properly, thus undermining the hypothesis that the battery is flat.
Derivative data These are data that can be extracted from some data whenever the latter are used as indirect sources in search of patterns, clues or inferential evidence about other things than those directly addressed by the data themselves, e.g. for comparative and quantitative analyses. From someones credit card bill, concerning e.g. the purchase of petrol in a certain petrol station, one may derive the information of her whereabouts at a given time. Difficult to define this category precisely.

Tidy data

Source: R for Data Science (2nd edition, 2023)



The relational database

It all started with the relational database management system (RDBMS)

Source: E.F. Codd (1970)


Databases are often at the heart of an information system



Databases are often at the heart of an information system



Relational algebra as the mathematical foundation

Source: Geeks for geeks (2026)


Consider this description of our domain of interest


  • Our company is divided into departments. Each department has a unique name, a unique department number and a department manager. The date this manager started, is registered. Furthermore, a department can be located on multiple locations.
  • All employees work for a specific department. Almost everyone has one supervisor, a supervisor may have several employees he is supervisor of. Every employee has a unique number, a name, a birth date, a gender, and a home address.
  • An employee works in one or more projects. Each project is controlled by one department, has a unique name and number and is settled on a location. Project members do not necessarily have to work fulltime for a project, but usually a fixed number of hours per week.
  • An employee may have several persons that are economically dependent on him: children and/or a spouse/husband. These can be identified by name, have a birth date and a gender. The sort of relationship is denoted as well.

Chen notation


Solution


Four layers of data modeling

Source: Geonovum (2024)



  • Semantic:
    fact-based modeling, ontology modeling
  • Conceptual:
    Entity-relationship modeling, UML modeling
  • Logical:
    dimensional modeling, data vault modeling
  • Physical: DuckDB, Polars, SQL Server etc.

Relational models in datawarehouses: the dimensional model (aka star schema)

Source: dbt documentation (2023)


Relational models in datawarehouses: the data vault

Source: Rahma Hassan (2023)


The relational model vs. the document model


The relational model vs. the document model



{
  "user_id":     251,
  "first_name":  "Barack",
  "last_name":   "Obama",
  "headline":    "Former President of the United States of America",
  "region_id":   "us:91",
  "photo_url":   "/p/7/000/253/05b/308dd6e.jpg",
  "positions": [
    {"job_title": "President", "organization": "United States of America"},
    {"job_title": "US Senator (D-IL)", "organization": "United States Senate"}
  ],
  "education": [
    {"school_name": "Harvard University",  "start": 1988, "end": 1991},
    {"school_name": "Columbia University", "start": 1981, "end": 1983}
  ],
  "contact_info": {
    "website": "https://barackobama.com",
    "twitter": "https://twitter.com/barackobama"
  }
}

But semantics often get lost in data models
Example: different conceptualizations of sex and gender

Source: Zhang, Cornet & Benis (2024)



Bringing semantics back into data engineering

From data to information and knowledge

Source: Fioridi (2009)


General definition of information

Source: Fioridi (2009)



Information = Data + Meaning

  • multiple data points
  • data is well-formed (syntax)
  • data is meaningful in a certain context (semantics)

What is the meaning of this statement?






“Daniel has a high blood pressure”

The semantics of blood pressure

Source: Edelman et al. (2024)



From data to information

Source: Fioridi (2009)



Level Dimension Theoretical anchor Example
Data Signal & Measurement Data is not “raw” but produced by instruments designed under specific physical theories. The voltage fluctuations from the sensor translated into a numerical output (e.g., 145/95).
Information (Statistical) Entropy & Surprise Shannon Information Theory: The statistical novelty or reduction of uncertainty within a signal, regardless of meaning. A reading of 145 mmHg is “surprising” (high information content) if the patient’s historical baseline is 120 mmHg.
Information (Semantic) Context & Relations Ontologies & Knowledge Graphs: Data structured via schemas to provide meaning (units, subjects, and temporal states). Linking the “145” to “mmHg,” “Patient X,” and “Resting State” within a standardized medical ontology.

… and from information to knowledge

Source: Fioridi (2009)



Level Dimension Theoretical anchor Example
Knowledge (Propositional) Justified True Belief (JTB) Classical Epistemology: Knowledge-that. A claim that is believed, is true, and has rigorous justification. The clinician’s belief that “Patient X has hypertension,” justified by multiple readings and clinical guidelines.
Knowledge (Procedural) Non-Propositional Skill Ryle’s “Knowledge-How”: The tacit ability to perform a task that cannot be fully captured in formal data structures. The nurse’s skill in correctly placing the cuff and identifying the specific Korotkoff sounds amidst noise.
Knowledge (Acquaintance) Personal experience familiarity with a person, place, or thing The nurse is familiar with the hospital

The return of the knowledge graph

Source: Barrasa & Webber (2023)


Semantic triples

Source: Wikipedia



  • A semantic triple, or RDF triple or simply triple, is the atomic data entity in the Resource Description Framework (RDF) data model
  • A triple is a sequence of three entities that codifies a statement about semantic data in the form of subject–predicate–object expressions

Stelselgegevens basisregistraties

Source: stelselcatalogus.nl



KIK-V ontology for the Dutch care sector

Source: Zorginstituut KIK-V


The Ontology Pipeline

Source: Jessica Talisman (2025)


Ontologies enhance vocabularies by describing and contextualizing relationships between concepts and with the introduction of logical reasoning. By assigning classes, properties, relations and attributes, ontologies establish rule bases that define how concepts behave in the wild, ensuring a level of coherence within complex information systems. Machines love ontologies because of their high-fidelity disambiguation and description, which bring clarity to machine understanding for tasks such as information retrieval, entity management, concept discovery, and RAG implementations for AI systems.

Are graphs the future?



I certainly think so

Caution

  • The RDF/SPARQL stack is mostly used by the ontology community
  • Labeled Property Graphs as more commonly used by companies
  • Graph Query Language GQL is the new ISO standard, similar to SQL and derived from cypher (neo4j)
  • There is also GraphQL, which is an API format and something completely different from GQL

SPARQL vs. Cypher/GQL comparison

Source: Arthur Keen (2018)


Choose your data structure and engine wisely

Source: Rishabh Agarwal (2025)


The elephant is your friend

Source: Pigsty



  • One physical engine: PostgreSQL is the most widely used open source relational database management system
  • Supports different data structures: not only relational but also document model (JSONB) and labeled property graph (Apache AGE)
  • … plus many use-case specific extensions such as time-series database, spatial database etc.

Online safari


Designing data science & AI platforms

How to manage
the data science lifecycle in real-world applications?




The best book on data engineering

Source: Kleppmann & Riccomini (2026)


  • Chapter 1: Tradeoffs in Data Systems Architecture
    Analytical versus Operational Systems - Cloud versus Self-Hosting - Distributed versus Single-Node Systems - Data Systems, Law, and Society
  • Chapter 2: Defining Nonfunctional Requirements
    Case Study: Social Network Home Timelines- Describing Performance - Reliability and Fault Tolerance - Scalability - Maintainability
  • Chapter 3: Data Models and Query Languages
    Relational Model versus Document Model - Graph-Like Data Models - Event Sourcing and CQRS - DataFrames, Matrices, and Arrays
  • Chapter 4: Storage and Retrieval
    Storage and Indexing for OLTP - Data Storage for Analytics - Multidimensional and Full-Text Indexes
  • (…)
  • Chapter 11: Batch Processing
    Batch Processing in Distributed Systems - Batch Processing Models - Batch Use Cases

The common definition of data engineering

Source: Reiss & Housley (2022)



Our dream: a data system that just works

Source: the Composable Codex



The problem

Source: MAD landscape


The Composable Data Management System Manifesto

Source: Pedreira et al. (2023)



The requirement for specialization in data management systems has evolved faster than our software development practices. After decades of organic growth, this situation has created a siloed landscape composed of hundreds of products developed and maintained as monoliths, with limited reuse between systems. This fragmentation has resulted in developers often reinventing the wheel, increased maintenance costs, and slowed down innovation. It has also affected the end users, who are often required to learn the idiosyncrasies of dozens of incompatible SQL and non-SQL API dialects, and settle for systems with incomplete functionality and inconsistent semantics.

It all started with the relational database management system (RDBMS)

Source: E.F. Codd (1970)


What makes a database?



System Catalog Database
postgresql, mysql, mssql, duckdb database schema
datafusion, trino catalog schema
druid dataSourceType dataSource
bigquery project database
flink catalog database
clickhouse database
clickhouse, impala, mysql, pyspark, snowflake database

Comparing operational and analytical data systems

Source: Kleppmann & Riccomini (2026), chapter 1


Property Operational systems (OLTP) Analytic systems (OLAP)
Main read pattern Point queries (fetch individual records by key) Aggregate over large number of records
Main write pattern Create, update, and delete individual records Bulk import (ETL) or event stream
Human user example End user of web/mobile application Internal analyst, for decision support
Type of queries Fixed, predefined by application Arbitrary, ad-hoc exploration by analysts
Query volume Lost of small queries Few queries, each is complex
Data represents Latest state of data (current point in time) History of events that happened over time
Dataset size Gigabytes to terabytes Terabytes to petabytes

Evolution of analytical system architectures

Source: Armbrust et al. (2021)


Opening up the data warehouses

Source: Kleppmann & Riccomini (2026), chapter 4


Component Function Open source software
Open query engine Parse SQL queries, optimize them into execution plans, and execute them against the data Apache DataFusion
Apache Spark
DuckDB
Open catalog format Defines which tables are contained in the database Apache Iceberg
Unity catalog
Lance
Open table format Support row inserts and deletions Apache Iceberg
Lance
Delta Lake
Apache Hudi
Open storage formats Determines how the rows of a table are encoded as bytes in a file Apache Parquet
Lance
Apache Orc
Open memory formats Determines how the rows of a table are encoded as bytes in memory Apache Arrow

From row-based to column-based systems

Source: Apache Arrow documentation


The Composable Data Stack takes the unbundling even further

Source: the Composable Codex


The Composable Data Stack takes the unbundling even further

Source: the Composable Codex


End-to-end columnar data with Apache Arrow ADBC and Flight SQL


JDBC/ODBC: row-based database connectivity protocols

Source: Apache Arrow project


Arrow Database Connectivity (ADBC) is column-based

Source: Apache Arrow project


Iceberg: an open table format and catalog

Source: Apache Iceberg specification



Iceberg catalog

Source: Dremio


Iceberg metadata file

Source: Dremio


Big Data Is Dead

Source: Jordan Tigani


The rise of embedded databases

Source: the Data Quarry blog


Type of data Open source embedable database
Multi-modal
Relational
Key-value documents
Vector
Labeled-property graph with Cypher query engine
Tripe-store with SPARQL query engine

So has our dream come true?

Source: the Composable Codex


We are getting very close …

Source: DuckLake



Diagonal scaling of a DuckLake lakehouse

Source: Cloudzero



  • Vertical scaling of single node query engines that can process up to 100 TB, covering 99% of use cases
  • Horizontal scaling of blob storage in (sovereign!) data centres

From components to a whole platform architecture

Source: Andreesen Horowitz


From components to a whole platform architecture

Source: Andreesen Horowitz


Introducing the Single-Repo Data Platform (SRDP)

Source: SRDP Hub


Patterns for building data & AI systems

The most common patterns for building data & AI systems

Source: Buschmann et al. (2013)



  • Pipes and filters pattern for batch data processing flows
  • Layers pattern for achieving interoperability
  • the hub-and-spoke event broker topology pattern for federated data systems

ETL, ELT, DAGs, pipelines, dataflows: it’s all the same

Source: Dagster


Business process models are also directed acyclic graphs

Source: Kleppmann & Riccomini (2026), chapter 5




Batch data processing has a strong functional flavour

Source: Hamilton



Why functional data engineering?
The problem of slowly changing dimensions

Source: Wikipedia


Why functional data engineering?
Immutability, snapshots and partitions

Source: Maxima Beauchemin (2018)


Idempotency


Warning

def current_temperature(location: str) -> int:
    return MyWeatherService().get_current_temperature(location)
def non_idempotent_function(location: str, destination: str) -> None:
    with open(destination, ‘w’) as f:
        f.write(current_temperature(location))

Tip

def get_temperature(timestamp: str, location: str) -> int:
    return MyWeatherService().get_temperature(location, timestamp)
def idempotent_function(timestamp: str, location: str, destination: str) -> None:
    with open(destination, ‘w’) as f:
        f.write(get_temperature(timestamp, location))

Software-Defined Assets bring it all together

Source: Dagster


  • Declarative Nature: declare the end state of an asset, orchestrator takes care of the execution. Shifts the focus from task execution to asset production.
  • Observability and Scheduling: enhanced observability into your data assets and allow for advanced scheduling. Easier to understand the state of your assets and when they should be updated.
  • Environment Agnosticism: environment-agnostic, same asset definitions can be used across different environments, such as development and production, without changes to the asset code.
  • Data Lineage: clear data lineage, easier to understand data flows and debug issues.
  • Integration with External Tools: the orchestrator can be integrated with assets generated by other tools such as dbt.
  • Rich Metadata and Grouping: assets have rich metadata, which is useful for organizing and searching assets.
  • Partitioning and Backfills: SDAs support time partitioning and backfills out of the box, which is useful for managing historical data and ensuring data consistency.

Same approach for machine learning pipelines

Source: Dagster


OpenLineage as the standard for metadata collection and data lineage

Source: OpenLineage docs


Marquez is the open source reference implementation of OpenLineage

Source: Marquez project


MLOps

Hidden technical debt in ML systems

Source: Scully et al. (2015)


Stages of machine learning CI/CD automation pipeline

Source: Google Cloud docs


MLOps level 0

Source: Google Cloud docs


MLOps level 1

Source: Google Cloud docs


MLOps level 2

Source: Google Cloud docs


ONNX as the standard for ai interoperability

Source: ONNX explained


The most complete open source MLOps library

Source: MLflow


MLflow Tracking & Experiments

Source: MLflow Tracking


MLflow Model Registry

Source: MLflow Model Registry


MLflow Model Deployment

Source: MLflow


How about testing?

Source: the Practical Test Pyramid




  • Write tests with different granularity
  • The more high-level you get the fewer tests you should have

How about testing?

Source: Breck et al. (2017)


The test strategy depends on the end product


Type of test
UI
  • Relevant when developing a dashboard or app (informatieproduct)
  • End-user testing of interactive visualisations can be very time-consuming and hence costly!
Contract
  • Relevant when developing REST APIs
  • Differentiate between consumer vs producer testrow 1 col 2
Integration
  • Often starting point of testing e.g. interacting with data stores
  • Consider writing simple checks on datasets when developing ETL pipelines (row counts, number of unique IDs, min-max dates etc.)
Unit
  • You shouldn’t need to write unit test if you use standard ML libraries
  • Relevant for data processing (e.g. your own utility functions for data cleaning and data validation (row counts!))