Engineering for data science & AI

Part I: Bringing semantics back into data & AI engineering
Part II: Designing, building and operating data & AI platforms

The scope of real-world data science

Source: Fairness and machine learning (2023)


From data to information and knowledge

Source: Fioridi (2009)


General definition of information

Source: Fioridi (2009)



Information = Data + Meaning

  • multiple data points
  • data is well-formed (syntax)
  • data is meaningful in a certain context (semantics)

What is the meaning of this statement?






“Daniel has a high blood pressure”

The semantics of blood pressure

Source: Edelman et al. (2024)



From data to information

Source: Fioridi (2009)



Level Dimension Theoretical anchor Example
Data Signal & Measurement Data is not “raw” but produced by instruments designed under specific physical theories. The voltage fluctuations from the sensor translated into a numerical output (e.g., 145/95).
Information (Statistical) Entropy & Surprise Shannon Information Theory: The statistical novelty or reduction of uncertainty within a signal, regardless of meaning. A reading of 145 mmHg is “surprising” (high information content) if the patient’s historical baseline is 120 mmHg.
Information (Semantic) Context & Relations Ontologies & Knowledge Graphs: Data structured via schemas to provide meaning (units, subjects, and temporal states). Linking the “145” to “mmHg,” “Patient X,” and “Resting State” within a standardized medical ontology.

… and from information to knowledge

Source: Fioridi (2009)



Level Dimension Theoretical anchor Example
Knowledge (Propositional) Justified True Belief (JTB) Classical Epistemology: Knowledge-that. A claim that is believed, is true, and has rigorous justification. The clinician’s belief that “Patient X has hypertension,” justified by multiple readings and clinical guidelines.
Knowledge (Procedural) Non-Propositional Skill Ryle’s “Knowledge-How”: The tacit ability to perform a task that cannot be fully captured in formal data structures. The nurse’s skill in correctly placing the cuff and identifying the specific Korotkoff sounds amidst noise.
Knowledge (Acquaintance) Personal experience familiarity with a person, place, or thing The nurse is familiar with the hospital

But what is “learning” and what are “models”?

Source: Fairness and machine learning (2023)



  • Machine learning model
  • Generative large language model
  • Logical reasoning model
  • Physical model
  • Data model


The Ladder of Causality

Source: Judea Pearl (2018)



Always remember:

  • Machine learning models are not the only models
  • Machine learning models are only based on correlations
  • Machine learning can not reason (although LLMs look like they can)

Diving into data: different approaches to data modeling



  • The basics: relational modeling
  • Bringing back semantics to data: the ontology pipeline
  • Comparison of different data models & query languages

It all started with the relational database management system (RDBMS)

Source: E.F. Codd (1970)


Databases are often at the heart of an information system



Databases are often at the heart of an information system



Relational algebra as the mathematical foundation

Source: Geeks for geeks (2026)


Consider this description of our domain of interest


  • Our company is divided into departments. Each department has a unique name, a unique department number and a department manager. The date this manager started, is registered. Furthermore, a department can be located on multiple locations.
  • All employees work for a specific department. Almost everyone has one supervisor, a supervisor may have several employees he is supervisor of. Every employee has a unique number, a name, a birth date, a gender, and a home address.
  • An employee works in one of more projects. Each project is controlled by one department, has a unique name and number and is settled on a location. Project members do not necessarily have to work fulltime for a project, but usually a fixed number of hours per week.
  • An employee may have several persons that are economically dependent on him: children and/or a spouse/husband. These can be identified by name, have a birth date and a gender. The sort of relationship is denoted as well.

Chen notation


Solution


Relational models in datawarehouses: the dimensional model (aka star schema)

Source: dbt documentation (2023)


Relational models in datawarehouses: the data vault

Source: Rahma Hassan (2023)


But relational modeling alone is not enough



How would you define a customer?

  • A person, a company?
  • Based on order history, since when?
  • Difference between an ‘occasional’ customer vs a ‘loyal’ customer?
  • Someone who has ordered but payment is done by someone ele?

KIK-V ontology for the Dutch care sector

Source: Zorginstituut KIK-V


The Ontology Pipeline

Source: Jessica Talisman (2025)


Ontologies enhance vocabularies by describing and contextualizing relationships between concepts and with the introduction of logical reasoning. By assigning classes, properties, relations and attributes, ontologies establish rule bases that define how concepts behave in the wild, ensuring a level of coherence within complex information systems. Machines love ontologies because of their high-fidelity disambiguation and description, which bring clarity to machine understanding for tasks such as information retrieval, entity management, concept discovery, and RAG implementations for AI systems.

Are graphs the future?



I certainly think so

Caution

  • The RDF/SPARQL stack is mostly used by the ontology community
  • Labeled Property Graphs as more commonly used by companies
  • Graph Query Language GQL is the new ISO standard, similar to SQL and derived from cypher (neo4j)
  • There is also GraphQL, which is an API format and something completely different from GQL

SPARQL vs. Cypher/GQL comparison

Source: Arthur Keen (2018)


Choose your data structure and engine wisely

Source: Rishabh Agarwal (2025)


The elephant is your friend

Source: Pigsty



  • One physical engine: PostgreSQL is the most widely used open source relational database management system
  • Supports different data structures: not only relational but also document model (JSONB) and labeled property graph (Apache AGE)
  • … plus many use-case specific extensions such as time-series database, spatial database etc.

How to manage
the data science lifecycle in real life?




The best book on data engineering

Source: Kleppmann & Riccomini (2026)


  • Chapter 1: Tradeoffs in Data Systems Architecture
    Analytical versus Operational Systems - Cloud versus Self-Hosting - Distributed versus Single-Node Systems - Data Systems, Law, and Society
  • Chapter 2: Defining Nonfunctional Requirements
    Case Study: Social Network Home Timelines- Describing Performance - Reliability and Fault Tolerance - Scalability - Maintainability
  • Chapter 3: Data Models and Query Languages
    Relational Model versus Document Model - Graph-Like Data Models - Event Sourcing and CQRS - DataFrames, Matrices, and Arrays
  • Chapter 4: Storage and Retrieval
    Storage and Indexing for OLTP - Data Storage for Analytics - Multidimensional and Full-Text Indexes
  • (…)
  • Chapter 11: Batch Processing
    Batch Processing in Distributed Systems - Batch Processing Models - Batch Use Cases

The common definition of data engineering

Source: Reiss & Housley (2022)



Designing data science & AI platforms

Our dream: a data system that just works

Source: the Composable Codex



The problem

Source: MAD landscape


The Composable Data Management System Manifesto

Source: Pedreira et al. (2023)



The requirement for specialization in data management systems has evolved faster than our software development practices. After decades of organic growth, this situation has created a siloed landscape composed of hundreds of products developed and maintained as monoliths, with limited reuse between systems. This fragmentation has resulted in developers often reinventing the wheel, increased maintenance costs, and slowed down innovation. It has also affected the end users, who are often required to learn the idiosyncrasies of dozens of incompatible SQL and non-SQL API dialects, and settle for systems with incomplete functionality and inconsistent semantics.

It all started with the relational database management system (RDBMS)

Source: E.F. Codd (1970)


What makes a database?



System Catalog Database
postgresql, mysql, mssql, duckdb database schema
datafusion, trino catalog schema
druid dataSourceType dataSource
bigquery project database
flink catalog database
clickhouse database
clickhouse, impala, mysql, pyspark, snowflake database

Comparing operational and analytical data systems

Source: Kleppmann & Riccomini (2026), chapter 1


Property Operational systems (OLTP) Analytic systems (OLAP)
Main read pattern Point queries (fetch individual records by key) Aggregate over large number of records
Main write pattern Create, update, and delete individual records Bulk import (ETL) or event stream
Human user example End user of web/mobile application Internal analyst, for decision support
Type of queries Fixed, predefined by application Arbitrary, ad-hoc exploration by analysts
Query volume Lost of small queries Few queries, each is complex
Data represents Latest state of data (current point in time) History of events that happened over time
Dataset size Gigabytes to terabytes Terabytes to petabytes

Evolution of analytical system architectures

Source: Armbrust et al. (2021)


Opening up the data warehouses

Source: Kleppmann & Riccomini (2026), chapter 4


Component Function Open source software
Open query engine Parse SQL queries, optimize them into execution plans, and execute them against the data Apache DataFusion
Apache Spark
DuckDB
Open catalog format Defines which tables are contained in the database Apache Iceberg
Unity catalog
Lance
Open table format Support row inserts and deletions Apache Iceberg
Lance
Delta Lake
Apache Hudi
Open storage formats Determines how the rows of a table are encoded as bytes in a file Apache Parquet
Lance
Apache Orc
Open memory formats Determines how the rows of a table are encoded as bytes in memory Apache Arrow

From row-based to column-based systems

Source: Apache Arrow documentation


The Composable Data Stack takes the unbundling even further

Source: the Composable Codex


The Composable Data Stack takes the unbundling even further

Source: the Composable Codex


End-to-end columnar data with Apache Arrow ADBC and Flight SQL


JDBC/ODBC: row-based database connectivity protocols

Source: Apache Arrow project


Arrow Database Connectivity (ADBC) is column-based

Source: Apache Arrow project


Iceberg: an open table format and catalog

Source: Apache Iceberg specification



Iceberg catalog

Source: Dremio


Iceberg metadata file

Source: Dremio


Big Data Is Dead

Source: Jordan Tigani


The rise of embedded databases

Source: the Data Quarry blog


Type of data Open source embedable database
Multi-modal
Relational
Key-value documents
Vector
Labeled-property graph with Cypher query engine
Tripe-store with SPARQL query engine

So has our dream come true?

Source: the Composable Codex


We are getting very close …

Source: DuckLake



Diagonal scaling of a DuckLake lakehouse

Source: Cloudzero



  • Vertical scaling of single node query engines that can process up to 100 TB, covering 99% of use cases
  • Horizontal scaling of blob storage in (sovereign!) data centres

From components to a whole platform architecture

Source: Andreesen Horowitz


From components to a whole platform architecture

Source: Andreesen Horowitz


Introducing the Single-Repo Data Platform (SRDP)

Source: SRDP Hub


Patterns for building data & AI systems

The most common patterns for building data & AI systems

Source: Buschmann et al. (2013)



  • Pipes and filters pattern for batch data processing flows
  • Layers pattern for achieving interoperability
  • the hub-and-spoke event broker topology pattern for federated data systems

ETL, ELT, DAGs, pipelines, dataflows: it’s all the same

Source: Dagster


Business process models are also directed acyclic graphs

Source: Kleppmann & Riccomini (2026), chapter 5




Batch data processing has a strong functional flavour

Source: Hamilton



Why functional data engineering?
The problem of slowly changing dimensions

Source: Wikipedia


Why functional data engineering?
Immutability, snapshots and partitions

Source: Maxima Beauchemin (2018)


Idempotency


Warning

def current_temperature(location: str) -> int:
    return MyWeatherService().get_current_temperature(location)
def non_idempotent_function(location: str, destination: str) -> None:
    with open(destination, ‘w’) as f:
        f.write(current_temperature(location))

Tip

def get_temperature(timestamp: str, location: str) -> int:
    return MyWeatherService().get_temperature(location, timestamp)
def idempotent_function(timestamp: str, location: str, destination: str) -> None:
    with open(destination, ‘w’) as f:
        f.write(get_temperature(timestamp, location))

Software-Defined Assets bring it all together

Source: Dagster


  • Declarative Nature: declare the end state of an asset, orchestrator takes care of the execution. Shifts the focus from task execution to asset production.
  • Observability and Scheduling: enhanced observability into your data assets and allow for advanced scheduling. Easier to understand the state of your assets and when they should be updated.
  • Environment Agnosticism: environment-agnostic, same asset definitions can be used across different environments, such as development and production, without changes to the asset code.
  • Data Lineage: clear data lineage, easier to understand data flows and debug issues.
  • Integration with External Tools: the orchestrator can be integrated with assets generated by other tools such as dbt.
  • Rich Metadata and Grouping: assets have rich metadata, which is useful for organizing and searching assets.
  • Partitioning and Backfills: SDAs support time partitioning and backfills out of the box, which is useful for managing historical data and ensuring data consistency.

Same approach for machine learning pipelines

Source: Dagster


OpenLineage as the standard for metadata collection and data lineage

Source: OpenLineage docs


Marquez is the open source reference implementation of OpenLineage

Source: Marquez project


MLOps

Hidden technical debt in ML systems

Source: Scully et al. (2015)


Stages of machine learning CI/CD automation pipeline

Source: Google Cloud docs


MLOps level 0

Source: Google Cloud docs


MLOps level 1

Source: Google Cloud docs


MLOps level 2

Source: Google Cloud docs


ONNX as the standard for ai interoperability

Source: ONNX explained


The most complete open source MLOps library

Source: MLflow


MLflow Tracking & Experiments

Source: MLflow Tracking


MLflow Model Registry

Source: MLflow Model Registry


MLflow Model Deployment

Source: MLflow


How about testing?

Source: the Practical Test Pyramid




  • Write tests with different granularity
  • The more high-level you get the fewer tests you should have

How about testing?

Source: Breck et al. (2017)


The test strategy depends on the end product


Type of test
UI
  • Relevant when developing a dashboard or app (informatieproduct)
  • End-user testing of interactive visualisations can be very time-consuming and hence costly!
Contract
  • Relevant when developing REST APIs
  • Differentiate between consumer vs producer testrow 1 col 2
Integration
  • Often starting point of testing e.g. interacting with data stores
  • Consider writing simple checks on datasets when developing ETL pipelines (row counts, number of unique IDs, min-max dates etc.)
Unit
  • You shouldn’t need to write unit test if you use standard ML libraries
  • Relevant for data processing (e.g. your own utility functions for data cleaning and data validation (row counts!))