Part I: Bringing semantics back into data & AI engineering
Part II: Designing, building and operating data & AI platforms
This lecture is based on the following open access materials:
Source code: https://github.com/anthology-of-data-science/lecture-engineering-data-ai-platforms
This work is licensed under CC BY-SA 4.0
Source: Fairness and machine learning (2023)
Source: Fioridi (2009)
Source: Fioridi (2009)
Source: Edelman et al. (2024)
Source: Fioridi (2009)
| Level | Dimension | Theoretical anchor | Example |
|---|---|---|---|
| Data | Signal & Measurement | Data is not “raw” but produced by instruments designed under specific physical theories. | The voltage fluctuations from the sensor translated into a numerical output (e.g., 145/95). |
| Information (Statistical) | Entropy & Surprise | Shannon Information Theory: The statistical novelty or reduction of uncertainty within a signal, regardless of meaning. | A reading of 145 mmHg is “surprising” (high information content) if the patient’s historical baseline is 120 mmHg. |
| Information (Semantic) | Context & Relations | Ontologies & Knowledge Graphs: Data structured via schemas to provide meaning (units, subjects, and temporal states). | Linking the “145” to “mmHg,” “Patient X,” and “Resting State” within a standardized medical ontology. |
Source: Fioridi (2009)
| Level | Dimension | Theoretical anchor | Example |
|---|---|---|---|
| Knowledge (Propositional) | Justified True Belief (JTB) | Classical Epistemology: Knowledge-that. A claim that is believed, is true, and has rigorous justification. | The clinician’s belief that “Patient X has hypertension,” justified by multiple readings and clinical guidelines. |
| Knowledge (Procedural) | Non-Propositional Skill | Ryle’s “Knowledge-How”: The tacit ability to perform a task that cannot be fully captured in formal data structures. | The nurse’s skill in correctly placing the cuff and identifying the specific Korotkoff sounds amidst noise. |
| Knowledge (Acquaintance) | Personal experience | familiarity with a person, place, or thing | The nurse is familiar with the hospital |
Source: Fairness and machine learning (2023)
Source: Judea Pearl (2018)

Source: E.F. Codd (1970)
Source: Geeks for geeks (2026)
Source: dbt documentation (2023)
Source: Rahma Hassan (2023)
Source: Zorginstituut KIK-V
Source: Jessica Talisman (2025)

Ontologies enhance vocabularies by describing and contextualizing relationships between concepts and with the introduction of logical reasoning. By assigning classes, properties, relations and attributes, ontologies establish rule bases that define how concepts behave in the wild, ensuring a level of coherence within complex information systems. Machines love ontologies because of their high-fidelity disambiguation and description, which bring clarity to machine understanding for tasks such as information retrieval, entity management, concept discovery, and RAG implementations for AI systems.
Caution
Source: Arthur Keen (2018)
Source: Rishabh Agarwal (2025)
Source: Pigsty


Source: Kleppmann & Riccomini (2026)

Source: Reiss & Housley (2022)

Designing data science & AI platforms
Source: the Composable Codex
Source: MAD landscape
Source: Pedreira et al. (2023)
The requirement for specialization in data management systems has evolved faster than our software development practices. After decades of organic growth, this situation has created a siloed landscape composed of hundreds of products developed and maintained as monoliths, with limited reuse between systems. This fragmentation has resulted in developers often reinventing the wheel, increased maintenance costs, and slowed down innovation. It has also affected the end users, who are often required to learn the idiosyncrasies of dozens of incompatible SQL and non-SQL API dialects, and settle for systems with incomplete functionality and inconsistent semantics.
Source: E.F. Codd (1970)

| System | Catalog | Database |
|---|---|---|
| postgresql, mysql, mssql, duckdb | database | schema |
| datafusion, trino | catalog | schema |
| druid | dataSourceType | dataSource |
| bigquery | project | database |
| flink | catalog | database |
| clickhouse | database | |
| clickhouse, impala, mysql, pyspark, snowflake | database |
Source: Kleppmann & Riccomini (2026), chapter 1
| Property | Operational systems (OLTP) | Analytic systems (OLAP) |
|---|---|---|
| Main read pattern | Point queries (fetch individual records by key) | Aggregate over large number of records |
| Main write pattern | Create, update, and delete individual records | Bulk import (ETL) or event stream |
| Human user example | End user of web/mobile application | Internal analyst, for decision support |
| Type of queries | Fixed, predefined by application | Arbitrary, ad-hoc exploration by analysts |
| Query volume | Lost of small queries | Few queries, each is complex |
| Data represents | Latest state of data (current point in time) | History of events that happened over time |
| Dataset size | Gigabytes to terabytes | Terabytes to petabytes |
Source: Armbrust et al. (2021)
Source: Kleppmann & Riccomini (2026), chapter 4
| Component | Function | Open source software |
|---|---|---|
| Open query engine | Parse SQL queries, optimize them into execution plans, and execute them against the data | Apache DataFusion Apache Spark DuckDB |
| Open catalog format | Defines which tables are contained in the database | Apache Iceberg Unity catalog Lance |
| Open table format | Support row inserts and deletions | Apache Iceberg Lance Delta Lake Apache Hudi |
| Open storage formats | Determines how the rows of a table are encoded as bytes in a file | Apache Parquet Lance Apache Orc |
| Open memory formats | Determines how the rows of a table are encoded as bytes in memory | Apache Arrow |
Source: Apache Arrow documentation
Source: the Composable Codex
Source: the Composable Codex
Source: Apache Arrow project
Source: Apache Arrow project
Source: Apache Iceberg specification
Source: Dremio
Source: Dremio
Source: Jordan Tigani
Source: the Data Quarry blog
Source: the Composable Codex
Source: DuckLake
Source: Cloudzero

Source: Andreesen Horowitz
Source: Andreesen Horowitz
Source: SRDP Hub
Patterns for building data & AI systems
Source: Buschmann et al. (2013)
Source: Dagster
Source: Kleppmann & Riccomini (2026), chapter 5
Source: Hamilton
Source: Wikipedia
Source: Maxima Beauchemin (2018)


Warning
def current_temperature(location: str) -> int:
return MyWeatherService().get_current_temperature(location)
def non_idempotent_function(location: str, destination: str) -> None:
with open(destination, ‘w’) as f:
f.write(current_temperature(location))
Tip
def get_temperature(timestamp: str, location: str) -> int:
return MyWeatherService().get_temperature(location, timestamp)
def idempotent_function(timestamp: str, location: str, destination: str) -> None:
with open(destination, ‘w’) as f:
f.write(get_temperature(timestamp, location))
Source: Dagster
Source: Dagster
Source: OpenLineage docs
Source: Marquez project
MLOps
Source: Scully et al. (2015)
Source: Google Cloud docs
Source: Google Cloud docs
Source: Google Cloud docs
Source: Google Cloud docs
Source: ONNX explained
Source: MLflow
Source: MLflow Tracking
Source: MLflow Model Registry
Source: MLflow
Source: the Practical Test Pyramid

Source: Breck et al. (2017)
| Type of test | |
|---|---|
| UI |
|
| Contract |
|
| Integration |
|
| Unit |
|
