A two-day Hitchhikers’ guide to the stuff
that data professionals spend most of their time on
This lecture is based on many open access materials for which references are given on each slide. We highlight the following resources, which provide the backbone of the main narrative:
Source code: https://github.com/anthology-of-data-science/lecture-engineering-data-ai-platforms
This work is licensed under CC BY-SA 4.0
Source: Fairness and machine learning (2023)
Source: Fairness and machine learning (2023)
Etymology
Source: Fairness and machine learning (2023)
Source: Fairness and machine learning (2023)
Source: Judea Pearl (2018)

Source: Fairness and machine learning (2023)
Source: Daniel Kahneman (2011)
Part I
Part II
Data as measurement
Source: H. Bruderer (2024)

Source: K. Housten (2023)

Source: Wikipedia
| scale | measure property | math operations | advanced operations | central tendency |
|---|---|---|---|---|
| nominal | classification, membership | =, ≠ | grouping | mode |
| ordinal | comparison, level | >, < | sorting | median |
| interval | difference, affinity | +, - | yardstick | mean, deviation |
| ratio | magnitude, amount | x, / | ratio | geometric mean, variation |
Source: Fioridi (2009)
| Type | Description |
|---|---|
| Primary data | The principal data stored e.g. in a database, for example a simple array of numbers that measures the battery of a car. The red light of the low battery indicator flashing is assumed to be an instance of primary data conveying primary information. |
| Secondary data | The converse of primary data, constituted by their absence (one could call them anti-data). You usually suspect your car battery is flat when the engine fails to make any of the usual noise. |
| Metadata | These are indications about the nature of some other (usually primary) data. They describe properties such as location, format, updating, availability, usage restrictions, and so forth. |
Source: Fioridi (2009)
| Type | Description |
|---|---|
| Operational data | These are data regarding the operations of the whole data system and the systems performance. Correspondingly, operational information is information about the dynamics of an information system. Suppose the car has a yellow light that, when flashing, indicates that the car checking system is malfunctioning. The fact that the light is on may indicate that the low battery indicator is not working properly, thus undermining the hypothesis that the battery is flat. |
| Derivative data | These are data that can be extracted from some data whenever the latter are used as indirect sources in search of patterns, clues or inferential evidence about other things than those directly addressed by the data themselves, e.g. for comparative and quantitative analyses. From someones credit card bill, concerning e.g. the purchase of petrol in a certain petrol station, one may derive the information of her whereabouts at a given time. Difficult to define this category precisely. |
Source: R for Data Science (2nd edition, 2023)
The relational database
Source: E.F. Codd (1970)
Source: Geeks for geeks (2026)
Source: Geonovum (2024)
Source: dbt documentation (2023)
Source: Rahma Hassan (2023)

{
"user_id": 251,
"first_name": "Barack",
"last_name": "Obama",
"headline": "Former President of the United States of America",
"region_id": "us:91",
"photo_url": "/p/7/000/253/05b/308dd6e.jpg",
"positions": [
{"job_title": "President", "organization": "United States of America"},
{"job_title": "US Senator (D-IL)", "organization": "United States Senate"}
],
"education": [
{"school_name": "Harvard University", "start": 1988, "end": 1991},
{"school_name": "Columbia University", "start": 1981, "end": 1983}
],
"contact_info": {
"website": "https://barackobama.com",
"twitter": "https://twitter.com/barackobama"
}
}Source: Zhang, Cornet & Benis (2024)
Bringing semantics back into data engineering
Source: Fioridi (2009)
Source: Fioridi (2009)
“Daniel has a high blood pressure”
Source: Edelman et al. (2024)
Source: Fioridi (2009)
| Level | Dimension | Theoretical anchor | Example |
|---|---|---|---|
| Data | Signal & Measurement | Data is not “raw” but produced by instruments designed under specific physical theories. | The voltage fluctuations from the sensor translated into a numerical output (e.g., 145/95). |
| Information (Statistical) | Entropy & Surprise | Shannon Information Theory: The statistical novelty or reduction of uncertainty within a signal, regardless of meaning. | A reading of 145 mmHg is “surprising” (high information content) if the patient’s historical baseline is 120 mmHg. |
| Information (Semantic) | Context & Relations | Ontologies & Knowledge Graphs: Data structured via schemas to provide meaning (units, subjects, and temporal states). | Linking the “145” to “mmHg,” “Patient X,” and “Resting State” within a standardized medical ontology. |
Source: Fioridi (2009)
| Level | Dimension | Theoretical anchor | Example |
|---|---|---|---|
| Knowledge (Propositional) | Justified True Belief (JTB) | Classical Epistemology: Knowledge-that. A claim that is believed, is true, and has rigorous justification. | The clinician’s belief that “Patient X has hypertension,” justified by multiple readings and clinical guidelines. |
| Knowledge (Procedural) | Non-Propositional Skill | Ryle’s “Knowledge-How”: The tacit ability to perform a task that cannot be fully captured in formal data structures. | The nurse’s skill in correctly placing the cuff and identifying the specific Korotkoff sounds amidst noise. |
| Knowledge (Acquaintance) | Personal experience | familiarity with a person, place, or thing | The nurse is familiar with the hospital |
Source: Barrasa & Webber (2023)
Source: Wikipedia

Source: stelselcatalogus.nl
Source: Zorginstituut KIK-V
Source: Jessica Talisman (2025)

Ontologies enhance vocabularies by describing and contextualizing relationships between concepts and with the introduction of logical reasoning. By assigning classes, properties, relations and attributes, ontologies establish rule bases that define how concepts behave in the wild, ensuring a level of coherence within complex information systems. Machines love ontologies because of their high-fidelity disambiguation and description, which bring clarity to machine understanding for tasks such as information retrieval, entity management, concept discovery, and RAG implementations for AI systems.
Caution
Source: Arthur Keen (2018)
Source: Rishabh Agarwal (2025)
Source: Pigsty

Designing data science & AI platforms

Source: Kleppmann & Riccomini (2026)

Source: Reiss & Housley (2022)

Source: the Composable Codex
Source: MAD landscape
Source: Pedreira et al. (2023)
The requirement for specialization in data management systems has evolved faster than our software development practices. After decades of organic growth, this situation has created a siloed landscape composed of hundreds of products developed and maintained as monoliths, with limited reuse between systems. This fragmentation has resulted in developers often reinventing the wheel, increased maintenance costs, and slowed down innovation. It has also affected the end users, who are often required to learn the idiosyncrasies of dozens of incompatible SQL and non-SQL API dialects, and settle for systems with incomplete functionality and inconsistent semantics.
Source: E.F. Codd (1970)

| System | Catalog | Database |
|---|---|---|
| postgresql, mysql, mssql, duckdb | database | schema |
| datafusion, trino | catalog | schema |
| druid | dataSourceType | dataSource |
| bigquery | project | database |
| flink | catalog | database |
| clickhouse | database | |
| clickhouse, impala, mysql, pyspark, snowflake | database |
Source: Kleppmann & Riccomini (2026), chapter 1
| Property | Operational systems (OLTP) | Analytic systems (OLAP) |
|---|---|---|
| Main read pattern | Point queries (fetch individual records by key) | Aggregate over large number of records |
| Main write pattern | Create, update, and delete individual records | Bulk import (ETL) or event stream |
| Human user example | End user of web/mobile application | Internal analyst, for decision support |
| Type of queries | Fixed, predefined by application | Arbitrary, ad-hoc exploration by analysts |
| Query volume | Lost of small queries | Few queries, each is complex |
| Data represents | Latest state of data (current point in time) | History of events that happened over time |
| Dataset size | Gigabytes to terabytes | Terabytes to petabytes |
Source: Armbrust et al. (2021)
Source: Kleppmann & Riccomini (2026), chapter 4
| Component | Function | Open source software |
|---|---|---|
| Open query engine | Parse SQL queries, optimize them into execution plans, and execute them against the data | Apache DataFusion Apache Spark DuckDB |
| Open catalog format | Defines which tables are contained in the database | Apache Iceberg Unity catalog Lance |
| Open table format | Support row inserts and deletions | Apache Iceberg Lance Delta Lake Apache Hudi |
| Open storage formats | Determines how the rows of a table are encoded as bytes in a file | Apache Parquet Lance Apache Orc |
| Open memory formats | Determines how the rows of a table are encoded as bytes in memory | Apache Arrow |
Source: Apache Arrow documentation
Source: the Composable Codex
Source: the Composable Codex
Source: Apache Arrow project
Source: Apache Arrow project
Source: Apache Iceberg specification
Source: Dremio
Source: Dremio
Source: Jordan Tigani
Source: the Data Quarry blog
Source: the Composable Codex
Source: DuckLake
Source: Cloudzero

Source: Andreesen Horowitz
Source: Andreesen Horowitz
Source: SRDP Hub
Patterns for building data & AI systems
Source: Buschmann et al. (2013)
Source: Dagster
Source: Kleppmann & Riccomini (2026), chapter 5
Source: Hamilton
Source: Wikipedia
Source: Maxima Beauchemin (2018)


Warning
def current_temperature(location: str) -> int:
return MyWeatherService().get_current_temperature(location)
def non_idempotent_function(location: str, destination: str) -> None:
with open(destination, ‘w’) as f:
f.write(current_temperature(location))
Tip
def get_temperature(timestamp: str, location: str) -> int:
return MyWeatherService().get_temperature(location, timestamp)
def idempotent_function(timestamp: str, location: str, destination: str) -> None:
with open(destination, ‘w’) as f:
f.write(get_temperature(timestamp, location))
Source: Dagster
Source: Dagster
Source: OpenLineage docs
Source: Marquez project
MLOps
Source: Scully et al. (2015)
Source: Google Cloud docs
Source: Google Cloud docs
Source: Google Cloud docs
Source: Google Cloud docs
Source: ONNX explained
Source: MLflow
Source: MLflow Tracking
Source: MLflow Model Registry
Source: MLflow
Source: the Practical Test Pyramid

Source: Breck et al. (2017)
| Type of test | |
|---|---|
| UI |
|
| Contract |
|
| Integration |
|
| Unit |
|
