Introduction to Generative AI

From foundational models to GenAI systems and LLM agents

Daniel Kapitan

d.kapitan@tue.nl

Eindhoven AI Systems Institue

June 20, 2025

Attribution

Jakub M. Tomczak, Deep Generative Modeling, 2nd Edition (2024)
Jay Alammar & Maarten Grootendorst, Hands-On Large Language Models (2024)
Maarten Grootendorst, Visual Guide to LLM Agents (2025)
Babak Esmaeli provided the idea of Plato’s cave for explaining latent variables

Generative Modeling

The need for generative models

Adding noise that humans can’t see, can trip the model badly

The need for generative models

We need better techniques for quantifying uncertainty

The core rules of probability theory

The product rule

\[ \begin{align} p(x, y) & = p(x|y)p(y) \newline & = p(y|x)p(x) \end{align} \]

This is in fact Bayes’ rule written differently

The sum rule

\[ p(x) = \sum_y p(x, y)\]

Overview of Generative AI

Deep learning provides architecture for parameterizing the model
Probabilistic modeling provides mathematical foundation for the model
Software engineering provides computing resources for implementing the model

Taxonomy of Deep Generative Modeling

We will cover Autoregressive Models ARM) and Latent Variable Models
Quiz: which type of model won the Nobel Prize in Physics in 2024?

Autoregressive Models

Remember time-series forecasting?

In statistics, econometrics, and signal processing, an autoregressive model

is a representation of a type of random process
can be used to describe time-varying processes in nature, economics, behavior, etc.
specifies that the output variable depends linearly on its own previous values and on a stochastic term

Markov property:

Stochastig process is memoryless, independent of its history:

\[ \begin{align} & P(\frac{\text{coding in Python is fun}}{\text{coding in Python is}}) \\ & \approx P(\frac{\text{Python is fun}}{\text{Python is}}) \end{align} \]

Autoregressive models with neural networks

Factorization of conditional probablities using product rule

Say we have a variable \(\mathbf{x}\) in \(D\) dimensions and we want to model \(p(\mathbf{x})\)
The conditional probablities can be written as: \[ p(\mathbf{x}) = p(x_1) \prod_{d=2}^{D} p(x_d | \mathbf{x}_{< d}) \]
In case of three dimenions: \[ p(x1)p(x_2|x_1)p(x_3|x_1,x_2) \]

Autoregressive models with neural networks

Reduce complexity: finite memory

to limit the complexity of a conditional model, assume a finite memory
For instance, we can assume that each variable is dependent on no more than two other variables

\[ p(\mathbf{x}) = p(x_1) \prod_{d=3}^{D} p(x_d | x_{d-1}, x_{d-2}) \]

Autoregressive models with neural networks

Multilayer perceptron (MLP) depending on two last inputs

Autoregressive models with neural networks

Long Short-Term Memory (LSTM) Recurrent Neural Networks

We want to have more long term memory, but still want to minimize the model’s complexity
LSTM RNN is a possible solution to this:

\[ p(\mathbf{x}) = p(x_1) \prod_{d=3}^{D} p(x_d | RNN(x_{d-1}, h_{d-1})) \]

Autoregressive models with neural networks

Recurrent Neural Network (RNN) depending on two last inputs

Autogressive models with neural networks

Discriminative vs. generative LSTM RNN (Yogamata et. al (2017))

Use-case: text classification

Predict document class \(y\) for each sequence of words \(x_1, x_2, ...\)
Inputs are static embeddings of words
All outputs are combined into output, typically softmax activation function

Use-case: next token prediction

Same as before, but we add class embeddings \(\mathbf{V}\)
Note use of chain rule to calculate conditional probabilities for each word
Recursively use output of \(x_{t-1}\) as input for \(x_t\)s

The Transformer

Attention is all you need (source: Jay Alammar)

The Transformer

Attention is all you need (source: Jay Alammar)

Concept of Query, Key and Values is analogous to information retrieval in a database
Computation is ‘just’ matrix multiplication
- Can be run in parallel (multiple attention heads)
- Optimized software for matrix computations
- still, this is computationally the most expensive parts
Ongoing developments
- More efficient attention calculation
- Alternative architectures: XLSTM, state space models (SSM)

The Transformer

Example sequence-to-sequence: translation (source: Jay Alamar)

Latent Variable Models

The Autoencoder

The Variational Autoencoder (VAE)

Generative AI Systems

From models to systems

Intermezzo: getting the terminology right

Embeddings, encoders, decoders …

Embeddings: usually static embedding like word2vec, always needed to transform text into vectors with some context information
Encoder: a contextual embedding like BERT, often referred to as encoder-only transformer
Decoder: often the core of the generative AI system, with the decoder-only transformers (generative pre-trained transformer, GPT) as a well-known example

Retrieval Augmented Generation

Source: Meta blogpost (2020)

ImaGen: Diffusion + Superresolution

Source: Google DeepMind

ImaGen: Diffusion + Superresolution

Source: Google DeepMind

LLM Agents

What is an agent?

The “augmented LLM” as an agent

Note people disagree with calling current state-of-the art agentic AI. You can decide yourself after the hands-on session.

Why do we need augmentation?

The autonomy of your LLM agent depends on its design

The need for short-term memory

The need for long-term memory

Adding short and long-term memory

Long-term memory with vector database

Using tools with LLMs

Different between procedural programming and agents

Model Context Protocol

Reasoning with Chain-of-Thought

Reasoning and Acting

Adding reflection

From single to multi-agents

Modular multi-agents frameworks

Isabella: simulacra or simulation?

Simulacra or simulation?

Frameworks in Python for building agentic AI apps

CrewAI: a lean Python framework built entirely from scratch—completely independent of LangChain or other agent frameworks.
Pydantic AI: a Python agent framework designed to make it less painful to build production grade applications with Generative AI. Plays nicely with the pydantic validation library.
AutoGen: Microsoft’s open-source programming framework for agentic AI
Langgraph: part of the LangChain stack
type-ai/fenic: new kid on the block, backed by Wes McKinney s

Examples from Texterous

Scraping job postings

Document retrieval: finding relevant grants

Education: lesson plan generator

AlphaFold: predicting 3D structure of proteins

AlphaFold 2 in a nutshell

AlphaFold 2 architecture

AlphaFold 3: prediction of nearly all molecular types in the Protein Data Bank

a, The pairformer module. Input and output: pair representation with dimension (n, n, c) and single representation with dimension (n, c). n is the number of tokens (polymer residues and atoms); c is the number of channels (128 for the pair representation, 384 for the single representation). Each of the 48 blocks has an independent set of trainable parameters.
b, The diffusion module. Input: coarse arrays depict per-token representations (green, inputs; blue, pair; red, single). Fine arrays depict per-atom representations. The coloured balls represent physical atom coordinates. Cond., conditioning; rand. rot. trans., random rotation and translation; seq., sequence.
c, The training set-up (distogram head omitted) starting from the end of the network trunk. The coloured arrays show activations from the network trunk (green, inputs; blue, pair; red, single). The blue arrows show abstract activation arrays. The yellow arrows show ground-truth data. The green arrows show predicted data. The stop sign represents stopping of the gradient. Both depicted diffusion modules share weights.
d, Training curves for initial training and fine-tuning stages, showing the LDDT on our evaluation set as a function of optimizer steps. The scatter plot shows the raw datapoints and the lines show the smoothed performance using a median filter with a kernel width of nine datapoints. The crosses mark the point at which the smoothed performance reaches 97% of its initial training maximum.