Introduction to Generative AI

From foundational models to GenAI systems

Daniel Kapitan

Eindhoven AI Systems Institue

June 20, 2025

Attribution


Generative Modeling

The need for generative models

Adding noise that humans can’t see, can trip the model badly


The need for generative models

We need better techniques for quantifying uncertainty


The core rules of probability theory


The product rule

\[ \begin{align} p(x, y) & = p(x|y)p(y) \newline & = p(y|x)p(x) \end{align} \]

This is in fact Bayes’ rule written differently

The sum rule

\[ p(x) = \sum_y p(x, y)\]

Overview of Generative AI


  • Deep learning provides architecture for parameterizing the model
  • Probabilistic modeling provides mathematical foundation for the model
  • Software engineering provides computing resources for implementing the model

Taxonomy of Deep Generative Modeling


  • We will cover Autoregressive Models ARM) and Latent Variable Models
  • Quiz: which type of model won the Nobel Prize in Physics in 2024?

Autoregressive Models

Remember time-series forecasting?


In statistics, econometrics, and signal processing, an autoregressive model

  • is a representation of a type of random process
  • can be used to describe time-varying processes in nature, economics, behavior, etc.
  • specifies that the output variable depends linearly on its own previous values and on a stochastic term

Markov property:

Stochastig process is memoryless, independent of its history:

\[ \begin{align} & P(\frac{\text{coding in Python is fun}}{\text{coding in Python is}}) \\ & \approx P(\frac{\text{Python is fun}}{\text{Python is}}) \end{align} \]

Autoregressive models with neural networks

Factorization of conditional probablities using product rule


  • Say we have a variable \(\mathbf{x}\) in \(D\) dimensions and we want to model \(p(\mathbf{x})\)
  • The conditional probablities can be written as: \[ p(\mathbf{x}) = p(x_1) \prod_{d=2}^{D} p(x_d | \mathbf{x}_{< d}) \]
  • In case of three dimenions: \[ p(x1)p(x_2|x_1)p(x_3|x_1,x_2) \]

Autoregressive models with neural networks

Reduce complexity: finite memory


  • to limit the complexity of a conditional model, assume a finite memory
  • For instance, we can assume that each variable is dependent on no more than two other variables

\[ p(\mathbf{x}) = p(x_1) \prod_{d=3}^{D} p(x_d | x_{d-1}, x_{d-2}) \]

Autoregressive models with neural networks

Multilayer perceptron (MLP) depending on two last inputs

Autoregressive models with neural networks

Long Short-Term Memory (LSTM) Recurrent Neural Networks


  • We want to have more long term memory, but still want to minimize the model’s complexity
  • LSTM RNN is a possible solution to this:

\[ p(\mathbf{x}) = p(x_1) \prod_{d=3}^{D} p(x_d | RNN(x_{d-1}, h_{d-1})) \]

Autoregressive models with neural networks

Recurrent Neural Network (RNN) depending on two last inputs

Autogressive models with neural networks

Discriminative vs. generative LSTM RNN (Yogamata et. al (2017))

Use-case: text classification

  • Predict document class \(y\) for each sequence of words \(x_1, x_2, ...\)
  • Inputs are static embeddings of words
  • All outputs are combined into output, typically softmax activation function

Use-case: next token prediction

  • Same as before, but we add class embeddings \(\mathbf{V}\)
  • Note use of chain rule to calculate conditional probabilities for each word
  • Recursively use output of \(x_{t-1}\) as input for \(x_t\)s

The Transformer

Attention is all you need (source: Jay Alammar)


The Transformer

Attention is all you need (source: Jay Alammar)

  • Concept of Query, Key and Values is analogous to information retrieval in a database
  • Computation is ‘just’ matrix multiplication
    • Can be run in parallel (multiple attention heads)
    • Optimized software for matrix computations
    • still, this is computationally the most expensive parts
  • Ongoing developments
    • More efficient attention calculation
    • Alternative architectures: XLSTM, state space models (SSM)

The Transformer

Example sequence-to-sequence: translation (source: Jay Alamar)

Latent Variable Models

The Autoencoder

The Variational Autoencoder (VAE)

The Variational Autoencoder (VAE)

Generative AI Systems

From models to systems


Intermezzo: getting the terminology right

Embeddings, encoders, decoders …


  • Embeddings: usually static embedding like word2vec, always needed to transform text into vectors with some context information
  • Encoder: a contextual embedding like BERT, often referred to as encoder-only transformer
  • Decoder: often the core of the generative AI system, with the decoder-only transformers (generative pre-trained transformer, GPT) as a well-known example

Retrieval Augmented Generation

Source: Meta blogpost (2020)

ImaGen: Diffusion + Superresolution

Source: Google DeepMind

ImaGen: Diffusion + Superresolution

Source: Google DeepMind

Examples from Texterous

Scraping job postings

Document retrieval: finding relevant grants

Education: lesson plan generator

Education: lesson plan generator

AlphaFold: predicting 3D structure of proteins

AlphaFold 2 in a nutshell

AlphaFold 2 architecture

AlphaFold 3: prediction of nearly all molecular types in the Protein Data Bank