DALL·E 1

DALL·E 1#

Paper: https://arxiv.org/abs/2102.12092
Blog post: https://openai.com/blog/dall-e/
Code: openai/dall-e (it’s not complete and covers only the dVAE part)
Model: Not available
Alternative code (PyTorch): lucidrains/DALLE-pytorch
Alternative code (JAX/Flax): borisdayma/dalle-mini

dalle2

The first version of DALL·E was a GPT-3 style transformer decoder that autoregressively generated a 256×256 image based on textual input and an optional beginning of the image.
A text is encoded by BPE-tokens (max. 256), and an image is encoded by special image tokens (1024 of them) produced by a discrete variational autoencoder (dVAE).
dVAE encodes a 256×256 image into a grid of 32×32 tokens with a vocabulary of 8192 possible values.
Because of the dVAE, some details and high-frequency features are lost in generated images, so some blurriness and smoothness are the features of the DALL·E-generated images.

DALL·E 1 Charateristics#

Controlling Attributes#

attributes

Drawing Multiple Objects#

objects

Visualizing Perspective and Three-Dimensionality#

perspective

Visualizing Internal and External Structure#

structure

Inferring Contextual Details#

context

DALL·E 1 Architecture#

The transformer is a large model with 12B parameters.
It consisted of 64 sparse transformer blocks with a complicated set of attention mechanisms inside, consisting of
- classical text-to-text masked attention,
- image-to-text attention, and
- image-to-image sparse attention.
All three attention types are merged into a single attention operation.
The model was trained on a dataset of 250M image-text pairs.

VQ-VAE#

vqvae

VQ-VAE is a type of variational autoencoder that uses vector quantization to obtain a discrete latent representation.
This is in contrast to the continuous latent space that other variational autoencoders have.
The objective function of a VQ-VAE, when trained on an image dataset, can be written as:

\(\mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - \beta \cdot D_{KL}[q(z|x) || p(z)]\)

where \(p(x)\) is the data distribution, \(q\) is the approximate posterior over latent variables and \(D_{KL}\) denotes the Kullback-Leibler divergence.
This objective function encourages the model to learn an efficient codebook that minimises reconstruction error while also matching the prior distribution over codes.

A latent space

A latent space is obtained by encoding the input image into the nearest codebook entry. This process is called vector quantization and results in a discrete latent space.

latent

Autoencoders#

autoencoder

Autoencoder is a neural network that is trained to predict its input.
The objective function of an autoencoder can be written as:

\(\mathcal{L} = \mathbb{E}_{p(x)}[\log p(x|z)]\)

where \(p(x)\) is the data distribution and \(p(x|z)\) is the model distribution.
This objective function encourages the model to learn a latent space that captures the underlying structure of the data.
A VQ-VAE can be seen as a type of autoencoder where the latent space is constrained to be discrete.

Variational Autoencoders (VAE)#

vae

Variational Autoencoders (VAE) is a type of autoencoder where the latent space is continuous.
The objective function of a VAE can be written as:

\(\mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}[q(z|x) || p(z)]\)

where \(p(x)\) is the data distribution, \(q\) is the approximate posterior over latent variables and \(D_{KL}\) denotes the Kullback-Leibler divergence.
This objective function encourages the model to learn a latent space that captures the underlying structure of the data while also matching the prior distribution over latent variables.

Discrete Spaces#

discrete

Discrete spaces are more efficient to represent than continuous spaces.
This is because a discrete space can be represented with a finite number of bits, whereas a continuous space requires an infinite number of bits.
In addition, discrete spaces are easier to manipulate and reason about than continuous spaces.
For these reasons, VQ-VAE is more efficient than VAE at learning latent representations of data.

Uncertainty in the Posterior#

uncetainty1

Uncertainty in the Posterior

uncetainty2

Uncertainty in the posterior is added by soft-sampling codebook vectors from the Gumbel-Softmax distribution.
This results in a softened latent space which can be seen as a continuous approximation of the discrete latent space.
The Gumbel-Softmax distribution is a type of distribution that allows for sampling from a discrete space while still allowing for gradients to flow through the samples.
This is useful for training models with discrete latent spaces, such as VQ-VAE.
The Gumbel-Softmax distribution is defined as:

\(G(z;\mu,\beta) = \frac{\exp((z - \mu)/\beta)}{\sum_{k=1}^K \exp((z_k - \mu)/\beta)}\)

where \(\mu\) is the mean, \(\beta\) is the temperature and \(K\) is the number of classes.

Comparison of original images (top) and reconstructions from the dVAE (bottom)

dvae

Decoder#

A GPT-3 like transformer decoder consumes a sequence of text tokens and (optional) image tokens (here a single image token with id 42) and produces a continuation of an image (here the next image token with id 1369)

decoder

Sampling From a Trained DALL-E#

sampling

DALL·E 1 Results#

Several image generation examples from the original paper

examples

The trained model generated several samples (up to 512!) based on the text provided, then all these samples were ranked by a special model called CLIP, and the top-ranked one was chosen as the result of the model.

eval