
  • The first version of DALL·E was a GPT-3 style transformer decoder that autoregressively generated a 256×256 image based on textual input and an optional beginning of the image.

  • A text is encoded by BPE-tokens (max. 256), and an image is encoded by special image tokens (1024 of them) produced by a discrete variational autoencoder (dVAE).

  • dVAE encodes a 256×256 image into a grid of 32×32 tokens with a vocabulary of 8192 possible values.

  • Because of the dVAE, some details and high-frequency features are lost in generated images, so some blurriness and smoothness are the features of the DALL·E-generated images.

DALL·E 1 Charateristics#

Controlling Attributes#


Drawing Multiple Objects#


Visualizing Perspective and Three-Dimensionality#


Visualizing Internal and External Structure#


Inferring Contextual Details#


DALL·E 1 Architecture#

  • The transformer is a large model with 12B parameters.

  • It consisted of 64 sparse transformer blocks with a complicated set of attention mechanisms inside, consisting of

    • classical text-to-text masked attention,

    • image-to-text attention, and

    • image-to-image sparse attention.

  • All three attention types are merged into a single attention operation.

  • The model was trained on a dataset of 250M image-text pairs.



  • VQ-VAE is a type of variational autoencoder that uses vector quantization to obtain a discrete latent representation.

  • This is in contrast to the continuous latent space that other variational autoencoders have.

  • The objective function of a VQ-VAE, when trained on an image dataset, can be written as:

    \(\mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - \beta \cdot D_{KL}[q(z|x) || p(z)]\)

    where \(p(x)\) is the data distribution, \(q\) is the approximate posterior over latent variables and \(D_{KL}\) denotes the Kullback-Leibler divergence.

  • This objective function encourages the model to learn an efficient codebook that minimises reconstruction error while also matching the prior distribution over codes.

A latent space

A latent space is obtained by encoding the input image into the nearest codebook entry. This process is called vector quantization and results in a discrete latent space.




  • Autoencoder is a neural network that is trained to predict its input.

  • The objective function of an autoencoder can be written as:

    \(\mathcal{L} = \mathbb{E}_{p(x)}[\log p(x|z)]\)

    where \(p(x)\) is the data distribution and \(p(x|z)\) is the model distribution.

  • This objective function encourages the model to learn a latent space that captures the underlying structure of the data.

  • A VQ-VAE can be seen as a type of autoencoder where the latent space is constrained to be discrete.

Variational Autoencoders (VAE)#


  • Variational Autoencoders (VAE) is a type of autoencoder where the latent space is continuous.

  • The objective function of a VAE can be written as:

    \(\mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}[q(z|x) || p(z)]\)

    where \(p(x)\) is the data distribution, \(q\) is the approximate posterior over latent variables and \(D_{KL}\) denotes the Kullback-Leibler divergence.

  • This objective function encourages the model to learn a latent space that captures the underlying structure of the data while also matching the prior distribution over latent variables.

Discrete Spaces#


  • Discrete spaces are more efficient to represent than continuous spaces.

  • This is because a discrete space can be represented with a finite number of bits, whereas a continuous space requires an infinite number of bits.

  • In addition, discrete spaces are easier to manipulate and reason about than continuous spaces.

  • For these reasons, VQ-VAE is more efficient than VAE at learning latent representations of data.

Uncertainty in the Posterior#


  • Uncertainty in the posterior is added by soft-sampling codebook vectors from the Gumbel-Softmax distribution.

  • This results in a softened latent space which can be seen as a continuous approximation of the discrete latent space.

  • The Gumbel-Softmax distribution is a type of distribution that allows for sampling from a discrete space while still allowing for gradients to flow through the samples.

  • This is useful for training models with discrete latent spaces, such as VQ-VAE.

  • The Gumbel-Softmax distribution is defined as:

    \(G(z;\mu,\beta) = \frac{\exp((z - \mu)/\beta)}{\sum_{k=1}^K \exp((z_k - \mu)/\beta)}\)

    where \(\mu\) is the mean, \(\beta\) is the temperature and \(K\) is the number of classes.

Comparison of original images (top) and reconstructions from the dVAE (bottom)



A GPT-3 like transformer decoder consumes a sequence of text tokens and (optional) image tokens (here a single image token with id 42) and produces a continuation of an image (here the next image token with id 1369)


Sampling From a Trained DALL-E#



DALL·E 1 Results#

Several image generation examples from the original paper


The trained model generated several samples (up to 512!) based on the text provided, then all these samples were ranked by a special model called CLIP, and the top-ranked one was chosen as the result of the model.
