Over the last few years, artificial intelligence technology has made enormous strides in the field of image generation. Thanks to advanced algorithms, such as diffusion models, AI is capable of creating detailed, high-quality images. Models like DALL-E or Stable Diffusion enable image generation based solely on textual descriptions. This technology allows individuals to produce high-quality visualizations without needing additional artistic skills.
AI image generation is not just a technical innovation; it also opens up new possibilities in design, marketing, education, and many other fields. This technology didn’t appear out of nowhere. In this article, we would like to introduce a few models that have influenced the development of image generation up to the present day.
VAE
The Variational Autoencoder (VAE) is a type of generative neural network proposed in 2013 and formally described in the paper “Auto-Encoding Variational Bayes,” published in early 2014. VAE became one of the first effective approaches to generating realistic data (e.g., images) from statistical distributions.
A VAE is an extension of the classic autoencoder, which learns to map input data (e.g., images) to a latent space – where the data’s dimensionality is reduced and important features are preserved – and then reconstruct them.
Unlike regular autoencoders, which learn “rigid” codes, a VAE learns a probability distribution over the latent space. This allows it to generate new data by sampling from this distribution.
The model consists of two components – an encoder and a decoder. The encoder transforms the input data (in this case, an image) into two vectors: means and standard deviations. The decoder, in turn, learns to reconstruct images based on these means and standard deviations.
The model learns based on the reconstruction error and KL divergence, which it aims to minimize:
The reconstruction error is calculated based on the differences in pixel values between the input image and the image reconstructed by the VAE.
KL divergence measures the difference between the output data’s distribution and a (typically) normal distribution.
A trained model enables the generation of new data from the distribution by passing a random vector (sampled from a normal distribution) to the decoder.
GAN
GANs (Generative Adversarial Networks) are a neural network architecture designed for data generation, first proposed in 2014 in the paper titled “Generative Adversarial Nets.”
They represent one of the most significant milestones in AI development—GANs have enabled the generation of incredibly realistic images, videos, and even music.
A GAN consists of two components: a generator network and a discriminator network. The generator’s task is to create new samples from random noise, while the discriminator’s job is to distinguish whether a sample is real (from the training data) or fake (produced by the generator).
During training, the generator and discriminator networks engage in a continuous learning process: the generator attempts to produce images so realistic they can deceive the discriminator, while the discriminator, with each iteration, aims to improve its ability to differentiate between real and generated images.
By the end of the training, the generator network is capable of producing high-quality images that are difficult to distinguish from real ones. Besides revolutionizing image generation, GANs have been utilized for various other tasks, such as style transfer (e.g., transforming photos to look like Monet paintings), generating maps from satellite imagery, image-to-image translation (like changing one object in an image to another), image super-resolution, and modifying a person’s age in a photograph.
GAN-generated cat image
Diffusion Models
Diffusion models are a powerful class of generative probabilistic models that have gained immense popularity in recent years, especially in tasks related to image generation. Their contemporary form was popularized by the paper “Denoising Diffusion Probabilistic Models” (DDPM), published in 2020.
The model is trained to predict what a less noisy image would look like based on a more noisy one, minimizing the error (most commonly the mean squared error) between the predicted noise and the actually added noise.
The DDPM model introduced a stable, efficient, and easy-to-train model that quickly became the foundation for many advanced generative systems, such as Stable Diffusion and DALL-E.
The paper “Denoising Diffusion Implicit Models” proposed a variant (DDIM) that enables faster image generation without compromising quality. Unlike the classic DDPM, which requires going through all (e.g., 1000) denoising steps, DDIM allows for image generation in a significantly smaller number of steps, such as 50 or 100.
Diffusion model-generated flower images
Latent Diffusion
LDMs (Latent Diffusion Models) are a breakthrough approach that significantly reduces the computational costs of diffusion models. They combine the diffusion process with a latent space, similar to the one used in VAEs.
Instead of applying diffusion directly in pixel space (which is computationally expensive for high-resolution images), LDMs first encode the image into a latent space (e.g., using an autoencoder) and then perform the diffusion process on this compressed representation. After the denoising process is complete, the result is decoded back into pixel space. This enables, among other things, the generation of higher-resolution images at a lower computational cost. The LDM architecture is used in text-to-image models such as Stable Diffusion.
Images generated by an LDM model and their text prompts
Transformers
The transformer architecture was initially presented in the paper “Attention Is All You Need.” It revolutionized natural language processing (NLP) and is widely used in large language models, such as ChatGPT. Transformers have also found applications in graphics generation. Although initially used mainly in text-based tasks, their ability to model long-range dependencies has made them a powerful tool in computer vision as well.
Transformers consist of an encoder and a decoder, which in turn are built from layers of attention mechanisms and multilayer perceptrons.
They excel at combining, for example, text and images. In systems like DALL-E, text is encoded using a transformer, and its representation is then used as an input condition for the graphics-generating model (e.g., a diffusion model).
Transformers act as a mechanism for understanding text and transforming it into a representation that then guides image generation.
In more advanced generative systems, transformers operate in image space, working on so-called “patches” – fragments of an image divided into a grid (e.g., 16×16). Instead of classic convolutional neural networks, ViT (Vision Transformer) networks analyze and reconstruct images globally, considering dependencies between distant image fragments.
Transformers do not generate images in the same way as GANs, VAEs, or diffusion models; instead, they are used to:
process images as sequences (pixels, image fragments, tokens),
participate in understanding text prompts,
create an interpretive layer in text-to-image systems,
be part of hybrid systems with autoencoders and diffusion models.
Their greatest advantage is their ability to model complex dependencies and their versatility – they can combine vision and text in a single model.
CLIP – Connecting Text and Images
One of the breakthrough tools that significantly influenced the development of generative image models is CLIP (Contrastive Language–Image Pre-training), developed by OpenAI. CLIP was designed for a better understanding of the relationship between natural language and images. In simple terms, this model can assess how well a given textual description matches a specific image because it was trained on a vast number of text-image pairs available on the internet.
CLIP consists of two main components: one processing text (e.g., a transformer) and the other processing images (e.g., a CNN or Vision Transformer). Both components encode input data into a vector space where an image and its corresponding text have similar values, while images and unrelated descriptions are far apart.
In the context of image generation, CLIP evaluates which images best correspond to a given description. This approach revolutionized text-to-image generation because it allowed for guiding the graphics creation process more precisely and in line with the user’s intent. For example, models like Stable Diffusion or DALL-E use CLIP to compare the textual prompt with the generated images and select those that best capture the idea.
The impact of CLIP has been enormous – it enabled the creation of more accurate, complex, and visually coherent illustrations based on texts, even very abstract or surreal ones.
CLIP Guided Diffusion – An implementation of a diffusion model created by OpenAI, which uses the CLIP model to steer the diffusion process. In this implementation, CLIP is used to find a set of images matching the given description, which serve as guides in the diffusion process. Additional information about the desired image, such as resolution or color palette, can also be provided to the model. At each denoising step, CLIP evaluates how well the current image matches the given description. This evaluation is used in the next denoising step, guiding the generation to be consistent with the text.
Image generated by a CLIP Guided Diffusion implementation. Prompt: Landscape resembling the Death tarot card by Gerardo Dottori
Stable Diffusion
Stable Diffusion can create high-quality images based on textual descriptions. It consists of three parts: a VAE, a U-Net, and a text encoder. The VAE encoder processes the image into a latent space, in which the diffusion process is carried out, and during this process, the U-Net denoises the image. Finally, the VAE decoder reconstructs the image from the latent representation. The denoising step can be conditioned on an image or text, allowing the model to be used for text-to-image and image-to-image tasks. Conditions are added to the U-Net network via an attention mechanism. The CLIP ViT-L/14 model is used to transform the input text into a vector space.
The model is relatively lightweight: the U-Net contains 860 million parameters, and the text encoder has 123 million, allowing it to be run on consumer GPUs.
Stable Diffusion 3.0 – Version 3.0 utilizes a multimodal diffusion transformer instead of a U-Net network. In the transformer blocks, the encoded text and image are mixed together. In addition to the text influencing the image, the encoded image also influences the text vector.
The Stable Diffusion architecture is open-source. The model’s flexibility allows for fine-tuning it to specific needs, creating specialized models for generating specific images and scenes.
Other text-to-image models include DALL-E and Midjourney. OpenAI has also made it possible to generate high-quality images within ChatGPT, but it has not released the model’s architecture.
Visualization of the denoising process in Stable Diffusion
Summary
In recent years, artificial intelligence technology has significantly advanced in image generation, enabling the creation of realistic and high-quality visualizations based on textual descriptions. Technologies such as diffusion models, GANs, VAEs, and text-to-image systems like DALL-E and Stable Diffusion form the foundation of modern generative systems.
VAEs and GANs are early approaches to image generation that laid the groundwork for modern techniques. Diffusion models, especially those based on DDPM, became popular due to their ability to generate detailed images, although they require more computational time. Latent Diffusion (LDM) offers a more efficient approach, reducing computational costs while maintaining high image quality. Transformers, though initially used mainly in NLP, have also found applications in image generation, especially when combined with CLIP, which enabled precise guidance of the generation process based on text.
CLIP revolutionized image generation by precisely aligning text with visual representations, allowing for the creation of more accurate images based on complex and abstract prompts. Stable Diffusion has become one of the most popular models for text-to-image generation, offering flexibility and an open-source architecture, which allows users to adapt it to various needs.
New models generate increasingly accurate images that are increasingly difficult to distinguish from real ones. Furthermore, they are better able to reflect user prompts.
AI image generation technology as a whole opens up new possibilities in medicine, design, education, and many other fields, changing the way visual content is created, and also represents a significant step in the development of artificial intelligence. The technologies developed for image synthesis are finding applications in other fields, such as natural language processing.
Image generated by GPT-4o. Prompt: A man looking in a mirror in his apartment, with a window and the Eiffel Tower visible in the reflection.