Text-to-image model

Text-to-image models began to be developed in the mid-2010s during the beginnings of the AI boom, as a result of advances in deep neural networks.

In 2022, the output of state-of-the-art text-to-image models—such as OpenAI's DALL-E 2, Google Brain's Imagen, Stability AI's Stable Diffusion, and Midjourney—began to be considered to approach the quality of real photographs and human-drawn art.

attempts to build text-to-image models were limited to collages by arranging existing component images, such as from a database of clip art.

alignDRAW extended the previously-introduced DRAW architecture (which used a recurrent variational autoencoder with an attention mechanism) to be conditioned on text sequences.

A model trained on the more diverse COCO (Common Objects in Context) dataset produced images which were "from a distance... encouraging", but which lacked coherence in their details.

[9] One of the first text-to-image models to capture widespread public attention was OpenAI's DALL-E, a transformer system announced in January 2021.

[10] A successor capable of generating more complex and realistic images, DALL-E 2, was unveiled in April 2022,[11] followed by Stable Diffusion that was publicly released in August 2022.

This dataset was created using web scraping and automatic filtering based on similarity to high-quality artwork and professional photographs.

Some modern AI platforms not only generate images from text but also create synthetic datasets to improve model training and fine-tuning.

An image conditioned on the prompt an astronaut riding a horse, by Hiroshige , generated by Stable Diffusion 3.5, a large-scale text-to-image model first released in 2022
High-level architecture showing the state of AI art machine learning models, and notable models and applications as a clickable SVG image map
Examples of images and captions from three public datasets which are commonly used to train text-to-image models