Text-to-image models began to be developed in the mid-2010s during the beginnings of the AI boom, as a result of advances in deep neural networks.
In 2022, the output of state-of-the-art text-to-image models—such as OpenAI's DALL-E 2, Google Brain's Imagen, Stability AI's Stable Diffusion, and Midjourney—began to be considered to approach the quality of real photographs and human-drawn art.
attempts to build text-to-image models were limited to collages by arranging existing component images, such as from a database of clip art.
alignDRAW extended the previously-introduced DRAW architecture (which used a recurrent variational autoencoder with an attention mechanism) to be conditioned on text sequences.
A model trained on the more diverse COCO (Common Objects in Context) dataset produced images which were "from a distance... encouraging", but which lacked coherence in their details.
[9] One of the first text-to-image models to capture widespread public attention was OpenAI's DALL-E, a transformer system announced in January 2021.
[10] A successor capable of generating more complex and realistic images, DALL-E 2, was unveiled in April 2022,[11] followed by Stable Diffusion that was publicly released in August 2022.
This dataset was created using web scraping and automatic filtering based on similarity to high-quality artwork and professional photographs.
Some modern AI platforms not only generate images from text but also create synthetic datasets to improve model training and fine-tuning.
an astronaut riding a horse, by
Hiroshige
, generated by
Stable Diffusion
3.5, a large-scale text-to-image model first released in 2022