Stable Diffusion

[3] Its development involved researchers from the CompVis Group at Ludwig Maximilian University of Munich and Runway with a computational donation from Stability and training data from non-profit organizations.

Its code and model weights have been released publicly,[8] and it can run on most consumer hardware equipped with a modest GPU with at least 4 GB VRAM.

Four of the original 5 authors (Robin Rombach, Andreas Blattmann, Patrick Esser and Dominik Lorenz) later joined Stability AI and released subsequent versions of Stable Diffusion.

[7] Stability AI also credited EleutherAI and LAION (a German nonprofit which assembled the dataset on which Stable Diffusion was trained) as supporters of the project.

[17] The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation.

Stable Diffusion was trained on pairs of images and captions taken from LAION-5B, a publicly available dataset derived from Common Crawl data scraped from the web, where 5 billion image-text pairs were classified based on language and filtered into separate datasets by resolution, a predicted likelihood of containing a watermark, and predicted "aesthetic" score (e.g. subjective visual quality).

[citation needed] An investigation by Bayerischer Rundfunk showed that LAION's datasets, hosted on Hugging Face, contain large amounts of private and sensitive data.

[37] Stable Diffusion XL (SDXL) version 1.0, released in July 2023, introduced native 1024x1024 resolution and improved generation for limbs and text.

In order to customize the model for new use cases that are not included in the dataset, such as generating anime characters ("waifu diffusion"),[40] new data and further training are required.

Fine-tuned adaptations of Stable Diffusion created through additional retraining have been used for a variety of different use-cases, from medical imaging[41] to algorithmically generated music.

[44] The creators of Stable Diffusion acknowledge the potential for algorithmic bias, as the model was primarily trained on images with English descriptions.

[31] As a result, generated images reinforce social biases and are from a western perspective, as the creators note that the model lacks data from other communities and cultures.

[34] Another configurable option, the classifier-free guidance scale value, allows the user to adjust how closely the output image adheres to the prompt.

[34] Additional text2img features are provided by front-end implementations of Stable Diffusion, which allow users to modify the weight given to specific parts of the text prompt.

[50][53] Stable Diffusion also includes another sampling script, "img2img", which consumes a text prompt, path to an existing image, and strength value between 0.0 and 1.0.

[50] A dedicated model specifically fine-tuned for inpainting use-cases was created by Stability AI alongside the release of Stable Diffusion 2.0.

[35] Conversely, outpainting extends an image beyond its original dimensions, filling the previously empty space with content generated based on the provided prompt.

This approach ensures that training with small datasets of image pairs does not compromise the integrity of production-ready diffusion models.

[84] Addressing the concerns that the model may be used for abusive purposes, CEO of Stability AI, Emad Mostaque, argues that "[it is] peoples' responsibility as to whether they are ethical, moral, and legal in how they operate this technology",[10] and that putting the capabilities of Stable Diffusion into the hands of the public would result in the technology providing a net benefit, in spite of the potential negative consequences.

[10] In addition, Mostaque argues that the intention behind the open availability of Stable Diffusion is to end corporate control and dominance over such technologies, who have previously only developed closed AI systems for image synthesis.

[10][84] This is reflected by the fact that any restrictions Stability AI places on the content that users may generate can easily be bypassed due to the availability of the source code.

[85] Controversy around photorealistic sexualized depictions of underage characters have been brought up, due to such images generated by Stable Diffusion being shared on websites such as Pixiv.

[89] In January 2023, Getty Images initiated legal proceedings against Stability AI in the English High Court, alleging significant infringement of its intellectual property rights.

[90][91] Key points of the lawsuit include: The trial is expected to take place in summer 2025 and has significant implications for UK copyright law and the licensing of AI-generated content.

Diagram of the latent diffusion architecture used by Stable Diffusion
The denoising process used by Stable Diffusion. The model generates images by iteratively denoising random noise until a configured number of steps have been reached, guided by the CLIP text encoder pretrained on concepts along with the attention mechanism, resulting in the desired image depicting a representation of the trained concept.