U-Net

[6] This technology underlies many modern image generation models, such as DALL-E, Midjourney, and Stable Diffusion.

[1] One important modification in U-Net is that there are a large number of feature channels in the upsampling part, which allow the network to propagate context information to higher resolution layers.

As a consequence, the expansive path is more or less symmetric to the contracting part, and yields a u-shaped architecture.

This tiling strategy is important to apply the network to large images, since otherwise the resolution would be limited by the GPU memory.

[1] It is an improvement and development of FCN: Evan Shelhamer, Jonathan Long, Trevor Darrell (2014).

This is an example architecture of U-Net for producing k 256-by-256 image masks for a 256-by-256 RGB image.