Diffusion Model
A diffusion model is a type of generative AI that creates images by gradually removing noise from a random pattern, guided by text descriptions or other conditioning inputs.
Diffusion models are a class of generative AI that produce high-quality images, videos, and audio through a process inspired by thermodynamics. During training, the model learns to reverse a gradual noising process. It takes clean data, adds noise step by step until the original content is completely destroyed, then trains a neural network to reverse each step of this noise addition. At generation time, the model starts with pure random noise and iteratively denoises it into a coherent output.
The breakthrough of diffusion models came with the realization that conditioning the denoising process on text embeddings allows users to guide image generation with natural language descriptions. Models like Stable Diffusion, DALL-E, and Midjourney use a text encoder (often CLIP) to convert prompts into embeddings that steer the denoising process toward images matching the description. Latent diffusion models perform this process in a compressed latent space rather than pixel space, dramatically reducing computational requirements.
Diffusion models have become the dominant approach for image generation due to their ability to produce high-quality, diverse outputs with fine-grained control. Advanced techniques include ControlNet for precise composition control, inpainting for selective editing, img2img for transforming existing images, and LoRA for style customization. Video diffusion models extend these concepts to generate short video clips, and audio diffusion models generate music and sound effects.
Real-World Examples
- •Stable Diffusion generating images from text prompts in the open-source community
- •DALL-E 3 creating detailed illustrations integrated into ChatGPT
- •Midjourney producing artistic and photorealistic imagery from natural language descriptions
- •Runway Gen-3 using video diffusion to generate short video clips from text