Diffusion model
A diffusion model is a generative neural network that creates images, video, or audio by iteratively denoising random noise toward a learned target distribution.
Diffusion models — including Stable Diffusion, Flux, Midjourney, DALL·E 3, Imagen, Sora, Kling, Veo, and Stable Audio — start from random Gaussian noise and run a denoising network for 20–50 steps to produce a coherent sample. Variants like latent diffusion (SD, Flux) operate in a compressed latent space for speed; flow-matching (Flux, SD3) replaces the diffusion formulation with a direct ODE path. Generation is steered by text encoders (CLIP, T5) that condition the denoising process. In 2026 diffusion still dominates image and video generation; autoregressive image models (e.g. nano-banana, Imagen 4) are catching up on instruction-following but are not the majority.
Common mistakes
- Comparing diffusion samplers without fixing the seed.
- Treating diffusion outputs as deterministic — same prompt, different seed = different image.
FAQ
What is diffusion model?
A diffusion model is a generative neural network that creates images, video, or audio by iteratively denoising random noise toward a learned target distribution.
What are the most common mistakes with diffusion model?
Comparing diffusion samplers without fixing the seed. Treating diffusion outputs as deterministic — same prompt, different seed = different image.
Related terms
- Negative prompt — A negative prompt is text that tells an image, video, or audio generator what to avoid producing — the opposite of the main prompt.
- Seed — A seed is an integer that initializes the random number generator inside an image, video, or audio model, making generation reproducible.
- CFG scale (classifier-free guidance) — CFG scale controls how strongly a diffusion image model follows its text prompt — higher values stick closer to the prompt, lower values explore more.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/diffusion-model.md.