technique

Distillation

Distillation trains a smaller "student" model to mimic a larger "teacher" model's outputs, capturing most of the quality at a fraction of the inference cost.

Distillation produces a small fast model that behaves like a big slow one. The student trains on the teacher's outputs (logits or generated text) rather than human-labelled data, which lets it pick up nuanced behaviour cheaply. In 2026 distillation drives most consumer-facing inference: GPT-4o-mini, Claude Haiku, Gemini Flash, and Llama 3.3 8B are all distilled from larger siblings. The current frontier is reasoning distillation — teaching small models to chain-of-thought by training on traces from o-series or Claude with extended thinking.

When to use distillation

Common mistakes

FAQ

What is distillation?

Distillation trains a smaller "student" model to mimic a larger "teacher" model's outputs, capturing most of the quality at a fraction of the inference cost.

When should I use distillation?

Cost-sensitive inference at high volume. Edge / on-device deployment.

What are the most common mistakes with distillation?

Distilling from a teacher that itself is wrong — student inherits the errors. Skipping evals — distilled models can drift on long-tail tasks.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/distillation.md.