AI Techniques

Knowledge Distillation

Knowledge distillation is a model compression technique where a compact 'student' model is trained to replicate the behavior of a larger, more capable 'teacher' model — achieving near-teacher-level accuracy at a fraction of the size, latency, and cost.

Training a state-of-the-art neural network requires enormous compute, memory, and time. Deploying that trained model requires equally large infrastructure: a 70B-parameter LLM needs multiple high-end GPUs just to serve a single concurrent request. For most real-world applications — a mobile app, an embedded device, a latency-sensitive API — this is impractical. Knowledge distillation is the technique that bridges the gap between the accuracy of massive models and the constraints of practical deployment.

Knowledge distillation is a model compression technique introduced by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in their landmark 2015 paper “Distilling the Knowledge in a Neural Network.” The core insight is deceptively elegant: rather than training a small model directly on raw ground-truth labels, train it to mimic the output probability distributions of a larger, pre-trained model. These probability distributions contain far more information than one-hot class labels — they encode how similar the model believes each class is to every other class, a form of structured knowledge the smaller model would otherwise never see.

Hinton called this information dark knowledge: the implicit structure buried in the teacher’s soft predictions that is invisible in the original training labels but is nevertheless learnable by the student.

The Core Concept: Soft Labels vs. Hard Labels

To understand why distillation works, consider image classification. For an image of a cat:

  • Hard label (ground truth): [cat: 1.0, dog: 0.0, automobile: 0.0, ...] — the model is told only that this is a cat.
  • Soft label (teacher output): [cat: 0.82, dog: 0.11, tiger: 0.04, ...] — the teacher’s probability distribution reveals that it considers cats more similar to dogs than to automobiles.

The student trained on soft labels learns these inter-class similarities implicitly. The relative probabilities assigned to non-winning classes encode generalisation structure that hard labels throw away. A student trained with soft labels from a good teacher typically outperforms one trained from scratch on the same dataset with hard labels, even when the student’s architecture is identical — because the soft labels carry richer supervisory signal.

Temperature Scaling

A key mechanism for extracting dark knowledge is temperature scaling. In a standard classifier, the softmax function converts logits to probabilities:

p(class_i) = exp(z_i) / Σ exp(z_j)

At the standard temperature T=1, a confident teacher model produces very peaky distributions — e.g., [cat: 0.999, dog: 0.001, ...]. The non-winning class probabilities are so small that they carry negligible information.

Dividing logits by a temperature T > 1 before the softmax softens the distribution:

p_T(class_i) = exp(z_i / T) / Σ exp(z_j / T)

At T=4, the same distribution might become [cat: 0.61, dog: 0.19, tiger: 0.09, ...] — far more informative about which classes the teacher considers similar. Both teacher and student use the same temperature during distillation training. At inference time, the student operates with T=1.

The Distillation Loss Function

The training objective for response-based distillation combines two terms:

L_distill = α · T² · KL(soft_teacher || soft_student) + (1 - α) · CE(student_logits, hard_labels)
  • KL divergence term — the student’s softened distribution is pushed toward the teacher’s softened distribution. The T² factor compensates for the reduced magnitude of gradients at higher temperatures (ensuring gradient scale is consistent regardless of T).
  • Cross-entropy term — the student is also trained on ground-truth labels, preventing it from drifting too far toward teacher idiosyncrasies.
  • α — a hyperparameter (typically 0.5–0.9) controlling the relative weight of soft-label versus hard-label supervision.

Four Categories of Knowledge Distillation

Research since 2015 has extended distillation well beyond response-based output matching. A foundational 2020 survey by Gou et al. established three canonical categories; the emergence of frontier LLMs as teachers has added a fourth — reasoning and process distillation — that has become the dominant paradigm in 2024–2025:

graph TD
    classDef default fill:#ffffff,stroke:#4338CA,stroke-width:2px,color:#0F172A
    classDef system fill:#4338CA,stroke:#4338CA,stroke-width:2px,color:#ffffff
    classDef mid fill:#EEF0F7,stroke:#6366F1,stroke-width:2px,color:#0F172A
    classDef leaf fill:#F7F8FC,stroke:#0D9488,stroke-width:2px,color:#0F172A
    classDef new fill:#EEF0F7,stroke:#4338CA,stroke-width:2px,color:#0F172A

    A[Knowledge Distillation]:::system
    B[Response-Based]:::mid
    C[Feature-Based]:::mid
    D[Relation-Based]:::mid
    E[Reasoning & Process\n2023–2025]:::new

    A --> B
    A --> C
    A --> D
    A --> E

    B --> B1[Soft output labels\ntemperature scaling]:::leaf
    B --> B2[Final logit\nmatching]:::leaf

    C --> C1[Intermediate layer\nactivation matching]:::leaf
    C --> C2[Attention map\ntransfer — TinyBERT]:::leaf

    D --> D1[Inter-sample\nrelationship graphs]:::leaf
    D --> D2[Flow of solution\nprocedure — FSP]:::leaf

    E --> E1[Chain-of-thought\ntrace distillation]:::leaf
    E --> E2[Synthetic data\ngeneration — Phi-4]:::leaf
    E --> E3[Reasoning strategy\ntransfer — Orca 2]:::leaf

Response-Based Distillation

The student mimics the teacher’s final output — output logits for classification, or generated token probabilities for language models. This is the original formulation from Hinton et al. and the most widely used. It requires no access to the teacher’s internal architecture, making it applicable even when the teacher is a black-box API.

Feature-Based Distillation

The student is trained to match the teacher’s intermediate layer representations (feature maps, hidden states, attention patterns). FitNets (Romero et al., 2015) was the first systematic feature distillation approach — a thin but deep student was guided to mimic the teacher’s intermediate feature maps via a regressor network. TinyBERT (Jiao et al., 2020) extends this to Transformer architectures by distilling both attention matrices and hidden states from every layer, achieving far higher compression ratios than response-only distillation for language models.

Feature-based distillation is more powerful but more architecturally constrained — the student must have intermediate layers compatible (in shape or mappable) with the teacher’s, or adapter layers must be introduced.

Relation-Based Distillation

Rather than matching individual outputs or activations, relation-based methods transfer the relationships between data samples as encoded by the teacher. The teacher’s internal representation of how samples relate to each other is used as supervisory signal. The Flow of Solution Procedure (FSP) matrix method captures the relationships between feature maps across consecutive layers. More recent work uses instance relationship graphs — encoding how the teacher clusters and separates different inputs in its latent space — to guide the student toward a similarly structured representation.

Reasoning and Process Distillation (2023–2025)

The most significant evolution in distillation methodology since the original Hinton framework. Instead of transferring logits or activations, reasoning distillation transfers the cognitive process by which the teacher arrives at an answer — its chain-of-thought reasoning traces, multi-step problem decompositions, and strategy selection.

The process works differently from classic distillation. The teacher (typically a frontier model like GPT-4, Claude 3 Opus, or DeepSeek-R1) is prompted to generate not just answers but complete reasoning paths — including intermediate steps, self-corrections, and strategy choices. These reasoning traces are used as fine-tuning targets for the student, teaching it to produce similar deliberative reasoning before committing to an answer. This is sometimes called imitation learning from thought rather than logit matching.

Chain-of-thought (CoT) distillation is the most direct form: student models are fine-tuned on (question, <teacher reasoning trace>, answer) tuples. The student learns the format and depth of reasoning the teacher uses across different problem types.

Synthetic data distillation (the Phi approach) uses the teacher as a curriculum designer: instead of sampling from web text, the teacher generates dense, pedagogically structured training data — worked examples, analogies, and explanations calibrated to the student’s learning capacity. The student is trained on teacher-curated data rather than teacher outputs directly.

Strategy distillation (the Orca 2 approach) goes further: the teacher is prompted to identify which reasoning strategy is optimal for each task category, and the student is trained to make that strategy selection itself — learning metacognitive routing in addition to reasoning execution.

The Teacher-Student Training Process

flowchart LR
    classDef default fill:#ffffff,stroke:#4338CA,stroke-width:2px,color:#0F172A
    classDef system fill:#4338CA,stroke:#4338CA,stroke-width:2px,color:#ffffff
    classDef frozen fill:#EEF0F7,stroke:#0D9488,stroke-width:2px,color:#0F172A
    classDef loss fill:#F7F8FC,stroke:#6366F1,stroke-width:2px,color:#0F172A
    classDef output fill:#0D9488,stroke:#0D9488,stroke-width:2px,color:#ffffff

    DATA([Training Data]):::frozen
    T[Teacher Model\n❄ Frozen Weights]:::frozen
    S[Student Model\n🔄 Trainable]:::system
    TL[Teacher Logits\nSoft Probabilities]:::default
    SL[Student Logits]:::default
    KL[KL Divergence\nSoft Loss]:::loss
    CE[Cross-Entropy\nHard Loss]:::loss
    COMB[Combined\nDistillation Loss]:::loss
    OPT[Optimizer\nUpdate Student]:::system
    OUT([Compressed\nStudent Model]):::output

    DATA --> T & S
    T --> TL
    S --> SL
    TL & SL --> KL
    SL --> CE
    KL & CE --> COMB
    COMB --> OPT --> S
    S -->|Training complete| OUT

The training loop is straightforward: the teacher model is loaded in eval mode with frozen weights — it generates soft targets but is never updated. The student processes the same inputs, produces its own logits, and its weights are updated via backpropagation on the distillation loss. No new data is required beyond what was used to train the teacher, though data augmentation and teacher-generated synthetic data can further improve student quality.

Landmark Distilled Models

DistilBERT (Hugging Face, 2019)

DistilBERT is the most widely deployed distilled language model. Sanh et al. trained a 6-layer, 66M-parameter student to mimic the 12-layer, 110M-parameter BERT-base. The distillation combined three signals: response distillation (soft MLM loss), intermediate layer hidden-state matching (cosine distance), and standard MLM hard-label loss.

Results reported in the DistilBERT paper:

  • 40% smaller than BERT-base
  • 60% faster at inference
  • 97% of BERT-base performance on the GLUE benchmark

DistilBERT became the go-to choice for latency-sensitive NLP deployments and remains one of the most downloaded models on Hugging Face Hub — a milestone that demonstrated distillation could reach mainstream production viability.

TinyBERT (Huawei, 2020)

TinyBERT applied feature-based distillation to every Transformer layer — attention matrices, hidden states, and embedding layers — across a two-phase process: general distillation on large corpora followed by task-specific distillation with data augmentation. A 4-layer TinyBERT achieved 96.8% of BERT-base performance on GLUE while being 7.5× smaller and 9.4× faster.

MobileBERT (Google, 2020)

MobileBERT introduced a bottleneck architecture for the student and used progressive layer-by-layer distillation from a specially designed “inverted-bottleneck” teacher (IB-BERT). The result was a model 4.3× smaller and 5.5× faster than BERT-base, achieving 100.6% of its average GLUE score — the first distilled model to match or exceed the teacher on GLUE.

DistilGPT-2 and Early Generative Distillation

For generative language models, response-based distillation uses token-level KL divergence between the teacher’s and student’s next-token probability distributions rather than class-level logits. DistilGPT-2 applies this to GPT-2, producing a model with roughly half the parameters that retains coherent text generation quality for many tasks. This laid the conceptual groundwork for what has become the dominant distillation paradigm in the LLM era.

2023–2025: The Generative Era of Distillation

The landscape shifted fundamentally from 2023 onward. Distillation evolved from compressing existing model families into a primary strategy for training capable small models from scratch — with frontier LLMs serving as teachers and synthetic data replacing the need for human-annotated datasets.

Distil-Whisper (Hugging Face, 2023)

Distil-Whisper applied knowledge distillation to OpenAI’s Whisper large-v2 speech recognition model, producing a student that is 6× faster and 49% smaller while achieving only a 1% increase in word error rate (WER) on out-of-distribution audio. The distillation used a pseudo-labelling approach: Whisper large-v2 generated transcriptions for 22,000 hours of unlabelled audio, which were used as the teacher’s soft targets. Distil-Whisper is now the default choice for real-time speech transcription in resource-constrained deployments and powers numerous production voice AI products.

Orca and Orca 2 — Reasoning Strategy Distillation (Microsoft, 2023)

Microsoft’s Orca series marked a conceptual leap: rather than distilling output logits, Orca distilled the reasoning processes of GPT-4. The teacher was prompted to generate step-by-step explanations, chain-of-thought traces, and worked examples across 5 million instruction-following tasks. Orca 2 (2023) extended this with deliberate reasoning strategy training — teaching the 7B and 13B student models when to use different reasoning approaches (step-by-step, direct answer, recall-based) depending on the task type. Orca 2 at 13B parameters matched or exceeded GPT-3.5 on multiple reasoning benchmarks, demonstrating that reasoning style itself is distillable.

Phi Series — Synthetic Data as Distillation (Microsoft, 2023–2024)

Microsoft’s Phi family (Phi-1, Phi-1.5, Phi-2, Phi-3, Phi-4) redefined what small models could achieve by treating high-quality synthetic data generation as a form of knowledge distillation. Rather than distilling weights or logits, Phi distils the knowledge distribution of larger models by using them to generate “textbook-quality” training corpora — mathematically dense, reasoning-rich, diversity-controlled synthetic data that smaller models trained on random web text never encounter.

Phi-4 (14B parameters, December 2024) outperforms models three to four times its size on STEM reasoning benchmarks. Notably, Phi-4 outperforms GPT-4o on MATH and AIME 2024 despite being a fraction of the size — an outcome that would have been considered implausible before synthetic data distillation became a primary training strategy.

Gemma 2 — Inter-Layer Knowledge Distillation at Scale (Google DeepMind, 2024)

Google DeepMind’s Gemma 2 (9B and 27B variants, released June 2024) explicitly used knowledge distillation during pre-training — a departure from the conventional post-training distillation approach. A larger teacher model generated soft targets for each training token, and the smaller Gemma 2 student was trained on a mixture of token-prediction loss against ground truth and KL divergence loss against the teacher’s probability distributions simultaneously. This inter-training distillation — as opposed to post-hoc compression — allowed Gemma 2 9B to match Llama 3 70B on several benchmarks while using 8× fewer parameters at inference.

DeepSeek-R1 Distillation — Reasoning at 1.5B Parameters (DeepSeek AI, January 2025)

The most consequential distillation release of 2025 came from DeepSeek AI. DeepSeek-R1 is a frontier reasoning model trained via reinforcement learning to produce extended chain-of-thought traces before answering. Rather than keeping this capability proprietary, DeepSeek released a full suite of distilled reasoning models trained by fine-tuning smaller open-weight base models on 800,000 reasoning samples generated by DeepSeek-R1:

Distilled ModelBaseParametersAIME 2024 Pass@1
DeepSeek-R1-Distill-Qwen-1.5BQwen2.5-Math-1.5B1.5B28.9%
DeepSeek-R1-Distill-Qwen-7BQwen2.5-Math-7B7B55.5%
DeepSeek-R1-Distill-Llama-8BLlama-3.1-8B8B50.4%
DeepSeek-R1-Distill-Qwen-14BQwen2.5-14B14B69.7%
DeepSeek-R1-Distill-Qwen-32BQwen2.5-32B32B72.6%
DeepSeek-R1-Distill-Llama-70BLlama-3.3-70B70B70.0%

DeepSeek-R1-Distill-Qwen-7B at 7B parameters surpassed OpenAI’s o1-mini on multiple mathematical reasoning benchmarks, while DeepSeek-R1-Distill-Qwen-32B matched o1 on AIME 2024. This demonstrated that chain-of-thought reasoning capability is transferable through distillation — a student can learn not just what to answer but how to think through a problem — and redefined expectations for what small open-weight models could achieve on hard reasoning tasks.

Offline vs. Online Distillation

Offline distillation (the standard approach): The teacher is fully trained first. Student training is a separate, subsequent process using the frozen teacher. This allows the teacher to be a black-box — distillation via API calls is possible, though expensive.

Online distillation: Teacher and student are trained simultaneously and mutually — they update each other in a collaborative learning framework. Deep Mutual Learning (DML) uses no pre-trained teacher at all; multiple peer networks learn from each other’s predictions. This is useful when no pre-trained teacher exists or when the teacher would be prohibitively expensive to train first.

Self-distillation: The model distils knowledge to itself — typically from deeper layers to shallower ones, or from one training epoch to the next. Born-Again Networks (BANs) train a sequence of identical architectures where each new network learns from the previous one. Surprisingly, the fourth generation network in a BAN sequence often outperforms the original despite having an identical architecture, because the distillation training provides richer supervisory signal.

Knowledge Distillation vs. Other Compression Methods

TechniqueWhat it removesQuality impactRequires retraining?
Knowledge DistillationParameters (via smaller architecture)Minimal with good teacherYes (student training)
PruningIndividual weights or attention headsModerateOften yes (fine-tune post-prune)
QuantizationNumerical precision (FP32 → INT8)Low to moderateSometimes (QAT)
Low-Rank FactorizationParameter count via matrix approximationModerateYes
LoRA / PEFTUpdate parameters during fine-tuningLowAdapter training only

Knowledge distillation typically achieves the best accuracy-to-size ratio among these techniques because the student is trained from scratch with purpose-built supervisory signal rather than having an existing model surgically modified. For maximum compression, distillation is often combined with quantization: the student is first trained via distillation, then quantized to INT8 or INT4, achieving multiplicative compression gains.

Data-Free Distillation

A practical barrier to distillation is access to the original training data — in many enterprise settings, the teacher model was trained on proprietary or privacy-sensitive datasets that cannot be shared with the team performing distillation. Data-free distillation methods synthesise surrogate training data using the teacher itself.

The teacher is used as a classifier to generate inputs that produce high-confidence predictions — these teacher-approved synthetic samples substitute for real training data. While quality is lower than dataset-based distillation, data-free methods enable compression in privacy-constrained settings and are increasingly practical as generative models improve the fidelity of synthetic data.

Applications

On-device and edge AI: Distilled models run inference directly on smartphones, IoT devices, and embedded systems — enabling applications like real-time translation, on-device voice assistants, and camera-based ML without cloud round-trips or connectivity dependencies. Apple’s on-device foundation models (shipping with Apple Intelligence in 2024) use distillation extensively to fit capable language models within the memory and power constraints of iPhone chips.

Latency-sensitive APIs: A customer support chatbot needs sub-200ms response times. Distilled models of 1–7B parameters running on a single GPU can meet this target; 70B+ models cannot. Distillation makes real-time conversational AI economically and technically feasible at scale, and is the primary reason production voice AI applications can operate with sub-300ms end-to-end latency.

Cost reduction in LLM deployments: At high request volumes, serving a 7B student versus a 70B teacher reduces GPU memory requirements by ~10×, enabling significantly higher concurrency per GPU. With frontier API costs ranging from $10–30 per million output tokens, even a modest 30% traffic diversion to a distilled model can reduce inference spend by hundreds of thousands of dollars annually at enterprise scale.

Reasoning capability in smaller models: The DeepSeek-R1 distillation release (January 2025) demonstrated that extended chain-of-thought reasoning — previously considered exclusive to frontier models — can be transferred to 7B and 14B models via fine-tuning on reasoning traces. Production deployments are now using distilled reasoning models for tasks like code generation, mathematical problem solving, and structured data analysis where multi-step reasoning improves output quality without requiring frontier-scale inference costs.

Real-time speech recognition: Distil-Whisper operates 6× faster than Whisper large-v2 with negligible accuracy loss, enabling real-time transcription on CPU — critical for call centre automation, live captioning, and voice assistant applications where batch processing is not viable.

LLM specialisation: A general-purpose 70B LLM can be distilled into a 7B student specifically optimised for a narrow domain (legal document review, medical coding, customer service). The student matches or exceeds the teacher on the domain while being deployable at a fraction of the cost — a pattern now standard in enterprise AI deployments.

Speculative decoding: Large autoregressive LLMs use a small distilled “draft” model to generate candidate token sequences at high speed, which the large model verifies in a single forward pass. Anthropic, Google, and Meta all deploy speculative decoding in production. When the draft model and the target model share a training lineage (i.e., the draft is distilled from the target), acceptance rates are significantly higher, making the speedup more consistent — typically 2–4× throughput improvement with no quality degradation.

Response-Based Knowledge Distillation with PyTorch

python
import torch
import torch.nn as nn
import torch.nn.functional as F

# Distillation loss: combines soft-label KL divergence and hard-label cross-entropy
def distillation_loss(
    student_logits: torch.Tensor,
    teacher_logits: torch.Tensor,
    true_labels: torch.Tensor,
    temperature: float = 4.0,
    alpha: float = 0.7,
) -> torch.Tensor:
    """
    temperature: scales logits to soften probability distributions.
                 Higher T → softer targets → more inter-class information.
    alpha:       weight for the soft-label loss (1-alpha for hard-label loss).
    """
    # Soft targets from teacher (temperature-scaled)
    soft_teacher = F.softmax(teacher_logits / temperature, dim=-1)
    soft_student = F.log_softmax(student_logits / temperature, dim=-1)

    # KL divergence loss on soft labels (scaled by T² to preserve gradient magnitude)
    soft_loss = F.kl_div(soft_student, soft_teacher, reduction="batchmean") * (temperature ** 2)

    # Standard cross-entropy on hard ground-truth labels
    hard_loss = F.cross_entropy(student_logits, true_labels)

    return alpha * soft_loss + (1 - alpha) * hard_loss


# Training loop sketch
teacher_model.eval()   # Teacher is frozen — no gradient updates
student_model.train()

optimizer = torch.optim.AdamW(student_model.parameters(), lr=3e-4)

for batch in dataloader:
    inputs, labels = batch

    with torch.no_grad():
        teacher_logits = teacher_model(inputs)   # Teacher inference (no grad)

    student_logits = student_model(inputs)       # Student forward pass

    loss = distillation_loss(
        student_logits=student_logits,
        teacher_logits=teacher_logits,
        true_labels=labels,
        temperature=4.0,
        alpha=0.7,
    )

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Ready to build?

Leverage AI technologies to build your product stack

Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.

Talk to Superteams