Main Concept
It is a technique in which a pre-trained model is taken and its knowledge is transferred to a new, smaller model. The large model is called the teacher and the small model is called the student.
Why it matters
You get 80% of the performance at 20% of the cost/speed. Useful when you need efficiency without full capability loss.
Key Points
- It does not fall into the Model Customization method.
- The student model is smaller and faster, therefore more cost-effective (up to 75% cheaper)
- Its accuracy is reduced, but it could still be acceptable.
- You collect input and output data from the teacher model (e.g. prompts and their respective answers), then you transfer (distill) this data to the student model. This will create a smaller model which will behave similarly to the original model.
- The student model learns to replicate the behavior of the teacher model, not just memorize input/output pairs.
Graphical Explanation

Examples
Real example:
- Teacher model (Claude 3 Opus): 176 billion parameters, slow, expensive
- Collect: 10,000 prompts with their answers from Claude
- Student model (Claude 3 Haiku): 8 billion parameters, fast, cheap
- Result: Much faster/cheaper model that behaves similarly