Model Distillation

Main Concept

It is a technique in which a pre-trained model is taken and its knowledge is transferred to a new, smaller model. The large model is called the teacher and the small model is called the student.

Why it matters

You get 80% of the performance at 20% of the cost/speed. Useful when you need efficiency without full capability loss.

Key Points

It does not fall into the Model Customization method.
The student model is smaller and faster, therefore more cost-effective (up to 75% cheaper)
Its accuracy is reduced, but it could still be acceptable.
You collect input and output data from the teacher model (e.g. prompts and their respective answers), then you transfer (distill) this data to the student model. This will create a smaller model which will behave similarly to the original model.
The student model learns to replicate the behavior of the teacher model, not just memorize input/output pairs.

Graphical Explanation

Examples

Real example:

Teacher model (Claude 3 Opus): 176 billion parameters, slow, expensive
Collect: 10,000 prompts with their answers from Claude
Student model (Claude 3 Haiku): 8 billion parameters, fast, cheap
Result: Much faster/cheaper model that behaves similarly

🌿💻 The Packets Garden

Model Distillation

Main Concept

Why it matters

Key Points

Graphical Explanation

Examples

Links:

Graph View

Table of Contents

Backlinks

🌿💻 The Packets Garden

Model Distillation

Main Concept

Why it matters

Key Points

Graphical Explanation

Examples

Related Concepts:

Links:

Graph View

Table of Contents

Backlinks