Main Concept
Fine-tuning is the process of taking a pre-trained foundation model and improve it by training on a specific dataset to specialize it for a particular task, domain, or behavior.
Key principle: You’re NOT training from scratch -you’re adjusting and exiting model
Why is needed?
Problems solved with fine-tuning
- The model doesn’t understand your domain-specific terminology
- The model doesn’t follow your company’s tone/style
- The model isn’t accurate enough for your specific task
- The model needs to learn proprietary knowledge
- You need consistent, specialized behavior
When not to do fine-tuning
- You just need access to current information
- Use RAG instead
- Simple instruction changes → use prompt engineering
- You don’t have training data 🤷♂️
- Pre-trained model already works well enough
Fine tuning classification tree
Primary classification: Based on Label Requirements
FINE-TUNING
│
├─── WITH LABELS (Supervised)
│ └─ Traditional Fine-tuning
│ • Input-output pairs
│ • Instruction tuning
│ • Task-specific training
│ • Examples: Classification, Q&A, Style transfer
│
└─── WITHOUT LABELS (Unsupervised)
└─ Continued Pre-training
• Domain-specific text corpus
• No explicit targets
• Examples: Domain adaptation, terminology learning
Fine Tuning Detailed Methods
Fine-tuning WITH Labels (Supervised)
-
You use example data that is labeled to train the model.
-
The data has input-output format (question-answer, prompt-completion, description-category, etc…)
-
You need hundreds to thousands of labeled example, high-quality i/o pairs, clear, consistent labeling.
-
Benefits are, the model learn your specific style/format, better performance, consistent outputs, match your standard/use case.
-
Use cases: Specific writing style, domain expertise, task specialization
-
Example: A customer support ticket classification, data example:
{"input": "My password reset email never arrived and I can't log in", "output": "Authentication"} {"input": "The mobile app crashes every time I try to upload a photo", "output": "Technical_Bug"} {"input": "I was charged twice for my subscription this month", "output": "Billing"} {"input": "How do I change my profile picture?", "output": "General_Support"} {"input": "The website is loading very slowly on my browser", "output": "Technical_Bug"} {"input": "I want to upgrade to the premium plan", "output": "Sales"} {"input": "Can you delete my account and all my data?", "output": "Account_Management"}
Instruction Tuning
-
Is an specific method of fine-tuning
-
It’s focuses on teaching the model to respond to instructions, rather than just expanding its general knowledges.
-
A typical dataset for instruction fine-tuning consists Instruction, Input Query and Output Response.
- Instruction: A natural language prompt that specifies the task to be performed.
- Input: The data on which the task is to be executed.
- Output: The expected result after performing the task on the given input.
-
The objective is made the model be good on follow any instructions, and not only an specific task.
-
Example of dataset:
{"instruction": "Tranlsate the following English sentence to Spanish", "input": "Hello, how are you?", "output": "Hola, ¿cómo estás?"} {"instruction": "Summary the following text", "input": "[long text...]", "output": "brief summary"} {"instruction": "Answer the question?", "input": "¿What is the capital of France?", "output": "The capital of France is París"} -
Famous Instruction models
- InstructGPT (OpeAI)
- FLAN (Google)
- Alpaca (Standoford)
Continued Pre-training (WITHOUT Labels)
-
You continue training the model with your domain-specific text
-
You need to provide large amount of unlabeled text of your own topics
-
The goal is to improve the general knowledge in your specific field.
-
Only certain Amazon models support Continued Pre-training in Amazon Bedrock
- Amazon Titan Text G1 - Express
- Amazon Titan Text G1 - Lite
-
Example: You’re building an AI assistant for doctors, but foundation models don’t have deep medical knowledge or use proper medical terminology consistently.
Medical textbooks (no labels needed - just raw text): "Myocardial infarction occurs when blood flow to the heart muscle is blocked, typically due to coronary artery thrombosis. The left anterior descending artery is most commonly affected..." Clinical guidelines: "For acute management of ST-elevation myocardial infarction,immediate reperfusion therapy is indicated. Primary percutaneous coronary intervention should be performed within 90 minutes..." Research papers: "The pathophysiology of atherosclerosis involves endothelial dysfunction, lipid accumulation, and inflammatory responses. Macrophages play a crucial role in plaque formation..." Patient case studies (anonymized): "A 58-year-old male presented with chest pain radiating to the left arm. ECG showed ST-segment elevation. Troponin levels were elevated at 2.5 ng/mL..."
Fine-tuning vs. Other Methods
Fine-tuning vs. RAG:
| Aspect | Fine-tuning | RAG |
|---|---|---|
| Model changes | ✅ Yes | ❌ No |
| Training needed | ✅ Yes | ❌ No |
| Updates | Retrain needed | Just update docs |
| Knowledge source | In model parameters | External database |
| Best for | Behavior/style | Current information |
| Latency | Lower | Higher (search step) |
Fine-tuning vs. Prompt Engineering:
| Aspect | Fine-tuning | Prompt Engineering |
|---|---|---|
| Model changes | ✅ Yes | ❌ No |
| Cost | High (training) | Low (just API calls) |
| Complexity | High | Low |
| Consistency | Very high | Variable |
| Token usage | Lower per request | Higher (long prompts) |
| Best for | Repeated tasks | Ad-hoc instructions |
Fine-tuning vs. Training from Scratch:
| Aspect | Fine-tuning | From Scratch |
|---|---|---|
| Starting point | Pre-trained model | Nothing |
| Data needed | 100s-1000s | Millions |
| Time | Hours-Days | Months |
| Cost | 1000s | $Millions |
| Expertise | Medium | Very High |
💰 Cost Considerations
Fine-tuning WITH Labels:
- Data preparation: Manual labeling (most expensive)
- Training: 500 (depends on model size)
- Inference: Similar to base model
- Total: 5000 (mostly labor)
Continued Pre-training:
- Data collection: Automated (cheaper)
- Training: 5000 (much more compute)
- Inference: Similar to base model
- Total: 10000 (mostly compute)
📈 Success Metrics
For Fine-tuning WITH Labels:
- Task-specific metrics (accuracy, F1, etc.)
- Compare against base model
- Human evaluation of outputs
- Consistency checks
For Continued Pre-training:
- Perplexity on domain text
- Domain-specific benchmarks
- Qualitative assessment
- A/B testing vs. base model
🚨 Common Pitfalls
Overfitting:
- Problem: Model memorizes training data
- Symptom: Great on training, poor on new data
- Solution: More data, regularization, early stopping
Catastrophic Forgetting:
- Problem: Model loses general capabilities
- Symptom: Good at new task, bad at everything else
- Solution: Mix general and specific data
Insufficient Data:
- Problem: Not enough examples to learn
- Symptom: No improvement over base model
- Solution: Collect more data or try RAG instead
Wrong Method:
- Problem: Using fine-tuning when RAG would work
- Symptom: High cost, unnecessary complexity
- Solution: Start simple (prompts/RAG), escalate if needed
Key Takeaways
IMPORTANT
- Fine-tuning: Specialized training on pre-trained model
- Two main types: WITH labels vs. WITHOUT labels
- WITH labels = Task specialization (supervised)
- WITHOUT labels = Domain knowledge (unsupervised)
- NOT the same as training from scratch_
- More expensive than RAG/prompts but more powerful
- Changes the model permanently
- Requires retraining for updates
🔗 Related Concepts
- Transfer Learning: Broader concept (fine-tuning is one type)
- Prompt Engineering: Simpler alternative
- Retrieval Augmented Generation (RAG): Complementary technique
- Model Domain Adaptation: not a method, but an objective
- Model Distillation: Creating smaller versions
- Few-shot Learning: Learning from few examples (in prompts)
- Zero-shot Learning: No examples needed (base model)