Main Concept

Fine-tuning is the process of taking a pre-trained foundation model and improve it by training on a specific dataset to specialize it for a particular task, domain, or behavior.

Key principle: You’re NOT training from scratch -you’re adjusting and exiting model


Why is needed?

Problems solved with fine-tuning

  • The model doesn’t understand your domain-specific terminology
  • The model doesn’t follow your company’s tone/style
  • The model isn’t accurate enough for your specific task
  • The model needs to learn proprietary knowledge
  • You need consistent, specialized behavior

When not to do fine-tuning

  • You just need access to current information
    • Use RAG instead
  • Simple instruction changes use prompt engineering
  • You don’t have training data 🤷‍♂️
  • Pre-trained model already works well enough

Fine tuning classification tree

Primary classification: Based on Label Requirements

FINE-TUNING
│
├─── WITH LABELS (Supervised)
│    └─ Traditional Fine-tuning
│       • Input-output pairs
│       • Instruction tuning
│       • Task-specific training
│       • Examples: Classification, Q&A, Style transfer
│
└─── WITHOUT LABELS (Unsupervised)
     └─ Continued Pre-training
        • Domain-specific text corpus
        • No explicit targets
        • Examples: Domain adaptation, terminology learning

Fine Tuning Detailed Methods

Fine-tuning WITH Labels (Supervised)

  • You use example data that is labeled to train the model.

  • The data has input-output format (question-answer, prompt-completion, description-category, etc…)

  • You need hundreds to thousands of labeled example, high-quality i/o pairs, clear, consistent labeling.

  • Benefits are, the model learn your specific style/format, better performance, consistent outputs, match your standard/use case.

  • Use cases: Specific writing style, domain expertise, task specialization

  • Example: A customer support ticket classification, data example:

    {"input": "My password reset email never arrived and I can't log in",
    "output": "Authentication"}
     
    {"input": "The mobile app crashes every time I try to upload a photo",
    "output": "Technical_Bug"}
     
    {"input": "I was charged twice for my subscription this month",
    "output": "Billing"}
     
    {"input": "How do I change my profile picture?",
    "output": "General_Support"}
     
    {"input": "The website is loading very slowly on my browser",
    "output": "Technical_Bug"}
     
    {"input": "I want to upgrade to the premium plan",
    "output": "Sales"}
     
    {"input": "Can you delete my account and all my data?",
    "output": "Account_Management"}

Instruction Tuning

  • Is an specific method of fine-tuning

  • It’s focuses on teaching the model to respond to instructions, rather than just expanding its general knowledges.

  • A typical dataset for instruction fine-tuning consists Instruction, Input Query and Output Response.

    • Instruction: A natural language prompt that specifies the task to be performed.
    • Input: The data on which the task is to be executed.
    • Output: The expected result after performing the task on the given input.
  • The objective is made the model be good on follow any instructions, and not only an specific task.

  • Example of dataset:

    {"instruction": "Tranlsate the following English sentence to Spanish",
    "input": "Hello, how are you?",
    "output": "Hola, ¿cómo estás?"}
     
    {"instruction": "Summary the following text",
    "input": "[long text...]",
    "output": "brief summary"}
     
    {"instruction": "Answer the question?",
    "input": "¿What is the capital of France?",
    "output": "The capital of France is París"}
  • Famous Instruction models

    • InstructGPT (OpeAI)
    • FLAN (Google)
    • Alpaca (Standoford)

Continued Pre-training (WITHOUT Labels)

  • You continue training the model with your domain-specific text

  • You need to provide large amount of unlabeled text of your own topics

  • The goal is to improve the general knowledge in your specific field.

  • Only certain Amazon models support Continued Pre-training in Amazon Bedrock

    • Amazon Titan Text G1 - Express
    • Amazon Titan Text G1 - Lite
  • Example: You’re building an AI assistant for doctors, but foundation models don’t have deep medical knowledge or use proper medical terminology consistently.

    Medical textbooks (no labels needed - just raw text):
    "Myocardial infarction occurs when blood flow to the heart muscle is blocked,
    typically due to coronary artery thrombosis. The left anterior descending artery
    is most commonly affected..."
     
    Clinical guidelines:
    "For acute management of ST-elevation myocardial infarction,immediate reperfusion
    therapy is indicated. Primary percutaneous coronary intervention should be
    performed within 90 minutes..."
     
    Research papers:
    "The pathophysiology of atherosclerosis involves endothelial dysfunction,
    lipid accumulation, and inflammatory responses. Macrophages play a crucial
    role in plaque formation..."
     
    Patient case studies (anonymized):
    "A 58-year-old male presented with chest pain radiating to the left arm.
    ECG showed ST-segment elevation. Troponin levels were elevated at 2.5 ng/mL..."

Fine-tuning vs. Other Methods

Fine-tuning vs. RAG:

AspectFine-tuningRAG
Model changes✅ Yes❌ No
Training needed✅ Yes❌ No
UpdatesRetrain neededJust update docs
Knowledge sourceIn model parametersExternal database
Best forBehavior/styleCurrent information
LatencyLowerHigher (search step)

Fine-tuning vs. Prompt Engineering:

AspectFine-tuningPrompt Engineering
Model changes✅ Yes❌ No
CostHigh (training)Low (just API calls)
ComplexityHighLow
ConsistencyVery highVariable
Token usageLower per requestHigher (long prompts)
Best forRepeated tasksAd-hoc instructions

Fine-tuning vs. Training from Scratch:

AspectFine-tuningFrom Scratch
Starting pointPre-trained modelNothing
Data needed100s-1000sMillions
TimeHours-DaysMonths
Cost1000s$Millions
ExpertiseMediumVery High

💰 Cost Considerations

Fine-tuning WITH Labels:

  • Data preparation: Manual labeling (most expensive)
  • Training: 500 (depends on model size)
  • Inference: Similar to base model
  • Total: 5000 (mostly labor)

Continued Pre-training:

  • Data collection: Automated (cheaper)
  • Training: 5000 (much more compute)
  • Inference: Similar to base model
  • Total: 10000 (mostly compute)

📈 Success Metrics

For Fine-tuning WITH Labels:

  • Task-specific metrics (accuracy, F1, etc.)
  • Compare against base model
  • Human evaluation of outputs
  • Consistency checks

For Continued Pre-training:

  • Perplexity on domain text
  • Domain-specific benchmarks
  • Qualitative assessment
  • A/B testing vs. base model

🚨 Common Pitfalls

Overfitting:

  • Problem: Model memorizes training data
  • Symptom: Great on training, poor on new data
  • Solution: More data, regularization, early stopping

Catastrophic Forgetting:

  • Problem: Model loses general capabilities
  • Symptom: Good at new task, bad at everything else
  • Solution: Mix general and specific data

Insufficient Data:

  • Problem: Not enough examples to learn
  • Symptom: No improvement over base model
  • Solution: Collect more data or try RAG instead

Wrong Method:

  • Problem: Using fine-tuning when RAG would work
  • Symptom: High cost, unnecessary complexity
  • Solution: Start simple (prompts/RAG), escalate if needed

Key Takeaways

IMPORTANT

  • Fine-tuning: Specialized training on pre-trained model
  • Two main types: WITH labels vs. WITHOUT labels
  • WITH labels = Task specialization (supervised)
  • WITHOUT labels = Domain knowledge (unsupervised)
  • NOT the same as training from scratch_
  • More expensive than RAG/prompts but more powerful
  • Changes the model permanently
  • Requires retraining for updates