Main Concept

RLHF is a model training technique used for fine-tuning large models, where human feedback is used to help ML to self-learn more efficiently.

Key Aspects

  • Unlike Reinforcement Learning, where the reward function is purely logical
    or mathematical, in RLHF we directly incorporate real human feedback into
    that reward function — allowing the model to align more closely with human
    goals, desires, and values.
  • First, the model’s responses are compared to human’s responses
  • Then, a human assess the quality of model’s responses
  • RLHF is used throughout GenAI applications including MML models.
  • RLHF significantly enhances the model performance

Examples

  • Grading text translation from “technically correct” to “human”

Training Methodology using RLHF

This is the high-level process used to train a model using RLHF:

The goal: Build a chatbot that answers internal company questions the way a human expert would.

Step 1: Data Collection

  • In this step, we build the training dataset, which consist of several question-answer pairs created manually by humans.
  • Those answers are the gold standard human answer for each question that we would expect the model to answer the same or similarly.
    • Q: “Where is the HR department in Boston?”
    • A: “3rd floor, Building A”

Step 2: Supervised Fine-tuning

  • Here we take an existing base model and fine-tune it with our internal data.
  • Now the model generates its own answers to those same questions.
  • Its answers are mathematically compared to the human answers.
  • The model learns to get closer to human-quality responses

At this point we have a decent model - but “mathematically close” doesn’t always mean “what a human would actually prefer”

Step 3: Build a Rewards Model

  • We show humans TWO responses to the same prompt.
    • “Which answers do yo prefer - A or B?”
  • Human picks their preference repeatedly across many examples.
  • A separate model (the reward model) learn from those preferences
  • It can how predict “how much would a human like this response?” without needing a human present every time.

This the key innovation of RLHF - you’re essentially training a model to simulate human judgment.

Step 4: Optimize with the Reward Model

  • Now we use the reward model as an automated judge.
  • The language model generates responses.
  • The reward model scores them.
  • The language model adjust to score higher
  • This loop runs fully automated meaning no humans needed at this stage.

The complete picture

  • Human answers → fine-tune model (Step 2)
  • Human preferences → train reward model (Step 3)
  • Reward model → optimize language model (Step 4)

Exam Domains

Domain 3, Task Statement 3.3.

References