Reinforcement learning from human feedback (RHLF)

Main Concept

RLHF is a model training technique used for fine-tuning large models, where human feedback is used to help ML to self-learn more efficiently.

Key Aspects

Unlike Reinforcement Learning, where the reward function is purely logical
or mathematical, in RLHF we directly incorporate real human feedback into
that reward function — allowing the model to align more closely with human
goals, desires, and values.
First, the model’s responses are compared to human’s responses
Then, a human assess the quality of model’s responses
RLHF is used throughout GenAI applications including MML models.
RLHF significantly enhances the model performance

Examples

Grading text translation from “technically correct” to “human”

Training Methodology using RLHF

This is the high-level process used to train a model using RLHF:

The goal: Build a chatbot that answers internal company questions the way a human expert would.

Step 1: Data Collection

In this step, we build the training dataset, which consist of several question-answer pairs created manually by humans.
Those answers are the gold standard human answer for each question that we would expect the model to answer the same or similarly.
- Q: “Where is the HR department in Boston?”
- A: “3rd floor, Building A”

Step 2: Supervised Fine-tuning

Here we take an existing base model and fine-tune it with our internal data.
Now the model generates its own answers to those same questions.
Its answers are mathematically compared to the human answers.
The model learns to get closer to human-quality responses

At this point we have a decent model - but “mathematically close” doesn’t always mean “what a human would actually prefer”

Step 3: Build a Rewards Model

We show humans TWO responses to the same prompt.
- “Which answers do yo prefer - A or B?”
Human picks their preference repeatedly across many examples.
A separate model (the reward model) learn from those preferences
It can how predict “how much would a human like this response?” without needing a human present every time.

This the key innovation of RLHF - you’re essentially training a model to simulate human judgment.

Step 4: Optimize with the Reward Model

Now we use the reward model as an automated judge.
The language model generates responses.
The reward model scores them.
The language model adjust to score higher
This loop runs fully automated meaning no humans needed at this stage.

The complete picture

Human answers → fine-tune model (Step 2)
Human preferences → train reward model (Step 3)
Reward model → optimize language model (Step 4)

Exam Domains

Domain 3, Task Statement 3.3.

References

https://aws.amazon.com/what-is/reinforcement-learning-from-human-feedback/

🌿💻 The Packets Garden

Explorer

Reinforcement learning from human feedback (RHLF)

Main Concept

Key Aspects

Examples

Training Methodology using RLHF

Step 1: Data Collection

Step 2: Supervised Fine-tuning

Step 3: Build a Rewards Model

Step 4: Optimize with the Reward Model

The complete picture

Exam Domains

References

Graph View

Table of Contents

Backlinks

🌿💻 The Packets Garden

Explorer

Reinforcement learning from human feedback (RHLF)

Main Concept

Key Aspects

Examples

Training Methodology using RLHF

Step 1: Data Collection

Step 2: Supervised Fine-tuning

Step 3: Build a Rewards Model

Step 4: Optimize with the Reward Model

The complete picture

Exam Domains

Related Notes

References

Graph View

Table of Contents

Backlinks