Main Concept
Reinforcement Learning is a type of Machine Learning where an agent learns to make decisions by interacting with an environment. The agent receives feedback in the form of rewards or penalties for each action it takes, and over time learns the strategy (policy) that maximizes cumulative reward.
Unlike Supervised Learning, there are no labeled examples β the agent discovers the right behavior entirely through trial and error.
Key Concepts
| Concept | Description |
|---|---|
| Agent | The learner or decision-maker |
| Environment | The external system the agent interacts with |
| State | The current situation of the environment |
| Action | A choice made by the agent |
| Reward | Feedback from the environment based on the action taken |
| Policy | The strategy the agent uses to select actions given a state |
How It Works

- The Agent observes the current State of the Environment
- It selects an Action based on its current Policy
- The Environment transitions to a new State and returns a Reward
- The Agent updates its Policy based on the reward received
- This loop repeats β Goal: maximize cumulative reward over time
Example β Robot in a Maze
Scenario: Train a robot to escape a maze using the shortest path possible.

Reward structure:
- Move to a free cell β -1 point
- Hit a wall β -10 points
- Reach the exit β +100 points

At the start, the robot has no knowledge of the maze and explores randomly, hitting walls and losing points. After simulating many times, it learns which paths cost fewer points and which lead to the exit.

Eventually the agent learns the optimal route β the one that reaches the exit with the minimum number of steps. Itβs the equivalent of letting an AI play a video game hundreds of times: at first it loses constantly, but eventually it masters the game.
Context β Why It Matters for the Exam
RL is distinct from supervised and unsupervised learning in a fundamental way: there is no dataset. The model generates its own experience by interacting with an environment. This makes it well-suited for sequential decision problems where the right answer isnβt known in advance.
RLHF (Reinforcement Learning from Human Feedback) applies this same principle to fine-tune LLMs β human raters provide reward signals that teach the model to produce better responses.
Applications
- Gaming β AlphaGo, chess engines, video game AIs
- Robotics β navigation and object manipulation in dynamic environments
- Finance β portfolio management and trading strategies
- Healthcare β optimizing treatment plans
- Autonomous Vehicles β path planning and real-time decision-making
Related Concepts
- Machine Learning (ML)
- Supervised Learning
- Unsupervised Learning
- RLHF β Reinforcement Learning from Human Feedback
- Deep Learning (DL)
Exam Domain (AIF-C01)
Domain 1 β Fundamentals of AI and ML
- Task Statement 1.1: Types of ML β supervised, unsupervised, and reinforcement learning.