Main Concept

Reinforcement Learning is a type of Machine Learning where an agent learns to make decisions by interacting with an environment. The agent receives feedback in the form of rewards or penalties for each action it takes, and over time learns the strategy (policy) that maximizes cumulative reward.

Unlike Supervised Learning, there are no labeled examples β€” the agent discovers the right behavior entirely through trial and error.

Key Concepts

ConceptDescription
AgentThe learner or decision-maker
EnvironmentThe external system the agent interacts with
StateThe current situation of the environment
ActionA choice made by the agent
RewardFeedback from the environment based on the action taken
PolicyThe strategy the agent uses to select actions given a state

How It Works

  1. The Agent observes the current State of the Environment
  2. It selects an Action based on its current Policy
  3. The Environment transitions to a new State and returns a Reward
  4. The Agent updates its Policy based on the reward received
  5. This loop repeats β€” Goal: maximize cumulative reward over time

Example β€” Robot in a Maze

Scenario: Train a robot to escape a maze using the shortest path possible.

Reward structure:

  • Move to a free cell β†’ -1 point
  • Hit a wall β†’ -10 points
  • Reach the exit β†’ +100 points

At the start, the robot has no knowledge of the maze and explores randomly, hitting walls and losing points. After simulating many times, it learns which paths cost fewer points and which lead to the exit.

Eventually the agent learns the optimal route β€” the one that reaches the exit with the minimum number of steps. It’s the equivalent of letting an AI play a video game hundreds of times: at first it loses constantly, but eventually it masters the game.

Context β€” Why It Matters for the Exam

RL is distinct from supervised and unsupervised learning in a fundamental way: there is no dataset. The model generates its own experience by interacting with an environment. This makes it well-suited for sequential decision problems where the right answer isn’t known in advance.

RLHF (Reinforcement Learning from Human Feedback) applies this same principle to fine-tune LLMs β€” human raters provide reward signals that teach the model to produce better responses.

Applications

  • Gaming β€” AlphaGo, chess engines, video game AIs
  • Robotics β€” navigation and object manipulation in dynamic environments
  • Finance β€” portfolio management and trading strategies
  • Healthcare β€” optimizing treatment plans
  • Autonomous Vehicles β€” path planning and real-time decision-making

Exam Domain (AIF-C01)

Domain 1 β€” Fundamentals of AI and ML

  • Task Statement 1.1: Types of ML β€” supervised, unsupervised, and reinforcement learning.