research

Reinforcement Learning

Pronunciation

/ˌriːɪnˈfɔːrsmənt ˈlɜːrnɪŋ/

Also known as:RLreward-based learningtrial-and-error learning

What is Reinforcement Learning?

Reinforcement Learning (RL) is a branch of machine learning where an agent learns to make decisions by interacting with an environment. Unlike supervised learning where the model learns from labeled examples, RL agents learn through trial and error, receiving rewards or penalties based on their actions. The goal is to discover a policy—a strategy for choosing actions—that maximizes cumulative reward over time.

The paradigm is inspired by behavioral psychology: just as animals learn behaviors through positive and negative reinforcement, RL agents learn by experiencing the consequences of their actions.

Key Components

Agent: The learner or decision-maker that takes actions in the environment.

Environment: The world the agent interacts with, which changes based on the agent's actions.

State: A representation of the current situation the agent finds itself in.

Action: A choice the agent can make that affects the environment.

Reward: A numerical signal indicating how good or bad an action was.

Policy: The strategy the agent uses to choose actions given states.

Why Reinforcement Learning Matters for AI

Reinforcement learning has been central to many of AI's most impressive achievements:

Game playing: DeepMind's AlphaGo used RL to defeat the world champion at Go
Robotics: RL enables robots to learn complex motor skills through practice
LLM alignment: RLHF (Reinforcement Learning from Human Feedback) became the default technique for aligning large language models like ChatGPT, Claude, and Gemini

In 2025, RL has seen a resurgence with breakthroughs like DeepSeek-R1, which used RL-based training to achieve major reasoning improvements. Researchers are increasingly turning to RL to strengthen reasoning capabilities and agentic behavior in AI systems.

RLHF: The LLM Connection

The most significant application of RL in modern AI is Reinforcement Learning from Human Feedback (RLHF). The typical pipeline involves:

Pre-training: Train a foundation model on large datasets
Supervised Fine-tuning: Refine with human-labeled examples
Reward Modeling: Humans rank outputs to train a reward model
RL Fine-tuning: Use PPO (Proximal Policy Optimization) to optimize against the reward model

John Schulman, co-founder of OpenAI, invented PPO—the algorithm that powered much of this work. RLHF has become the standard approach for making AI systems helpful, harmless, and honest.

Limitations and Critiques

Despite its successes, RL has fundamental limitations. As Yejin Choi notes, reinforcement learning provides reward signals but doesn't teach models how to reason. The model learns what outputs get high rewards without necessarily understanding why.

This is why some researchers argue that pure RL approaches may hit ceilings—they optimize for outcomes without developing genuine understanding or the ability to discover novel solutions.

John Schulman - Co-founder of OpenAI, inventor of PPO
Abductive Reasoning - A form of reasoning RL struggles to capture

Mentioned In

Yejin Choi at 00:22:15

"Reinforcement learning gives you a reward signal, but it doesn't teach the model how to reason about the world."

Related Terms

Rlhf Ppo Reward Modeling Agents