Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF) is a technique for aligning AI models with human preferences. It enables models to learn what humans want without requiring a hand-crafted reward function—instead, a reward model is learned directly from human feedback.

RLHF and related preference-learning methods (PPO-based RLHF, DPO, RLAIF) are key post-training techniques behind many assistant models like ChatGPT, Claude, and Gemini.

Why RLHF Matters

In classical reinforcement learning, an agent maximizes a predefined reward function. But for complex tasks involving human values—like “write a helpful response” or “generate safe content”—defining a good reward function is extremely difficult.

RLHF solves this by learning the reward function from human preferences.

Challenge	Traditional RL	RLHF
Reward Specification	Must define explicit reward function	Learns reward from human feedback
Complex Tasks	Struggles with ambiguous objectives	Excels when tasks are “easy to judge but hard to specify”
Human Values	Difficult to encode	Captured through human comparisons

How RLHF Works

RLHF consists of three main phases:

Phase 1: Supervised Fine-Tuning (SFT)

Before RLHF, the base model is first fine-tuned on high-quality demonstration data.

Aspect	Details
Goal	Teach the model to follow instructions and engage in dialogue
Data	(prompt, response) pairs created by humans
Scale	Typically tens of thousands of examples
Result	A model that generates plausible responses

Key Point: SFT alone shows the model what to do, but not how well it’s doing. The model needs feedback on response quality.

Phase 2: Train a Reward Model

The reward model (RM) learns to predict human preferences.

How Comparison Data is Collected

Humans rank multiple responses to the same prompt:

Prompt	Response A	Response B	Response C
”Explain quantum computing”	[Response]	[Response]	[Response]

Humans rank: A > B > C

This creates pairwise comparisons: (A > B), (A > C), (B > C)

Training the Reward Model

The RM is trained to output higher scores for preferred responses:

Loss = -log(σ(score_winning - score_losing))

Where σ is the sigmoid function. This loss ensures the RM assigns higher scores to responses humans prefer.

Data Scale	Often hundreds of thousands to millions of comparison pairs
Inter-rater agreement	Often imperfect (commonly ~60-80% depending on task)
Initialization	Often initialized from the base or SFT model

Why ranking instead of absolute scores? Humans are much better at comparing two options than assigning absolute scores.

Phase 3: RL Fine-Tuning with PPO

The final phase uses reinforcement learning to optimize the language model to generate responses that maximize the reward model’s scores.

The Setup

RL Component	Language Model Equivalent
State	Prompt + tokens generated so far
Action	Next token
Policy	The language model itself
Reward	RM score of the final response (plus KL penalty)

PPO (Proximal Policy Optimization) optimizes the language model to:

Maximize reward model scores
Stay close to the SFT model (via KL penalty to prevent drift)
Optionally preserve language capabilities (some recipes mix in a language-modeling loss, called “PPO-ptx”)

Term	Purpose
Reward	Maximize scores from the reward model
KL Penalty	Prevent the model from drifting too far from SFT behavior
LM Loss (optional)	Preserve the model’s original language capabilities

Why the KL Penalty Matters

Without the KL penalty, the model might exploit the reward model by generating responses that get high scores but are actually low quality (reward hacking). The KL penalty reduces (but doesn’t eliminate) this risk by keeping the model grounded in its original behavior.

The Complete Pipeline

1. PRETRAINING (Foundation model)
   ↓
2. SUPERVISED FINE-TUNING
   Learn: (prompt, response) pairs
   Output: SFT Model
   ↓
3. REWARD MODEL TRAINING
   Learn: Human rankings of responses
   Output: Reward Model
   ↓
4. RL FINE-TUNING (PPO)
   Optimize: Maximize reward model scores
   Output: Final Aligned Model

RLHF in Practice

Well-Known Models Using RLHF

Model	Organization	Notes
ChatGPT	OpenAI	Popularized RLHF for dialogue
InstructGPT	OpenAI	Precursor to ChatGPT
Claude	Anthropic	Combines RLHF with Constitutional AI (RLAIF)
Sparrow	DeepMind	Dialogue agent with evidence-based responses
Gemini	Google	Uses preference optimization for alignment

What RLHF Improves

According to OpenAI’s InstructGPT paper:

Metric	GPT-3 (175B)	InstructGPT (1.3B)
Human Preference	Baseline	+30% preferred
Truthfulness	Baseline	Improved
Toxicity	Baseline	Reduced

A smaller model with RLHF (1.3B) was preferred over a much larger model without it (175B).

Limitations and Challenges

1. Subjectivity of Human Preferences

Human preferences are diverse and sometimes contradictory. What one person considers a “good” response, another might dislike.

2. Bias Amplification

If the human labelers have biases, the reward model will learn and amplify those biases.

Risk	Description
Cultural bias	Labelers from one culture may penalize responses from another
Demographic skew	If labelers aren’t representative, the model won’t serve all users equally

3. Reward Hacking

Models may learn to generate responses that score high on the reward model but aren’t actually good—similar to how students might learn to pass tests without understanding the material.

4. Hallucination

RLHF improves instruction-following and safety-style behavior, but it doesn’t guarantee factuality. Reducing hallucinations often requires retrieval, verification, or targeted training signals beyond basic RLHF.

5. Cost and Scalability

Challenge	Details
Human labeling	Expensive, time-consuming
Reward model training	Requires significant compute
PPO training	Multiple forward/backward passes per update

Alternatives to RLHF

RLAIF (Reinforcement Learning from AI Feedback)

Instead of humans providing feedback, AI systems generate preferences based on a set of rules or principles. Used in Anthropic’s Constitutional AI.

Direct Preference Optimization (DPO)

A simpler approach that directly optimizes the policy from preference data without training a separate reward model. Eliminates the RL loop entirely.

Method	Pros	Cons
RLHF	Well-studied, proven results	Complex, expensive
RLAIF	Scalable, consistent	Depends on rule quality
DPO	Simpler, no RL loop	Newer, less proven

TL;DR

RLHF aligns AI models with human preferences by learning a reward model from human comparisons
Three phases: SFT (learn dialogue) → Reward Model (learn preferences) → PPO (optimize for rewards)
Why it works: Humans are good at judging (“which response is better?”) even when they can’t define the objective function
Key applications: ChatGPT, Claude, Gemini, and most modern helpful AI assistants
Limitations: Subjective preferences, bias amplification, reward hacking, high cost

RLHF represents a paradigm shift: instead of specifying what we want, we teach AI systems to learn from our judgments.