Reinforcement Learning from Human Feedback (RLHF)
Reinforcement Learning from Human Feedback (RLHF) is a technique for aligning AI models with human preferences. It enables models to learn what humans want without requiring a hand-crafted reward function—instead, a reward model is learned directly from human feedback.
RLHF and related preference-learning methods (PPO-based RLHF, DPO, RLAIF) are key post-training techniques behind many assistant models like ChatGPT, Claude, and Gemini.
Why RLHF Matters
Section titled “Why RLHF Matters”In classical reinforcement learning, an agent maximizes a predefined reward function. But for complex tasks involving human values—like “write a helpful response” or “generate safe content”—defining a good reward function is extremely difficult.
RLHF solves this by learning the reward function from human preferences.
| Challenge | Traditional RL | RLHF |
|---|---|---|
| Reward Specification | Must define explicit reward function | Learns reward from human feedback |
| Complex Tasks | Struggles with ambiguous objectives | Excels when tasks are “easy to judge but hard to specify” |
| Human Values | Difficult to encode | Captured through human comparisons |
How RLHF Works
Section titled “How RLHF Works”RLHF consists of three main phases:
Phase 1: Supervised Fine-Tuning (SFT)
Section titled “Phase 1: Supervised Fine-Tuning (SFT)”Before RLHF, the base model is first fine-tuned on high-quality demonstration data.
| Aspect | Details |
|---|---|
| Goal | Teach the model to follow instructions and engage in dialogue |
| Data | (prompt, response) pairs created by humans |
| Scale | Typically tens of thousands of examples |
| Result | A model that generates plausible responses |
Key Point: SFT alone shows the model what to do, but not how well it’s doing. The model needs feedback on response quality.
Phase 2: Train a Reward Model
Section titled “Phase 2: Train a Reward Model”The reward model (RM) learns to predict human preferences.
How Comparison Data is Collected
Section titled “How Comparison Data is Collected”Humans rank multiple responses to the same prompt:
| Prompt | Response A | Response B | Response C |
|---|---|---|---|
| ”Explain quantum computing” | [Response] | [Response] | [Response] |
Humans rank: A > B > C
This creates pairwise comparisons: (A > B), (A > C), (B > C)
Training the Reward Model
Section titled “Training the Reward Model”The RM is trained to output higher scores for preferred responses:
Loss = -log(σ(score_winning - score_losing))Where σ is the sigmoid function. This loss ensures the RM assigns higher scores to responses humans prefer.
| Data Scale | Often hundreds of thousands to millions of comparison pairs |
|---|---|
| Inter-rater agreement | Often imperfect (commonly ~60-80% depending on task) |
| Initialization | Often initialized from the base or SFT model |
Why ranking instead of absolute scores? Humans are much better at comparing two options than assigning absolute scores.
Phase 3: RL Fine-Tuning with PPO
Section titled “Phase 3: RL Fine-Tuning with PPO”The final phase uses reinforcement learning to optimize the language model to generate responses that maximize the reward model’s scores.
The Setup
Section titled “The Setup”| RL Component | Language Model Equivalent |
|---|---|
| State | Prompt + tokens generated so far |
| Action | Next token |
| Policy | The language model itself |
| Reward | RM score of the final response (plus KL penalty) |
PPO (Proximal Policy Optimization) optimizes the language model to:
- Maximize reward model scores
- Stay close to the SFT model (via KL penalty to prevent drift)
- Optionally preserve language capabilities (some recipes mix in a language-modeling loss, called “PPO-ptx”)
| Term | Purpose |
|---|---|
| Reward | Maximize scores from the reward model |
| KL Penalty | Prevent the model from drifting too far from SFT behavior |
| LM Loss (optional) | Preserve the model’s original language capabilities |
Why the KL Penalty Matters
Section titled “Why the KL Penalty Matters”Without the KL penalty, the model might exploit the reward model by generating responses that get high scores but are actually low quality (reward hacking). The KL penalty reduces (but doesn’t eliminate) this risk by keeping the model grounded in its original behavior.
The Complete Pipeline
Section titled “The Complete Pipeline”1. PRETRAINING (Foundation model) ↓2. SUPERVISED FINE-TUNING Learn: (prompt, response) pairs Output: SFT Model ↓3. REWARD MODEL TRAINING Learn: Human rankings of responses Output: Reward Model ↓4. RL FINE-TUNING (PPO) Optimize: Maximize reward model scores Output: Final Aligned ModelRLHF in Practice
Section titled “RLHF in Practice”Well-Known Models Using RLHF
Section titled “Well-Known Models Using RLHF”| Model | Organization | Notes |
|---|---|---|
| ChatGPT | OpenAI | Popularized RLHF for dialogue |
| InstructGPT | OpenAI | Precursor to ChatGPT |
| Claude | Anthropic | Combines RLHF with Constitutional AI (RLAIF) |
| Sparrow | DeepMind | Dialogue agent with evidence-based responses |
| Gemini | Uses preference optimization for alignment |
What RLHF Improves
Section titled “What RLHF Improves”According to OpenAI’s InstructGPT paper:
| Metric | GPT-3 (175B) | InstructGPT (1.3B) |
|---|---|---|
| Human Preference | Baseline | +30% preferred |
| Truthfulness | Baseline | Improved |
| Toxicity | Baseline | Reduced |
A smaller model with RLHF (1.3B) was preferred over a much larger model without it (175B).
Limitations and Challenges
Section titled “Limitations and Challenges”1. Subjectivity of Human Preferences
Section titled “1. Subjectivity of Human Preferences”Human preferences are diverse and sometimes contradictory. What one person considers a “good” response, another might dislike.
2. Bias Amplification
Section titled “2. Bias Amplification”If the human labelers have biases, the reward model will learn and amplify those biases.
| Risk | Description |
|---|---|
| Cultural bias | Labelers from one culture may penalize responses from another |
| Demographic skew | If labelers aren’t representative, the model won’t serve all users equally |
3. Reward Hacking
Section titled “3. Reward Hacking”Models may learn to generate responses that score high on the reward model but aren’t actually good—similar to how students might learn to pass tests without understanding the material.
4. Hallucination
Section titled “4. Hallucination”RLHF improves instruction-following and safety-style behavior, but it doesn’t guarantee factuality. Reducing hallucinations often requires retrieval, verification, or targeted training signals beyond basic RLHF.
5. Cost and Scalability
Section titled “5. Cost and Scalability”| Challenge | Details |
|---|---|
| Human labeling | Expensive, time-consuming |
| Reward model training | Requires significant compute |
| PPO training | Multiple forward/backward passes per update |
Alternatives to RLHF
Section titled “Alternatives to RLHF”RLAIF (Reinforcement Learning from AI Feedback)
Section titled “RLAIF (Reinforcement Learning from AI Feedback)”Instead of humans providing feedback, AI systems generate preferences based on a set of rules or principles. Used in Anthropic’s Constitutional AI.
Direct Preference Optimization (DPO)
Section titled “Direct Preference Optimization (DPO)”A simpler approach that directly optimizes the policy from preference data without training a separate reward model. Eliminates the RL loop entirely.
| Method | Pros | Cons |
|---|---|---|
| RLHF | Well-studied, proven results | Complex, expensive |
| RLAIF | Scalable, consistent | Depends on rule quality |
| DPO | Simpler, no RL loop | Newer, less proven |
- RLHF aligns AI models with human preferences by learning a reward model from human comparisons
- Three phases: SFT (learn dialogue) → Reward Model (learn preferences) → PPO (optimize for rewards)
- Why it works: Humans are good at judging (“which response is better?”) even when they can’t define the objective function
- Key applications: ChatGPT, Claude, Gemini, and most modern helpful AI assistants
- Limitations: Subjective preferences, bias amplification, reward hacking, high cost
RLHF represents a paradigm shift: instead of specifying what we want, we teach AI systems to learn from our judgments.