Skip to content

Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF) is a technique for aligning AI models with human preferences. It enables models to learn what humans want without requiring a hand-crafted reward function—instead, a reward model is learned directly from human feedback.

RLHF and related preference-learning methods (PPO-based RLHF, DPO, RLAIF) are key post-training techniques behind many assistant models like ChatGPT, Claude, and Gemini.


In classical reinforcement learning, an agent maximizes a predefined reward function. But for complex tasks involving human values—like “write a helpful response” or “generate safe content”—defining a good reward function is extremely difficult.

RLHF solves this by learning the reward function from human preferences.

ChallengeTraditional RLRLHF
Reward SpecificationMust define explicit reward functionLearns reward from human feedback
Complex TasksStruggles with ambiguous objectivesExcels when tasks are “easy to judge but hard to specify”
Human ValuesDifficult to encodeCaptured through human comparisons

RLHF consists of three main phases:

Before RLHF, the base model is first fine-tuned on high-quality demonstration data.

AspectDetails
GoalTeach the model to follow instructions and engage in dialogue
Data(prompt, response) pairs created by humans
ScaleTypically tens of thousands of examples
ResultA model that generates plausible responses

Key Point: SFT alone shows the model what to do, but not how well it’s doing. The model needs feedback on response quality.


The reward model (RM) learns to predict human preferences.

Humans rank multiple responses to the same prompt:

PromptResponse AResponse BResponse C
”Explain quantum computing”[Response][Response][Response]

Humans rank: A > B > C

This creates pairwise comparisons: (A > B), (A > C), (B > C)

The RM is trained to output higher scores for preferred responses:

Loss = -log(σ(score_winning - score_losing))

Where σ is the sigmoid function. This loss ensures the RM assigns higher scores to responses humans prefer.

Data ScaleOften hundreds of thousands to millions of comparison pairs
Inter-rater agreementOften imperfect (commonly ~60-80% depending on task)
InitializationOften initialized from the base or SFT model

Why ranking instead of absolute scores? Humans are much better at comparing two options than assigning absolute scores.


The final phase uses reinforcement learning to optimize the language model to generate responses that maximize the reward model’s scores.

RL ComponentLanguage Model Equivalent
StatePrompt + tokens generated so far
ActionNext token
PolicyThe language model itself
RewardRM score of the final response (plus KL penalty)

PPO (Proximal Policy Optimization) optimizes the language model to:

  • Maximize reward model scores
  • Stay close to the SFT model (via KL penalty to prevent drift)
  • Optionally preserve language capabilities (some recipes mix in a language-modeling loss, called “PPO-ptx”)
TermPurpose
RewardMaximize scores from the reward model
KL PenaltyPrevent the model from drifting too far from SFT behavior
LM Loss (optional)Preserve the model’s original language capabilities

Without the KL penalty, the model might exploit the reward model by generating responses that get high scores but are actually low quality (reward hacking). The KL penalty reduces (but doesn’t eliminate) this risk by keeping the model grounded in its original behavior.


1. PRETRAINING (Foundation model)
2. SUPERVISED FINE-TUNING
Learn: (prompt, response) pairs
Output: SFT Model
3. REWARD MODEL TRAINING
Learn: Human rankings of responses
Output: Reward Model
4. RL FINE-TUNING (PPO)
Optimize: Maximize reward model scores
Output: Final Aligned Model

ModelOrganizationNotes
ChatGPTOpenAIPopularized RLHF for dialogue
InstructGPTOpenAIPrecursor to ChatGPT
ClaudeAnthropicCombines RLHF with Constitutional AI (RLAIF)
SparrowDeepMindDialogue agent with evidence-based responses
GeminiGoogleUses preference optimization for alignment

According to OpenAI’s InstructGPT paper:

MetricGPT-3 (175B)InstructGPT (1.3B)
Human PreferenceBaseline+30% preferred
TruthfulnessBaselineImproved
ToxicityBaselineReduced

A smaller model with RLHF (1.3B) was preferred over a much larger model without it (175B).


Human preferences are diverse and sometimes contradictory. What one person considers a “good” response, another might dislike.

If the human labelers have biases, the reward model will learn and amplify those biases.

RiskDescription
Cultural biasLabelers from one culture may penalize responses from another
Demographic skewIf labelers aren’t representative, the model won’t serve all users equally

Models may learn to generate responses that score high on the reward model but aren’t actually good—similar to how students might learn to pass tests without understanding the material.

RLHF improves instruction-following and safety-style behavior, but it doesn’t guarantee factuality. Reducing hallucinations often requires retrieval, verification, or targeted training signals beyond basic RLHF.

ChallengeDetails
Human labelingExpensive, time-consuming
Reward model trainingRequires significant compute
PPO trainingMultiple forward/backward passes per update

RLAIF (Reinforcement Learning from AI Feedback)

Section titled “RLAIF (Reinforcement Learning from AI Feedback)”

Instead of humans providing feedback, AI systems generate preferences based on a set of rules or principles. Used in Anthropic’s Constitutional AI.

A simpler approach that directly optimizes the policy from preference data without training a separate reward model. Eliminates the RL loop entirely.

MethodProsCons
RLHFWell-studied, proven resultsComplex, expensive
RLAIFScalable, consistentDepends on rule quality
DPOSimpler, no RL loopNewer, less proven

  • RLHF aligns AI models with human preferences by learning a reward model from human comparisons
  • Three phases: SFT (learn dialogue) → Reward Model (learn preferences) → PPO (optimize for rewards)
  • Why it works: Humans are good at judging (“which response is better?”) even when they can’t define the objective function
  • Key applications: ChatGPT, Claude, Gemini, and most modern helpful AI assistants
  • Limitations: Subjective preferences, bias amplification, reward hacking, high cost

RLHF represents a paradigm shift: instead of specifying what we want, we teach AI systems to learn from our judgments.