Reinforcement Learning from Human Feedback (RLHF)

RLHF is a training technique used to align AI models with human preferences. Human raters compare AI outputs and indicate which is better; these preferences are used to train a reward model that the AI then tries to maximize. RLHF is how models like Claude and GPT are trained to be helpful, harmless, and honest rather than just statistically likely — it takes a capable model and shapes its behavior to match what humans actually want.

Most modern LLMs use some form of RLHF or its variants as a key part of their training process.

Related terms

Explore more terms