The training method used to align language models with human preferences.

Prepare for the Anthropic Fellows Program Test with multiple choice questions and in-depth explanations. Our quiz covers AI Safety, Economics, and Research Methods. Master the skills needed for success!

Multiple Choice

The training method used to align language models with human preferences.

Training method used to align language models with human preferences is RLHF, reinforcement learning from human feedback. It works by gathering human judgments on model outputs—through demonstrations or rankings—and then training a reward model to predict those judgments. The base language model is then fine-tuned with reinforcement learning to maximize the reward from that reward model, nudging outputs toward what people find helpful, safe, and desirable. This combines concrete examples with a signal that captures human preferences, helping the model generalize beyond the exact demonstrations. A plain dataset lacks this evaluative signal about alignment, so it can imitate but not optimally reflect human priorities. An API is just an interface to use the model, not a training method. Constitutional AI offers an alternative approach that constrains behavior with a set of rules, but RLHF specifically uses human feedback to shape the learning signal through rewards.

The training method used to align language models with human preferences.

Prepare for the Anthropic Fellows Program Test with multiple choice questions and in-depth explanations. Our quiz covers AI Safety, Economics, and Research Methods. Master the skills needed for success!

The training method used to align language models with human preferences.

Get the latest from Passetra