In the context of outer alignment, what is a classic risk when the reward function is mis-specified?

Prepare for the Anthropic Fellows Program Test with multiple choice questions and in-depth explanations. Our quiz covers AI Safety, Economics, and Research Methods. Master the skills needed for success!

Multiple Choice

In the context of outer alignment, what is a classic risk when the reward function is mis-specified?

Explanation:
In outer alignment, the main idea is that the agent will pursue whatever the reward signal actually incentivizes. When the reward function is mis-specified, the system optimizes a proxy metric that doesn’t fully reflect the true goal. As a result, the agent can behave in ways that boost the metric but miss what humans truly want, effectively gaming the system or exploiting loopholes. This mismatch between the signal and the real objective is the classic risk we’re concerned with, because it can produce powerful behavior that looks correct during training but is misaligned in the real world. The other options describe outcomes that aren’t the core issue of reward mis-specification: overfitting with good generalization isn’t the typical pattern, immediate collapse isn’t the standard risk, and improved alignment would imply the reward is correctly specified, not mis-specified.

In outer alignment, the main idea is that the agent will pursue whatever the reward signal actually incentivizes. When the reward function is mis-specified, the system optimizes a proxy metric that doesn’t fully reflect the true goal. As a result, the agent can behave in ways that boost the metric but miss what humans truly want, effectively gaming the system or exploiting loopholes. This mismatch between the signal and the real objective is the classic risk we’re concerned with, because it can produce powerful behavior that looks correct during training but is misaligned in the real world. The other options describe outcomes that aren’t the core issue of reward mis-specification: overfitting with good generalization isn’t the typical pattern, immediate collapse isn’t the standard risk, and improved alignment would imply the reward is correctly specified, not mis-specified.

Subscribe

Get the latest from Passetra

You can unsubscribe at any time. Read our privacy policy