Which two metrics are proposed for evaluating alignment in a safety-critical task?

Prepare for the Anthropic Fellows Program Test with multiple choice questions and in-depth explanations. Our quiz covers AI Safety, Economics, and Research Methods. Master the skills needed for success!

Multiple Choice

Which two metrics are proposed for evaluating alignment in a safety-critical task?

Explanation:
Two metrics for evaluating alignment in safety-critical tasks are alignment accuracy and safety margin. Alignment accuracy gauges how often the model’s outputs meet the intended safe and aligned behavior across a representative set of tasks, directly assessing correctness relative to safety goals. Safety margin measures how robust the model is to uncertainties or edge cases—how much buffer exists before a dangerous or unsafe outcome could occur, which helps ensure safety even under distribution shifts or ambiguous prompts. Together, they give a complete picture: one shows how often the model behaves correctly, the other shows how resilient that safe behavior is under pressure. The other options don’t focus on both correctness and robustness of aligned behavior: explainability concerns understanding, not necessarily safe output; data efficiency is about learning efficiency; model size and throughput are about resources, not alignment quality.

Two metrics for evaluating alignment in safety-critical tasks are alignment accuracy and safety margin. Alignment accuracy gauges how often the model’s outputs meet the intended safe and aligned behavior across a representative set of tasks, directly assessing correctness relative to safety goals. Safety margin measures how robust the model is to uncertainties or edge cases—how much buffer exists before a dangerous or unsafe outcome could occur, which helps ensure safety even under distribution shifts or ambiguous prompts. Together, they give a complete picture: one shows how often the model behaves correctly, the other shows how resilient that safe behavior is under pressure. The other options don’t focus on both correctness and robustness of aligned behavior: explainability concerns understanding, not necessarily safe output; data efficiency is about learning efficiency; model size and throughput are about resources, not alignment quality.

Subscribe

Get the latest from Passetra

You can unsubscribe at any time. Read our privacy policy