What are two commonly used safety evaluation metrics for language models, and what do they measure?

Prepare for the Anthropic Fellows Program Test with multiple choice questions and in-depth explanations. Our quiz covers AI Safety, Economics, and Research Methods. Master the skills needed for success!

Multiple Choice

What are two commonly used safety evaluation metrics for language models, and what do they measure?

Explanation:
These metrics focus on safety and alignment: they assess how the model avoids harmful content and how well its outputs match human judgments about what is acceptable. Harmlessness or safety filter rate measures how often unsafe or harmful outputs are blocked or mitigated by safety systems. It looks at the proportion of risky prompts that trigger a safety filter or produce safe responses, reflecting the effectiveness of safeguards in real use. Alignment with human preferences evaluates how closely the model’s outputs match what people consider safe and desirable. This is typically measured through human judgments or preference data (such as comparisons or feedback loops), showing how well the model’s behavior aligns with human values and expectations. The other options focus on quality, speed, or resource use rather than safety or alignment, which is why they aren’t the primary safety evaluation metrics.

These metrics focus on safety and alignment: they assess how the model avoids harmful content and how well its outputs match human judgments about what is acceptable.

Harmlessness or safety filter rate measures how often unsafe or harmful outputs are blocked or mitigated by safety systems. It looks at the proportion of risky prompts that trigger a safety filter or produce safe responses, reflecting the effectiveness of safeguards in real use.

Alignment with human preferences evaluates how closely the model’s outputs match what people consider safe and desirable. This is typically measured through human judgments or preference data (such as comparisons or feedback loops), showing how well the model’s behavior aligns with human values and expectations.

The other options focus on quality, speed, or resource use rather than safety or alignment, which is why they aren’t the primary safety evaluation metrics.

Subscribe

Get the latest from Passetra

You can unsubscribe at any time. Read our privacy policy