January 7, 2025
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
This blog highlights concerns about AI safety benchmarks, which often reflect general capability improvements rather than true safety advancements, a practice termed "safetywashing."
TABLE OF CONTENTS
Key Takeaways
Key Takeaways
  • Safetywashing Risks: Many AI safety benchmarks inadvertently reflect improvements in general capabilities rather than safety-specific advancements, leading to misrepresentation of progress.
  • Correlation Challenges: High correlations between safety metrics, model capabilities, and training compute emphasize the need for benchmarks that measure distinct safety properties.
  • Redefining Safety Research: Prioritizing benchmarks that decouple safety progress from capability scaling is critical to addressing genuine risks in AI systems.
  • Actionable Recommendations: Transparency, decorrelated benchmarks, and a focus on persistent safety challenges can guide the AI community toward meaningful safety progress.

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

AI safety is increasingly at the forefront of discussions about responsible technology development, with benchmarks becoming the primary tools for evaluating progress. Yet, the recently released paper, "Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?", highlights troubling concerns about the validity and utility of these benchmarks. The paper's findings suggest that many benchmarks may not reflect genuine safety advancements but instead track improvements in general AI capabilities, a phenomenon the authors term "safetywashing."

This blog post explores the implications of this research for the AI, data, and security communities, providing a technical review of the paper and integrating insights from a LinkedIn post by Mindgard CEO Peter Garraghan.

The Core Argument

As the paper outlines, benchmarks are pivotal in shaping AI research by defining metrics for success and guiding resource allocation. For AI safety, benchmarks ostensibly measure how well a model mitigates risks such as bias, adversarial robustness, and alignment with human preferences. However, the authors conducted a meta-analysis of dozens of benchmarks and discovered that many safety metrics correlate strongly with upstream model capabilities and training compute. This suggests that general improvements in AI performance, such as scaling model size or dataset quality, are being misconstrued as progress in safety.

This misrepresentation is problematic for two reasons:

  1. Lack of Differential Progress: Safety should advance disproportionately to capabilities to address new risks introduced by more powerful AI systems. If benchmarks fail to measure this, they risk providing a false sense of security.
  2. Distortion of Research Priorities: If capability enhancements are mistaken for safety advancements, researchers and funding bodies may allocate resources ineffectively, focusing on benchmarks that fail to address critical safety challenges.

Key Findings

The paper provides empirical evidence of safetywashing across several dimensions:

  1. Capabilities Correlation: By analyzing the relationship between safety benchmarks and general AI performance metrics, the authors found a high correlation for many benchmarks. For example, alignment benchmarks like MT-Bench showed capabilities correlations exceeding 70%, indicating they primarily measure model sophistication rather than distinct safety attributes.
  2. Compute Correlation: Safety scores often scaled with training compute, further supporting the idea that these metrics reflect general capabilities rather than targeted safety improvements.
  3. Benchmark Design Flaws: Benchmarks like TruthfulQA and ETHICS conflate safety with capabilities. In contrast, benchmarks focusing on adversarial robustness and calibration showed more promise in isolating safety attributes from general performance.
  4. Weaponization Risks: While some benchmarks effectively measure harmful capabilities, such as biosecurity and cybersecurity risks, these remain underutilized in mainstream safety research.

Implications for AI Security

Peter Garraghan’s LinkedIn post underscores the importance of distinguishing between safety and performance in AI systems, particularly for those working in security contexts.



Peter highlights three key observations:

  1. Terminology Confusion: AI red teaming, often conflated with both safety and security, illustrates the ambiguity in defining and addressing AI safety. Developers working on mission-critical systems struggle to differentiate between AI safety as a domain-specific concern and as a broader research agenda.
  2. Adversarial Robustness Paradox: While adversarial robustness is crucial, it’s unclear how improvements in this area translate to broader system security. Increasing a model’s operational scope to address adversarial risks could expand its attack surface, introducing new vulnerabilities.
  3. Future-Proofing AI Security: Arguments against adversarial robustness often hinge on speculative scenarios involving AGI. However, Garraghan suggests focusing on practical and immediate challenges, such as ensuring that AI systems do not compromise other components of critical infrastructures.

Measuring Disproportionate Safety Gains

One of the most intriguing opportunities in AI safety lies in identifying instances where safety or security improvements are disproportionately greater than general capability enhancements. Developing a method to predict or measure these instances could offer immense value. Such metrics would allow users to map interventions that result in meaningful safety progress, clearly decoupled from upstream performance improvements. This approach could help organizations prioritize effective strategies, ensuring that resources are allocated to areas where safety advancements significantly outpace capability gains. By making this distinction measurable, the AI community could better understand and focus on interventions that genuinely improve safety outcomes.

Moving Beyond Safetywashing

To advance the science of AI safety, the paper proposes three actionable recommendations:

  1. Report Capabilities Correlations: Researchers should disclose the extent to which safety metrics are influenced by general capabilities. This transparency would help distinguish genuine safety progress from safetywashing.
  2. Develop Decorrelated Benchmarks: New benchmarks should focus on properties orthogonal to capabilities, ensuring they measure distinct safety attributes. For example, RMS calibration error offers a more reliable measure of model uncertainty than the Brier score, which is highly correlated with accuracy.
  3. Reassess Research Priorities: Safety research should prioritize areas that remain challenging as capabilities scale. This includes designing methods to mitigate risks from weaponization, malicious misuse, and systemic vulnerabilities.

Conclusion

The findings in "Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?" are particularly interesting and I would encourage those working in the AI safety and AI security space to read the full report. Benchmarks are not just neutral tools; they shape the direction of research and set the incentives for how safety is pursued. As Garraghan aptly notes, conflating safety with capabilities obscures the path toward genuinely secure AI systems.

By adopting rigorous, empirically validated benchmarks and focusing on challenges that persist as models scale, we can align the incentives of researchers, policymakers, and developers toward creating AI systems that are not only powerful but also safe and trustworthy.

Hero Photo by Kelly Mora from Unsplash