The new 2025 OWASP Top 10 Risks for Large Language Models (LLMs) highlights critical shifts within AI security. Here's a summary of new, expanded or updated risks that are particularly interesting.
Fergal Glynn
AI safety is increasingly at the forefront of discussions about responsible technology development, with benchmarks becoming the primary tools for evaluating progress. Yet, the recently released paper, "Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?", highlights troubling concerns about the validity and utility of these benchmarks. The paper's findings suggest that many benchmarks may not reflect genuine safety advancements but instead track improvements in general AI capabilities, a phenomenon the authors term "safetywashing."
This blog post explores the implications of this research for the AI, data, and security communities, providing a technical review of the paper and integrating insights from a LinkedIn post by Mindgard CEO Peter Garraghan.
As the paper outlines, benchmarks are pivotal in shaping AI research by defining metrics for success and guiding resource allocation. For AI safety, benchmarks ostensibly measure how well a model mitigates risks such as bias, adversarial robustness, and alignment with human preferences. However, the authors conducted a meta-analysis of dozens of benchmarks and discovered that many safety metrics correlate strongly with upstream model capabilities and training compute. This suggests that general improvements in AI performance, such as scaling model size or dataset quality, are being misconstrued as progress in safety.
This misrepresentation is problematic for two reasons:
The paper provides empirical evidence of safetywashing across several dimensions:
Peter Garraghan’s LinkedIn post underscores the importance of distinguishing between safety and performance in AI systems, particularly for those working in security contexts.
Peter highlights three key observations:
One of the most intriguing opportunities in AI safety lies in identifying instances where safety or security improvements are disproportionately greater than general capability enhancements. Developing a method to predict or measure these instances could offer immense value. Such metrics would allow users to map interventions that result in meaningful safety progress, clearly decoupled from upstream performance improvements. This approach could help organizations prioritize effective strategies, ensuring that resources are allocated to areas where safety advancements significantly outpace capability gains. By making this distinction measurable, the AI community could better understand and focus on interventions that genuinely improve safety outcomes.
To advance the science of AI safety, the paper proposes three actionable recommendations:
The findings in "Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?" are particularly interesting and I would encourage those working in the AI safety and AI security space to read the full report. Benchmarks are not just neutral tools; they shape the direction of research and set the incentives for how safety is pursued. As Garraghan aptly notes, conflating safety with capabilities obscures the path toward genuinely secure AI systems.
By adopting rigorous, empirically validated benchmarks and focusing on challenges that persist as models scale, we can align the incentives of researchers, policymakers, and developers toward creating AI systems that are not only powerful but also safe and trustworthy.
Hero Photo by Kelly Mora from Unsplash