Poster

Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations

Anka Reuel ⋅ Avijit Ghosh ⋅ Jenny Chim ⋅ Andrew Tran ⋅ Yanan Long ⋅ Jennifer Mickel ⋅ Usman Gohar ⋅ Srishti Yadav ⋅ Pawan Sasanka Ammanamanchi ⋅ Mowafak Allaham ⋅ Hossein A. Rahmani ⋅ Mubashara Akhtar ⋅ Felix Friedrich ⋅ Robert Scholz ⋅ Michael Riegler ⋅ Jan Batzner ⋅ Eliya Habba ⋅ Arushi Saxena ⋅ Anastassia Kornilova ⋅ Kevin Wei ⋅ Prajna Soni ⋅ Yohan Mathew ⋅ Kevin Klyman ⋅ Jeba Sania ⋅ Subramanyam Sahoo ⋅ Olivia B Bruvik ⋅ Pouya Sadeghi ⋅ Sujata Goswami ⋅ Angelina Wang ⋅ Yacine Jernite ⋅ Zeerak Talat ⋅ Stella Biderman ⋅ Mykel Kochenderfer ⋅ Sanmi Koyejo ⋅ Irene Solaiman

Abstract

Foundation models are increasingly central to high-stakes AI systems, and governance frameworks now depend on evaluations to assess their risks and capabilities. Although general capability evaluations are widespread, social impact assessments covering bias, fairness, privacy, environmental costs, and labor remain uneven. To characterize this landscape, we conduct the first comprehensive analysis of social impact evaluation reporting, examining 186 first-party release reports and 248 third-party evaluation sources, supplemented by developer interviews. We find a stark division of labor: first-party reporting is sparse, often superficial, and declining in areas like environmental impact and bias, while third-party evaluators provide broader, more rigorous coverage of bias, harmful content, and performance disparities. However, only developers can authoritatively report on data provenance, content moderation labor, costs, and infrastructure, yet interviews reveal these disclosures are deprioritized unless tied to product adoption or compliance. Current practices leave major gaps in assessing societal impacts, underscoring the need for policies that mandate developer transparency, strengthen independent evaluation ecosystems, and create shared infrastructure for aggregating third-party evaluations.