Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Actionable Interpretability
Sat, Jul 19, 2025 • 10:40 AM – 11:40 AM PDT

Single Feature Tips the Balance: Reducing Language Model Over-Refusal with Sparse Representations

Ailin Deng · Shaoliang Nie · Lijuan Liu · Xianjun Yang · Ujjwal Karn · Dat Huynh · Fulton Wang · Ying Xu · Madian Khabsa · Saghar Hosseini

Abstract

Chat is not available.