Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Actionable Interpretability

Single Feature Tips the Balance: Reducing Language Model Over-Refusal with Sparse Representations

Ailin Deng · Shaoliang Nie · Lijuan Liu · Xianjun Yang · Ujjwal Karn · Dat Huynh · Fulton Wang · Ying Xu · Madian Khabsa · Saghar Hosseini
2025 Poster
in
Workshop: Actionable Interpretability

Abstract

Chat is not available.