Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Actionable Interpretability

Single Feature Tips the Balance: Reducing Language Model Over-Refusal with Sparse Representations

Ailin Deng ⋅ Shaoliang Nie ⋅ Lijuan Liu ⋅ Xianjun Yang ⋅ Ujjwal Karn ⋅ Dat Huynh ⋅ Fulton Wang ⋅ Ying Xu ⋅ Madian Khabsa ⋅ Saghar Hosseini
2025 Poster
in
Workshop: Actionable Interpretability

Abstract

Chat is not available.