Skip to yearly menu bar Skip to main content


Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models

Lillian Sun · Martin Pawelczyk · Zhenting Qi · Aounon Kumar · Himabindu Lakkaraju

Abstract

Chat is not available.