Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Challenges in Deployable Generative AI

Surely You’re Lying, Mr. Model: Improving and Analyzing CCS

Naomi Bashkansky · Chloe Loughridge · Chuyue Tang

Keywords: [ model internals ] [ GPT-J ] [ transformers ] [ deception ] [ Safety ] [ large language models ] [ trust ]


Abstract:

Contrast Consistent Search (Burns et al., 2022) is a method for eliciting latent knowledge without supervision. In this paper, we explore a few directions for improving CCS. We use conjunctive logic to make CCS fully unsupervised. We investigate which factors contribute to CCS’s poor performance on autoregressive models. Replicating (Belrose & Mallen, 2023), we improve CCS’s performance on autoregressive models and study the effect of multi-shot context. And we better characterize where CCS techniques add value by adding early exit baselines to the original CCS experiments, replicating (Halawi et al., 2023).

Chat is not available.