Poster
Stealing part of a production language model
Nicholas Carlini · Daniel Paleka · Krishnamurthy Dvijotham · Thomas Steinke · Jonathan Hayase · A. Feder Cooper · Katherine Lee · Matthew Jagielski · Milad Nasr · Arthur Conmy · Eric Wallace · David Rolnick · Florian Tramer
Hall C 4-9 #2308
Abstract:
We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding projection layer (up to symmetries) of a transformer model, given typical API access. For under 20USD,ourattackextractstheentireprojectionmatrixofOpenAI′sAdaandBabbagelanguagemodels.Wetherebyconfirm,forthefirsttime,thattheseblack−boxmodelshaveahiddendimensionof1024and2048,respectively.WealsorecovertheexacthiddendimensionsizeoftheGPT−3.5−turbomodel,andestimateitwouldcostunder2,000 in queries to recover the entire projection matrix. We conclude with potential defenses and mitigations, and discuss the implications of possible future work that could extend our attack.
Chat is not available.