Barriers to Counterfactual Credit Attribution for Autoregressive Models
Abstract
Generative AI disrupts the practice of giving credit to work that came before. Ideally, a generative model would give credit to any work on which its output depends in a significant way. Counterfactual credit attribution (CCA) is a technical condition formalizing this goal---a relaxation of differential privacy---recently introduced by Livni, Moran, Nissim, and Pabbaraju (2024) who studied it in the PAC learning setting. We initiate the study of CCA generative models. Specifically, we consider autoregressive models giving credit to a deployment-time dataset (e.g., a RAG database). We uncover barriers to two natural approaches to CCA autoregressive models. First, we show that imposing CCA on the underlying next-token predictor does not guarantee that the model is CCA: CCA does not compose autoregressively (unlike DP). Second, we consider a different approach to building CCA models which we call retrofitting. Retrofitting takes a model that does not attribute credit, and adds credit onto it. Given black-box access to the starting model, retrofitting requires query complexity exponential in the length of the model's outputs.