Membership Inference Attacks for Unseen Classes
Abstract
A key tool in developing safe AI models is data auditing, i.e., using statistical tools to determine whether harmful content may have been used in the training data of a black-box model. Unfortunately, most membership inference attacks (MIAs) used to perform this type of auditing themselves assume access to examples of harmful content from the same distribution as the query data. In real-world auditing scenarios, auditors often face legal and ethical restrictions preventing them from accessing a representative set of samples of harmful content to train these attacks effectively. We abstract and formalize this setting into a new data access model, the “unseen class” setting, and show that the state-of-the-art MIAs fail due to the lack of access to the full target distribution. We show that in this setting, quantile regression attacks outperform approaches typically considered to be state of the art. We demonstrate this both empirically and theoretically, showing that quantile regression attacks achieve up to 11× the TPR of shadow model-based approaches in practice, and providing a theoretical model that outlines the generalization properties required for this approach to succeed. Our work identifies an important failure mode in existing MIAs and provides a cautionary tale for practitioners who aim to directly use existing tools for real-world applications of AI safety.