Workshop: ES-FoMo: Efficient Systems for Foundation Models

Mental Calibration: Discovering and Adjusting for Latent Factors Improves Zero-Shot Inference of CLIP

Bang An · Sicheng Zhu · Michael-Andrei Panaitescu-Liess · Chaithanya Kumar Mummadi · Furong Huang


The CLIP model demonstrates remarkable zero-shot inference capability that can be understood by humans through natural language.However, interpreting this zero-shot inference process and designing suitable methods, including crafting text description templates, remains an open problem.In this paper, we develop an understanding of the zero-shot inference process of CLIP by explicitly considering the latent factors in the data generation process along with their corresponding text descriptions.Building on this, we first find that conditioning on the correct latent factors improves inference, meaning that CLIP can adjust for them.Then, we find that CLIP can infer latent factors from images, meaning it can discover them.With these two findings, we propose an inference method that automatically discovers and adjusts for latent factors as long as we provide CLIP with a comprehensive set of potential latent factors.We empirically verify that this inference method improves both generalization and interpretability of the zero-shot inference of CLIP.

Chat is not available.