Poster
in
Workshop: Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand and Guarantee Foundation Model Performance Thu, Jul 9, 2026 • 7:00 PM – 8:00 PM PDT

From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for VLM Weak Supervision Across Three Medical-Imaging Benchmarks

Bruce C Xu ⋅ Jose James ⋅ Alexander Ryu

Project Page

Abstract

Classical noisy-label theory predicts that downstream performance under weak supervision is bounded above by the labeler's accuracy, implying a sharp crossover: once a gold-trained classifier matches the labeler, weak labels stop helping and start hurting. The prediction is theoretical; what is missing is a benchmark calibration that turns it into an instance-level statement for modern foundation-model labelers. We provide such a calibration for BiomedCLIP-generated weak labels on three medical-imaging benchmarks (PCAM, ISIC, NIH-CXR) and six downstream architectures spanning an $11\times$ parameter range. The crossover predicted by theory appears at $n_g \approx 100$ on PCAM, $20$--$50$ on ISIC, and $250$--$500$ on NIH-CXR; weak labels above the crossover degrade AUC by up to $-0.10$. The location is architecture-invariant for four of five pretrained architectures, and a within-family DenseNet sweep ($2.5\times$ parameters, identical pretraining) confirms the labeler---not the student---is the binding constraint. The calibration in turn produces a decision rule operable from $10$--$20$ gold labels: compare gold-only AUC to VLM accuracy on the user's gold set. A structured-vs-random noise sign flip on NIH-CXR shows that the rate-only formulation of the bound is incomplete and identifies a concrete refinement (label-space projection) that future benchmarks can be designed to test.