Timezone: »

 
Detecting Dataset Drift and Non-IID Sampling via k-Nearest Neighbors
Jesse Cummings · Jonas Mueller · Elías Snorrason

We present a straightforward statistical test todetect certain violations of the assumption thatthe data are Independent and Identically Dis-tributed (IID). The specific form of violation con-sidered is common across real-world applications:whether the examples are ordered in the datasetsuch that almost adjacent examples tend to havemore similar feature values (e.g. due to distri-butional drift, or attractive interactions betweendatapoints). Based on a k-Nearest Neighbors es-timate, our approach can be used to audit anymultivariate numeric data as well as other datatypes (image, text, audio, etc.) that can be numeri-cally represented, perhaps via model embeddings.Compared with existing methods to detect drift orauto-correlation, our approach is both applicableto more types of data and also able to detect awider variety of IID violations in practice.

Author Information

Jesse Cummings (MIT)
Jonas Mueller (Cleanlab)
Elías Snorrason (Cleanlab)

More from the Same Authors