Position: Stop Reactively Patching Your Model Every Time and Start Proactive Test-Driven AI Development
Abstract
Many modern AI systems are designed to operate under diverse, open-ended, use-cases. To help generalize deployed systems, developers rely on a reactive AI flywheel that observes emerging feedback from user behavior (errors) and patches the model accordingly. However, most flywheels ignore the broader context of these errors within the system's objectives, failing to preempt potential future edge cases, which leads to more unnecessary flywheel iterations. Also, it is statistically increasingly difficult to collect remaining errors due to the long-tail nature of open-world use-cases (Boneh and Hofri, 1997). This position paper argues that a proactive test-driven flywheel is required to address reactive flywheel's limitations and to approach a generalizable system. We advocate for creating a ``test space" to technically map feedback data to task objectives, evolving the flywheel from reactive to proactive. We augment our position by mathematically proving a proactive one achieves better long-term scaling with fewer iterations than the reactive flywheel.