Position: *Beyond Text* The Text-Centric Bias in Foundation Models Must Be Revisited for a Speech-First Future
Abstract
This position paper argues that the machine learning community should prioritize speech-native architectures that treat audio as a first-class modality, anticipating the inevitable shift from text-dominated to speech-first data distributions. Text dominates human-computer interaction not because it is cognitively natural, but because decades of interface design conditioned users to express knowledge through keyboards and search boxes. Recent advances in speech recognition and multimodal foundation models have removed the technical barriers to voice-based interaction; what remains is primarily a habit problem. As voice becomes habitual, the data ecosystem underlying machine learning will shift toward speech-native knowledge—with profound implications for model architecture, training efficiency, and evaluation paradigms. This paper examines the technical readiness of speech systems, identifies habit inertia as the primary adoption barrier, addresses alternative views that favor text-centric approaches, and outlines a research agenda for ML systems that anticipate speech-first data distributions.