Position: Scale is a False Promise for Endangered Languages
Abstract
As endangered languages disappear, Machine Learning (ML) increasingly frames their revitalization as a problem of scale, emphasizing more data, larger models, and broader coverage. We posit that scale is not the limiting constraint in endangered language revitalization, and that progress lies in methodological and evaluative reorientation. Evidence from Language Identification (LID), Optical Character Recognition (OCR), and synthetic data generation shows that benchmark-driven scaling produces brittle or culturally misaligned outcomes, as evaluation and modeling lack epistemic fit. Advancement in this domain lies in rethinking methodology, by grounding evaluation in cultural fidelity, community trust, and situated use rather than abstract accuracy. The revitalization of endangered languages is not about the universality of success, but the specificity of care afforded to each language and community.