MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources
Huu Nguyen ⋅ Victor May ⋅ Harsh Raj ⋅ Marianna Nezhurina ⋅ Yishan Wang ⋅ Yanqi Luo ⋅ Vu Chien ⋅ Taishi Nakamura ⋅ Ken Tsui ⋅ Van Nguyen ⋅ David Salinas ⋅ Aleksandra Krasnodębska ⋅ Christoph Schuhmann ⋅ Mats Richter ⋅ Xuan-Son Vu ⋅ Jenia Jitsev
Abstract
We present MixtureVitae, an open‑access pretraining corpus built to minimize legal risk while providing strong downstream performance. MixtureVitae follows a permissive‑first, risk‑mitigated sourcing strategy that combines public‑domain and permissively licensed text (e.g., CC‑BY/Apache) with carefully justified low‑risk additions (e.g., government works and EU TDM‑eligible sources). MixtureVitae adopts a simple, single-stage pretraining recipe that integrates a large proportion of permissive synthetic instruction and reasoning data—signals typically introduced during post-training and generally scarce in permissive web corpora. We categorize all sources into a three-tier scheme that reflects varying risk levels and provide shard-level provenance metadata to enable risk-aware usage. In controlled experiments using the open‑sci‑ref training protocol (fixed architectures and hyperparameters; 50B and 300B token budgets across 130M–1.7B parameters), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B-parameters/300B-tokens setting, they match FineWeb‑Edu and approach DCLM--demonstrating that the large fraction of reasoning and instruction data does not come at the cost of general-purpose language understanding. Performance is particularly strong on MMLU and on math and code benchmarks: a 1.7B model pretrained on 300B MixtureVitae tokens outperforms all strong non-permissive reference datasets and matches or exceeds smolLM2-Instruct, a strong 1.7B instruction‑tuned baseline on GSM8K, HumanEval, and MBPP, despite using over 36$\times$ fewer tokens (300B vs. $\approx$11T). Supported by a thorough decontamination analysis, these results show that permissive‑first data with high instruction and reasoning density, tiered by licensing and provenance-related risk, can provide a practical and risk-mitigated foundation for training capable LLMs, reducing reliance on broad web scrapes without sacrificing competitiveness. Dataset, source code for experiments reproduction and pre-trained models are available at https://github.com/ontocord/mixturevitae .
Successful Page Load