Skip to yearly menu bar Skip to main content


Poster

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

Huu Nguyen ⋅ Victor May ⋅ Harsh Raj ⋅ Marianna Nezhurina ⋅ Yishan Wang ⋅ Yanqi Luo ⋅ Vu Chien ⋅ Taishi Nakamura ⋅ Ken Tsui ⋅ Van Nguyen ⋅ David Salinas ⋅ Aleksandra Krasnodębska ⋅ Christoph Schuhmann ⋅ Mats Richter ⋅ Xuan-Son Vu ⋅ Jenia Jitsev

Abstract

Log in and register to view live content