Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models
Open Artificial Knowledge
Vadim Borisov · Richard Schreiber
The success of chat-based systems like ChatGPT is largely due to large language models (LLMs), which require extensive training data. The performance of these models depends greatly on the volume and quality of the data. However, acquiring large, high-quality datasets is challenging due to costs and constraints related to privacy, data diversity, and ethical considerations.To mitigate these issues, with this primarily work, we present the Open Artificial Knowledge (OAK) dataset, freely available and comprising a vast amount of high-quality text generated by cutting-edge open-source LLMs.The OAK dataset was developed using Wikipedia's main categories for topic generation and responses generated by state-of-the-art models like GPT4o, LLaMa3-70B, LLaMa3-8B, Mixtral-8x7B, and Gemma-7B. This ensures a wide range of knowledge coverage across various domains.Beyond addressing data scarcity, the OAK dataset also focuses on privacy and diversity in training materials. The dataset currently includes 100 million tokens, with future expansions planned through ongoing work and community contributions. Access to OAK will be available under the Apache License on [website] and HuggingFace.