HuggingFaceTB

This is the home of synthetic datasets for pre-training, such as Cosmopedia v1 and v2. We're trying to scale synthetic data generation by curating diverse prompts that cover a wide range of topics and efficiently scaling the generations on GPUs with tools like llm-swarm.

We released:

For more details check our blog posts: https://huggingface.co/blog/cosmopedia and https://huggingface.co/blog/smollm