What is RedPajama? - by Michael Spencer
Hey Everyone,
I’m not a developer but the Open-Source movement in LLMs is gaining some momentum in the Spring of 2023.
From Meta AI’s LLaMA, to UC Berkley’s 7B OpenLLaMA model, an open-source alternative to Meta’s LLaMA language model.
The model has been trained on the RedPajama dataset with 200 billion tokens, and its weights are available in PyTorch and Jax. With the latest release all non-commercial models stemming from LLaMA can now be re-trained with a permissive licence.
To MPT-7B, things are starting to get interesting.
On April 17th, 2023 something incredible happened, something Google researchers had feared. RedPajama is a project to create a set of leading, fully open-source models. Today, we are excited to announce the completion of the first step of this project: the reproduction of the LLaMA training dataset of over 1.2 trillion tokens.
A few weeks later it seems the Open-source movement of A.I. researchers collaborating in a distributed manner globally appears to have some momentum in 2023.
You could make the argument that this was a Cambrian explosion of A.I’s Linux Moment circa May 4th, 2023. Just three weeks after RedPajama came out.
Well how so? New projects are coming into being at an accelerated rate in the open-source community that really democratizes A.I., and perhaps a lot better than Microsoft or Google alone could do. Certainly Meta A.I. is at least a bit more friendly to this movement.
RedPajama, an effort to produce a reproducible, fully-open, leading language model. RedPajama is a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute. RedPajama has three key components:
Pre-training data, which needs to be both high quality and have broad coverage.
Base models, which are trained at scale on this data.
Instruction tuning data and models, which improve the base model to make it usable and safe.
Our starting point is LLaMA, which is the leading suite of open base models for two reasons: First, LLaMA was trained on a very large (1.2 trillion tokens) dataset that was carefully filtered for quality. Second, the 7 billion parameter LLaMA model is trained for much longer, well beyond the Chincilla-optimal point, to ensure the best quality at that model size. A 7 billion parameter model is particularly valuable for the open community as it can run on a wide variety of GPUs, including many consumer grade GPUs. However, LLaMA and all its derivatives (including Alpaca, Vicuna, and Koala) are only available for non-commercial research purposes. We aim to create a fully open-source reproduction of LLaMA, which would be available for commercial applications, and provide a more transparent pipeline for research.
A similar movement has now begun around large language models with the recent release of semi-open models like LLaMA, Alpaca, Vicuna, and Koala; as well as fully-open models like Pythia, OpenChatKit, Open Assistant and Dolly.
ncG1vNJzZminlpvBqbHGq6CdsKBjwLau0q2YnKNemLyue89orqGZpGK2tHnRnpupmZqWuqI%3D