This looks amazing @rasbt! Out of curiosity, is your primary goal to cultivate understanding and demystify, or to encourage people to build their own small models tailored to their needs?
I'd say my primary motivation is an educational goal, i.e., helping people understand how LLMs work by building one. LLMs are an important topic, and there are lots of hand-wavy videos and articles out there -- I think if one codes an LLM from the ground up, it will clarify lots of concepts.
Now, the secondary goal is, of course, also to help people with building their own LLMs if they need to. The book will code the whole pipeline, including pretraining and finetuning, but I will also show how to load pretrained weights because I don't think it's feasible to pretrain an LLM from a financial perspective. We are coding everything from scratch in this book using GPT-2-like LLM (so that we can load the weights for models ranging from 124M that run on a laptop to the 1558M that runs on a small GPU). In practice, you probably want to use a framework like HF transformers or axolotl, but I hope this from-scratch approach will demystify the process so that these frameworks are less of a black box.
While pretraining a decent-sized LLM from scratch is not financially feasible for the average person, it is very much feasible for the average YC/VC backed startup (ignoring the fact that it's almost always easier to just use something like Mixtral or LLaMa 2 and fine-tune as necessary).
>Introducing MPT-7B, the first entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on 1T tokens of text and code. It is open source, available for commercial use, and matches the quality of LLaMA-7B. MPT-7B was trained on the MosaicML platform in 9.5 days with zero human intervention at a cost of ~$200k
Thanks for such a thoughtful response. I'm building with LLMs, and do feel uncomfortable with my admittedly hand-wavy understanding of the underlying transformer architecture. I've ordered your book and look forward to following along!
Hi Rasbt, thanks for writing the new guide and the upcoming book on LLM, another must buy book from Manning.
Just wondering are going to include any specific section or chapter in your LLM book on RAG? I think it will be very much a welcome addition for the build your own LLM crowd.
Semi-related, as long as we're requesting things: to @pr337h4m's point above, it would be interesting to have some rough guidance (even a sidebar or single paragraph) on when it makes sense to pre-train a new foundation model vs finetune vs pass in extra context (RAG). Clients of all sizes—from Fortune 100 to small businesses—are asking us this question.
That's a good point. I may briefly mention RAG-like systems and add some literature references on this, but I am bit hesitant to give general advice because it's heavily project-dependent in my opinion. It usually also comes down in what form the client has the data and whether referencing from a database or documentation is desired or not. The focus of chapter 6+7 is also instruction-finetuning and alignment rather than finetuning for knowledge. The latter goal is best achieved done via pretraining (as opposed to finetuning) imho.
In any case, I just read this interesting case study last week on Finetuning vs RAG that might come in handy: "RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture" (https://arxiv.org/abs/2401.08406)