This looks amazing @rasbt! Out of curiosity, is your primary goal to cultivate u...

rasbt · on Jan 27, 2024

I'd say my primary motivation is an educational goal, i.e., helping people understand how LLMs work by building one. LLMs are an important topic, and there are lots of hand-wavy videos and articles out there -- I think if one codes an LLM from the ground up, it will clarify lots of concepts.

Now, the secondary goal is, of course, also to help people with building their own LLMs if they need to. The book will code the whole pipeline, including pretraining and finetuning, but I will also show how to load pretrained weights because I don't think it's feasible to pretrain an LLM from a financial perspective. We are coding everything from scratch in this book using GPT-2-like LLM (so that we can load the weights for models ranging from 124M that run on a laptop to the 1558M that runs on a small GPU). In practice, you probably want to use a framework like HF transformers or axolotl, but I hope this from-scratch approach will demystify the process so that these frameworks are less of a black box.

pr337h4m · on Jan 28, 2024

While pretraining a decent-sized LLM from scratch is not financially feasible for the average person, it is very much feasible for the average YC/VC backed startup (ignoring the fact that it's almost always easier to just use something like Mixtral or LLaMa 2 and fine-tune as necessary).

>Introducing MPT-7B, the first entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on 1T tokens of text and code. It is open source, available for commercial use, and matches the quality of LLaMA-7B. MPT-7B was trained on the MosaicML platform in 9.5 days with zero human intervention at a cost of ~$200k

https://www.databricks.com/blog/mpt-7b

turnsout · on Jan 27, 2024

Thanks for such a thoughtful response. I'm building with LLMs, and do feel uncomfortable with my admittedly hand-wavy understanding of the underlying transformer architecture. I've ordered your book and look forward to following along!

rasbt · on Jan 28, 2024

Thanks for your support, I hope you'll get something useful out of this book!

turnsout · on Jan 28, 2024

Honestly, I already have—the overview of PyTorch in the Appendix finally made a few things click for me!

rasbt · on Jan 28, 2024

Glad to hear! I was thinking hard whether to write an intro to PyTorch for this book and am glad that this was useful!

teleforce · on Jan 28, 2024

Hi Rasbt, thanks for writing the new guide and the upcoming book on LLM, another must buy book from Manning.

Just wondering are going to include any specific section or chapter in your LLM book on RAG? I think it will be very much a welcome addition for the build your own LLM crowd.

rasbt · on Jan 28, 2024

This is a good point. It's currently not in the TOC, but I may add this as supplementary text.

turnsout · on Jan 28, 2024

Semi-related, as long as we're requesting things: to @pr337h4m's point above, it would be interesting to have some rough guidance (even a sidebar or single paragraph) on when it makes sense to pre-train a new foundation model vs finetune vs pass in extra context (RAG). Clients of all sizes—from Fortune 100 to small businesses—are asking us this question.

rasbt · on Jan 28, 2024

That's a good point. I may briefly mention RAG-like systems and add some literature references on this, but I am bit hesitant to give general advice because it's heavily project-dependent in my opinion. It usually also comes down in what form the client has the data and whether referencing from a database or documentation is desired or not. The focus of chapter 6+7 is also instruction-finetuning and alignment rather than finetuning for knowledge. The latter goal is best achieved done via pretraining (as opposed to finetuning) imho. In any case, I just read this interesting case study last week on Finetuning vs RAG that might come in handy: "RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture" (https://arxiv.org/abs/2401.08406)