You probably don’t need to fine-tune an LLM

adamgordonbell · on Aug 18, 2023

I agree with summary. When I first wanted to tackle a hard problem I thought to reach for fine-tuning with lots of input and output pairs, but it wasn't needed.

Past few shot and RAG, you can overcome context window limits if you find ways to break a single request into many, each with specific context and then roll them up somehow. This can help get past context window limits.

Claude 2 has a large context window, but if you are actually giving that much in prompt examples, to cover tricky edge cases, I've found its better to break things down into multiple steps.

And if you can break things up that way, and costs isn't at issue, GPT-4, with lots of few shot examples, and chain of thought seems to give me the best results.

Or this is what I found writing a code translator for a language the LLM didn't know. I wrote it down in more details here:

https://earthly.dev/blog/build-transpose/

gk1 · on Aug 18, 2023

How much do you make use of Claude's large context window, and has it helped? Or did you basically revert back to small context size + multiple steps?

ttul · on Aug 18, 2023

In my experience, Claude 2 is marvellously good at sucking in massive documents and accurately generating responses to questions. I gave it the entire Georgia indictment (minus a few pages of crap at the start that was irrelevant to reduce token count) and it wrote a NYTimes article based on the indictment that could be favourably compared to the actual NYTimes piece summarizing the same. There were no factual errors in its output.

I imagine OpenAI is not far behind in expanding the context window of its models. The LLM companies have access to the same techniques and - in my estimation - are just choosing to focus on one aspect or another to address different market needs. For instance, Claude 2 clearly focuses on maximum context window size at the cost of speedy inference and, presumably, inference cost. By contrast, OpenAI seems to be focused on speed and low cost (GPT-3.5) and accuracy (GPT-4) rather than maximum token length.

adamgordonbell · on Aug 18, 2023

I am just using GPT4-8k and breaking tasks up. When I tested with Claude 1 ( 2 wasn't out yet) you could feed it lots of context, but it didn't always seem like it paid attention to it all. It would go off the rails.

For example, if I put instructions first, then a lot of context, it would forget the task, so instead put a small explanation, then context, then question, but still it would seem to sometimes loose the plot.

For feedback on prose though, I found it Claude very strong. It's nice to be able to give it half of a book, in text format and have it guide you to important parts and summarize sections. Some things, like what are the themes of this work, what's surprising finding, don't work well with vector DBs.

gk1 · on Aug 18, 2023

Interesting, thanks for sharing. We also found[1] that context stuffing leads (generally) to worse quality results, as did a recent study.[2] But that's for question-answering use cases and not summarization.

[1] https://www.pinecone.io/blog/why-use-retrieval-instead-of-la...

[2] https://arxiv.org/pdf/2307.03172.pdf

adamgordonbell · on Aug 18, 2023

Great work! Matches my findings but much more rigorous.

The tasks I'm doing might be uncommon. Here is a build script in some language you (the llm) understand, and I want to get it translated to a language you don't. And so the context is example conversions and documentation.

I've also had some luck with things that boil down to "Here is a very large style guide. now how would you improve this code?". Or "here is a number of examples of feedback on writing to conform to a style. now generate the same on this new input."

I found the large context windows and Claude to work quite well in those examples. But, if its possible, breaking it down into multiple steps with less context somehow and using GPT4 even better (though more work).

saliagato · on Aug 18, 2023

Not OP but, I finally got access to GPT-4 32k and I thought that for a thing as simple as a summary it would be better than the smaller context-window model. But then I realized that now I have (practically) infinite context in input but what the LLM can output is always between 1-1.5k tokens. That is, I think, because of the samples in its training, not because it can't output a more lengthy output. In short, I think that multiple steps are better, for example in summarization.

lgrammel · on Aug 18, 2023

I think it is more nuanced. This article for example contains examples that suggest otherwise if you want to increase quality (which is a major concern when putting things in production):

https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehe...

zackproser · on Aug 18, 2023

Great post!

We've got some additional resources for folks looking to better understand Retrieval Augmented Generation (RAG) and even see it in action - in this example we demonstrate a potentially very dangerous hallucination (that has to do with driving) and how to fix it using RAG: https://www.pinecone.io/learn/retrieval-augmented-generation...

If you're curious to actually try out the difference between an LLM without domain-specific context and an LLM that is using RAG, you can try our live demo here: https://pinecone-vercel-starter.vercel.app/

And if you'd like to fork and make your own tweaks to the above demo ^ chatbot, in order to, for example, swap in your own company logo and extend it for your purposes, you can find our Vercel template here: https://github.com/pinecone-io/pinecone-vercel-starter

In our opinion, RAG is indeed an effective technique partly because you don't need to be a machine learning expert in order to implement it in your Generative AI applications.

intellectronica · on Aug 18, 2023

Completely agree. If you are in the hacker filter bubble you may get the impression that fine-tuning is super important and powerful. But in reality for most use cases it offers little advantage for _a lot_ of effort.

The future is likely > 90% of developers relying on the best frontier models and using the context to specialise, and 10% of specialised developers who have the expertise, budget, and time, customising LLMs for very specific use cases where there is no other option.

rafaelero · on Aug 18, 2023

I tried finetuning the 13b LLaMa model to insert the knowledge from my own documents, but my experiments weren't successful. My conclusion is that you need billions of tokens to make a LLM reason based on your own dataset. And even if they did acquire these reasoning skills, they probably wouldn't beat GPT-4. And we are not even getting into the costs of self-hosting these LLM's. So why bother? Just use an API from these companies with powerful models and tweak it to deal with your own need.

ZunarJ5 · on Aug 18, 2023

https://github.com/khoj-ai/khoj

This is the easiest I found, on here too.

nico · on Aug 18, 2023

The readme says it’s offline and doesn’t mention any LLMs

Does it not use any LLMs, or does it have its own, or how how does it work?

Thank you

ZunarJ5 · on Aug 18, 2023

You can set it up so that it downloads Llama for offline use in the UI. You do not have to do it manually.

clharman · on Aug 19, 2023

@rafaelero I'm working on a blogpost (https://colinharman.substack.com/) to demonstrate this fact since I get a lot of tiresome questions like "why don't you just train instead of retrieving"

Do you have any scripts you could share for the training/eval process? Would love to credit you in the post

CharlesW · on Aug 18, 2023

Did you try a Retrieval Augmented Generation (RAG) approach? If so, what made you give it up in favor of fine-tuning?

rafaelero · on Aug 18, 2023

This is the approach I am using right now. My intuition for trying to finetune was that for complex questions it would be better if the model could naturally deal with those intrincacies instead of reading documents with concepts that are connected but not very explicitly so. There is also the problem with context window limit; sometimes I have to truncate the relevant documents, limiting the capacity of the model to offer a good answer.

But I am very impressed with GPT-4's abilities in finding the correct answer just by reading those documents, so I think the only problem still enduring is the context window size.

CharlesW · on Aug 18, 2023

> My intuition for trying to finetune was that for complex questions it would be better if the model could naturally deal with those intrincacies instead of reading documents with concepts that are connected but not very explicitly so.

That was my impression as well, which is why your comment was so interesting to me. Have you found tools/projects for the DAG approach that you'd recommend?

kordlessagain · on Aug 18, 2023

The need to train/tune a model, in this case LLMs, is assumed to rely on the requirement for grounding and running on the edge or offline. This need will vary by use case.

With log file analysis as an example, training a model may increase the model's ability to deal with outliers, through writing regex which is placed in the indexing pipeline. In this use, tuning a prompt isn't going to help much, given the foundation model might have no idea how to parse a given field in a log line no matter how you put it to it in the prompt.

Tuning models also serves other purposes, such as removing guardrails introduced in the training data by others, and customizing the self referenced material the model "knows" about, such as its name, creators and the "personality" presented to the end user.

famouswaffles · on Aug 18, 2023

LLMs are a different beast in the ML world.

Finetuned Palm for Medicine and Finetuned Minerva for Math all perform a good deal worse than GPT-4.

A fine-tuned smaller model is by no means guaranteed to beat a larger more general one (though of course you may get acceptable performance).

And then the necessity of fine-tuning itself is called into question plenty with LLMs.

https://huggingface.co/papers/2308.00304

https://huggingface.co/papers/2308.07921

https://arxiv.org/abs/2211.09066

norwalkbear · on Aug 18, 2023

RAG sucks. Microsoft is the force behind it because they don't allow training on their chatgpt models.

Fine-tuning even Lora on the open source models is nearly always better than these other approaches

lukev · on Aug 18, 2023

RAG is fundamentally different and there will always be a place for it.

A model's weights are inherently lossy and opaque. If a model asserts some fact, it is impossible to tell whether that fact was true or hallucinated just from the model, because the model has no notion of "fact" or "truth", it's just probabilities.

Generating an answer solely from model weights is like asking a random person to answer a question from memory. Sure, you might get the right answer, but there's no guarantee.

Using RAG is like handing them a book and asking them what the book says the answer is. With the benefit that LLMs can "read" much faster than a human.

My take is that LLMs are actually much better at "reading" than they are at "writing", and RAG plays to that strength.

cubefox · on Aug 18, 2023

> My take is that LLMs are actually much better at "reading" than they are at "writing", and RAG plays to that strength.

They are certainly much faster at reading than at writing. In fact, they re-read the entire context window for every token they write!

rolisz · on Aug 18, 2023

I think there is some caching of some of the computation

intellectronica · on Aug 18, 2023

Fine-tuning is coming soon for GPT-3.5-turbo and GPT-4 from both Open AI and Azure. Still, I don't think many users will need it.

Fine-tuning is not a solution for getting fresh data as you would with RAG, unless you are planning to run your entire fine-tuning suite for every new document. It can help improve accuracy a bit when you need to specialise the model for a very specific domain or modality. In practice, this is rare and unnecessary for most use cases building on LLMs.

The counter argument to your assertion (my own opinion, not Microsoft's) is that the reason you hear so much about fine-tuning from everyone other than Open AI / MS, is that they offer less capable models that can't reliably produce the same quality of results without fine-tuning.

binarymax · on Aug 18, 2023

RAG is the ONLY way to make sure you models are keeping true to facts and source material. Fine-tuning a model before using RAG helps with shaping the style of the summary, and gravitating towards more important facts presented.

nostrebored · on Aug 18, 2023

RAG doesn't give you this -- it gives you a higher probability that you're keeping true to facts and source material, but the model may still give you hallucinated responses.

treprinum · on Aug 18, 2023

There are other methods than RAG over vector database; you can use basic TF-IDF or even full-text search to find candidate paragraphs and put them into the context.

binarymax · on Aug 18, 2023

Of course. “Retrieval” in RAG doesn’t require a special kind of retriever. As long as the relevance is tuned for the top documents to seed the prompt context, it doesn’t matter what kind of search backend you use.

treprinum · on Aug 18, 2023

Agreed though R in RAG typically means specifically vector search. Not sure how this was called before vector DBs were popularized by LLaMAIndex, probably just retrieval systems with LLMs.

binarymax · on Aug 19, 2023

That's not true. Information Retrieval needs much more than vector search. I've contributed to a book on the subject: https://aipoweredsearch.com

QuantumGood · on Aug 20, 2023

Any more info? The book's not published yet, I see.

rolisz · on Aug 18, 2023

It still doesn't give guarantees, it just makes hallucinations much less likely.

natsucks · on Aug 18, 2023

not to mention staying true to their benchmarks...

simonw · on Aug 18, 2023

"Fine-tuning even Lora on the open source models is nearly always better than these other approaches"

Can you expand on that? I've not seen evidence of that myself yet, but maybe I haven't looked in the right places.

a_ml · on Aug 19, 2023

From Aug 2022, but still worth a look

"Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning" https://arxiv.org/abs/2205.05638

Kiro · on Aug 18, 2023

What do you mean?

https://platform.openai.com/docs/guides/fine-tuning

whimsicalism · on Aug 18, 2023

none of those are chatgpt models fwiw

brigadier132 · on Aug 18, 2023

How does RAG not meet your expectations? When I use chatgpt I provide source material for my questions and typically get much higher quality responses.

sharemywin · on Aug 18, 2023

Is Lora used on language models?

Tostino · on Aug 18, 2023

Yes, quite successfully. And qlora, where things are further quantized while just sitting in VRAM.

whimsicalism · on Aug 18, 2023

herdcall · on Aug 18, 2023

We're using LLMs (OpenAI) to generate SQL queries to search customer data, and the current approach using chat API frequently generates queries using the wrong record/column names. I'm exploring use of fine tuning to improve accuracy on a customer/customer basis to train on their set of data, isn't that a good use case?

lumost · on Aug 18, 2023

The question is how successful you'll be. Generally fine-tuning means you're going to drop 10-500k on data labeling and compute costs + 1 science type for 6 months.

This means you're easily looking at a 1 Million dollar project in order to be successful. Even once you're done, the odds of success are mixed - and Claude-3 may beat your fine-tuned model. These economics aren't hard for research shops, but startups are going to struggle with this approach.

gsuuon · on Aug 18, 2023

I think this is something that sample biasing would work better for, which you could do with local LLM's. For example with ad-llama[1] you would just have a sampler bias like so:

  const knownColumns = ['name', 'email', 'id']

  template`
    SELECT "${a('column name', {
      sampler: bias.accept(oneOf(knownColumns))
    })}"
    FROM "table"
  `

You're able to enforce, at the sampler level, that the output is one of the expected choices.

[1] https://ad-llama.vercel.app/playground/

gk1 · on Aug 18, 2023

As other commenter said, fine-tuning is a costly and time-consuming affair. You can use RAG and have a separate namespace[1] for each of your customers in the vector database, so that it only searches through a specified customers' data for relevant context.

[1] https://docs.pinecone.io/docs/namespaces

rolisz · on Aug 18, 2023

I don't think so. You don't have enough data for finetuning. How would you even fine tune? If you have less then several thousand examples, I wouldn't even think about finetuning

ilaksh · on Aug 18, 2023

1. Use GPT-4

2. Clearly specify the column names as part of the prompt and that there are no other columns.

3. Occasionally you may get an error that you have to feed back to GPT-4.

freediver · on Aug 18, 2023

One of the main problems with LLMs today is they are gigantic, and this is because they are shipping the entire compressed memry of the training data with them.

Future LLMs are likely to have much smaller size and have outside long term memory/training knowledge as well as work memory (a'la RAG approach).

eldenring · on Aug 18, 2023

Isn't this just ignoring all the lessons we learned from pretraining large models? Seems like a different flavor of the bitter lesson.

adamgordonbell · on Aug 18, 2023

But in compressing all that data, they seem to build up internal representations that let them generalize on data they haven't seem. That reasoning ability, if not the actual data, is what gives them their power.

natsucks · on Aug 18, 2023

- After seeing how merely quantizing a model can make it go berserk, I have very little confidence I can fine tune an LLM and expect similar performance benchmarks.

- A RAG-empowered LLM can tell you where the knowledge used to answer a question came from.

maximamas · on Aug 18, 2023

Finetuning should never be the first step; it's slow, expensive, and indeterminant. Until you are maxing out that context window, you can just keep layering in more information into the prompt.

ilaksh · on Aug 18, 2023

I wonder if the approach may change a bit when OpenAI releases fine tuning for the chat models. I think it depends on how well it works. If they find some way to significantly decrease the amount of training data needed or someone creates a tool to easily generate lots of training examples (using OpenAI), the advice might change again.

What also matters is the size of the context window and how effectively the models can follow large amounts of instructions. So new models might change the advice again.

pradn · on Aug 18, 2023

I love how clearly this article is written. The author uses a table with the columns being "Initial Motivation for Fine-tuning" and "Why a Base LLM is Sufficient" - exactly what you need to learn why "You probably don’t need to fine-tune LLMs", which precisely the title. Using text to convey something that can be expressed as a table or a chart is just as bad as trying to do math without math notation. Stellar work!

jdmccarty · on Aug 18, 2023

Is there a method for this to be "Augmented" and not "Replacement" eg in the example from the blog post, "retriever=vectorstore.as_retriever()" which I believe would return something like "I don't know" if the content is not in the vectorstore.

In humans, a person might say something like, "I'm not an expert, but X" and I think being able to default back to the underlying LLM would be useful.

ilaksh · on Aug 18, 2023

I assume those frameworks have a way to specify the cut off for similarity and a default. And they are flexible in terms of the granularity that you use them, so you should be able to specify the prompt directly in theory.

But using the OpenAI API directly is not that complicated. And neither is using something like pg_vector or just a plain cosine similarity calculation from Stack Overflow.

mark_l_watson · on Aug 18, 2023

For most use cases this article is right on target.

I have been self hosting a 16K context size model and there is a lot you can do with 16K or larger context.

There are also great use cases for fine tuning. For example, if you are writing a chatbot for your company’s products, it might make sense to fine tune on product data and then RAG it with specific customer data when setting up a chat session.

valgaze · on Aug 18, 2023

Fine-tuning is such a dangerous phrased bc it sounds perfect.

“We’ll just fine-tune based on our (we think valuable + special) data”

Fine-tuning doesn’t enhance the model w/ new “knowledge” but a new narrowly-defined task

One other “cost” to consider is fine-tuning a 3rd-party model means if that foundation model changes or goes away that effort/cost needs to be repeated

JamesBarney · on Aug 18, 2023

My understanding of fine-tuning is 95% of the work could be re-used between different foundation models.

cubefox · on Aug 18, 2023

It is worth noting that all the current models offered by the OpenAI API are already fine-tuned, with supervised learning and reinforcement learning, to follow instructions and to follow them in a certain way.

OpenAI removed the base GPT-3.5 model a while ago and never made it available for GPT-4.

naillo · on Aug 18, 2023

What are people even doing with fine tuned LLMs? I can never think of something that it can't do natively or that I have enough data for to be able to fine tune a task for. Just curious

swsieber · on Aug 18, 2023

Extracting data from documents.

gsuuon · on Aug 18, 2023

Besides the three mentioned in the article (stringent accuracy requirements, fast edge inference, involved style transfer task) are there other good reasons to fine-tune?

fnordpiglet · on Aug 18, 2023

I fine tune smaller LLMs with fanfic to generate stories in a genre. For instance, most models have no meaningful Minecraft content. By feeding lots of fanfic into a Lora and supplementing with RAG, I get pretty good results. RAG alone is garbage as there’s too much broad context on say Herobrine spanning many many stories, but the base models know almost nothing. Few shot doesn’t help because there’s not enough semantic support in the model weights. Etc.

gsuuon · on Aug 18, 2023

That's really interesting! I've wanted to do something similar but assumed that the token count required to ingrain a meaningful amount of information would make it very difficult. How many tokens do you typically need to feed into these base models to get consistent content out? Does generated text ever end up leaving the specialized domain? (like League of Legends characters mixed into a Minecraft story)

fnordpiglet · on Aug 19, 2023

I’ve not done anything particularly scientific to measure but my observation is if symbols are fairly rare it doesn’t require a huge amount. Herobrine for instance as a string isn’t very common, so it seems to be pretty good at keeping in the semantic region. Steve and Alex are popular Minecraft characters but are not very unique so you see more drift. But the more you mix in specific semantics, like creepers, or other tokens that are pretty specific to the genre, things improve. You see similar behavior with stable diffusion Lora.

I’ve never seen it wander between genres, but it can tell pretty generic stories particularly if you’re pretty generic in your promoting.

whimsicalism · on Aug 18, 2023

Fine tuning/training is useful if you want to start using the llm as a decoder for some sort of multi-modal embedding.

If it is just text corpora, probably not worth it

lamroger · on Aug 18, 2023

Thanks for sharing! Diagrams look great from a first glance. Will read more indepth later!

gdiamos · on Aug 18, 2023

How can you edit or remove guardrails without fine tuning?

gk1 · on Aug 18, 2023

How would guardrails be applied through fine-tuning?

I haven't heard of that, so genuinely curious. What I'm familiar with is guardrails applied on top of the LLM. For instance through prompting, managing the available data inside the vector database that could be used for context (if using RAG), or through something like NeMo[1].

https://www.pinecone.io/learn/nemo-guardrails-intro/

zargon · on Aug 18, 2023

The technique is called reinforcement learning from human feedback (RLHF). It's how, for example, Facebook can train Llama models to meet a certain safety score before releasing them into the wild.

011021 · on Aug 18, 2023

for example by using a full fine-tuning it's possible to censor a LLama model. then people use the same dataset, but filter out guardrails to create an uncensored type of this model. just google for "llama wizard vicuna uncensored".