I agree with summary. When I first wanted to tackle a hard problem I thought to reach for fine-tuning with lots of input and output pairs, but it wasn't needed.
Past few shot and RAG, you can overcome context window limits if you find ways to break a single request into many, each with specific context and then roll them up somehow. This can help get past context window limits.
Claude 2 has a large context window, but if you are actually giving that much in prompt examples, to cover tricky edge cases, I've found its better to break things down into multiple steps.
And if you can break things up that way, and costs isn't at issue, GPT-4, with lots of few shot examples, and chain of thought seems to give me the best results.
Or this is what I found writing a code translator for a language the LLM didn't know. I wrote it down in more details here:
In my experience, Claude 2 is marvellously good at sucking in massive documents and accurately generating responses to questions. I gave it the entire Georgia indictment (minus a few pages of crap at the start that was irrelevant to reduce token count) and it wrote a NYTimes article based on the indictment that could be favourably compared to the actual NYTimes piece summarizing the same. There were no factual errors in its output.
I imagine OpenAI is not far behind in expanding the context window of its models. The LLM companies have access to the same techniques and - in my estimation - are just choosing to focus on one aspect or another to address different market needs. For instance, Claude 2 clearly focuses on maximum context window size at the cost of speedy inference and, presumably, inference cost. By contrast, OpenAI seems to be focused on speed and low cost (GPT-3.5) and accuracy (GPT-4) rather than maximum token length.
I am just using GPT4-8k and breaking tasks up. When I tested with Claude 1 ( 2 wasn't out yet) you could feed it lots of context, but it didn't always seem like it paid attention to it all. It would go off the rails.
For example, if I put instructions first, then a lot of context, it would forget the task, so instead put a small explanation, then context, then question, but still it would seem to sometimes loose the plot.
For feedback on prose though, I found it Claude very strong. It's nice to be able to give it half of a book, in text format and have it guide you to important parts and summarize sections. Some things, like what are the themes of this work, what's surprising finding, don't work well with vector DBs.
Interesting, thanks for sharing. We also found[1] that context stuffing leads (generally) to worse quality results, as did a recent study.[2] But that's for question-answering use cases and not summarization.
Great work! Matches my findings but much more rigorous.
The tasks I'm doing might be uncommon. Here is a build script in some language you (the llm) understand, and I want to get it translated to a language you don't. And so the context is example conversions and documentation.
I've also had some luck with things that boil down to "Here is a very large style guide. now how would you improve this code?". Or "here is a number of examples of feedback on writing to conform to a style. now generate the same on this new input."
I found the large context windows and Claude to work quite well in those examples. But, if its possible, breaking it down into multiple steps with less context somehow and using GPT4 even better (though more work).
Not OP but, I finally got access to GPT-4 32k and I thought that for a thing as simple as a summary it would be better than the smaller context-window model. But then I realized that now I have (practically) infinite context in input but what the LLM can output is always between 1-1.5k tokens. That is, I think, because of the samples in its training, not because it can't output a more lengthy output.
In short, I think that multiple steps are better, for example in summarization.
I think it is more nuanced. This article for example contains examples that suggest otherwise if you want to increase quality (which is a major concern when putting things in production):
We've got some additional resources for folks looking to better understand Retrieval Augmented Generation (RAG) and even see it in action - in this example we demonstrate a potentially very dangerous hallucination (that has to do with driving) and how to fix it using RAG: https://www.pinecone.io/learn/retrieval-augmented-generation...
If you're curious to actually try out the difference between an LLM without domain-specific context and an LLM that is using RAG, you can try our live demo here: https://pinecone-vercel-starter.vercel.app/
And if you'd like to fork and make your own tweaks to the above demo ^ chatbot, in order to, for example, swap in your own company logo and extend it for your purposes, you can find our Vercel template here: https://github.com/pinecone-io/pinecone-vercel-starter
In our opinion, RAG is indeed an effective technique partly because you don't need to be a machine learning expert in order to implement it in your Generative AI applications.
Completely agree. If you are in the hacker filter bubble you may get the impression that fine-tuning is super important and powerful. But in reality for most use cases it offers little advantage for _a lot_ of effort.
The future is likely > 90% of developers relying on the best frontier models and using the context to specialise, and 10% of specialised developers who have the expertise, budget, and time, customising LLMs for very specific use cases where there is no other option.
I tried finetuning the 13b LLaMa model to insert the knowledge from my own documents, but my experiments weren't successful. My conclusion is that you need billions of tokens to make a LLM reason based on your own dataset. And even if they did acquire these reasoning skills, they probably wouldn't beat GPT-4. And we are not even getting into the costs of self-hosting these LLM's. So why bother? Just use an API from these companies with powerful models and tweak it to deal with your own need.
@rafaelero I'm working on a blogpost (https://colinharman.substack.com/) to demonstrate this fact since I get a lot of tiresome questions like "why don't you just train instead of retrieving"
Do you have any scripts you could share for the training/eval process? Would love to credit you in the post
This is the approach I am using right now. My intuition for trying to finetune was that for complex questions it would be better if the model could naturally deal with those intrincacies instead of reading documents with concepts that are connected but not very explicitly so. There is also the problem with context window limit; sometimes I have to truncate the relevant documents, limiting the capacity of the model to offer a good answer.
But I am very impressed with GPT-4's abilities in finding the correct answer just by reading those documents, so I think the only problem still enduring is the context window size.
> My intuition for trying to finetune was that for complex questions it would be better if the model could naturally deal with those intrincacies instead of reading documents with concepts that are connected but not very explicitly so.
That was my impression as well, which is why your comment was so interesting to me. Have you found tools/projects for the DAG approach that you'd recommend?
The need to train/tune a model, in this case LLMs, is assumed to rely on the requirement for grounding and running on the edge or offline. This need will vary by use case.
With log file analysis as an example, training a model may increase the model's ability to deal with outliers, through writing regex which is placed in the indexing pipeline. In this use, tuning a prompt isn't going to help much, given the foundation model might have no idea how to parse a given field in a log line no matter how you put it to it in the prompt.
Tuning models also serves other purposes, such as removing guardrails introduced in the training data by others, and customizing the self referenced material the model "knows" about, such as its name, creators and the "personality" presented to the end user.
RAG is fundamentally different and there will always be a place for it.
A model's weights are inherently lossy and opaque. If a model asserts some fact, it is impossible to tell whether that fact was true or hallucinated just from the model, because the model has no notion of "fact" or "truth", it's just probabilities.
Generating an answer solely from model weights is like asking a random person to answer a question from memory. Sure, you might get the right answer, but there's no guarantee.
Using RAG is like handing them a book and asking them what the book says the answer is. With the benefit that LLMs can "read" much faster than a human.
My take is that LLMs are actually much better at "reading" than they are at "writing", and RAG plays to that strength.
Fine-tuning is coming soon for GPT-3.5-turbo and GPT-4 from both Open AI and Azure. Still, I don't think many users will need it.
Fine-tuning is not a solution for getting fresh data as you would with RAG, unless you are planning to run your entire fine-tuning suite for every new document. It can help improve accuracy a bit when you need to specialise the model for a very specific domain or modality. In practice, this is rare and unnecessary for most use cases building on LLMs.
The counter argument to your assertion (my own opinion, not Microsoft's) is that the reason you hear so much about fine-tuning from everyone other than Open AI / MS, is that they offer less capable models that can't reliably produce the same quality of results without fine-tuning.
RAG is the ONLY way to make sure you models are keeping true to facts and source material. Fine-tuning a model before using RAG helps with shaping the style of the summary, and gravitating towards more important facts presented.
RAG doesn't give you this -- it gives you a higher probability that you're keeping true to facts and source material, but the model may still give you hallucinated responses.
There are other methods than RAG over vector database; you can use basic TF-IDF or even full-text search to find candidate paragraphs and put them into the context.
Of course. “Retrieval” in RAG doesn’t require a special kind of retriever. As long as the relevance is tuned for the top documents to seed the prompt context, it doesn’t matter what kind of search backend you use.
Agreed though R in RAG typically means specifically vector search. Not sure how this was called before vector DBs were popularized by LLaMAIndex, probably just retrieval systems with LLMs.
How does RAG not meet your expectations? When I use chatgpt I provide source material for my questions and typically get much higher quality responses.
We're using LLMs (OpenAI) to generate SQL queries to search customer data, and the current approach using chat API frequently generates queries using the wrong record/column names. I'm exploring use of fine tuning to improve accuracy on a customer/customer basis to train on their set of data, isn't that a good use case?
The question is how successful you'll be. Generally fine-tuning means you're going to drop 10-500k on data labeling and compute costs + 1 science type for 6 months.
This means you're easily looking at a 1 Million dollar project in order to be successful. Even once you're done, the odds of success are mixed - and Claude-3 may beat your fine-tuned model. These economics aren't hard for research shops, but startups are going to struggle with this approach.
I think this is something that sample biasing would work better for, which you could do with local LLM's. For example with ad-llama[1] you would just have a sampler bias like so:
As other commenter said, fine-tuning is a costly and time-consuming affair. You can use RAG and have a separate namespace[1] for each of your customers in the vector database, so that it only searches through a specified customers' data for relevant context.
I don't think so. You don't have enough data for finetuning. How would you even fine tune? If you have less then several thousand examples, I wouldn't even think about finetuning
One of the main problems with LLMs today is they are gigantic, and this is because they are shipping the entire compressed memry of the training data with them.
Future LLMs are likely to have much smaller size and have outside long term memory/training knowledge as well as work memory (a'la RAG approach).
But in compressing all that data, they seem to build up internal representations that let them generalize on data they haven't seem. That reasoning ability, if not the actual data, is what gives them their power.
- After seeing how merely quantizing a model can make it go berserk, I have very little confidence I can fine tune an LLM and expect similar performance benchmarks.
- A RAG-empowered LLM can tell you where the knowledge used to answer a question came from.
Finetuning should never be the first step; it's slow, expensive, and indeterminant. Until you are maxing out that context window, you can just keep layering in more information into the prompt.
I wonder if the approach may change a bit when OpenAI releases fine tuning for the chat models. I think it depends on how well it works. If they find some way to significantly decrease the amount of training data needed or someone creates a tool to easily generate lots of training examples (using OpenAI), the advice might change again.
What also matters is the size of the context window and how effectively the models can follow large amounts of instructions. So new models might change the advice again.
I love how clearly this article is written. The author uses a table with the columns being "Initial Motivation for Fine-tuning" and "Why a Base LLM is Sufficient" - exactly what you need to learn why "You probably don’t need to fine-tune LLMs", which precisely the title. Using text to convey something that can be expressed as a table or a chart is just as bad as trying to do math without math notation. Stellar work!
Is there a method for this to be "Augmented" and not "Replacement" eg in the example from the blog post, "retriever=vectorstore.as_retriever()" which I believe would return something like "I don't know" if the content is not in the vectorstore.
In humans, a person might say something like, "I'm not an expert, but X" and I think being able to default back to the underlying LLM would be useful.
I assume those frameworks have a way to specify the cut off for similarity and a default. And they are flexible in terms of the granularity that you use them, so you should be able to specify the prompt directly in theory.
But using the OpenAI API directly is not that complicated. And neither is using something like pg_vector or just a plain cosine similarity calculation from Stack Overflow.
For most use cases this article is right on target.
I have been self hosting a 16K context size model and there is a lot you can do with 16K or larger context.
There are also great use cases for fine tuning. For example, if you are writing a chatbot for your company’s products, it might make sense to fine tune on product data and then RAG it with specific customer data when setting up a chat session.
Fine-tuning is such a dangerous phrased bc it sounds perfect.
“We’ll just fine-tune based on our (we think valuable + special) data”
Fine-tuning doesn’t enhance the model w/ new “knowledge” but a new narrowly-defined task
One other “cost” to consider is fine-tuning a 3rd-party model means if that foundation model changes or goes away that effort/cost needs to be repeated
It is worth noting that all the current models offered by the OpenAI API are already fine-tuned, with supervised learning and reinforcement learning, to follow instructions and to follow them in a certain way.
OpenAI removed the base GPT-3.5 model a while ago and never made it available for GPT-4.
What are people even doing with fine tuned LLMs? I can never think of something that it can't do natively or that I have enough data for to be able to fine tune a task for. Just curious
Besides the three mentioned in the article (stringent accuracy requirements, fast edge inference, involved style transfer task) are there other good reasons to fine-tune?
I fine tune smaller LLMs with fanfic to generate stories in a genre. For instance, most models have no meaningful Minecraft content. By feeding lots of fanfic into a Lora and supplementing with RAG, I get pretty good results. RAG alone is garbage as there’s too much broad context on say Herobrine spanning many many stories, but the base models know almost nothing. Few shot doesn’t help because there’s not enough semantic support in the model weights. Etc.
That's really interesting! I've wanted to do something similar but assumed that the token count required to ingrain a meaningful amount of information would make it very difficult. How many tokens do you typically need to feed into these base models to get consistent content out? Does generated text ever end up leaving the specialized domain? (like League of Legends characters mixed into a Minecraft story)
I’ve not done anything particularly scientific to measure but my observation is if symbols are fairly rare it doesn’t require a huge amount. Herobrine for instance as a string isn’t very common, so it seems to be pretty good at keeping in the semantic region. Steve and Alex are popular Minecraft characters but are not very unique so you see more drift. But the more you mix in specific semantics, like creepers, or other tokens that are pretty specific to the genre, things improve. You see similar behavior with stable diffusion Lora.
I’ve never seen it wander between genres, but it can tell pretty generic stories particularly if you’re pretty generic in your promoting.
How would guardrails be applied through fine-tuning?
I haven't heard of that, so genuinely curious. What I'm familiar with is guardrails applied on top of the LLM. For instance through prompting, managing the available data inside the vector database that could be used for context (if using RAG), or through something like NeMo[1].
The technique is called reinforcement learning from human feedback (RLHF). It's how, for example, Facebook can train Llama models to meet a certain safety score before releasing them into the wild.
for example by using a full fine-tuning it's possible to censor a LLama model.
then people use the same dataset, but filter out guardrails to create an uncensored type of this model.
just google for "llama wizard vicuna uncensored".
Past few shot and RAG, you can overcome context window limits if you find ways to break a single request into many, each with specific context and then roll them up somehow. This can help get past context window limits.
Claude 2 has a large context window, but if you are actually giving that much in prompt examples, to cover tricky edge cases, I've found its better to break things down into multiple steps.
And if you can break things up that way, and costs isn't at issue, GPT-4, with lots of few shot examples, and chain of thought seems to give me the best results.
Or this is what I found writing a code translator for a language the LLM didn't know. I wrote it down in more details here:
https://earthly.dev/blog/build-transpose/