Cookbook: Finetuning Llama 2 in your own cloud environment, privately

andrewmutz · on Aug 2, 2023

Does anyone know how to estimate the cost of inference using your own Llama2 model? This article talks about the cost of fine tuning it, but not what to expect when running it in production for inference.

In particular, it would be great to know how the inference cost compares to gpt3.5 turbo and gpt4

weichiang · on Aug 2, 2023

the cost would be depending on GPU type/serving system/traffic pattern. check out some throughput comparison from vllm's blog post https://vllm.ai/ if you serve 7B on cost-optimized GPUs (A10G/L4) and keep it busy, it can be a lot cheaper than gpt3.5 turbo. tho it's not a fair comparison as 3.5's quality is still far better.

zhwu · on Aug 2, 2023

Great reference!

Just want to add about hosting your own LLM vs using ChatGPT. Cost is definitely a thing to consider, but it also depends on whether it is ok to share the requests to your product with OpenAI.

Also, something you cannot do with ChatGPT is to custom it with your own data, such as internal documents, etc. As shown in the blog, the model trained by ourselves can easily know its identity.

weichiang · on Aug 2, 2023

say using A10G ~$1.2/hr and with full utilization on vllm 112 reqs/min => per req ~$0.00018 versus gpt-3.5 turbo $0.002 per 1k token

npsomaratna · on Aug 3, 2023

Quick question: what would you estimate the running cost of Llama 2 70b to be? (On GPU, and assuming maximum utilization)?

cpill · on Aug 3, 2023

yeah, that's the real question here

zhwu · on Aug 2, 2023

It is the underlying operational guide of the latest release of Vicuna-1.5: https://twitter.com/lmsysorg/status/1686794639469371393

ripvanwinkle · on Aug 2, 2023

Can fine tuning replace the retrieval step i.e. is it possible to fine tune the model so it knows all the knowledge from my organization and we skip the retrieval step during a chat about the data

a5huynh · on Aug 2, 2023

A problem with fine tuning based on organization data is that if the underlying data changes, you'd need to fine-tune the model again each change. This might be okay for one-off changes (such as the name of the model in the example) but if it costs $300 each time (not to mention the time spent) and you have 100s/1,000s of changes per month, it's not really viable.

zhwu · on Aug 2, 2023

The finetuning can tailor the model to have more customized knowledge, just like the identity knowledge of itself shown in the blog post. If you ask the original llama model, it should know nothing about SkyPilot or Vicuña, as it is trained on old knowledge from the internet.

However, finetuning still cannot get rid of the hallucination problem that all the chatbot suffers from. It depends on how accurate you expect the chatbot should be. The retrieval might be considered more accurate, as it will not make up solutions, but just return irrelevant answer in the worst case.

covi · on Aug 2, 2023

You could finetune then add on a retrieval step, which has the advantage of citing sources. Jury's probably still out on which, or which combination of, methods work best. Likely use case and data size-dependent.

dang · on Aug 2, 2023

Related ongoing thread:

Run Llama 2 uncensored locally - https://news.ycombinator.com/item?id=36973584 - Aug 2023 (148 comments)

bestcoder69 · on Aug 2, 2023

Looking for a llama2 fine-tune guide specific to Apple silicon, if anyone has one. I wanna see how big of a model I can tune on my 64gb mac studio