With all the different open-weight models appearing, is there some way of figuri...

GodelNumbering · 2025-04-28T21:58:54 1745877534

There are a lot of variables here such as your hardware's memory bandwidth, speed at which at processes tensors etc.

A basic thing to remember: Any given dense model would require X GB of memory at 8-bit quantization, where X is the number of params (of course I am simplifying a little by not counting context size). Quantization is just 'precision' of the model, 8-bit generally works really well. Generally speaking, it's not worth even bothering with models that have more param size than your hardware's VRAM. Some people try to get around it by using 4-bit quant, trading some precision for half VRAM size. YMMV depending on use-case

refulgentis · 2025-04-28T22:09:51 1745878191

4 bit is absolutely fine.

I know this is crazy to here because the big iron folks still debate 16 vs 32 and 8 vs 16 is near verboten in public conversation.

I contribute to llama.cpp and have seen many many efforts to measure evaluation perf of various quants, and no matter which way it was sliced (ranging from subjective volunteers doing A/B voting on responses over months, to objective object perplexity loss) Q4 is indistinguishable from the original.

brigade · 2025-04-28T23:55:37 1745884537

It's incredibly niche, but Gemma 3 27b can recognize a number of popular video game characters even in novel fanart (I was a little surprised at that when messing around with its vision). But the Q4 quants, even with QAT, are very likely to name a random wrong character from within the same franchise, even when Q8 quants name the correct character.

Niche of a niche, but just kind of interesting how the quantization jostles the name recall.

smallerize · 2025-04-29T00:05:20 1745885120

Vision models do degrade more with quantization. https://unsloth.ai/blog/dynamic-4bit

magicalhippo · 2025-04-29T01:53:31 1745891611

> 4 bit is absolutely fine.

For larger models.

For smaller models, about 12B and below, there is a very noticeable degradation.

At least that's my experience generating answers to the same questions across several local models like Llama 3.2, Granite 3.1, Gemma2 etc and comparing Q4 against Q8 for each.

The smaller Q4 variants can be quite useful, but they consistently struggle more with prompt adherence and recollection especially.

Like if you tell it to generate some code without explaining the generated code, a smaller Q4 is significantly more likely to explain the code regardless, compared to Q8 or better.

Grimblewald · 2025-04-29T03:07:05 1745896025

4 bit is fine conditional to the task. This condition is related to the level of nuance in understanding required for the response to be sensible.

All the models I have explored seem to capture nuance in understanding in the floats. It makes sense, as initially it will regress to the mean and slowly lock in lower and lower significance figures to capture subtleties and natural variance in things.

So, the further you stray from average conversation, the worse a model will do, as a function of it's quantisation.

So, if you don't need nuance, subtly, etc. say for a document summary bot for technical things, 4 bit might genuinely be fine. However, if you want something that can deal with highly subjective material where answers need to be tailored to a user, using in-context learning of user preferences etc. then 4 bit tends to struggle badly unless the user aligns closely with the training distribution's mean.

mmoskal · 2025-04-28T23:17:42 1745882262

Just for some callibration: approx. no one runs 32 bit for LLMs on any sort of iron, big or otherwise. Some models (eg DeepSeek V3, and derivatives like R1) are native FP8. FP8 was also common for llama3 405b serving.

whimsicalism · 2025-04-29T00:41:14 1745887274

> 8 vs 16 is near verboten in public conversation.

i mean, deepseek is fp8

CamperBob2 · 2025-04-29T04:25:29 1745900729

Not only that, but the 1.58 bit Unsloth dynamic quant is uncannily powerful.

PhilippGille · 2025-04-29T10:10:43 1745921443

Mozilla started LocalScore for exactly what you're looking for: https://www.localscore.ai/

sireat · 2025-04-29T11:40:24 1745926824

Fascinating that 5090 is often close but not quite as good as 4090 and RTX 6000 ADA. Perhaps it indicates that 5090 has those infamous missing computational units?

3090Ti seems to hold up quite well.

frainfreeze · 2025-04-28T22:13:12 1745878392

Bartowski quants on hugging face are excellent starting point in your case. Pretty much every upload he does has a note how to pick model vram wise. If you follow the recommendations you'll have good user experience. Then next step is localllama subreddit. Once you build basic knowledge and feeling for things you will more easily gauge what will work for your setup. There is no out of the box calculator

rahimnathwani · 2025-04-28T23:17:19 1745882239

With 8GB VRAM, I would try this one first:

https://ollama.com/library/qwen3:8b-q4_K_M

For fast inference, you want a model that will fit in VRAM, so that none of the layers need to be offloaded to the CPU.

Spooky23 · 2025-04-28T22:20:25 1745878825

Depends what fast means.

I’ve run llama and gemma3 on a base MacMini and it’s pretty decent for text processing. It has 16GB ram though which is mostly used by the GPU with inference. You need more juice for image stuff.

My son’s gaming box has a 4070 and it’s about 25% faster the last time I compared.

The mini is so cheap it’s worth trying out - you always find another use for it. Also the M4 sips power and is silent.

estsauver · 2025-04-29T10:23:47 1745922227

I don't think this is all that well documented anywhere. I've had this problem too and I don't think anyone has tried to record something like a decent benchmark of token inference/speed for a few different models. I'm going to start doing it while playing around with settings a bit. Here's some results on my (big!) M4 Mac Pro with Gemma 3, I'm still downloading Qwen3 but will update when it lands.

https://gist.github.com/estsauver/a70c929398479f3166f3d69bce...

Here's a video of the second config run I ran so you can see both all of the parameters as I have them configured and a qualitative experience.

https://screen.studio/share/4VUt6r1c

hedgehog · 2025-04-29T03:58:16 1745899096

Fast enough depends what you are doing. Models down around 8B params will fit on the card, Ollama can spill out though so if you need more quality and can tolerate the latency bigger models like the 30B MoE might be good. I don't have much experience with Qwen3 but Qwen2.5 coder 7b and Gemma3 27b are examples of those two paths that I've used a fair amount.

yencabulator · 2025-04-29T19:57:31 1745956651

Well, deepseek-r1:7b on AMD CPU only is ~12 token/s, gemma3:27b-it-qat is ~2.2 token/s. That's pure CPU at about 0.1x of a $3,500 Apple laptop at about 0.1x of the price. It's more a question about your patience, use case, and budget.

For discrete GPUs, RAM size is a harder cutoff. You either can run a model, or you can't.

colechristensen · 2025-04-28T22:10:42 1745878242

>is there some way of figuring out what model would work with sensible speed (> X tok/s) on a standard desktop GPU ?

Not simply, no.

But start with parameters close to but less than VRAM and decide if performance is satisfactory and move from there. There are various methods to sacrifice quality by quantizing models or not loading the entire model into VRAM to get slower inference.

xiphias2 · 2025-04-28T23:27:00 1745882820

When I tested Qwen with different sizes / quants, generally the 8-bit quant versions had the best quality for the same speed.

4-bit was ,,fine'', but a smaller 8-bit version beat it in quality for the same speed

refulgentis · 2025-04-28T22:07:01 1745878021

i desperately want a method to approximate this and unfortunately it's intractable in practice.

Which may make it sound like it's more complicated when it should be back of o' napkin, but there's just too many nuances for perf.

Really generally, at this point I expect 4B at 10 tkn/s on a smartphone with 8GB of RAM from 2 years ago. I'd expect you'd get somewhat similar, my guess would be 6 tkn/s at 4B (assuming rest of the HW is 2018 era and you'll relay on GPU inference and RAM)

archerx · 2025-04-29T12:31:01 1745929861

On hugging face, if you tell them which GPU you have the models that will run decently will have a green icon.

wmf · 2025-04-28T21:57:25 1745877445

Speed should be proportional to the number of active parameters, so all 7B Q4 models will have similar performance.

jack_pp · 2025-04-28T21:57:39 1745877459

Use the free chatgpt to help you write a script to download them all and test speed

Fokamul · 2025-04-29T14:21:38 1745936498

8G VRAM for LLM, are you sure? I thought you need way more, 20GB++ Nvidia doesn't want peasants running own LLMs locally, 90% of their business is supporting AI bubble with a lot of GPU datacenters