Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

With quantization-aware-training techniques, q4 models are less than 1% off from bf16 models. And yes, if your use case hinges on the very latest and largest cloud-scale models, there are things they can do the local ones just can't. But having them spitting tokens 24/7 for you would have you paying off a whole enterprise-scale GPU in a few months, too.

If anyone has a gaming GPU with gobs of VRAM, I highly encourage they experiment with creating long-running local-LLM apps. We need more independent tinkering in this space.



> But having them spitting tokens 24/7 for you would have you paying off a whole enterprise-scale GPU in a few months, too.

Again, what's the use case? What would make sense to run, at high rates, where output quality isn't much of a concern? I'm genuinely interested in this question, because answering it always seems to be avoided.


Any sort of business that might want to serve from a customized LLM at scale and doesn't need the smartest model possible, or hobbyist/researcher experiments. If you can get an agentic framework to work on a problem with a local model, it'll almost certainly work just as well on a cloud model. Again, speaking mostly people to already have a xx90 class GPU sitting around. Smoke 'em if you've got 'em. If you don't have a 3090/4090/5090 already, and don't care about privacy, then just enjoy how the improvements in local models are driving down the price per token of non-bleeding-edge cloud models.


> If you can get an agentic framework to work on a problem with a local model, it'll almost certainly work just as well on a cloud model.

This is the exact opposite from my tests: it will almost certainly NOT work as well as the cloud models, as supported by every benchmark I've ever seen. I feel like I'm living in another AI universe here. I suppose it heavily depends on the use case.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: