> This sounds like a logic riddle! The answer is: You should go by car. Here is why: If you walk, you will arrive at the car wash, but your car will still be 50 meters away at home. You can't wash the car if the car isn't there! To accomplish your goal, you have to drive the car to the car wash.
It has the wrong one in thinking. It did think longer than usual:
Unfortunately it really is running this slow with Llama.cpp, but of course that's with Vulkan mode. The VRAM capacity is definitely where it shines, rather than compute power. I am pretty sure that this isn't really optimal use of the cards, especially since I believe we should be able to get decent, if still sublinear, scaling with multiple cards. I am not really a machine learning expert, I'm curious to see if I can manage to trace down some performance issues. (I've already seen a couple issues get squashed since I first started testing this.)
I've heard that vLLM performs much better, scaling particularly better in the multi GPU case. The 4x B70 setup may actually be decent for the money given that, but probably worth waiting on it to see how the situation progresses rather than buying on a promise of potential.
A cursory Google search does seem to indicate that in my particular case interconnect bandwidth shouldn't actually be a constraint, so I doubt tensor level parallelism is working as expected.
But on my 3x 1080 Ti 1x TITAN V getto machine I learned that multi gpu takes a lot of tuning no matter what. With the B70, where Vulkan has the CPU copy problem, and SYCL doesn't have a sponsor or enough volunteers, it will probably take a bit of profiling on your part.
There are a lot of variables, but PCIe bus speed doesn't matter that much for inference, but the internal memory bandwidth does, and you won't match that with PCIe ever.
To be clear, multicard Vulkan and absolutely SYCL have a lot of optimizations that could happen, but the only time two GPUs are really faster for inference is when one doesn't have enough ram to fit the entire model.
A 3090 has 936.2 GB/s of (low latency) internal bandwidth, while 16xPCIe5 only has 504.12, may have to be copied through the CPU, have locks, atomic operations etc...
For LLM inference, the bottleneck just usually going to be memory bandwidth which is why my 3090 is so close to the 5070ti above.
LLM next token prediction is just a form of autoregressive decoding and will primarily be memory bound.
As I haven't used the larger intel GPUs I can't comment on what still needs to be optimized, but just don't expect multiple GPUs to increase performance without some nvlink style RDMA support _unless_ your process is compute and not memory bound.
The default Qwen "quantization" is not "bad", it's "large".
Unsloth releases lower-quality versions of the model (Qwen in this case). Think about taking a 95% quality JPEG and converting it to a 40% quality JPEG.
Models are quantized to lower quality/size so they can run on cheaper/consumer GPUs.
What do you think about creating a tool which can just patch the template embedded in the .gguf file instead of forcing a re-download? The whole file hash can be checked afterwards.
Sadly it's not always chat template fixes :( But yes we now split the first shard as pure metadata (10MB) for huge models - these include the chat template etc - so you only need to download that.
For serious fixes, sadly we have to re-compute imatrix since the activation patterns have changed - this sadly makes the entire quant change a lot, hence you have to re-download :(
> Having spent six weeks or so using Gas Town across multiple simultaneous projects, I believe I can describe the shift concretely. The bottleneck migrates from coding speed to the rate at which you can generate ideas, write specifications, and validate outputs. You are no longer limited by how fast you can build. You are limited by how fast you can think.
Interesting:
> Kubernetes asks “Is it running?” Gas Town asks “Is it done?” Kubernetes optimizes for uptime. Gas Town optimizes for completion.
I’m not sure I find the testimony of a Bain & Company AI consultant (https://www.bain.com/our-team/eric-koziol/) to be compelling for anything outside of generating fees.
This sounds like every LLM workflow, which is 'you tell the LLM what you want'.
The real distinction is of scale - whether you want a REST endpoint or a fully functional word processor.
But real, actual, complex software is at least half spec (either explicit, or implicitly captured by its code), the question is, can LLMs specify software to the same degree with Gas Town, that you get something functioning?
You provided a quote from someone who seems to be an AI-boosting influencer who claimed to use it, but where's the output in the form of code we can look at, or in the form of an app someone can use today?
I'm not an AI-denier. I use LLMs and agentic coding. They increase my productivity.
...but there is still a very real problem with people claiming that some new way of using AI is earth shattering, and changes everything based on vague anecdotes that don't involve a tangible released output that they can point to.
Yeah if this can truly just autonomously make great software, then where is all the new SaaS that is able to undercut incumbents by charging 10-20% of what they are charging?
try using codex-5.3-spark, it has much faster inference, might be able to keep up. and maybe a specialized different openrouter model for visual parsing.
reply