More

dist-epoch · 2026-04-16T22:19:59 1776377999

Buy new Macs from where? There is a shortage of RAM, SSD, GPUs, and the CPU shortage just started.

dist-epoch · 2026-04-16T22:18:30 1776377910

Jensen just said that if the signal/commitments are there, ASML can scale in 2-3 years.

vessenes · 2026-04-17T02:14:49 1776392089

With Anthropic buying compute in dark alleys I’d assume that day is coming..

dist-epoch · 2026-04-16T22:17:35 1776377855

yes, and it will keep increasing

dist-epoch · 2026-04-16T16:35:01 1776357301

Interestingly, Gemma4-26B IQ4_XS gets it correct:

> This sounds like a logic riddle! The answer is: You should go by car. Here is why: If you walk, you will arrive at the car wash, but your car will still be 50 meters away at home. You can't wash the car if the car isn't there! To accomplish your goal, you have to drive the car to the car wash.

It has the wrong one in thinking. It did think longer than usual:

Direct answer: Walk.

Reasoning 1: Distance (50m is negligible).

Reasoning 2: Practicality/Efficiency (engine wear/fuel).

Reasoning 3: Time (walking is likely faster or equal when considering car prep).

...

Wait, if I'm washing the car, I need to get the car to the car wash. The question asks how I should get there.

...

Wait, let's think if there's a trick. If you "go by car," you are moving the car to the destination. If you "walk," you are just moving yourself.

Conclusion: You should drive the car.

dist-epoch · 2026-04-16T16:31:06 1776357066

There are really nice GUIs for LLMs - CherryStudio for example, can be used with local or cloud models.

There are also web-UIs - just like the labs ones.

And you can connect coding agents like Codex, Copilot or Pi to local coding agents - the support OpenAI compatible APIs.

It's literally a terminal command to start serving the model locally and you can connect various things to it, like Codex.

dist-epoch · 2026-04-16T16:24:31 1776356671

NVIDIA 5070 Ti can run Gemma 4 26B at 4-bit at 120 tk/s.

Arc Pro B70 seems unexpectedely slow? Or are you using 8-bit/16-bit quants.

jchw · 2026-04-16T17:17:57 1776359877

Unfortunately it really is running this slow with Llama.cpp, but of course that's with Vulkan mode. The VRAM capacity is definitely where it shines, rather than compute power. I am pretty sure that this isn't really optimal use of the cards, especially since I believe we should be able to get decent, if still sublinear, scaling with multiple cards. I am not really a machine learning expert, I'm curious to see if I can manage to trace down some performance issues. (I've already seen a couple issues get squashed since I first started testing this.)

I've heard that vLLM performs much better, scaling particularly better in the multi GPU case. The 4x B70 setup may actually be decent for the money given that, but probably worth waiting on it to see how the situation progresses rather than buying on a promise of potential.

A cursory Google search does seem to indicate that in my particular case interconnect bandwidth shouldn't actually be a constraint, so I doubt tensor level parallelism is working as expected.

nyrikki · 2026-04-17T00:05:25 1776384325

Parallelism can be tricky and always has a cost, but don't discount the 3090 which is more expensive these days in that price bracket.

3090 llama.cpp (container in VM)

    unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL  105 t/s
    unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL  103 t/s

Still slow compaired to the

    ggml-org/gpt-oss-20b-GGUF 206 t/s

But on my 3x 1080 Ti 1x TITAN V getto machine I learned that multi gpu takes a lot of tuning no matter what. With the B70, where Vulkan has the CPU copy problem, and SYCL doesn't have a sponsor or enough volunteers, it will probably take a bit of profiling on your part.

There are a lot of variables, but PCIe bus speed doesn't matter that much for inference, but the internal memory bandwidth does, and you won't match that with PCIe ever.

To be clear, multicard Vulkan and absolutely SYCL have a lot of optimizations that could happen, but the only time two GPUs are really faster for inference is when one doesn't have enough ram to fit the entire model.

A 3090 has 936.2 GB/s of (low latency) internal bandwidth, while 16xPCIe5 only has 504.12, may have to be copied through the CPU, have locks, atomic operations etc...

For LLM inference, the bottleneck just usually going to be memory bandwidth which is why my 3090 is so close to the 5070ti above.

LLM next token prediction is just a form of autoregressive decoding and will primarily be memory bound.

As I haven't used the larger intel GPUs I can't comment on what still needs to be optimized, but just don't expect multiple GPUs to increase performance without some nvlink style RDMA support _unless_ your process is compute and not memory bound.

dist-epoch · 2026-04-16T16:20:49 1776356449

The default Qwen "quantization" is not "bad", it's "large".

Unsloth releases lower-quality versions of the model (Qwen in this case). Think about taking a 95% quality JPEG and converting it to a 40% quality JPEG.

Models are quantized to lower quality/size so they can run on cheaper/consumer GPUs.

danielhanchen · 2026-04-17T08:27:49 1776414469

Love the JPEG analogy :)

dist-epoch · 2026-04-16T16:18:37 1776356317

What do you think about creating a tool which can just patch the template embedded in the .gguf file instead of forcing a re-download? The whole file hash can be checked afterwards.

danielhanchen · 2026-04-16T16:58:05 1776358685

Sadly it's not always chat template fixes :( But yes we now split the first shard as pure metadata (10MB) for huge models - these include the chat template etc - so you only need to download that.

For serious fixes, sadly we have to re-compute imatrix since the activation patterns have changed - this sadly makes the entire quant change a lot, hence you have to re-download :(

dist-epoch · 2026-04-14T20:37:12 1776199032

> Having spent six weeks or so using Gas Town across multiple simultaneous projects, I believe I can describe the shift concretely. The bottleneck migrates from coding speed to the rate at which you can generate ideas, write specifications, and validate outputs. You are no longer limited by how fast you can build. You are limited by how fast you can think.

Interesting:

> Kubernetes asks “Is it running?” Gas Town asks “Is it done?” Kubernetes optimizes for uptime. Gas Town optimizes for completion.

https://embracingenigmas.substack.com/p/exploring-gas-town

Zafira · 2026-04-14T20:42:52 1776199372

I’m not sure I find the testimony of a Bain & Company AI consultant (https://www.bain.com/our-team/eric-koziol/) to be compelling for anything outside of generating fees.

dist-epoch · 2026-04-14T20:44:49 1776199489

Does this mean you would avoid an article on PostgreSQL if it's from a company selling Postgres products or consultation?

Leszek · 2026-04-14T20:59:22 1776200362

It means they'd avoid an article on the benefits of smoking if it's posted by a company selling cigarettes.

mtlynch · 2026-04-14T20:44:29 1776199469

This seems to be an AI-generated post where the "author" never reveals building any successful product or even tangible project with Gas Town.

joezydeco · 2026-04-14T21:02:15 1776200535

It's like Web 4.0 zombo.com

coldtea · 2026-04-14T23:21:44 1776208904

"You can build anything with Gas Town! The only limit is yourself!"

edit: was "is your imagination". Changed to fully match https://genius.com/Zombo-zombocom-lyrics

selimthegrim · 2026-04-14T23:26:09 1776209169

Oh man, we can't even say yourself anymore.

ok_dad · 2026-04-15T02:42:29 1776220949

“Yourself” still has to pay for tokens!

torginus · 2026-04-14T21:02:09 1776200529

This sounds like every LLM workflow, which is 'you tell the LLM what you want'.

The real distinction is of scale - whether you want a REST endpoint or a fully functional word processor.

But real, actual, complex software is at least half spec (either explicit, or implicitly captured by its code), the question is, can LLMs specify software to the same degree with Gas Town, that you get something functioning?

bayarearefugee · 2026-04-14T20:47:30 1776199650

This doesn't really answer the question...?

You provided a quote from someone who seems to be an AI-boosting influencer who claimed to use it, but where's the output in the form of code we can look at, or in the form of an app someone can use today?

I'm not an AI-denier. I use LLMs and agentic coding. They increase my productivity.

...but there is still a very real problem with people claiming that some new way of using AI is earth shattering, and changes everything based on vague anecdotes that don't involve a tangible released output that they can point to.

tcoff91 · 2026-04-14T21:33:14 1776202394

Yeah if this can truly just autonomously make great software, then where is all the new SaaS that is able to undercut incumbents by charging 10-20% of what they are charging?

tom_ · 2026-04-14T21:29:14 1776202154

I don't use LLMs and I never use agentic coding. And I too am interested in an answer to this question.

coldtea · 2026-04-14T23:20:45 1776208845

>Kubernetes asks “Is it running?” Gas Town asks “Is it done?” Kubernetes optimizes for uptime. Gas Town optimizes for completion.

Sounds like the typical AI post slop.

dist-epoch · 2026-04-14T07:49:43 1776152983

try using codex-5.3-spark, it has much faster inference, might be able to keep up. and maybe a specialized different openrouter model for visual parsing.