Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A relief to see the Qwen team still publishing open weights, after the kneecapping [1] and departures of Junyang Lin and others [2]!

[1] https://news.ycombinator.com/item?id=47246746 [2] https://news.ycombinator.com/item?id=47249343



This is just one model in the Qwen 3.6 series. They will most likely release the other small sizes (not much sense in keeping them proprietary) and perhaps their 122A10B size also, but the flagship 397A17B size seems to have been excluded.


And shout-out to Qwen if they release 122b -- Jeff Barr's original Gemma 4 tweet said they'd release a ~122b, then it got redacted :(


122b would be awesome. It is the largest size you can kinda run with a beefy consumer PC. I wondered about gemma stopping in the 30b category, it is already very strong. 122b might have been too close to being really useful.


> not much sense in keeping them proprietary

Maybe for LLMs since everyone has their own competing LLM, but with Video models, Wan 2.2 did a rug pull, left a huge gap for the community that built around Wan 2.2 too, and I don't think a single open video model has come close since. Wan is at 2.7 now, and its been nearly a year since the last update.


Is there any source for these claims?


https://x.com/ChujieZheng/status/2039909917323383036 is the pre-release poll they did. ~397B was not a listed choice and plenty of people took it as a signal that it might not be up for release.


A Qwen research member had a poll on X asking what Qwen 3.6 sizes people wanted to see:

https://x.com/ChujieZheng/status/2039909917323383036

Likely to drive engagement, but the poll excluded the large model size.


397A17B = 397B total weights, 17B per expert?


That's not how it works. Many people get confused by the “expert” naming, when in reality the key part of the original name “sparse mixture of experts” is sparse.

Experts are just chunks of each layers MLP that are only partially activated by each token, there are thousands of “experts” in such a model (for Qwen3-30BA3, it was 48 layers x 128 “experts” per layer with only 8 active at each token)


17b per token. So when you’re generating a single stream of text (“decoding”) 17b parameters are active.

If you’re decoding multiple streams, it will be 17b per stream (some tokens will use the same expert, so there is some overlap).

When the model is ingesting the prompt (“prefilling”) it’s looking at many tokens at once, so the number of active parameters will be larger.


397B params, 17B activated at the same time

Those 17B might be split among multiple experts that are activated simultaneously


How many people/hackernews can run a 397b param model at home? Probably like 20-30.


The point is that open weights turns puts inference on the open market, so if your model is actually good and providers want to serve it, it will drive costs down and inference speeds up. Like Cerebras running Qwen 3 235B Instruct at 1.4k tps for cheaper than Claude Haiku (let that tps number sink in for a second. For reference, Claude Opus runs ~30-40 tps, Claude Haiku at ~60. Several orders of magnitude difference). As a company developing models, it means you can't easily capture the inference margins even though I believe you get a small kickback from the providers.

So I understand why they wouldn't want to go open weight, but on the other hand, open weight wins you popularity/sentiment if the model is any good, researchers (both academic and other labs) working on your stuff, etc etc. Local-first usage is only part of the story here. My guess is Qwen 3.5 was successful enough that now they want to start reaping the profits. Unfortunately most of Qwen 3.5's success is because it's heavily (and successfully!) optimized for extremely long-context usage on heavily constrained VRAM (i.e. local) systems, as a result of its DeltaNet attention layers.


You can rent a cloud H200 with 140GB VRAM in a server with 256GB system ram for $3-4/hr.


Can you tell me where? I used runpod before, but they don't have systems like that.


This is like saying that Open source is not important because I don't have a machine to run it on right now. Of course it is important. We don't have any state of the art Language models that are open source, but some are still Open Weight. Better than nothing, and the only way to secure some type of privacy and control over own AI use. It is my goal to run these large models locally eventually; if they all go away that is not even a possibility. . .


I can (barely, but sustainably) run Q3.5 397B on my Mac Studio with 256GB unified. It cost $10,000 but that's well within reach for most people who are here, I expect.


Hacker News moment


$10k is well outside my budget for frivolous computer purchases.


It would be plenty in-budget if the software part of local AI was a bit more full-featured than it is at present. I want stuff like SSD offload for cold expert weights and/or for saved/cached KV-context, dynamic context sizing, NPU use for prefill, distributed inference over the network, etc. etc. to all be things that just work for most users, without them having to set anything up in an overly error-prone way. The system should not just explode when someone tries to run something slightly larger; it should undergo graceful degradation and let them figure out where the reasonable limits are.


But it's well within the budget of a small company that wants to run a model locally. There are plenty of reasons to run one locally even if it's not state of the art, such as for privacy, being able to do unlimited local experiments, or refining it to solve niche problems.


yeah, but if you really really wanted to and/or your livelyhood depended on it, you probably could afford it.


99.97% of HN users are nodding… :)


There are way too many good uses of these models for local that I fully expect a standard workstation 10 years from now to start at 128GB of RAM and have at least a workstation inference device.


or if you believe a lot of HN crowd we are in AI bubble and in 10 years inference will be dirt cheap when all of this crashes and we have all this hardware in data centers and it won't make any sense to run monster workstations at home (I work 128GB M4 but not run inference, just too many electron apps running at the same time...) :)


> I work 128GB M4 but not run inference, just too many electron apps running at the same time.

This is somewhat depressing - needing a couple of thousand bucks worth of ram just to run your chat app and code/text editor and API doco tool and forum app and notetaking app all at the same time...


Crucial (Micron) sold 128GB of DDR5-5600 in SODIMM form for $280 a year ago. It would be slower tham the same amount on an M4 Mac, but still, I object to characterizing either as “a couple thousand bucks worth”.


I( get that number by optioning up a Mac Studio to 128GB at the Apple Store.

(Admittedly, Apple should be facing criminal price gouging law suits for their ram pricing.)


Inference will be dirt cheap for things like coding but you'll want much more compute for architectural planning, personal assistants with persistent real time "thinking / memory", as well as real time multimedia. I could put 10 M4s to work right now and it won't be enough for what I've been cooking.


That's kind of a specific percentage. What numbers did you use to get there?


Just have to reclassify it as non-frivolous then. $10k's not a lot for something as important as a car, if you live somewhere where one is required. Housing is typically gonna cost you more than $10k to own. I probably spend close to $10k for food for 1.5 years.

So if you just huff enough of the AI Kool aid, you too can own a Mac Studio. Or an M5 MacBook. Or a dual 3090 rig.


For some reason you were being downvoted but I enjoy hearing how people are running open weights models at home (NOT in the cloud), and what kind of hardware they need, even if it's out of my price range.


I'm running it on my Intel Xeon W5 with 256GB of DDR5 and Nvidia 72GB VRAM. Paid $7-8k for this system. Probably cost twice as much now.

Using UD-IQ4_NL quants.

Getting 13 t/s. Using it with thinking disabled.


I get 20 t/s on the UD-Q6_K_XL quant, Radeon 6800 XT.


In where I am living, 10k USD is a little more than 3 years worth of rent, for a relatively new and convenient 2 bedroom apartment.


$277 a month for a two bedroom is literally 6-10% of what someone in the SF Bagholder Area pays.

Either you're in Africa, southeast Asia or south/central Amarica.

How do you even afford internet?


Yes, I am in SEA. Home internet here costs 10$ per month.

My point was: not every person browsing this site has high living standard, and the ability to spend 10k on computing is a privilege.


you have proved my point


I'm running it on dual DGX Sparks.


I'm interested in your experiences running dual


which exact model, and how many tokens per second for generation?


According to this blog (https://kaitchup.substack.com/p/lessons-from-gguf-evaluation...) the UD_IQ2_M quants are quite strong (rel. error to the base is very low), so it's around 120GB of RAM needed, while the experts can be loaded into VRAM and the rest offloaded into system RAM. It's a high end consumer PC, sure, but not unaffordable. For example, I got an older rig with a RTX 6000 ADA (48GB VRAM), 128 GB RAM and a Threadripper, which runs this quant offloaded at 20 tps


I’ve mentioned this as an option in other discussions, but if you don’t care that much about tok/sec, 4x Xeon E7-8890 v4s with 1TB of DDR3 in a supermicro X10QBi will run a 397b model for <$2k (probably closer to $1500). Power use is pretty high per token but the entry price cannot be beat.

Full (non-quantized, non-distilled) DeepSeek runs at 1-2 tok/sec. A model half the size would probably be a little faster. This is also only with the basic NUMA functionality that was in llama.cpp a few months ago, I know they’ve added more interesting distribution mechanisms recently that I haven’t had a chance to test yet.


It doesn't matter how many can run it now, it's about freedom. Having a large open weights model available allows you to do things you can't do with closed models.


OpenRouter.


Yeah I think there’s benefits to third-party providers being able to run the large models and have stronger guarantees about ZDR and knowing where they are hosted! So Open Weights for even the large models we can’t personally serve on our laptops is still useful.


If you're running it from OpenRouter, you might as well use Qwen3.6 Plus. You don't need to be picky about a particular model size of 3.6. If you just want the 397b version to save money, just pick a cheaper model like M2.7.


The 397B model can be run at home with the weights stored on an SSD (or on 2 SSDs, for double throughput).

Probably too slow for chat, but usable as a coding assistant.


I think you have that backwards. Agentic coding is way more demanding than simple chat. The request/response loops (tool calling) are much tighter and more numerous, and the context is waaaaay bigger in general.


In processing power, but chat is interactive. Agentic coding, you come up with a plan and sign off on it, and then just let it go for a while. It's the difference between speed and latency.


Running the mxfp4 unsloth quant of qwen3.5-397b-a17b, I get 40 tps prefill, 20tps decode.

AMD threadripper pro 9965WX, 256gb ddr5 5600, rtx 4090.


It only has 17b active params, it's a mixture of experts model. So probably a lot more people than you realize!


I really wish they released qwen-image 2.0 as open weights.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: