The benchmark results are so incredibly good they are hard to believe. A 30B model that's competitive with Gemini 2.5 Pro and way better than Gemma 27B?
Update: I tested "ollama run qwen3:30b" (the MoE) locally and while it thought much it wasn't that smart. After 3 follow up questions it ended up in an infinite loop.
I just tried again, and it ended up in an infinite loop immediately, just a single prompt, no follow-up: "Write a Python script to build a Fitch parsimony tree by stepwise addition. Take a Fasta alignment as input and produce a nwk string as outpput."
Update 2: The dense one "ollama run qwen3:32b" is much better (albeit slower of course). It still keeps on thinking for what feels like forever until it misremembers the initial prompt.
Another thing you’re running into is the context window. Ollama sets a low context window by default, like 4096 tokens IIRC. The reasoning process can easily take more than that, at which point it is forgetting most of its reasoning and any prior messages, and it can get stuck in loops. The solution is to raise the context window to something reasonable, such as 32k.
Instead of this very high latency remote debugging process with strangers on the internet, you could just try out properly configured models on the hosted Qwen Chat. Obviously the privacy implications are different, but running models locally is still a fiddly thing even if it is easier than it used to be, and configuration errors are often mistaken for bad model performance. If the models meet your expectations in a properly configured cloud environment, then you can put in the effort to figure out local model hosting.
Update: I tested "ollama run qwen3:30b" (the MoE) locally and while it thought much it wasn't that smart. After 3 follow up questions it ended up in an infinite loop.
I just tried again, and it ended up in an infinite loop immediately, just a single prompt, no follow-up: "Write a Python script to build a Fitch parsimony tree by stepwise addition. Take a Fasta alignment as input and produce a nwk string as outpput."
Update 2: The dense one "ollama run qwen3:32b" is much better (albeit slower of course). It still keeps on thinking for what feels like forever until it misremembers the initial prompt.