After reading the technical report do the effort of downloading the model and run it against a few prompts. In 5 minutes you understand how broken LLM benchmarking is.
That's why I like giving it a real world test. For example take a podcast transcription and ask it to make show notes and summary. With a temperature of 0 different models will tackle the problem in different ways and you can infer if they really understood the transcript. Usually the transcripts that I give it come from about 1 hour of audio of two or more people talking.
Unfortunately that wouldn't help as much as you think since talented AI labs can just watch the public leaderboard and note what models move up and down to deduce and target whatever the hidden benchmark is testing.
The model performs very poorly in practice, while in the benchmark it is shown to be DeepSeek V3 level. It's not terrible but it's at another level compared to the models it is very close to (a bit better / a bit worse) in the benchmarks.
I’d recommend trying it on Google AI Studio (aistudio.google.com). I am getting exceptional results on a handful of novel problems that require deep domain knowledge and structured reasoning. I’m not able to replicate this performance with Ollama, so I suspect something is a bit off.
Same experience here: On AI Studio, this is easily one of the strongest models I have used, including when compared to proprietary LLMs.
But ollama and openwebui performance is very bad, even when running the FP16 version. I also tried to mirror some of AI studio settings (temp 1 and top p 0.95) but couldn't get it to produce anything useful.
I suspect there's some bug in the ollama releases (possibly wrong conversation delimiters?). If this is fixed, I will definitely start using Gemma 3 27b as my main model.
Update: Unsloth is recommending a temperature of 0.1, not 1.0, if using Ollama. I don’t know why Ollama would require a 10x lower value, but it definitely helped. I also read some speculation that there might be an issue with the tokenizer.
Hello! I tried to show it Redis code yet not released (llama.cpp 4 bit quants and the official web interface) and V3 can reason about the design tradeoffs, but (very understandably) Gemma 3 can't. I also tried to make it write a simple tic tac toe Montecarlo program, and it didn't account for ties, while SOTA models consistently do.
Can you share the all the recommended settings to run this LLM? It is clear that the performance is very good when running on AI studio. If possible, I'd like to use the all the same settings (temp, top-k, top-p, etc) on Ollama. AI studio only shows Temperature, top-p and output length.
I really respect the work that you've done, but I am always very surprised when people just speak anecdotally as though it is truth with regards to AI models these days. It's as if everyone believes they are an expert now, but have nothing of substance to provide but their gut feelings.
It's as if people don't realize that these models are used for many different purposes, and subjectively one person could think one model is amazing and another person think it's awful. I just would hope that we could at least back up statements like "The model performs very poorly in practice" with actual data or at least some explanation of how it performed poorly.