After reading the technical report do the effort of downloading the model and ru...

archerx · on March 12, 2025

That's why I like giving it a real world test. For example take a podcast transcription and ask it to make show notes and summary. With a temperature of 0 different models will tackle the problem in different ways and you can infer if they really understood the transcript. Usually the transcripts that I give it come from about 1 hour of audio of two or more people talking.

antirez · on March 12, 2025

Good test. I'm slowly accumulating private tests that I use to rate LLMs, and this one was missing... Thanks.

amelius · on March 12, 2025

Aren't there any "blind" benchmarks?

nathanasmith · on March 12, 2025

Unfortunately that wouldn't help as much as you think since talented AI labs can just watch the public leaderboard and note what models move up and down to deduce and target whatever the hidden benchmark is testing.

nickthegreek · on March 12, 2025

OpenRouter Arena Ratings are probably the closet thing.

toinewx · on March 12, 2025

can you expand a bit?

antirez · on March 12, 2025

The model performs very poorly in practice, while in the benchmark it is shown to be DeepSeek V3 level. It's not terrible but it's at another level compared to the models it is very close to (a bit better / a bit worse) in the benchmarks.

anon373839 · on March 12, 2025

I’d recommend trying it on Google AI Studio (aistudio.google.com). I am getting exceptional results on a handful of novel problems that require deep domain knowledge and structured reasoning. I’m not able to replicate this performance with Ollama, so I suspect something is a bit off.

tarruda · on March 12, 2025

Same experience here: On AI Studio, this is easily one of the strongest models I have used, including when compared to proprietary LLMs.

But ollama and openwebui performance is very bad, even when running the FP16 version. I also tried to mirror some of AI studio settings (temp 1 and top p 0.95) but couldn't get it to produce anything useful.

I suspect there's some bug in the ollama releases (possibly wrong conversation delimiters?). If this is fixed, I will definitely start using Gemma 3 27b as my main model.

anon373839 · on March 13, 2025

Update: Unsloth is recommending a temperature of 0.1, not 1.0, if using Ollama. I don’t know why Ollama would require a 10x lower value, but it definitely helped. I also read some speculation that there might be an issue with the tokenizer.

pzo · on March 12, 2025

Maybe model is sensitive to quantization, by default ollama quantize it significantly.

tarruda · on March 12, 2025

I tried ollama fp16 and it had the same issues.

alekandreev · on March 12, 2025

Hey, Gemma engineer here. Can you please share reports on the type of prompts and the implementation you used?

antirez · on March 12, 2025

Hello! I tried to show it Redis code yet not released (llama.cpp 4 bit quants and the official web interface) and V3 can reason about the design tradeoffs, but (very understandably) Gemma 3 can't. I also tried to make it write a simple tic tac toe Montecarlo program, and it didn't account for ties, while SOTA models consistently do.

tarruda · on March 12, 2025

Can you share the all the recommended settings to run this LLM? It is clear that the performance is very good when running on AI studio. If possible, I'd like to use the all the same settings (temp, top-k, top-p, etc) on Ollama. AI studio only shows Temperature, top-p and output length.

sgt101 · on March 12, 2025

vibe testing, vibe model engineering...

kamranjon · on March 13, 2025

I really respect the work that you've done, but I am always very surprised when people just speak anecdotally as though it is truth with regards to AI models these days. It's as if everyone believes they are an expert now, but have nothing of substance to provide but their gut feelings.

It's as if people don't realize that these models are used for many different purposes, and subjectively one person could think one model is amazing and another person think it's awful. I just would hope that we could at least back up statements like "The model performs very poorly in practice" with actual data or at least some explanation of how it performed poorly.

tarruda · on March 12, 2025

In my experience, Gemma models were always bad at coding (but good at other tasks).

bearjaws · on March 12, 2025

Prompt adherence is pretty bad from what I can tell.