CodeGemma has fewer parameters than Llama3, so it absolutely should not be slowe...

attentive · on May 14, 2024

I suppose it could be quantization issue, but both are done by lmstudio-community. Llama3 does have a different architecture and bigger tokenizer which might explain it.

coder543 · on May 14, 2024

You should try ollama and see what happens. On the same hardware, with the same q8_0 quantization on both models, I'm seeing 77 tokens/s with Llama3-8B and 72 tokens/s with CodeGemma-7B, which is a very surprising result to me, but they are still very similar in performance.

attentive · on May 14, 2024

You're right, ollama does perform the same on both models. Thanks.