Strongly agree with the first part of your post :)
BTW in addition to the weights, it's also interesting to consider the precision of accumulation. f16 is just not enough for the large matrix sizes we are now seeing.
(Gemma.cpp TL here) FYI we are a research testbed, not full-featured nor user-centric. Some interesting things there are the fp8 weights and extremely fast matmul especially on workstation CPUs, plus some attention to numerics.
(Gemma.cpp TL here) FYI we are a research testbed, not full-featured nor user-centric. Some interesting things there are the fp8 weights and extremely fast matmul especially on workstation CPUs, plus some attention to numerics.