I find it hard to trust post training quantizations. Why don't they run benchmarks to see the degradation in performance? It sketches me out because it should be the easiest thing to automatically run a suite of benchmarks
One thing to consider is that this version is a new architecture, so it’ll take time for Llama CPP to get updated. Similar to how it was with Qwen Next.
There are a bunch of 4bit quants in the GGUF link and the 0xSero has some smaller stuff too. Might still be too big and you'll need to ungpu poor yourself.
Except this is GLM 4.7 Flash which has 32B total params, 3B active. It should fit with a decent context window of 40k or so in 20GB of ram at 4b weights quantization and you can save even more by quantizing the activations and KV cache to 8bit.
yes, but the parrent link was to the big glm 4.7 that had a bunch of ggufs, the new one at the point of posting did not, nor does it now. im waiting for unsloth guys for the 4.7 flash
This user has also done a bunch of good quants:
https://huggingface.co/0xSero