Using ik_llama.cpp to run a 27B 4bpw quant on a RTX 3090, I get 1312 tok/s PP an... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		throwdbaaway 77 days ago \| parent \| context \| favorite \| on: How to run Qwen 3.5 locally Using ik_llama.cpp to run a 27B 4bpw quant on a RTX 3090, I get 1312 tok/s PP and 40.7 tok/s TG at zero context, dropping to 1009 tok/s PP and 36.2 tok/s TG at 40960 context. 35B A3B is faster but didn't do too well in my limited testing.

ranger_danger 76 days ago [–]

with regular llama.cpp on a 3070ti I get 60tok/s TG with the 9B model, it's quite impressive.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact