True. I wrote some software that does these calculations for me besides the ones... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		menaerus on Feb 26, 2025 \| parent \| context \| favorite \| on: DeepSeek Open Source FlashMLA – MLA Decoding Kerne... True. I wrote some software that does these calculations for me besides the ones I already have on the paper. I confused two different graphs. So, the final number would be ~0.6 GFLOPS (self-attention across heads) + ~0.15 GFLOPS (attention) + ~1 GFLOPS (ffwd) which in total give or take is ~2 GFLOPS per-layer. Bandwidth-wise, the ~1GB number I previously gave was also wrong (llama3-70B has 8 KV heads). Now, with more precise calculations that figure is ~0.6 GB per-layer. So, at batch_size=1, FP8 precision, 1024 tokens, during the decode phase with KV-cache, we need ~2GFLOPS of compute and ~0.6GB of bandwidth per each layer. Still looks compute-bound to me.

rfoo on March 4, 2025 [–]

> Still looks compute-bound to me.

H100 has 3.3TB/s HBM bandwidth on paper, and ~1000TFLOPS bf16 compute on paper. That's 1:300. 0.6GB vs ~2GFLOPS is 1:3. Tell me how is this compute bound?

(also, your number, even after accounting for GQA, is still off. You usually can't store kvcache in fp8.)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact