Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you look at the "metrics" section on Page 9, it says:

> 2) Metrics: We use three performance metrics: (i) latency, i.e., end-to-end output generation time for a batch of input prompts, (ii) token throughput, i.e., tokens-per-second processed, and (iii) compute throughput, i.e., TFLOPS per GPU.

This is somewhat confusing to me because at least two of these three definitions should be essentially the same thing, but I don't think there's any way to interpret their claim of ~74 teraflops achieved other than ~211 tokens/second of throughput.

Put another way, 18 tokens per second is 2% flops utilization, which we are obviously capable of doing better than for bulk inference.

3x is not huge in this space because just using a 4090 instead of an A100 is a 5x gain.



Maybe 74 tflops is the best they've achieved, but not all 16 GPUs can consistently hit that number? Just guessing.. The 211 tokens/sec throughput on GPU is just insane, it's even better than what TPU can do on PaLM 540B.


Well LM-175B is 540/175=3.08x smaller, so it makes sense you would get better performance. Also, in Table D.4 it takes them 9.614s to process (28 input + 8 output tokens = 136 tok * 256 batches = 34,816 tokens with 24 A100s, which is ~150 tok/s/A100. It feels totally plausible that they could hit 211 tok/s with a bigger model. I think 211 tok/s is in fact a pretty poor showing from them and you could do significantly better.


One possible explanation is that they hit the teraflops number during the prefill stage, where you can process all tokens at once, and are generally more operationally intensive, so you can use more compute. Utilization usually drops during the token generation stage. The utilization of the TPU during token generation is 3% when batch size 16. (https://arxiv.org/pdf/2211.05102.pdf, Table on the last page).


I mean maybe? This seems unlikely. I agree that decode is much more expensive and tok/s depends a lot on what your ratio of decode tokens to prefill tokens is.

This table was very helpful by the way, I didn't see that before. To me it clearly shows that 211 tok/s/A100 is very plausible and in fact kind of a poor showing because if you look at table D.4 and specifically the results for BS=256 PP3/TP8 they achieve ~150 tok/s/A100 on a model that's 3x larger than GPT-3.


The 4090 has half the memory bandwidth, so it could not get a 5X gain, it would actually run slower on a memory bound LLM like this.


5x gain per dollar




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: