If you look at the "metrics" section on Page 9, it says: > 2) Metrics: We use th...

punkgenius · on July 12, 2023

Maybe 74 tflops is the best they've achieved, but not all 16 GPUs can consistently hit that number? Just guessing.. The 211 tokens/sec throughput on GPU is just insane, it's even better than what TPU can do on PaLM 540B.

why_only_15 · on July 13, 2023

Well LM-175B is 540/175=3.08x smaller, so it makes sense you would get better performance. Also, in Table D.4 it takes them 9.614s to process (28 input + 8 output tokens = 136 tok * 256 batches = 34,816 tokens with 24 A100s, which is ~150 tok/s/A100. It feels totally plausible that they could hit 211 tok/s with a bigger model. I think 211 tok/s is in fact a pretty poor showing from them and you could do significantly better.

yvn1uo · on July 12, 2023

One possible explanation is that they hit the teraflops number during the prefill stage, where you can process all tokens at once, and are generally more operationally intensive, so you can use more compute. Utilization usually drops during the token generation stage. The utilization of the TPU during token generation is 3% when batch size 16. (https://arxiv.org/pdf/2211.05102.pdf, Table on the last page).

why_only_15 · on July 13, 2023

I mean maybe? This seems unlikely. I agree that decode is much more expensive and tok/s depends a lot on what your ratio of decode tokens to prefill tokens is.

This table was very helpful by the way, I didn't see that before. To me it clearly shows that 211 tok/s/A100 is very plausible and in fact kind of a poor showing because if you look at table D.4 and specifically the results for BS=256 PP3/TP8 they achieve ~150 tok/s/A100 on a model that's 3x larger than GPT-3.

kraken12 · on July 12, 2023

The 4090 has half the memory bandwidth, so it could not get a 5X gain, it would actually run slower on a memory bound LLM like this.

why_only_15 · on July 13, 2023

5x gain per dollar