We've now partially replicated Reflection Llama 3.1 70B's eval claims

Lockal · on Sept 9, 2024

And the twit is gone after public outroar.

Now there claim that 70B saw worse performance than Llama 3.1 70B (and obviously worse than closed source alternatives)[1].

Outstanding questions:

- What exactly did they "partially replicate"

- Why Redditors were able to identify all the details (wrapped Claude, wrapped GPT4o, initial prompt, details of finetuned Lllama 3.0, not 3.1) and ArtificialAnlys was not?

- Why after revealing the truth they still write "We are not clear", "We are not clear"?

[1] https://x.com/ArtificialAnlys/status/1832965630472995220