Now there claim that 70B saw worse performance than Llama 3.1 70B (and obviously worse than closed source alternatives)[1].
Outstanding questions:
- What exactly did they "partially replicate"
- Why Redditors were able to identify all the details (wrapped Claude, wrapped GPT4o, initial prompt, details of finetuned Lllama 3.0, not 3.1) and ArtificialAnlys was not?
- Why after revealing the truth they still write "We are not clear", "We are not clear"?
Now there claim that 70B saw worse performance than Llama 3.1 70B (and obviously worse than closed source alternatives)[1].
Outstanding questions:
- What exactly did they "partially replicate"
- Why Redditors were able to identify all the details (wrapped Claude, wrapped GPT4o, initial prompt, details of finetuned Lllama 3.0, not 3.1) and ArtificialAnlys was not?
- Why after revealing the truth they still write "We are not clear", "We are not clear"?
[1] https://x.com/ArtificialAnlys/status/1832965630472995220