I haven't been able to find it again, but a few years ago I read a paper that found that certain prompts massively improved the performance of some LLMs on benchmarks. But the same prompt massively reduced the performance of some other LLMs. I assume this is still true, though perhaps not as dramatically as before.