> *It's about if the model will pay attention to it, in the Transformers sense, ...

> It's about if the model will pay attention to it, in the Transformers sense, which it doesn't always do.

Right... Which is why the "canary" idea doesn't make much sense. The fact that the model isn't paying attention to the canary instruction doesn't demonstrate that the model has stopped paying attention to some other instruction that's relevant to the task - it proves nothing. If anything, a better performing model should pay less attention to the canary since it becomes less and less relevant as the context is filled with tokens relevant to the task.