My understanding is that they trained a separate model to specifically estimate when they have enough context to begin translating, as a skilled translator would.
My mom used to do English/French translation. Her favorite example was the word "file". That word has multiple translation in French depending on the context, and that context may simply be implied by who is speaking. You may not be able to figure it based on the conversation alone.
So then we need something like neuralink to get the whole thought from one's brain first, then the sentences are processed properly for the context, then translated before the speech is delivered.
Are most thoughts in language? This doesn’t reflect my experience. Language floats on top, but there is layers under there. You can also feel it when you end up thinking in another language. It does not go through the first one but is a thing of it’s own.
Pretty sure there is nothing universal there though as you say.
So no matter what, conversations in your native speech would have to be delayed before translation.