It doesn't have to sound human. Fluent, ok, but human?
But no matter what, it's unnecessary to make a tool that can copy anyone's voice. That just lowers the threshold for abuse, while not adding almost nothing of value.
Listening to something that sounds non-human for a long period of time is fairly unpleasant; imagine trying to listen to an audiobook or podcast, or dialogue in an animated movie, when the voices are all obviously non-human/robotic. So TTS wouldn't be usable for a lot of cases where we might want it.
And with the way the models work, once you have a model that can sound human, it is unfortunately very easy for it to sound like any individual human as well.
It doesn't have to, and you can use ones that don't, they get the info across mostly. But it is jarring if it doesn't sound human, it's a speech impediment, or as a more generous take, an accent.
The correct inflections, pauses, annunciations, etc. are all important to humans, especially so for audio books and similar things that need to immerse.
Otherwise you have to strain to listen, similar to listening to someone with a heavy accent.
Because a model that can imitate a voice is still not capable of that. There's no need to have model that can do that. A robotic accent is best. Or perhaps you like to see your politicians make all kind of bizarre statements on youtube.
But no matter what, it's unnecessary to make a tool that can copy anyone's voice. That just lowers the threshold for abuse, while not adding almost nothing of value.