It doesn't have to sound human. Fluent, ok, but human? But no matter what, it's ...

moyix · on Jan 9, 2023

Listening to something that sounds non-human for a long period of time is fairly unpleasant; imagine trying to listen to an audiobook or podcast, or dialogue in an animated movie, when the voices are all obviously non-human/robotic. So TTS wouldn't be usable for a lot of cases where we might want it.

And with the way the models work, once you have a model that can sound human, it is unfortunately very easy for it to sound like any individual human as well.

MrOwnPut · on Jan 9, 2023

It doesn't have to, and you can use ones that don't, they get the info across mostly. But it is jarring if it doesn't sound human, it's a speech impediment, or as a more generous take, an accent.

The correct inflections, pauses, annunciations, etc. are all important to humans, especially so for audio books and similar things that need to immerse.

Otherwise you have to strain to listen, similar to listening to someone with a heavy accent.

tgv · on Jan 9, 2023

> especially so for audio books

Perhaps real human readers can help?

> The correct inflections, pauses

Because a model that can imitate a voice is still not capable of that. There's no need to have model that can do that. A robotic accent is best. Or perhaps you like to see your politicians make all kind of bizarre statements on youtube.

MrOwnPut · on Jan 9, 2023

> Perhaps real human readers can help?

Certainly, and they do, but computers can annotate much faster, without attrition, and cost much less.

Generally you'll have human recordings for what you can, and TTS for anything missing.

There are a lot of books, live streams, podcasts, articles, etc. in the world.

> There's no need to have model that can do that.

I wouldn't say there's no need, or otherwise we wouldn't talk that way. It's a human element and the generated speech is for human ears.

> A robotic accent is best.

Best is subjective because that's a human preference. That being said, I'd say your preference is a far outlier.

Most people's "best" would be what they are used to hearing; the speech of native speakers.