Char2Wav: End-To-End Speech Synthesis

rspeer · on Feb 22, 2017

I am getting tired of people implementing "deep learning to convert foo into bar" and staking a claim on the name "foo2bar".

It leads to "AI hallucination", where even if "foo2bar" doesn't work, people assume that it's the one right AI for turning foo into bar. When someone gets better at turning foo into bar, the typical response will be "is that just foo2bar?"

This happened absurdly backwards with doc2vec, which after word2vec everyone talked about as if it were a real thing, until Radim Řehůřek finally made a reasonable implementation of it under that name.

phreeza · on Feb 22, 2017

I'm not sure people will interpret it that way. For example, seq2seq is really just a generic term for an entire class of networks that map sequences to other sequences.

rspeer · on Feb 22, 2017

If someone else made a speech synthesizer named "Char2wav", wouldn't the author of this project feel like their branding was being stolen from them?

phreeza · on Feb 22, 2017

"branding" a project that way might make the authors unhappy, but i can totally imagine someone saying in a couple of months "I built a char2wav net with GRUs and deconvolution" or something like that, ie using the word as a generic term...

zardo · on Feb 22, 2017

That's the risk of using the type signature of your function as the name of the project. It may become a generic term whether or not you like it.

ganfortran · on Feb 22, 2017

Anything can be described as X2Y, if it is a function, even AlphaGo, from a high level is just image2(x,y).

NN is universal function approximator, X2Y is just what it is.

rspeer · on Feb 22, 2017

AlphaGo is interesting because it's the world's best Go player, not because its input is a grid and its output is a move. A terrible Go player has the same format of inputs and outputs.

Describing everything as a "universal function approximator" is misleading if you never look into how good the approximation is. Char2wav, for example, is a neat trick, but clearly wouldn't be used for real speech synthesis.

kastnerkyle · on Feb 22, 2017

On "...clearly wouldn't be used for real speech synthesis" - I think it depends what language you are looking at. For languages with many years of linguistic research (e.g. English), it will be hard to beat a good parametric TTS or even a well-engineered concatenative system that encodes years and years of linguistic knowledge/features, and also has the ability for engineers to add pronunciation rules for words which are wrong. But for languages which are not as focused on, there are some gains to be made in my opinion.

See some early work on Romanian (https://www.youtube.com/watch?v=cwnDjq33uMs), and compare to Google Translate TTS (the second to last, robotic example). The existing TTS systems out there in the research community are also pretty good (last example), but I think we are at least competitive which is interesting, given that we are far from TTS experts especially in all the NLP processing that normally happens in building these systems (https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/qu... for example).

At very least the Edinburgh system (http://romaniantts.com/new/) should be the baseline across companies for Romanian - but one of the things we also want to show is that one architecture/approach can generalize easily across several languages. By and large the approach for all our languages is identical, including nearly all hyperparameters (we change one setting for attention default step size, and I think that is it).

There are tons of languages which have poor existing systems, and having something that is basically - record a speaker for a while, write down whatever sentences they are saying, train a big model, and do pretty well - could be very useful. We are still exploring what languages this approach works for, but have had a good success rate so far if we can find a good openly available dataset. It also opens the door to using existing sentence level ASR datasets "in reverse", since we don't need timing/alignment information.

There is also a lot of potential for personalized sound/speakers using speaker interpolation (see the alternate speaker examples in the youtube video) that we have not explored yet, as well as applications to related sequence generation tasks. I think some of our training tricks are useful for training sequence generation models in general.

Some great videos that help put our work in context [1][2].

[0] early Romanian Demo of our approach: https://www.youtube.com/watch?v=cwnDjq33uMs

[1] Alex Graves, Generating Sequences with Recurrent Neural Networks (see ~36m in for speech demo): https://www.youtube.com/watch?v=-yX1SYeDHbg

[2] Heiga Zen, Generative Model-Based TTS Synthesis: https://www.youtube.com/watch?v=nsrSrYtKkT8

BugsJustFindMe · on Feb 22, 2017

http://www.josesotelo.com/speechsynthesis/files/wav/blizzard...

I have not laughed this hard in a long time.

kastnerkyle · on Feb 22, 2017

You might also enjoy http://badsamples.tumblr.com/

dqv · on Feb 22, 2017

I discovered that one and was just about to post it. I wonder why that happened.

lhlmgr · on Feb 22, 2017

thanks for sharing.. got me too! :)

billconan · on Feb 22, 2017

the demo page isn't clearly presented.

for example, on this page, only spanish has the char2wav label.

http://www.josesotelo.com/speechsynthesis/

It's unclear which results are the output of the model.

microcolonel · on Feb 22, 2017

I love how the "Reader over characters with vocoder output." samples sound sometimes like they're giving up or falling asleep.

verytrivial · on Feb 22, 2017

Many of the synth voices sound to my ear very similar to people who are either drunk or have a brain injury. I'm not complaining, it's an interesting parallel.

option_greek · on Feb 22, 2017

So how does this work? It's not very clear from the article.

jfsantos · on Feb 22, 2017

Hi, I'm one of the authors. In broad lines, we pretrained one model (the "Reader") to learn to read text and output vocoder variables, and another model (SampleRNN) to go from these vocoder variables to an audio waveform. Then, we finetuned both models together to be able to go from text to speech, end-to-end. The "end product" is a text-to-speech system, but without the need of having to extract tons of hand-engineered features from the text to be able to generate speech. We also expect that with more training this will be able to overcome the usual vocoder speech "unnaturalness" issues.

BugsJustFindMe · on Feb 22, 2017

Can you comment on what's happening in this sample (result???) clip? http://www.josesotelo.com/speechsynthesis/files/wav/blizzard...

Also, I notice that many of the result clips trail off in volume. Is that a processing error or intentional in how the clips are edited?

jfsantos · on Feb 22, 2017

I think the model just got tired of reading text and decided to mock us :) Just kidding. The attention mechanism got stuck somehow for this sample. This does not happen very often, though. It's important to note the samples we posted were not cherry-picked: they are just the first 10 sentences from our test set.

Regarding the truncation at the end, that was a bug in our sampling code that we just fixed. We will update the samples soon!

atomicthumbs · on March 7, 2017

Is there any way to artificially induce that failure? I'm an artist and I've been trying to get a handle on ML stuff, and being able to feed speech through this to give it the flat affect of the phoneme-mode samples, or insert attention failures at specific points, would be extremely useful for a number of projects I have in mind.

kastnerkyle · on Feb 22, 2017

We will have a longer paper out soon with more details about the training process - but for now the code is there as well. There were a few small things in training that seem key to getting good results, we are analyzing those things and honing the "recipe" now.

wernerb · on Feb 22, 2017

Proceeding to feed this Paul Bettany's "Jarvis" from some movies...

kastnerkyle · on Feb 22, 2017

This is no joke something I have considered - do you have a source on a pairing of "read speech" and "transcript" for this? I could process the movies myself but that seems... tedious...

teddyh · on Feb 22, 2017

I’d like to feed The Chaos to it and see how it fares.