I am getting tired of people implementing "deep learning to convert foo into bar" and staking a claim on the name "foo2bar".
It leads to "AI hallucination", where even if "foo2bar" doesn't work, people assume that it's the one right AI for turning foo into bar. When someone gets better at turning foo into bar, the typical response will be "is that just foo2bar?"
This happened absurdly backwards with doc2vec, which after word2vec everyone talked about as if it were a real thing, until Radim Řehůřek finally made a reasonable implementation of it under that name.
I'm not sure people will interpret it that way. For example, seq2seq is really just a generic term for an entire class of networks that map sequences to other sequences.
"branding" a project that way might make the authors unhappy, but i can totally imagine someone saying in a couple of months "I built a char2wav net with GRUs and deconvolution" or something like that, ie using the word as a generic term...
AlphaGo is interesting because it's the world's best Go player, not because its input is a grid and its output is a move. A terrible Go player has the same format of inputs and outputs.
Describing everything as a "universal function approximator" is misleading if you never look into how good the approximation is. Char2wav, for example, is a neat trick, but clearly wouldn't be used for real speech synthesis.
On "...clearly wouldn't be used for real speech synthesis" - I think it depends what language you are looking at. For languages with many years of linguistic research (e.g. English), it will be hard to beat a good parametric TTS or even a well-engineered concatenative system that encodes years and years of linguistic knowledge/features, and also has the ability for engineers to add pronunciation rules for words which are wrong. But for languages which are not as focused on, there are some gains to be made in my opinion.
See some early work on Romanian (https://www.youtube.com/watch?v=cwnDjq33uMs), and compare to Google Translate TTS (the second to last, robotic example). The existing TTS systems out there in the research community are also pretty good (last example), but I think we are at least competitive which is interesting, given that we are far from TTS experts especially in all the NLP processing that normally happens in building these systems (https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/qu... for example).
At very least the Edinburgh system (http://romaniantts.com/new/) should be the baseline across companies for Romanian - but one of the things we also want to show is that one architecture/approach can generalize easily across several languages. By and large the approach for all our languages is identical, including nearly all hyperparameters (we change one setting for attention default step size, and I think that is it).
There are tons of languages which have poor existing systems, and having something that is basically - record a speaker for a while, write down whatever sentences they are saying, train a big model, and do pretty well - could be very useful. We are still exploring what languages this approach works for, but have had a good success rate so far if we can find a good openly available dataset. It also opens the door to using existing sentence level ASR datasets "in reverse", since we don't need timing/alignment information.
There is also a lot of potential for personalized sound/speakers using speaker interpolation (see the alternate speaker examples in the youtube video) that we have not explored yet, as well as applications to related sequence generation tasks. I think some of our training tricks are useful for training sequence generation models in general.
Some great videos that help put our work in context [1][2].
Many of the synth voices sound to my ear very similar to people who are either drunk or have a brain injury. I'm not complaining, it's an interesting parallel.
Hi, I'm one of the authors. In broad lines, we pretrained one model (the "Reader") to learn to read text and output vocoder variables, and another model (SampleRNN) to go from these vocoder variables to an audio waveform. Then, we finetuned both models together to be able to go from text to speech, end-to-end. The "end product" is a text-to-speech system, but without the need of having to extract tons of hand-engineered features from the text to be able to generate speech. We also expect that with more training this will be able to overcome the usual vocoder speech "unnaturalness" issues.
I think the model just got tired of reading text and decided to mock us :) Just kidding. The attention mechanism got stuck somehow for this sample. This does not happen very often, though. It's important to note the samples we posted were not cherry-picked: they are just the first 10 sentences from our test set.
Regarding the truncation at the end, that was a bug in our sampling code that we just fixed. We will update the samples soon!
Is there any way to artificially induce that failure? I'm an artist and I've been trying to get a handle on ML stuff, and being able to feed speech through this to give it the flat affect of the phoneme-mode samples, or insert attention failures at specific points, would be extremely useful for a number of projects I have in mind.
We will have a longer paper out soon with more details about the training process - but for now the code is there as well. There were a few small things in training that seem key to getting good results, we are analyzing those things and honing the "recipe" now.
This is no joke something I have considered - do you have a source on a pairing of "read speech" and "transcript" for this? I could process the movies myself but that seems... tedious...
It leads to "AI hallucination", where even if "foo2bar" doesn't work, people assume that it's the one right AI for turning foo into bar. When someone gets better at turning foo into bar, the typical response will be "is that just foo2bar?"
This happened absurdly backwards with doc2vec, which after word2vec everyone talked about as if it were a real thing, until Radim Řehůřek finally made a reasonable implementation of it under that name.