Google has basically invented a special processor with a very, very, VERY weird ...

gok · on Oct 24, 2017

The TPU architecture isn't that weird...it's basically a hardware implementation of matrix multiplication. It also isn't a silver bullet for ASR, where neural networks are usually only used for a part of the recognition process.

dragontamer · on Oct 24, 2017

> The TPU architecture isn't that weird

Its "weird" in the ways that matter: there's no commodity hardware in existence that replicates what a TPU does. The only place to get TPUs is through Google's cloud services.

CPUs are basically Von Neumann Architecture. GPUs (NVidia and AMD) are basically SIMD / SIMT systems.

Google's TPU is just something dramatically different, optimized yes for Matrix Multiplication, but its not something you can buy and use offline.

gok · on Oct 24, 2017

It's not that dissimilar in architecture or performance to the Tensor Cores in Volta, which you can buy soon.

dragontamer · on Oct 24, 2017

I wasn't aware of Tensor Cores in Volta.

I'll look into them, thanks!

Cacti · on Oct 24, 2017

> but its not something you can buy and use offline.

But you will. The entire point is to put this in a phone, so you can distribute a trained neural net in a way that people can actually use without a desktop and $500-$4,000 GPU.

dragontamer · on Oct 24, 2017

> But you will. The entire point is to put this in a phone, so you can distribute a trained neural net in a way that people can actually use without a desktop and $500-$4,000 GPU.

As far as I can tell, they put a microphone on your phone and then relay your voice to Google's servers for analysis.

Or Amazon's servers, in the case of the Echo.

I don't see any near-term future where Google's TPUs become widely available for consumers: be it on a phone or desktop. And I'm not aware of any product from the major hardware manufacturers that even attempt to replicate Google's TPU architecture.

NVidia and AMD are sorta going the opposite direction: they're making their GPUs more and more flexible (which will be useful in a wider variety of problems), while Google's TPUs specialize further and further into low-precision matrix multiplications.

shpx · on Oct 24, 2017

Neural nets take large GPUs (or TPUs) to train. Realtime inference on CPUs has been possible since forever.

Also, I just turned on airplane mode and google assistant recognized my voice.

StillBored · on Oct 24, 2017

Is that the point? I ask, because the "weird" in the TPU is mostly its scale. Its not like you can't do matrix multiplies with the vector units on a CPU or with a GPU. Its really the scale, by that I mean its more elements than what you get with existing hardware, but its also lower precision, and appears less flexible, and is bolted to a heavyweight memory subsystem.

So, in that regard its no more "weird" than other common accelerator/coprocessors for things like compression.

So, in the end, what would show up in a phone doesn't really look anything like a TPU. I would maybe expect a lightweight piece of matrix acceleration hardware, which due to power constraints isn't going to be able to match what a "desktop" level FPGA or GPU is capable of much less a full blown TPU.

jrkdgdjenr · on Oct 24, 2017

If this were true i’d expect to see some effort to open and standardize the hardware. Otherwise what’s the point?

IshKebab · on Oct 24, 2017

Neural networks are used for nearly all of ASR now. Last I heard only the spectral components were still calculated not using a neural net and the text-to-speech is now entirely neural network (i.e. you feed text in and get audio samples out). I'd be surprised if they don't do that for ASR too soon if they haven't already.

adrianbg · on Oct 24, 2017

Although some models are end-to-end neural nets, most of the ones in production (and all of the ones that get state of the art results) only use a neural net for one part of the process. Lots of people are as surprised as you, but that's the way it is.

Edit: I should say that in state of the art results there tend to be multiple components, including multiple neural nets and the tricky "decode graph" that gok and I are talking about. These are trained separately then get stuck together, as opposed to being trained in an end-to-end fashion.

woodson · on Oct 24, 2017

Separating acoustic model and decoding graph search makes sense since you would need a huge amount of (correctly!) transcribed speech for training. See, for example, this paper by Google [1], where they used 125,000 hours (after filtering out the badly transcribed ones from the original 500,000 hours of transcribed speech) for training an end-to-end acoustic-to-word model. Good "old-school" DNN acoustic models can already be trained with orders of magnitude less training data (hundreds to thousands of hours).

[1] https://arxiv.org/abs/1610.09975

adrianbg · on Oct 25, 2017

Yes, exactly. I do wonder whether a similarly good end-to-end system could be trained by constraining the alignments as I've seen done in some papers.

computerex · on Oct 25, 2017

AFAIK state of the art models are hybrid of HMM/GMM and CNN for phoneme classification. There are exotic CTC/RNN based architectures for end-to-end recognition but they aren't state of the art.

dharma1 · on Oct 24, 2017

You're right in that TPU's allow Google to train very large datasets faster and using less power. But I think a reasonable ASR model should be trainable with GPUs alone.

The issue previously has been a lack of large enough high quality annotated datasets, and open source ASR libraries being a bit behind or not well integrated with cutting edge deep learning. I think that's changing now though. I hope it won't take too long until pre-trained, reasonable size and high accuracy TensorFlow/Kaldi models for many languages are common.

Cacti · on Oct 24, 2017

The level of the computation can be achieved just fine with a GPU or some co-processors. What the TPU excels at is performing forward inference very efficiently. So, you can't train on it, or perform arbitrary computation that well, but if you had a pre-trained neural net, you could run it very fast and with little power.