I don't think this level of computational power can be achieved on a modern CPU, or even a GPU! But GPUs are probably the closest analog to Google's absurdly parallel architecture.
To get a GPU working at maximum performance, you either have to go OpenCL2.0 or CUDA. Compared to OpenCL1.2, OpenCL 2.0 has a better atomics model, dynamic parallelism (kernels that can launch kernels), shared memory, and tons of other features.
NVidia of course supports those features in CUDA, but NVidia's OpenCL support is stuck at 1.2. So in effect, CUDA and OpenCL are in competition with each other.
Anyway, that's the current layout of the hardware that's available to consumers. I think its reasonable to expect a graphics card in a modern machine, even Intel's weak integrated-GPUs have a parallel-computing advantage over a CPU.
So for high-parallelism tasks like audio analysis or image analysis, it only makes sense to target GPUs today.
The TPU architecture isn't that weird...it's basically a hardware implementation of matrix multiplication. It also isn't a silver bullet for ASR, where neural networks are usually only used for a part of the recognition process.
Its "weird" in the ways that matter: there's no commodity hardware in existence that replicates what a TPU does. The only place to get TPUs is through Google's cloud services.
CPUs are basically Von Neumann Architecture. GPUs (NVidia and AMD) are basically SIMD / SIMT systems.
Google's TPU is just something dramatically different, optimized yes for Matrix Multiplication, but its not something you can buy and use offline.
> but its not something you can buy and use offline.
But you will. The entire point is to put this in a phone, so you can distribute a trained neural net in a way that people can actually use without a desktop and $500-$4,000 GPU.
> But you will. The entire point is to put this in a phone, so you can distribute a trained neural net in a way that people can actually use without a desktop and $500-$4,000 GPU.
As far as I can tell, they put a microphone on your phone and then relay your voice to Google's servers for analysis.
Or Amazon's servers, in the case of the Echo.
I don't see any near-term future where Google's TPUs become widely available for consumers: be it on a phone or desktop. And I'm not aware of any product from the major hardware manufacturers that even attempt to replicate Google's TPU architecture.
NVidia and AMD are sorta going the opposite direction: they're making their GPUs more and more flexible (which will be useful in a wider variety of problems), while Google's TPUs specialize further and further into low-precision matrix multiplications.
Is that the point? I ask, because the "weird" in the TPU is mostly its scale. Its not like you can't do matrix multiplies with the vector units on a CPU or with a GPU. Its really the scale, by that I mean its more elements than what you get with existing hardware, but its also lower precision, and appears less flexible, and is bolted to a heavyweight memory subsystem.
So, in that regard its no more "weird" than other common accelerator/coprocessors for things like compression.
So, in the end, what would show up in a phone doesn't really look anything like a TPU. I would maybe expect a lightweight piece of matrix acceleration hardware, which due to power constraints isn't going to be able to match what a "desktop" level FPGA or GPU is capable of much less a full blown TPU.
Neural networks are used for nearly all of ASR now. Last I heard only the spectral components were still calculated not using a neural net and the text-to-speech is now entirely neural network (i.e. you feed text in and get audio samples out). I'd be surprised if they don't do that for ASR too soon if they haven't already.
Although some models are end-to-end neural nets, most of the ones in production (and all of the ones that get state of the art results) only use a neural net for one part of the process. Lots of people are as surprised as you, but that's the way it is.
Edit: I should say that in state of the art results there tend to be multiple components, including multiple neural nets and the tricky "decode graph" that gok and I are talking about. These are trained separately then get stuck together, as opposed to being trained in an end-to-end fashion.
Separating acoustic model and decoding graph search makes sense since you would need a huge amount of (correctly!) transcribed speech for training. See, for example, this paper by Google [1], where they used 125,000 hours (after filtering out the badly transcribed ones from the original 500,000 hours of transcribed speech) for training an end-to-end acoustic-to-word model. Good "old-school" DNN acoustic models can already be trained with orders of magnitude less training data (hundreds to thousands of hours).
AFAIK state of the art models are hybrid of HMM/GMM and CNN for phoneme classification. There are exotic CTC/RNN based architectures for end-to-end recognition but they aren't state of the art.
You're right in that TPU's allow Google to train very large datasets faster and using less power. But I think a reasonable ASR model should be trainable with GPUs alone.
The issue previously has been a lack of large enough high quality annotated datasets, and open source ASR libraries being a bit behind or not well integrated with cutting edge deep learning. I think that's changing now though. I hope it won't take too long until pre-trained, reasonable size and high accuracy TensorFlow/Kaldi models for many languages are common.
The level of the computation can be achieved just fine with a GPU or some co-processors. What the TPU excels at is performing forward inference very efficiently. So, you can't train on it, or perform arbitrary computation that well, but if you had a pre-trained neural net, you could run it very fast and with little power.
I don't think this level of computational power can be achieved on a modern CPU, or even a GPU! But GPUs are probably the closest analog to Google's absurdly parallel architecture.
To get a GPU working at maximum performance, you either have to go OpenCL2.0 or CUDA. Compared to OpenCL1.2, OpenCL 2.0 has a better atomics model, dynamic parallelism (kernels that can launch kernels), shared memory, and tons of other features.
NVidia of course supports those features in CUDA, but NVidia's OpenCL support is stuck at 1.2. So in effect, CUDA and OpenCL are in competition with each other.
Anyway, that's the current layout of the hardware that's available to consumers. I think its reasonable to expect a graphics card in a modern machine, even Intel's weak integrated-GPUs have a parallel-computing advantage over a CPU.
So for high-parallelism tasks like audio analysis or image analysis, it only makes sense to target GPUs today.