This design is utter computational tripe that completely ignores Amdahl's Law or any notion of data-parallelism.
His 1% error comes from relying on 7-bit logarithmic floating point used in a task-parallel manner. This is a non-starter for HPC. While most theoretical models certainly have 1% or greater underlying errors in accuracy, a 1% or worse error in precision is going to doom these algorithms to numerical instability without Herculean additional effort that would obliterate the computational advantage here.
Neural networks? See The Vanishing Gradient Problem.
Molecular dynamics? It's numerically unstable without 48-bit or better force accumulation as proven by D.E. Shaw.
NVIDIA, AMD, and Intel have invested a huge sum in manycore processors that are already too hard to program for most engineers. These processors are a cakewalk in comparison to what's proposed here.
Finally, even if you did find a task amenable to this architecture (and I'd admit there may be some computer vision tasks that might work here), where's the data bus that could keep it fed? We're already communication-limited with GPUs for a lot of tasks. Why do we even need such a wacky architecture?
Computational approximation has been shipped en masse very successfully.
You mention GPUs but did you know that GPUs already do a lot of approximate math, for example, fast reciprocal and fast reciprocal square root?
You mention how approximation must be impossible in all these applications (because REASONS) but all methods that numerically integrate some desired function are doing refined approximation anyway. If you have another source of error, that lives inside the integration step, it may be fine so long as your refinement is still able to bring the error to zero as the number of steps increases.
Your diagnosis of "utter computational tripe" and the accompanying vitriol seem completely inappropriate.
Really? I don't know about GPUs? That's news to me! Did you know that the precision of the fast reciprocal square root on NVIDIA GPUs is 1 ulps out of 23 bits? That's a world of difference away from 1 ulps out of 7 bits. I wouldn't touch a 7-bit floating point processor. Life is too damned short for that.
And that's because I have spent days chasing and correcting dynamic range errors that doomed HPC applications that tried to dump 64-bit double-precision for 32-bit floating point. It turns out in the end that while you can do this, you often need to accumulate 32-bit quantities into a 64-bit accumulator. Technically, D.E. Shaw demonstrated you can do it with 48 bits, but who makes 48-bit double precision units?
I stand by the computational tripe definition (with the caveat that Hershel has now posted an app where this architecture is possibly optimal). My objections to the broad extraordinary claims made in the presentation above.
And hey, you're a game developer, let me give you an analogy: would you develop a software renderer these days if you were 100% constrained to relying on mathematical operations on signed chars? It's doable, but would you bother? Start with Chris Hecker's texture mapper from Game Developer back in the 1990s, I'm guessing madness would ensue shortly thereafter. Evidence: HPC apps on GPUs that rely entirely on 9-bit subtexel precision to get crazy 1000x speedups over traditional CPU interpolation do not generally produce the same results as the CPU. If the result is visual, it's usually OK. If it's quantitative, no way.
Snark aside, I agree broadly with the points your making here. This isn’t especially groundbreaking; this is using the fact that logarithmic number representations don’t require much area to implement if you don’t need high-accuracy and are willing to trade latency for throughput (something that FPGA programmers have been taking advantage of since forever), and then going shopping for algorithms that can still run correctly in such an environment.
>Neural networks? See The Vanishing Gradient Problem.
For what it's worth, Geoff Hinton and others have had a lot of success in the last few years by intentionally injecting noise into their neural network computations. The best-known example is [0], but he's also had some success just adding noise to the calculation of the gradient in good old sigmoid MLPs--even reducing the information content of the gradient below a single bit in some cases.
Going back further, stochastic gradient descent and other algorithms have shown for decades that trading accuracy for speed can be very desirable for neural network computations.
Except the degree of error introduced by techniques like "knockout" and "maxout" are a hyperparameter under user control as are the many possible implementations of stochastic gradient descent.
What's being suggested here seems equivalent to saying that because you love the crunchy fired top of creme brulee, why not go ahead and torch the whole thing?
As for contrastive divergence with Restricted Boltzmann Machines overcoming the vanishing gradient problem. That's true, but has anyone even tried doing this with 7-bit floating point and demonstrated it even works? I'm assuming recurrent neural networks relied on at least 32-bit point (correct me if I'm wrong). A reply to me flickered on here briefly indicating they haven't built a processor based on this yet which the author then deleted.
I think this would be a much more interesting architecture if it would just stop trying to reinvent floating point or "if it ain't broke, don't "fix" it."
>As for contrastive divergence with Restricted Boltzmann Machines overcoming the vanishing gradient problem. That's true, but has anyone even tried doing this with 7-bit floating point and demonstrated it even works?
I wasn't referring to RBMs here, although it does look like I misremembered the details. See slides 74-76 of this presentation for a brief sketch [0]. In these feed-forward MLPs, the information content from each neuron is capped at one bit (which is quite a bit less than seven).
I seem to recall him saying something similar (and with more detail) in a Google Tech Talk a few months later [1], although I'm not sure.
Indeed, but I quote: "The pre-training uses stochastic binary units. After pre-training we cheat and use
backpropagation by pretending that they are deterministic units that send the REAL-VALUED OUTPUTS of logistics."
So without trying to sound like a pedant, the manner in which the error is introduced is effectively a hyper parameter, no? The skills involved in doing this correctly will land you the big bucks these days:
My problem is that I think the programming experience of an entire hardware stack based on 7-bit floating point would be abysmal and that companies that tried to port their software stack to it would die horribly (to be fair they'd deserve it though). OTOH If they'd just stick to the limited domains where this sort of thing is applicable, my perception of it would flip 180 degrees - it's probably a pretty cool image preprocessor. What it's not is the replacement for traditional CPUs.
But the people behind this seem to have utter contempt for the engineers tasked with programming their magical doohickey, or putting this in TLDR terms:
1. Scads of 7-bit floating point processors
2. ???
3. PROFIT!!!
>Neural networks? See The Vanishing Gradient Problem.
Only on deep networks. And as another comment mentioned, NNs can withstand and even benefit from noise.
There are numerous ways around it was well. You could just run the network several times (probably in parallel) and average the results together. Same with most of the other problems. It does defeat the purpose a little bit, but it would still retain most of the speed up. You could train multiple networks and average them together (also very successful even in deterministic nets.) Perhaps the network itself could be made more redundant or error tolerant.
Hell if the thing is as fast as promised, you could even do genetic algorithms and get reasonable performance.
This is the sort of broad statement I put in the same fool file as "We used simulated annealing because it's guaranteed to converge." Bonus points for lobbing the genetic algorithm grenade into the mix(1).
In the former case, it's the judicious use of noise rather than basing the entire application on an architecture designed to inject 1% error into every calculation (because that seems to me to be a really good definition of programmer hell) and in the latter case, simulated annealing only converges assuming you run it for INFINITE TIME. But that doesn't stop people from saying this or even Jonathan Blow, someone I otherwise respect for his indie game dev creds and success therein, making an equally uninformed statement such as equating 1% computational error to 0.00001% error.
Getting useful information out of Neural Networks requires extreme attention to detail. If it didn't, Google wouldn't be paying up to $10M a head for people who can do this - the DeepMind acquihire.
Imagine the alternative here: I train a cluster of GPUs or FPGAs and I get useful results months before the engineer working with 1% error everywhere gets a network implementation that even converges. I then design and tape out custom ASICs to do this 10x faster within a year. See bitcoin mining for a successful example of this approach.
1. My doctorate made extensive use of genetic algorithms, I'm very familiar with their strengths and weaknesses.
I never claimed anything about simulated annealing, and I know the noise is less than ideal, but adjustments can be made. Really my point stands, there are plenty of ways around it, and with the supposed 10,000X speed up it would be a massive advantage.
So now we're down to ~5x better performance in exchange for that programming nightmare.
And just to be thorough, if you're hellbent on nontraditional arithmetic, here's an error-free 32-bit integer ALU made with 1696 transistors, that would be >3x more efficient than the architecture here (assuming all you care about is throughput):
So now we're talking at best 1/3 to 1/4 the performance of what they could get if they dropped logarithmic math and went to something more predictable. Someone call Vinod Khosla, we're gonna be rrrriccchhhh...
His 1% error comes from relying on 7-bit logarithmic floating point used in a task-parallel manner. This is a non-starter for HPC. While most theoretical models certainly have 1% or greater underlying errors in accuracy, a 1% or worse error in precision is going to doom these algorithms to numerical instability without Herculean additional effort that would obliterate the computational advantage here.
Neural networks? See The Vanishing Gradient Problem.
Molecular dynamics? It's numerically unstable without 48-bit or better force accumulation as proven by D.E. Shaw.
NVIDIA, AMD, and Intel have invested a huge sum in manycore processors that are already too hard to program for most engineers. These processors are a cakewalk in comparison to what's proposed here.
Finally, even if you did find a task amenable to this architecture (and I'd admit there may be some computer vision tasks that might work here), where's the data bus that could keep it fed? We're already communication-limited with GPUs for a lot of tasks. Why do we even need such a wacky architecture?