summary: author implemented a sophisticated compiler algorithm (loop header recognition) in a straightforward canonical way in C++, Java, Scala, and Go, measured performance/memory usage, and then (the most interesting part of the paper) he had colleagues who were experts in each respective language write highly-tuned optimized versions and reported on what it took to optimize in each language.
So C++ wins the performance test by large margin, Scala follows (3.6x), then Java 64bit (5.8x), Go (7x) and finally Java 32bit (12.6x).
Scala without any optimization is still better than Java with all the ninja-skills black magic applied. Scala optimized is pretty good - "just" 2.5x slower than C++
Actually it's odd. They say that in the Java tuning, they made some simple optimization that got it on par to the original C++ one but then they refused to do anything further; noting instead the same C++ optimization would have applied in Java as well.
Does that mean the java version would have been just as fast? It seems in general though that Java is just as fast as C++ unless you are an ultimate C++ expert.
Regarding the java version, "Note that Jeremy deliberately refused to optimize the code further, many of the C++ optimizations would apply to the Java version as well"
From a performance perspective, the VM is what dominates and Java operates closer to the VM assumptions about what it's going to run.
If you read the paper, you'll note that the Scala author significantly changed the structure of the algorithm to conform with the way Scala does things (recursion, etc). So it's sort of an apples-to-oranges comparison as far as Scala's concerned, too bad they couldn't write ugly Scala code that would give a better comparison.
I only spent an hour straightening out his use of Go. Had I know he was going to publish it externally as "Go Pro" I would have tried to turn it into a real Go program, and perhaps even optimize it somewhat. Overall I think the Java and C++ code got the most attention from language experts. Ah well.