Not to take anything away from this, it's great that such a tool is available, but Microsoft had this kind of technology 20 years ago, known as BBT - still used in some places, but overall systems moved to profile guided optimizations done by Visual C++ (an overall improvement over BBT). It is mostly focused on doing block/function placement optimizations to reduce paging and separate hot/cold code. Some info here: https://blogs.msdn.microsoft.com/reiley/2011/08/06/microsoft....
Seems odd to have "data center" pitched here, when, as far as I can tell, it is really suitable for any good old large applications. Or does "data center" have some further implications that I'm missing?
Having our laptops finish a workload in 15% less time is nice. Running Facebook on 15% fewer computers is the power consumption of a midsized country. For them, a 1% improvement is a great day.
15% is winning the lottery. Every week. ;-)
It's worth saying that when tons of deployed clients are 15% faster, that's at a scale of high aggregate energy and time savings, too. But yes, companies tend not to care about that as long as their users machines can handle the burden of something less optimized. Good enough is good enough. For them it it makes economic sense, but logically it doesn't; to treat client side optimizations as less valuable at similar scales.
It's all about volumes of scale. If you save 15% on a process on a standalone server then that saving will likely go unnoticed. However if you have several thousand servers running the same application, you've now just saved yourself 150 servers. Scale that up again to Facebook / Google's / etc scale and the savings are enormous.
This is the same reason why Facebook invested into their own PHP compiler and Google created Go.
One of the inputs here is profile samples from numerous production hosts. It’s something that you might only get from a large deployment. Similar to Google’s GWP (the 2nd and 9th references in this paper).
The new optimizations are about code layout and it seems are based on real-world perf data being fed back into the compilation process. So it wouldn't necessarily be apparent to an initial compile what the hot code path is, but with that profiling information present you can arrange your binary in order to maximize hits on the code cache.
Note that "feedback optimization" using data from actual real-world perf data has been a standard part of compilers for decades. SPECcpu limits what you're allowed to do with it to avoid over-fitting.
GCC supports the hot and cold attributes so you can manually instruct the compiler how to arrange compilation sections. They basically set the PGO counts to +/- infinity for those functions.
While nowhere near as nice as real profile guided optimization (and likely to grow stale over time), these attributes are much easier to insert compared to fighting with whatever build toolchain you’re using.
If the program takes any time at all then the compiler was not optimal :-)
But really, people don’t appreciate that icache and iTLB misses are absolutely crucial to real-world performance. The cache is everything. Code layout makes a massive difference (regarding other comments I’ve made here lately, this is another reason to hate shared libraries).