Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Bolt: A Practical Binary Optimizer for Data Centers and Beyond (arxiv.org)
89 points by matt_d on July 22, 2018 | hide | past | favorite | 26 comments


Not to take anything away from this, it's great that such a tool is available, but Microsoft had this kind of technology 20 years ago, known as BBT - still used in some places, but overall systems moved to profile guided optimizations done by Visual C++ (an overall improvement over BBT). It is mostly focused on doing block/function placement optimizations to reduce paging and separate hot/cold code. Some info here: https://blogs.msdn.microsoft.com/reiley/2011/08/06/microsoft....




Seems odd to have "data center" pitched here, when, as far as I can tell, it is really suitable for any good old large applications. Or does "data center" have some further implications that I'm missing?


Having our laptops finish a workload in 15% less time is nice. Running Facebook on 15% fewer computers is the power consumption of a midsized country. For them, a 1% improvement is a great day. 15% is winning the lottery. Every week. ;-)


It's worth saying that when tons of deployed clients are 15% faster, that's at a scale of high aggregate energy and time savings, too. But yes, companies tend not to care about that as long as their users machines can handle the burden of something less optimized. Good enough is good enough. For them it it makes economic sense, but logically it doesn't; to treat client side optimizations as less valuable at similar scales.


It's all about volumes of scale. If you save 15% on a process on a standalone server then that saving will likely go unnoticed. However if you have several thousand servers running the same application, you've now just saved yourself 150 servers. Scale that up again to Facebook / Google's / etc scale and the savings are enormous.

This is the same reason why Facebook invested into their own PHP compiler and Google created Go.


One of the inputs here is profile samples from numerous production hosts. It’s something that you might only get from a large deployment. Similar to Google’s GWP (the 2nd and 9th references in this paper).


> We have also applied BOLT to GCC and Clang binaries, and our evaluation shows that BOLT speeds up these binaries by up to 15.3%

How is this even possible? Doesn’t this mean that there’s probably just some inefficient or unoptimized compiler settings?


The new optimizations are about code layout and it seems are based on real-world perf data being fed back into the compilation process. So it wouldn't necessarily be apparent to an initial compile what the hot code path is, but with that profiling information present you can arrange your binary in order to maximize hits on the code cache.


Note that "feedback optimization" using data from actual real-world perf data has been a standard part of compilers for decades. SPECcpu limits what you're allowed to do with it to avoid over-fitting.


This sounds a lot like JIT compilation techniques, but applied to an already-compiled binary.


I would think it's closer to branch prediction but on a larger scale than just instruction pipelines.

I'm curious if it modifies the original binary.


GCC supports the hot and cold attributes so you can manually instruct the compiler how to arrange compilation sections. They basically set the PGO counts to +/- infinity for those functions.

While nowhere near as nice as real profile guided optimization (and likely to grow stale over time), these attributes are much easier to insert compared to fighting with whatever build toolchain you’re using.


cold functions are also optimised at -Os.



If the program takes any time at all then the compiler was not optimal :-)

But really, people don’t appreciate that icache and iTLB misses are absolutely crucial to real-world performance. The cache is everything. Code layout makes a massive difference (regarding other comments I’ve made here lately, this is another reason to hate shared libraries).


Wouldn't shared libraries actually group together instructions that are used at similar times?


Finally. When the tool was first announced, it was only described in a press release on Facebook’s website: no GitHub, no arxiv.

Now it’s properly released.


So facebook runs Non-PIE binaries in production?


Are there any Linux/x86-64 binaries somewhere? (Apparently my 16GB laptop is not enough to compile LLVM...)


Are you not using ninja? Both of my private machines have 16GB and I can compile LLVM without a problem.


I used ninja on my big desktop machine, not laptop, and half of the link targets were killed by the linux oom killer. Linux really became a joke.

Only ninja -j1 helped. My other llvm builds always succeeded the old way: cmake && make -s -j16


Anybody knows if this also can help improve performance of language interpreters such as CPython?


Sure


If this supported non-PIE I'd try to use it on Anaconda Distribution in a flash.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: