Bolt: A Practical Binary Optimizer for Data Centers and Beyond

compilerdev · on July 22, 2018

Not to take anything away from this, it's great that such a tool is available, but Microsoft had this kind of technology 20 years ago, known as BBT - still used in some places, but overall systems moved to profile guided optimizations done by Visual C++ (an overall improvement over BBT). It is mostly focused on doing block/function placement optimizations to reduce paging and separate hot/cold code. Some info here: https://blogs.msdn.microsoft.com/reiley/2011/08/06/microsoft....

pella · on July 23, 2018

discussions: https://news.ycombinator.com/item?id=17350122 (173 points, 33 days ago )

other link : https://code.fb.com/data-infrastructure/accelerate-large-sca...

pella · on July 22, 2018

https://github.com/facebookincubator/BOLT

zokier · on July 22, 2018

Seems odd to have "data center" pitched here, when, as far as I can tell, it is really suitable for any good old large applications. Or does "data center" have some further implications that I'm missing?

rbanffy · on July 23, 2018

Having our laptops finish a workload in 15% less time is nice. Running Facebook on 15% fewer computers is the power consumption of a midsized country. For them, a 1% improvement is a great day. 15% is winning the lottery. Every week. ;-)

smolder · on July 23, 2018

It's worth saying that when tons of deployed clients are 15% faster, that's at a scale of high aggregate energy and time savings, too. But yes, companies tend not to care about that as long as their users machines can handle the burden of something less optimized. Good enough is good enough. For them it it makes economic sense, but logically it doesn't; to treat client side optimizations as less valuable at similar scales.

laumars · on July 23, 2018

It's all about volumes of scale. If you save 15% on a process on a standalone server then that saving will likely go unnoticed. However if you have several thousand servers running the same application, you've now just saved yourself 150 servers. Scale that up again to Facebook / Google's / etc scale and the savings are enormous.

This is the same reason why Facebook invested into their own PHP compiler and Google created Go.

ebikelaw · on July 23, 2018

One of the inputs here is profile samples from numerous production hosts. It’s something that you might only get from a large deployment. Similar to Google’s GWP (the 2nd and 9th references in this paper).

dev_dull · on July 23, 2018

> We have also applied BOLT to GCC and Clang binaries, and our evaluation shows that BOLT speeds up these binaries by up to 15.3%

How is this even possible? Doesn’t this mean that there’s probably just some inefficient or unoptimized compiler settings?

mikepurvis · on July 23, 2018

The new optimizations are about code layout and it seems are based on real-world perf data being fed back into the compilation process. So it wouldn't necessarily be apparent to an initial compile what the hot code path is, but with that profiling information present you can arrange your binary in order to maximize hits on the code cache.

wumpus · on July 23, 2018

Note that "feedback optimization" using data from actual real-world perf data has been a standard part of compilers for decades. SPECcpu limits what you're allowed to do with it to avoid over-fitting.

hoosieree · on July 23, 2018

This sounds a lot like JIT compilation techniques, but applied to an already-compiled binary.

tgtweak · on July 23, 2018

I would think it's closer to branch prediction but on a larger scale than just instruction pipelines.

I'm curious if it modifies the original binary.

CoolGuySteve · on July 23, 2018

GCC supports the hot and cold attributes so you can manually instruct the compiler how to arrange compilation sections. They basically set the PGO counts to +/- infinity for those functions.

While nowhere near as nice as real profile guided optimization (and likely to grow stale over time), these attributes are much easier to insert compared to fighting with whatever build toolchain you’re using.

RayDonnelly · on July 23, 2018

cold functions are also optimised at -Os.

fnord123 · on July 23, 2018

Each of the passes has a nice comment explaining what it does:

https://github.com/facebookincubator/BOLT/tree/master/src/Pa...

e.g. https://github.com/facebookincubator/BOLT/blob/master/src/Pa...

ebikelaw · on July 23, 2018

If the program takes any time at all then the compiler was not optimal :-)

But really, people don’t appreciate that icache and iTLB misses are absolutely crucial to real-world performance. The cache is everything. Code layout makes a massive difference (regarding other comments I’ve made here lately, this is another reason to hate shared libraries).

CyberDildonics · on July 23, 2018

Wouldn't shared libraries actually group together instructions that are used at similar times?

stochastic_monk · on July 22, 2018

Finally. When the tool was first announced, it was only described in a press release on Facebook’s website: no GitHub, no arxiv.

Now it’s properly released.

nwmcsween · on July 23, 2018

So facebook runs Non-PIE binaries in production?

szemet · on July 23, 2018

Are there any Linux/x86-64 binaries somewhere? (Apparently my 16GB laptop is not enough to compile LLVM...)

orbifold · on July 23, 2018

Are you not using ninja? Both of my private machines have 16GB and I can compile LLVM without a problem.

rurban · on July 24, 2018

I used ninja on my big desktop machine, not laptop, and half of the link targets were killed by the linux oom killer. Linux really became a joke.

Only ninja -j1 helped. My other llvm builds always succeeded the old way: cmake && make -s -j16

dmoreno · on July 23, 2018

Anybody knows if this also can help improve performance of language interpreters such as CPython?

rurban · on July 23, 2018

RayDonnelly · on July 23, 2018

If this supported non-PIE I'd try to use it on Anaconda Distribution in a flash.