Build OpenJDK for a Nice Speedup

pron · on Nov 4, 2019

Precisely what compilers do for a given flag varies by compiler and by version. There could potentially be issues with certain optimizations (unfortunately, the OpenJDK VM does have undefined behavior, although we're attempting to gradually reduce it; the chance of a real problem due to undefined behavior certainly increases with the optimization level, as does the probability of a compiler bug), and there have been such issues in the past.

So while building OpenJDK with particular optimizations for your particular hardware might be worthwhile in some cases, it's not for the faint of heart, and it should be done with care and extensive testing. If you intend to deploy such a custom-built OpenJDK in production, you should strongly consider getting the JCK [1] (the Java TCK) and testing your build for conformance (for all the VM configurations you'll use: GC choice, compiler choice etc.).

A safer and easier way to get "free" performance speedups is to use the most recent JDK (currently 13).

[1]: http://openjdk.java.net/groups/conformance/JckAccess/

saagarjha · on Nov 4, 2019

> unfortunately, the OpenJDK VM does have undefined behavior, although we're attempting to gradually reduce it

Is it a goal to remove undefined behavior completely? Because I seem to recall certain architectural decisions being completely reliant on undefined behavior, such as signal handlers as implicit null checks…

pron · on Nov 4, 2019

I don't know if there is such a goal, but we certainly try to reduce undefined behavior in OpenJDK; e.g. https://bugs.openjdk.java.net/browse/JDK-8233144

chrisseaton · on Nov 4, 2019

You can write these little architectural parts in assembly rather than C, to avoid having to write undefined behaviour.

saagarjha · on Nov 4, 2019

Does the JVM use assembly for this? I’d expect writing all of the VM code that handles Java objects in assembly for each platform would be a bit difficult, and I’m not sure I saw enough assembly files in the project for this…

pron · on Nov 4, 2019

BTW, there's much more assembly code in OpenJDK than you'd find by looking for assembly files. There's a lot of code (all of the interpreter, for example) that's generated at startup time, and written in assmebly as a C++ DSL. E.g. https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x... and https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x...

chrisseaton · on Nov 4, 2019

No I mean it is an option available to avoid undefined behaviour, not that this is what the JVM does.

But moving more of the implementation of the JVM (or all of it!) to Java is the real solution to all this in my opinion.

saagarjha · on Nov 4, 2019

> No I mean it is an option available to avoid undefined behaviour, not that this is what the JVM does.

Ah.

> But moving more of the implementation of the JVM (or all of it!) to Java is the real solution to all this in my opinion.

And what, ship a Graal binary to cut the bootstrapping/startup step?

chrisseaton · on Nov 4, 2019

Yes, rewrite all of the JVM into Java and ship an AOT-compiled binary.

danwills · on Nov 9, 2019

I wonder whether Gentoo might have an edge here (even in general beyond just jdk) since nearly everything on the system is built with the local CPU as the optimisation target? It certainly reduces the need to not be 'feint of heart' to build things too as it's part of business-as-usual. Whether Gentoo itself requires one to not be feint of heart, is definitely an open question!

eadler · on Nov 4, 2019

The entry to the article suggests compiling for your architecture and the n deps several other things.

The article benchmarks - Ofast which is poorly named. It's really -Obroken-by-design. It'll be "faster" but completely break applications.

It also suggests using omit frame pointer which destroys debugability.

-march and -mtune are the parts that the the article title and intro actually suggest. While possible, I see no evidence that this matters. As I understand it the arch that Java is compiled with is not the same as the one that gets used for JIT compiling.

earenndil · on Nov 4, 2019

> [-Ofast will] be "faster" but completely break applications

Except for the jvm, apparently. And everything else I've tried it with.

> It also suggests using omit frame pointer which destroys debugability.

Which is completely useless except for jvm developers.

> As I understand it the arch that Java is compiled with is not the same as the one that gets used for JIT compiling.

The performance of the compiler itself matters, not just the performance of the generated code, because, since it's a JIT, compiler code continues to run.

rjsw · on Nov 4, 2019

The Hotspot JIT reads the CPU configuration at runtime to choose which optimizations are best or which instruction extensions are available.

pron · on Nov 4, 2019

Right, but the OpenJDK VM (HotSpot) uses three JITs -- C1, C2 and Graal -- and two of them are written in C++, so C++ compiler flags could affect the performance of the JIT compilers, although not of the code they generate. Because the performance of the emitted code is far more important than the performance of the compiler, I doubt that will make a difference, but there are other important parts of the OpenJDK VM that are written in C++ and whose performance might be affected, most notably the GCs.

rjsw · on Nov 4, 2019

I would like gcc to compile OpenJDK correctly for AArch64 before I started messing with any performance settings.

pron · on Nov 4, 2019

I agree that messing with compiler flags is not the most effective use of time and risk as fat as improving the JDK's performance goes.

msbarnett · on Nov 4, 2019

That's great for the Java code being JITed, but does nothing for the JIT compiler itself, or the garbage collectors, or any other runtime components that are written in C or C++ and compiled (and optimized) ahead of time.

simpsond · on Nov 4, 2019

Yeah, I suspect a good chunk of the performance improvement is related to GC performance improvement. The netty benchmark with pooled buffers compared to unpooled buffers hints at that (although I don't know for sure).

chrisseaton · on Nov 4, 2019

That doesn't help the performance of the runtime code, which is C++ or AOT-compiled Java.

paulddraper · on Nov 5, 2019

> omit frame pointer which destroys debugability.

IIRC frame pointers were necessary for a Linux flame graph tool I used on the JVM.

umanwizard · on Nov 5, 2019

Based on `perf`? If so, try passing `--call-graph dwarf` in your `perf record` invocation. Then things should work fine despite lack of frame pointers.

gpderetta · on Nov 4, 2019

Wait, omit-frame-pointer has been the default on GCC x86_64 for a while now, even in debug build. The debugger is supposed to use unwind tables to to traverse the stack. Am I missing something?

alfalfasprout · on Nov 4, 2019

It can completely break applications. Particularly those that require a particular level of floating point precision (in which case a fixed precision library is actually a better choice and those that use abundant aliasing (which plenty of modern static analyzers can help with). But for the most part I've been using -Ofast for years in a wide variety of applications with no issues.

chrisseaton · on Nov 4, 2019

> It also suggests using omit frame pointer which destroys debugability.

Well, yeah it's an article about performance optimisations, and disabling this debug mechanism increases performance (or maybe it does - I haven't measured it myself.)

If you need to debug, don't optimise for performance this aggressively. Seems a reasonable tradeoff?

edoceo · on Nov 4, 2019

I run other apps (in production) with omit-frame-ponter. We know it kills debug. Some narrow cases it makes sense. Like many optimization "tricks"

oso2k · on Nov 4, 2019

Interesting that `-Ofast` & `-O3` were chosen to bench and not `-Osize`. In some real-world cases, `-Osize` can beat `-O3` because it will impact caches less with the smaller code size. This has interesting affects also when the workload is is highly parallel/threaded and less sequential. It also reduces side-effects of trying to create fast, possibly unrolled loop code.

bcaa7f3a8bbc · on Nov 5, 2019

This is the wisdom from the old days, and it's still true in many cases.

But I believe compiler developers are trying to keep -O3/-Os working as advertised by their literal meaning, especially when it's combined with -march and -mtune, the cases of performance degradation should be fewer by now. For example, by using the knowledge of the subarchitecture, compilers can optimizing the code for Intel's MicroFusion rather than performing useless loop unrolling that actually degrades performance.

In all benchmarks on Phoronix since GCC 4.9, they showed -Os is almost always slower than -O2 and -O3.

https://www.phoronix.com/scan.php?page=article&item=gcc_49_o...

Linux kernel used to prefer -Os at everywhere, but now it has -O2 and -O3 as well. I think there is a measurable performance improvement in benchmarks in some cases.

https://github.com/torvalds/linux/blob/15f5db60a13748f44e5a1...

nwallin · on Nov 5, 2019

It's very dependent on the code.

I'm currently writing a path tracer as a side project, and -Ofast is sickeningly faster than -O2. Like 4-5 times faster. -O3 duplicates all the loops into two versions: one with AVX instructions chunking 8 iterations at a time, and a second scalar version that does the final 1-7 iterations. -Ofast is a little bit faster than -O3 because it generates the approximate rsqrtps and rcpps instructions.

However this is a special case. Most code isn't heavy on crunching massive quantities of floats, and much of the code that does isn't written in a way that gives the compiler the freedom to autovectorize your loops. And a surprisingly large amount of code is still compiled with MSVC which won't vectorize at all.

In gcc, -O2 is generally fastest for general purpose code, both faster than -Os and -O3. As an example of why -O2 is faster than -Os, -O2 will optimize signed integer division by a constant power of two into a handful of bitwise instructions, which are larger than a single (but much, much slower) division instruction. (signed integer division can't be replaced with a single bitshift because negative integers work differently. unsigned integer division can be replaced by a single bitshift.) People think this is a universal optimization, but it's a size tradeoff so -Os specifically eschews it.

oso2k · on Nov 5, 2019

You might be misremembering but strength reduction is part of ‘-O1’ [0].

However, my point for a general purpose language system like OpenJDK that favors multithreading or multiprocessing or server workloads, each core/thread and process/thread share the I-Cache/L1 cache. Address lines are still 64 bytes. Code for these systems rarely is expected to run in isolation. I tend to want to be a good neighbor and reduce code size when I can.

[0] https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

nwallin · on Nov 5, 2019

I'm actually super way too drunk to give you a full rebuttal, so here's a godbolt compilation thing that proves that I'm right and you're wrong:

https://cpp.godbolt.org/z/-zeMWT

(stop reading now if that's enough)

The -Os compilation is some bookkeeping and an idiv instruction, the -O2 compilation is bookkeeping, two shifts, and an and instruction. I don't know what they because I'm drunk sorry not sorry but the one that does the idiv instruction is both slower and smaller and that's on purpose. (you can tell the idiv one is smaller by clicking the 11010 button to display offsets. the starting instructions of both functions have the same offset, but the final instruction of the -Os compilation is substantially smaller than the -O2 compilation.)

The description of -Os which you linked is telling. It says it's -O2 without certain optimizations, and then it also says:

> It also enables -finline-functions, causes the compiler to tune for code size rather than execution speed, and performs further optimizations designed to reduce code size.

Somewhere buried in that tuning for code size rather than execution speed and further optimizations designed to reduce code size is an optimization that will replace bitwise magic with division statements. -Os does what it says on the tin. It makes your code small. It makes your code slow. It does so on purpose.

People think that -Os is a superset of -O1 and a subset -O2. It's neither. It is neither a superset of -O1 nor -O2, nor is it a subset of -O1 or -O2. There are speed optimizations that -Os adds to -O1 and there are size optimizations that -Os adds to both -O1 and -O2.

The point, I think, of -Os, is for embedded. If you have a size n PROM, and your code compiles to size n+1 with -O2, and if you apply -Os and it compiles to size n, that's a feature. -Os, in my opinion, ought to be uncompromising towards that goal. For better or for worse.

I'll be sober in the morning and can engage with you better then. Sorry.

hak8or · on Nov 5, 2019

Just wanted to say, you were surely drunk when writing that, but it was still very fun to read.

paulddraper · on Nov 5, 2019

With today's L1/L2 caches, it's been a long time since I've seen -Osize not hurt performance.

NullPrefix · on Nov 4, 2019

>funroll loops

Reads like it came right from the Gentoo ricing guide.

bcaa7f3a8bbc · on Nov 5, 2019

-Ofast enables unsafe floating-point optimizations that might give a performance boost for applications which don't need strict compliance. For example, FP operations cannot be optimized by using the association rule due to various concerns, but -Ofast enables -fassociative-math and allows such optimization. If your program just computes a bunch of numbers and doesn't rely on some uncommon features such as NaN or signed zero, it would only slightly change the last irrelevant few significant digits of the result, and greatly speed up your program.

But the general rule-of-thumb is that -Ofast should not be used, unless you know what the program is doing and how the optimization affects it.

A more meaningful comparison is -O2 vs. -O3 vs. -O3 -march=native -mtune=broadwell. Or run the OpenJDK test suite with -Ofast and see whether there are failed tests.

this_user · on Nov 4, 2019

Whether or not the build may already be optimised depends on your source for the JDK. Looks like the Arch build of the OpenJDK is already using -O3.

mumblemumble · on Nov 4, 2019

It's not about what level of optimization is used, so much as what particular CPU generation the target is being optimized toward.

The public binary distributions have to limit themselves to what X86-64 looked like when it first came out in 2003, which means they can't take advantage of any new instructions that were introduced in the past 16 years.

ambrop7 · on Nov 4, 2019

> -march=native and -mtune=broadwell tell the compiler to optimize for your architecture. One would think given the compiler documentation that march implies mtune, but this is apparently not the case.

That sounds like a bug to me which should be reported.

saagarjha · on Nov 4, 2019

There's a thread discussing that topic: https://lemire.me/blog/2018/07/25/it-is-more-complicated-tha...

exabrial · on Nov 4, 2019

Seems like something GraalVM could do at runtime!