Precisely what compilers do for a given flag varies by compiler and by version. There could potentially be issues with certain optimizations (unfortunately, the OpenJDK VM does have undefined behavior, although we're attempting to gradually reduce it; the chance of a real problem due to undefined behavior certainly increases with the optimization level, as does the probability of a compiler bug), and there have been such issues in the past.
So while building OpenJDK with particular optimizations for your particular hardware might be worthwhile in some cases, it's not for the faint of heart, and it should be done with care and extensive testing. If you intend to deploy such a custom-built OpenJDK in production, you should strongly consider getting the JCK [1] (the Java TCK) and testing your build for conformance (for all the VM configurations you'll use: GC choice, compiler choice etc.).
A safer and easier way to get "free" performance speedups is to use the most recent JDK (currently 13).
> unfortunately, the OpenJDK VM does have undefined behavior, although we're attempting to gradually reduce it
Is it a goal to remove undefined behavior completely? Because I seem to recall certain architectural decisions being completely reliant on undefined behavior, such as signal handlers as implicit null checks…
Does the JVM use assembly for this? I’d expect writing all of the VM code that handles Java objects in assembly for each platform would be a bit difficult, and I’m not sure I saw enough assembly files in the project for this…
I wonder whether Gentoo might have an edge here (even in general beyond just jdk) since nearly everything on the system is built with the local CPU as the optimisation target? It certainly reduces the need to not be 'feint of heart' to build things too as it's part of business-as-usual. Whether Gentoo itself requires one to not be feint of heart, is definitely an open question!
The entry to the article suggests compiling for your architecture and the n deps several other things.
The article benchmarks - Ofast which is poorly named. It's really -Obroken-by-design. It'll be "faster" but completely break applications.
It also suggests using omit frame pointer which destroys debugability.
-march and -mtune are the parts that the the article title and intro actually suggest. While possible, I see no evidence that this matters. As I understand it the arch that Java is compiled with is not the same as the one that gets used for JIT compiling.
> [-Ofast will] be "faster" but completely break applications
Except for the jvm, apparently. And everything else I've tried it with.
> It also suggests using omit frame pointer which destroys debugability.
Which is completely useless except for jvm developers.
> As I understand it the arch that Java is compiled with is not the same as the one that gets used for JIT compiling.
The performance of the compiler itself matters, not just the performance of the generated code, because, since it's a JIT, compiler code continues to run.
Right, but the OpenJDK VM (HotSpot) uses three JITs -- C1, C2 and Graal -- and two of them are written in C++, so C++ compiler flags could affect the performance of the JIT compilers, although not of the code they generate. Because the performance of the emitted code is far more important than the performance of the compiler, I doubt that will make a difference, but there are other important parts of the OpenJDK VM that are written in C++ and whose performance might be affected, most notably the GCs.
That's great for the Java code being JITed, but does nothing for the JIT compiler itself, or the garbage collectors, or any other runtime components that are written in C or C++ and compiled (and optimized) ahead of time.
Yeah, I suspect a good chunk of the performance improvement is related to GC performance improvement. The netty benchmark with pooled buffers compared to unpooled buffers hints at that (although I don't know for sure).
Based on `perf`? If so, try passing `--call-graph dwarf` in your `perf record` invocation. Then things should work fine despite lack of frame pointers.
Wait, omit-frame-pointer has been the default on GCC x86_64 for a while now, even in debug build. The debugger is supposed to use unwind tables to to traverse the stack. Am I missing something?
It can completely break applications. Particularly those that require a particular level of floating point precision (in which case a fixed precision library is actually a better choice and those that use abundant aliasing (which plenty of modern static analyzers can help with). But for the most part I've been using -Ofast for years in a wide variety of applications with no issues.
> It also suggests using omit frame pointer which destroys debugability.
Well, yeah it's an article about performance optimisations, and disabling this debug mechanism increases performance (or maybe it does - I haven't measured it myself.)
If you need to debug, don't optimise for performance this aggressively. Seems a reasonable tradeoff?
Interesting that `-Ofast` & `-O3` were chosen to bench and not `-Osize`. In some real-world cases, `-Osize` can beat `-O3` because it will impact caches less with the smaller code size. This has interesting affects also when the workload is is highly parallel/threaded and less sequential. It also reduces side-effects of trying to create fast, possibly unrolled loop code.
This is the wisdom from the old days, and it's still true in many cases.
But I believe compiler developers are trying to keep -O3/-Os working as advertised by their literal meaning, especially when it's combined with -march and -mtune, the cases of performance degradation should be fewer by now. For example, by using the knowledge of the subarchitecture, compilers can optimizing the code for Intel's MicroFusion rather than performing useless loop unrolling that actually degrades performance.
In all benchmarks on Phoronix since GCC 4.9, they showed -Os is almost always slower than -O2 and -O3.
Linux kernel used to prefer -Os at everywhere, but now it has -O2 and -O3 as well. I think there is a measurable performance improvement in benchmarks in some cases.
I'm currently writing a path tracer as a side project, and -Ofast is sickeningly faster than -O2. Like 4-5 times faster. -O3 duplicates all the loops into two versions: one with AVX instructions chunking 8 iterations at a time, and a second scalar version that does the final 1-7 iterations. -Ofast is a little bit faster than -O3 because it generates the approximate rsqrtps and rcpps instructions.
However this is a special case. Most code isn't heavy on crunching massive quantities of floats, and much of the code that does isn't written in a way that gives the compiler the freedom to autovectorize your loops. And a surprisingly large amount of code is still compiled with MSVC which won't vectorize at all.
In gcc, -O2 is generally fastest for general purpose code, both faster than -Os and -O3. As an example of why -O2 is faster than -Os, -O2 will optimize signed integer division by a constant power of two into a handful of bitwise instructions, which are larger than a single (but much, much slower) division instruction. (signed integer division can't be replaced with a single bitshift because negative integers work differently. unsigned integer division can be replaced by a single bitshift.) People think this is a universal optimization, but it's a size tradeoff so -Os specifically eschews it.
You might be misremembering but strength reduction is part of ‘-O1’ [0].
However, my point for a general purpose language system like OpenJDK that favors multithreading or multiprocessing or server workloads, each core/thread and process/thread share the I-Cache/L1 cache. Address lines are still 64 bytes. Code for these systems rarely is expected to run in isolation. I tend to want to be a good neighbor and reduce code size when I can.
The -Os compilation is some bookkeeping and an idiv instruction, the -O2 compilation is bookkeeping, two shifts, and an and instruction. I don't know what they because I'm drunk sorry not sorry but the one that does the idiv instruction is both slower and smaller and that's on purpose. (you can tell the idiv one is smaller by clicking the 11010 button to display offsets. the starting instructions of both functions have the same offset, but the final instruction of the -Os compilation is substantially smaller than the -O2 compilation.)
The description of -Os which you linked is telling. It says it's -O2 without certain optimizations, and then it also says:
> It also enables -finline-functions, causes the compiler to tune for code size rather than execution speed, and performs further optimizations designed to reduce code size.
Somewhere buried in that tuning for code size rather than execution speed and further optimizations designed to reduce code size is an optimization that will replace bitwise magic with division statements. -Os does what it says on the tin. It makes your code small. It makes your code slow. It does so on purpose.
People think that -Os is a superset of -O1 and a subset -O2. It's neither. It is neither a superset of -O1 nor -O2, nor is it a subset of -O1 or -O2. There are speed optimizations that -Os adds to -O1 and there are size optimizations that -Os adds to both -O1 and -O2.
The point, I think, of -Os, is for embedded. If you have a size n PROM, and your code compiles to size n+1 with -O2, and if you apply -Os and it compiles to size n, that's a feature. -Os, in my opinion, ought to be uncompromising towards that goal. For better or for worse.
I'll be sober in the morning and can engage with you better then. Sorry.
-Ofast enables unsafe floating-point optimizations that might give a performance boost for applications which don't need strict compliance. For example, FP operations cannot be optimized by using the association rule due to various concerns, but -Ofast enables -fassociative-math and allows such optimization. If your program just computes a bunch of numbers and doesn't rely on some uncommon features such as NaN or signed zero, it would only slightly change the last irrelevant few significant digits of the result, and greatly speed up your program.
But the general rule-of-thumb is that -Ofast should not be used, unless you know what the program is doing and how the optimization affects it.
A more meaningful comparison is -O2 vs. -O3 vs. -O3 -march=native -mtune=broadwell. Or run the OpenJDK test suite with -Ofast and see whether there are failed tests.
It's not about what level of optimization is used, so much as what particular CPU generation the target is being optimized toward.
The public binary distributions have to limit themselves to what X86-64 looked like when it first came out in 2003, which means they can't take advantage of any new instructions that were introduced in the past 16 years.
> -march=native and -mtune=broadwell tell the compiler to optimize for your architecture. One would think given the compiler documentation that march implies mtune, but this is apparently not the case.
That sounds like a bug to me which should be reported.
So while building OpenJDK with particular optimizations for your particular hardware might be worthwhile in some cases, it's not for the faint of heart, and it should be done with care and extensive testing. If you intend to deploy such a custom-built OpenJDK in production, you should strongly consider getting the JCK [1] (the Java TCK) and testing your build for conformance (for all the VM configurations you'll use: GC choice, compiler choice etc.).
A safer and easier way to get "free" performance speedups is to use the most recent JDK (currently 13).
[1]: http://openjdk.java.net/groups/conformance/JckAccess/