The compiler is always right

tikhonj · on May 11, 2014

The compiler is always right... except when it isn't.

Really, the compiler is usually right. Fair enough. But I've run into compiler bugs, and that's just in internships! (And I'm not even counting the compiler I helped write...)

And sometimes the issue isn't even a bug but a design deficiency. Those are hard to deal with because they're real problems, but also exist for real reasons. I ran into one of these when compiling OCaml to JavaScript and getting stack overflows—turns out the compiler couldn't handle mutual recursion properly. And that's because it's a hard problem! And yet, it also meant the compiler was wrong.

Thinking about it, perhaps the title could be read as "the compiler is always right, even when it's wrong". Just like the ultimate arbiter in a car crash is momentum, not the rules of traffic, if you hit a compiler bug, "fixing" your program is probably much easier than fixing the compiler—even if you're in the right! That's what we ended up doing with js_of_ocaml, by choosing a different backend for our parser generator that didn't create mutually recursive code.

Chris_Newton · on May 11, 2014

To be fair, the original author does acknowledge at the end that the title is hyperbole.

Still, given that the compiler almost always seems to be right in these cases, it’s surprising just how often even our trusty C compilers do have bugs. There is at least one well established project trying to formally verify a compiler to prevent that problem, apparently with considerable success already:

http://compcert.inria.fr/

Having once spent a lot of time trying to track down a bug in some floating point code, only to find that the optimizer had tried to use one more floating point register than it actually had available something like ten levels down the call stack from where the error was being observed, I wish such projects the best of luck and look forward to a time when all our development tools are similarly trustworthy.

foxhill · on May 11, 2014

i think in this context, the author is referring to mainstream (gnu, ms, intel) compilers, that are very mature and have thousands of tests that they must pass to be eligible for release.

writing a compiler is tricky, and it is almost an absolute certainty that it will have a bug :)

jevinskie · on May 11, 2014

Nope, even mature compilers are buggy.

I find it amusing how often compiler authors have to workaround bugs in other compilers. See: http://search.gmane.org/?query=MSVC+2012&author=&group=gmane...

I don't mean to pick on MSVC and many of those are not-yet-implemented issues, it is just the easiest to search for. I've found (and even fixed a few) enough bugs in LLVM/Clang to realize that no compiler is perfect or even close! =)

foxhill · on May 11, 2014

well, of course they can be buggy, they are written by humans, after all :)

but in my time as a programmer, i think i've seen one bug that was a genuine compiler bug. and this was in floating point arithmetic optimisations.

llvm/clang, well, i'd hazard to call them mature (even though apple insists on doing so).

lpw25 · on May 11, 2014

I think that js_of_ocaml 2.0 added support for compiling mutually recursive functions using trampolines. So that compiler is back to being right again.

tgma · on May 11, 2014

Having done research on finding miscompilations in production compilers just recently[1], hunting a couple hundred bugs in GCC and LLVM, I feel much more skeptical these days about this matter.

Modern compilers are big, complex, systems, and naturally have bugs (to be fair, given the complexity and aggressiveness of optimizations, the quality of GCC is extremely admirable.)

[1]: http://mehrdadafshari.com/emi/paper.pdf (check out the example bugs in the paper. They are amusing.)

P.S. The "it's your code, not the compiler" mindset does not generally apply to compilers targeting embedded platforms.

makomk · on May 11, 2014

Oh wow, those example compiler bugs are really interesting. Turns out aggressive removal of code that invokes undefined behaviour can combine with common, safe optimisations to cause misoptimisation - even though the code output by the earlier optimisation passes was safe due to implementation details, a subsequent pass decided that because the optimised code's behaviour wasn't defined by the C standard it could assume it never executed and remove chunks of it. Nasty.

imslavko · on May 11, 2014

Someone who has been participating in algorithm competitions like TopCoder or Codeforces would know this - compiler optimizers can have bugs pretty often.

Usually on algorithmic programming competitions the tasks are very hard to come up with a good asymptotic solution but are fairly easy to implement and the code can be rather short (usually 70-300 LOC written in <1h).

Given the average ratio of program complexity / size of the program, competitive programming community manages to find bugs in GCC optimizer several times a year.

Sometimes compilers do have bugs :) Not very often though.

http://codeforces.ru/blog/entry/1059?locale=en

http://codeforces.ru/blog/entry/11450?locale=en

http://codeforces.ru/blog/entry/2068?locale=en

http://codeforces.ru/blog/entry/1993#comment-40700

http://codeforces.ru/blog/entry/1840?locale=en

http://codeforces.ru/blog/entry/3742

etc

nitrogen · on May 11, 2014

For beginners, it is probably best to assume the compiler is right. For the rest of us, the compiler is usually right, except when it's not. See the extensive work of John Regehr on automated compiler bug discovery and related compiler work: http://blog.regehr.org/archives/category/compilers

wglb · on May 11, 2014

My favorite part is where his process found a non-trivial number of bugs in a research compiler that was proved correct.

twic · on May 11, 2014

Which part is that? I had a look through the archives but couldn't spot a post which looked like that.

wglb · on May 11, 2014

Here is one paper that describes it: http://www.cs.utah.edu/~regehr/papers/pldi11-preprint.pdf It is the CompCert compiler, which is formally verified. It looks as if the errors found were in the unverified portion. Those errors have been found and the verified surface has been expanded.

smcl · on May 11, 2014

Absolutely. I utilised John Regehr's work on this on my previous job - uncovered a bunch of bugs ranging from compiler crashes to inconsistent output between optimisation levels, and other compiler switches. I wish I got the chance to do more with it, and provide him with detailed feedback he requested on the nature of the bugs but I never did :(

steveklabnik · on May 11, 2014

Regehr came to the Bay Area Rust Meetup a few days ago and talked to us about all this: https://air.mozilla.org/rust-meetup-may-2014/

ekidd · on May 11, 2014

In my experience, if I've eliminated every other possibility, it's worth actually reading the assembly generated by the compiler. Most compilers are pretty good, but they do contain bugs, and if I completely rule out the possibility that the compiler is generating bad code, I can waste days searching for the problem. Some examples:

1. When compiling certain highly-optimized routines in Quake II, Microsoft Visual C++ generated tail calls that popped the current stack frame before recursing, even though a pointer to a stack variable had already escaped into a global variable. Now, I'm sure there's some technical reason why a C compiler is allowed to generate garbage code here (there usually is—Google "nasal demons"), but it's still easier to debug if you assume the compiler is untrustworthy.

2. The Rubinius FFI had problems passing doubles to external functions: https://github.com/rubinius/rubinius/commit/1122c1c26b81c969...

3. I've broken quite a few in-house and experimental compilers. My favorite error message came from a Lisp compiler: "Lost value of variable 't' in anonymous lambda in anonymous lambda in top-level form."

Now, this sort of thing is rare. If you're using a mature production compiler in CS class, you can assume with near-100% certainty that any problems are your fault. But if you're maintaining a 250K-line program, you upgrade your compiler, and you see some mysterious new backtraces showing up in the crash reporter, then it's entirely possible your compiler is buggy.

As a handy rule of thumb, if you've been staring at a function in a debugger for more than an hour, and you know it's correct, and something weird is still happening, it's time to choose "Show Assembly" and see what's actually going on. It may still be your fault, but it's time to stop trusting the abstractions provided by the compiler and debug your real program—the one the computer actually runs.

munin · on May 11, 2014

> you upgrade your compiler, and you see some mysterious new backtraces showing up in the crash reporter, then it's entirely possible your compiler is buggy.

IME, there are two more likely causes of this, that are kind of rooted in the same cause:

1. Your code depended on a compiler bug to function, the bug is fixed, and now your code is broken

2. Your code depended on an undefined part of the language semantics, the compiler changed from doing one undefined thing to another, and now your code is broken

huhtenberg · on May 11, 2014

> you upgrade your compiler, and you see some mysterious new backtraces showing up in the crash reporter, then it's entirely possible your compiler is buggy.

Or it's not a compiler, but a linker now sticking 64bit code above 0x100000000 for shit and giggles. Fun times, ask me how I know :)

buro9 · on May 11, 2014

The assumption any dev should have is:

"The bug is yours"

That isn't a true statement, as there are exceptions. It's just that you are unlikely to be the exception.

Whether you're relying on a compiler, network, some other application, a library, an API... the bug is nearly always yours and you should more rigorously check your code without making any assumptions, until you fully understand the root cause of a bug.

If you actually have got to the root cause of a bug, you'll find in nearly all cases the problem was with your code.

mratzloff · on May 11, 2014

Yes. Similar to the hyperbole in the article title, I say "All bugs are logic bugs" (in your code). Now, it's not strictly always true--sometimes it's environmental or in a library you depend on, or it's a garbage collection bug with the language, or what have you.

But some developers (myself included, sometimes) reach for those excuses first because their code couldn't possibly be exhibiting that strange behavior on its own. After all, it worked perfectly in (insert situations).

No, it's almost certainly a logic bug: a race condition, a parallel execution problem, something doesn't get set correctly in a very uncommon case, etc.

shortstuffsushi · on May 11, 2014

Came to say the same thing. "Innocent until proven guilty" when it comes to others' codebases. If you can strip an issue down to the bare minimum, and then prove that the issue is not within your own code, then it's time to start digging elsewhere. Rarely is that the right place to start, though.

acqq · on May 11, 2014

Is it a poor debugger that the author uses? The code looked wrong to me until I understood that instead of writing the labels inside of the line the line contains the value of 0 and the unconventional decoding and then the label follows in the next line e.g.

    2:	mov    0x0(%rip),%rax
    5:  R_X86_64_GOTPCREL NSS_ERROR_NOT_FOUND+0xfffffffffffffffc

was probably supposed to mean something like

    mov  addr NSS_ERROR_NOT_FOUND, %rax

That is, the next line is actually content of the part of the instruction of the previous, and the later was incomplete and with 0 instead of the value. If the value in the second line would appear after the instruction in the first line is over, during the execution the second line would be executed as the instruction. That it's not the case is visible from the offset 5 vs. 2 of the start of the instruction. I'm more used to Intal than AT&T notation, but I don't believe that it's the effect of the notation. Anybody knows more, what produces such a strange code?

This weirdness

    2:	48 8b 05 00 00 00 00 	mov    0x0(%rip),%rax
	5: R_X86_64_GOTPCREL	NSS_ERROR_NOT_FOUND+0xfffffffffffffffc
     9:	8b 38                	mov    (%rax),%edi
     b:	e8 00 00 00 00       	callq  10 <NSSTrustDomain_GenerateSymmetricKeyFromPassword+0x10>
	c: R_X86_64_PLT32	nss_SetError+0xfffffffffffffffc

probably just means

     mov addr NSS_ERROR_NOT_FOUND, rax
     mov (%rax),%edi
     call nss_SetError

Then what and why produces the longer and stranger form?

cnvogel · on May 11, 2014

That's objdump's default format when outputting relocations together with the disassembly. As you have already found out, the code is compiled with dummy values for the actual addresses being accessed or jumped to, and the reloacation table instructs the linker to overwrite these dummy values with the actual addresses, once they are known.

R_X86_64_GOTPCREL is a constant defined in /usr/include/elf.h (I think it's from libbdf, the library dealing with the different file formats binutils understand).

     #define R_X86_64_GOTPC32        26      /* 32 bit signed pc relative
                                                offset to GOT */

Here's another example, with the invocation of objdump:

    $ cat hackernews.c
    int
    doit(int a)
    {
       return blah(a);
    }
    $ cc -Os -c hackernews.c
    $ objdump -r -S hackernews.o
    hackernews.o:     file format elf64-x86-64
    Disassembly of section .text:
    0000000000000000 <doit>:
       0:       31 c0                   xor    %eax,%eax
       2:       e9 00 00 00 00          jmpq   7 <doit+0x7>
                        3: R_X86_64_PC32        blah-0x4

on ARM (raspberry pi, the relocation looks a little bit different)

    $ objdump -r -S hackernews.o
    hackernews.o:     file format elf32-littlearm
    Disassembly of section .text:
    00000000 <doit>:
       0:       eafffffe        b       0 <blah>
                        0: R_ARM_JUMP24 blah

    #define R_ARM_JUMP24                29      /* PC relative 24 bit
                                           (B, BL<cond>).  */

To only see the relocation table, use "objdump --relocs":

    hackernews.o:     file format elf32-littlearm
    RELOCATION RECORDS FOR [.text]:
    OFFSET   TYPE              VALUE
    00000000 R_ARM_JUMP24      blah

gsg · on May 11, 2014

By the way, I believe the zeros are addends and not actually dummy values. This is a bit strange because ELF has existing provisions for relocations with addends, but there it is.

acqq · on May 11, 2014

Thanks! So it's objdump producing something nobody who doesn't debug linker's rellocation code needs. Is there an option "make sane output"?

I guess gcc's -S option would make something prettier?

gsg · on May 11, 2014

Those are just ordinary relocations in PI code. You'll see this kind of information if you run, say, objdump -rd foo.o on a x86-64 object file that has been compiled as PIC.

Note that your code fragment is not PIC, so it would not substitute in this situation.

cnvogel · on May 11, 2014

The relocation mechanism is used for either pic or non-pic code. If parts of your code reside in different compilation units (.c files), there need to be a way to tell the linker how to jump from one to the other, even if the final program will be run at a well-known static position in the address space.

Of course, when compiling -fpic versus -fno-pic, different types of relocation information will be produced, and the compiler might produce different instructions for branching. And unresolved symbols will remain in your library to be fixed up by the dynamic linker. But relocation entries will be produced for either case.

gsg · on May 11, 2014

I was referring to the PIC-specific 0x0(%rip) construct as well as the relocation. Sorry if that was unclear.

acqq · on May 11, 2014

Thanks. That explains me the part of "why."

Still I believe even PIC is not written as the input for the assembler in that form? Then I can imagine a nicer objdump which would produce something more readable (or even immediately usable by the assembler).

gsg · on May 11, 2014

More or less, yeah. You would still need to use RIP-relative addressing since that is how PIC functions, but you could have nice symbolic addresses.

objdump is a very simple tool. If you want more featureful disassembly tools you should probably look at something like IDA.

4ad · on May 11, 2014

What a silly premise. I find bugs in compilers all the time. I don't like these absolute statements, the industry loves them; I don't think they are useful.

Also, the hardware is not always right. I used to write drivers and, if anything, the hardware is always wrong. Hardware is full of bugs, drivers work hard to hide these bugs from the user.

amboar · on May 11, 2014

So at the bottom of the article the author points to bugzillas for GCC and LLVM, and points out that compilers actually do have bugs. It feels like an admission that the title is really just for attracting clicks, and that the first para of the article could've just been ignored or dropped.

hyp0 · on May 11, 2014

Do not think the compiler is wrong. That's impossible. Instead... try to realize the truth. What truth? There is no compiler. (for non-compiled languages)

For compiler authors, it's the other compiler that's right.

I have to admit the only time I ever thought there was a compiler bug, there actually was a compiler bug (javac). Same for hardware bug (on aligned memory). And a standard library (in C++). I think it's because when I go through my code I can tell if my understanding is clear enough - and don't look elsewhere until it is.

robert_tweed · on May 11, 2014

In many years of programming, I don't think I've ever run into a genuine compiler bug. It's fallacy to assume they don't exist, but they tend to be so rare and esoteric that you probably have a better chance of winning the lottery than finding one.

I have however, run into numerous bugs in standard libraries. These are definitely much more common in proprietary languages like ActionScript (Flash) or Lingo (Director), or rapidly-changing (I'll be nice and not say poorly designed) languages like PHP than in say, C, C++ or Java. Platform-specific bugs in x-platform code also seem to be the most common.

I agree with the general premise that "it's probably your own fault". I can probably count more times that I've suspected a compiler/OS/stdlib bug and found after extensive testing that it was my own fault than I can count genuine library bugs. On average, I probably hit one genuine language bug every two years at most.

The trick is to start with the assumption that it is a compiler/OS/stdlib bug. Next, go create a minimal proof-of-concept to demonstrate the bug you believe exists. In doing so, more than 50% of the time you'll figure out what was really wrong with your code as you are doing this. The other times, you have a nice minimal test case you can submit to the language maintainer's mailing list.

It's also surprising how much you can learn about how a language or feature by trying to methodically prove that it is broken. The process of doing so forces you to think about all the edge cases you don't normally consider, but most of the time, the language designer already did.

wglb · on May 11, 2014

When I was writing the code generation part of a compiler in a previous lifetime, there were a number of conversations of the following sort early on in the production life of the compiler:

Programmer: The compiler is producing the wrong code for this case. Compiler team: Ok, let's look at the code and step through it. Programmer: So it is putting this quantity in the register pair here Compiler team: Right. Programmer: and it is taking out the high byte and putting it there Compiler team: Right. Programmer: and then it is . . . Oh. Wow, that is a weird way to do this. Compiler team: Right. Saves two bytes of code.

The programming team had previously used assembler and had developed conventions of code patterns. Our boss pointed out that "A compiler should produce assembly code that an assembly programmer would be fired for writing."

Not that we didn't have bugs, but the previous conversation was more common than the code generation bugs we had.

And if, Team Compiler, if you think that that compiler is always right, John Regehr http://blog.regehr.org/ has some news for you.

yk · on May 11, 2014

The post somewhat misses the point of the title ( or I did overthink it). The anecdote is interesting, but not exactly a example of a arguable compiler bug. On the other hand, I think that The compiler is always right is a rather good general guideline for programming.

There are three models of programming [1] involved in programming, the one in the head of the programmer, the language specification and the one in the compiler. To look at them individually, it would be nice if the first one is right but of course it is pretty much by definition wrong. (Until we finally have the tools to program via telepathic link.) The second is the language, which should be the authoritative one. And the one in the compiler, which for pragmatic reasons wins. This is of course just a complicated way of saying that I am usually more interested in working code than in the standard.

[1] Here understood as the correspondence between source code and program.

golergka · on May 11, 2014

That's only true when you work with established, trusted compilers. Every worthy Unity developer is familiar with AOT bugs on iOS platforms, which render one of the best instruments in the hands of C# programmer, Linq, completely unusable.

dap · on May 11, 2014

While I appreciate the general principle that many others have also observed, that one should assume bugs are in one's own code before blaming the underlying system, I find that this is often an excuse to avoid understanding the problem and applying the solution where it belongs. The operating system, the network, remote APIs, and other infrastructure often are the problem, and if you don't invest in building and understanding tools that can truly explain what's going on, you often and up building crappy workarounds for shortcomings in the underlying infrastructure.

mratzloff · on May 11, 2014

> you often and up building crappy workarounds for shortcomings in the underlying infrastructure.

Which, to be fair, you would often have to build regardless.

SideburnsOfDoom · on May 11, 2014

See also an older proverb to the same effect: "SELECT isn't broken"

http://blog.codinghorror.com/the-first-rule-of-programming-i... http://pragmatictips.com/26

sanxiyn · on May 11, 2014

The compiler is mostly right for tested configurations. If you are using less frequently used configurations, all bets are off.

For example, unusual target architectures, using optimize options not enabled by default, using non-default modes of compilation such as PGO and LTO, etc.

ikusalic · on May 11, 2014

Here's one absurd bug in Java that really surprised me:

Math.abs(-2147483647); // 2147483647

Math.abs(-2147483648); // -2147483648

When you know this behavior, it's kinda obvious why, but still...

rwallace · on May 12, 2014

Well, not just kind of obvious why, but kind of obvious it's not a bug in Java! If you're dealing with quantities that don't fit in a 32-bit twos complement integer, you need to use a larger datatype.

foxhill · on May 11, 2014

well, i'd only consider that a bug if java defines that integers will use at least as many bits as needed to store their value.

i'd guess this is not the case (but i don't know for sure)

peterbotond · on May 11, 2014

and sometimes a few lines of extra code is needed to help the compiler figure out programmer's intent. simplest, example, is extra code that will be optimized out in the final product, or another to not optimize out certain aliasing. the compiler is your friend who sometimes disagrees, misbehaves, and right, just all along is a good friend.

frozenport · on May 11, 2014

Sometimes you need to rebuild.

Ono-Sendai · on May 11, 2014

Indeed. Visual studio will often link together a severely broken executable, which will be fixed upon full rebuild.