The weirdest bug I've ever encountered

fargle · on Nov 13, 2021

Way before QNX went shared/open source, they were as they are now. But they had just released Neutrino a major update which basically took their QNX4 microkernel and mated it with a POSIX/GNU/UNIX-like userland. And it was for a while, free as in beer, for R&D type use.

So that all sounded great and we tried it, this was around 20 years ago. It was a disaster.

The same kind of shoddy work, race-condition, type bugs all over the place.

- malloc broken under heavy multi-threaded load. Replaced with dl-malloc.

- serial driver (16550) broke under heavy load. Wrote our own replacement.

- ethernet driver broke under heavy load. Had to replace intel cards with 3-com (or vice-versa) to use a different driver.

- had to write/rewrite several other drivers. Interestingly the only think I liked about QNX is that user mode drivers are very nice and easy to write.

- compiler, QNX bespoke port of GCC, crashed and broke on large codebases (could be because of malloc). Had to re-port GCC (which was easy) and build our own toolchains.

When we got done trying it our adamant report was GARBAGE NEVER USE. Got overrode by management. because we succeeded in making it work. Sigh. Started using it for products, so we bought commercial not-free license AND support. Reported all of the above bugs and 20 more, with reproducible test cases. QNX "promised to fix them in the next release". This was 20 years ago. Finally after many years we completed excising QNX from our products.

We never had a problem with ps, but it was probably broken. This company was producing garbage in the late nineties and didn't care. They produce garbage now. They didn't have enough or experienced enough developers to be trying to pull together a whole OS, so it was about how you'd expect a university or hobby project.

The only place this author got it wrong is that on QNX, if you assume the OS or compiler is at fault, you are probably right.

fwip · on Nov 14, 2021

QNX is the only compiler I've ever used that removing a comment broke my code at runtime.

If I remember right, the comment was:

// printf("debug 2");

I added in an additional comment directly above that one.

// load-bearing comment, do not remove

dotancohen · on Nov 14, 2021

  > // load-bearing comment, do not remove

I've actually seen a "load bearing comment" with reflection in a PHP app that I had inherited. And some Python packages uses them for defining tests.

But I love the term "load bearing comment". I'm definitely adding this to my vocabulary.

bostik · on Nov 14, 2021

Fortran compiler (f77 maybe?) on a Sun system had reserved filenames. In Finland, back in the late 90's first-year chemistry students were still exposed to Fortran as their entry to programming.

I was helping another student debug their programming assignment, which blew up the compiler. Over the shoulder I couldn't figure out what it supposedly wrong, so we copied it (via /scratch) to my home directory. I just shortened the name a bit. Tried to compile on my account, it worked. No errors. Confirmed it was the same compiler being invoked.

So we copied the file to a new name on the other student's account and tried. Worked.

Same file content, two different file names. One compiles fine, another hits internal compiler error paths. Introduction to programming indeed.

bradrn · on Nov 14, 2021

…how‽ What is the explanation for this?

mst · on Nov 14, 2021

No idea, but I do know there was a gcc port to NCR UNIX bootstrapping off their own compiler that stalled for a while until one of the developers worked out that it unrolled switch statements to if+goto ... and used the label 'dflt:' for the default case of the switch.

Sucked to be you if you decided to name one of your own labels that.

fwip · on Nov 14, 2021

Honestly, I'm not sure. In hindsight, I think that there must have been a race condition somewhere else in my code (school project) that this was somehow exposing. Removing the comment reliably resulted in an executable that always did the wrong behavior, and leaving the comment in, the program always worked.

zoomablemind · on Nov 14, 2021

printf() is an old-school fuzzer, so to speak. It may be just a line, but underneath, the call to it brings in a ton of branching and memory alloc and access.

So quite often any latent memory leaks or code issues could lead to such odd behavior. Sometimes it's the opposite - "works" with printf(), but segfaults in some odd place or infiniloops without it.

At times like this, -Wall, valgrind, and critical thinking are the ways to "cherchez le 'bug'"... Multithreaded it is? Good luck!

Arnavion · on Nov 14, 2021

The line in question is a comment, not a call to printf.

dotancohen · on Nov 14, 2021

You don't know what the compiler is doing behind the scenes before it figures out that is a comment. For instance, it might parse the entire function body for printf() and if it finds it then loads some library, only then getting around to actually parsing the function line by line. Think of Javascript hoisting - only dumber.

Jenda_ · on Nov 14, 2021

There was a case, which I cannot find now, where the compiler decided whether the function is small enough to inline based on its source code size, including comments.

Edit: found it https://medium.com/@c2c/nodejs-a-quick-optimization-advice-7...

https://news.ycombinator.com/item?id=18170499

Edit 2: the medium.com URL redirects to a non-existent domain when I'm not signed into a Medium account (wtf). http://web.archive.org/web/20151222072514/https://top.fse.gu...

Arnavion · on Nov 14, 2021

That may well be the case, but it is not what the comment I replied to was talking about.

btbuilder · on Nov 13, 2021

This reminded me of a story about the launch of WebTV by Microsoft back in the 90s.

My memory is blurry but essentially the bug was in the signup or login code. When displaying an error to the user, the memory address to read the message from was pointing to an array of banned words. This lead to the user seeing no details of the error but instead just the word “f*k”.

Unfortunately the only reference I can find to it online is someone also trying to find it and failing: https://dotclue.org/archive/2006/12/002664/

lanewinfield · on Nov 13, 2021

Well, there looks to be a comment on that very link you sent with the story:

https://fadden.com/tech/webtv-anecdotes.html

(and the dialog box: https://fadden.com/tech/images/fkdialog.jpg)

And really laughed at it. Thanks for sharing.

kristjansson · on Nov 14, 2021

The worlds first onomatopoeic bug.

salamanderman · on Nov 14, 2021

That made my day. Wow.

rincebrain · on Nov 14, 2021

Once upon a time, a local computer club chapter I was in bought a bunch of 1Gb PCI/PCI-X Intel e1000 NICs to upgrade our systems, and a shiny used GbE switch.

We installed them, and a curious thing happened - on _a few_ systems, eventually the NIC would effectively hang, flooding dmesg with "Rx Unit Hang" or "Tx Unit Hang", and since the systems were NFS+LDAP based, people without local account access were SOL, and a reboot was required to get the NIC back into a good state.

When swapping NICs with a system (also using an e1000) where this _never_ happened, the problem did not follow the NIC, it stayed on the original system - same statement if you duplicated the OS install from the working one onto the broken one.

So we reached out to Intel's Linux support (which, at the time, took posts on their sourceforge page), and they replied basically "hey, yeah, we've heard a few reports of that, but we haven't been able to repro it on our testbeds yet".

"We have two identical systems which both do it, would you like to borrow one?"

Their reply was a shipping label, so we shipped it to them.

A little while later, I got a phone call, and the first words out of the engineer's mouth were "Do you have the BIOS that was on this motherboard?"

"...what do you mean, was?"

Apparently they plugged in a PCI bus analyzer that made assumptions that were true on Intel x86 and not AMD x86, and we had shipped them a Sempron, so comedy ensued.

After a number of rabbit holes, the root cause came back - our affected systems were _so_ cheap and terrible, _they lied about bus DMAs succeeding sometimes_, so sometimes the "hey I put/cleared this buffer on the NIC" messages were getting lost but not reporting they were lost, and eventually, you run out of buffers...

They couldn't reproduce it because none of their testbeds were bargain-basement AMD boards. XD

They asked us to keep the system for their test set, and shipped us a brand spanking new Core 2 Quad with all the bells and whistles.

rincebrain · on Nov 14, 2021

Another strange one: Once, when testing a whole bunch of various-grade SSDs, a curious thing happened.

We had a lot of Windows servers at a prior job for a variety of reasons, so we did a lot with Windows, in addition to Linux.

On Windows only, when using a SAS HBA only, a certain vendor's certain line of SSDs[1] would not show up as disks. (They showed up as "Unknown device" if you dug through Device Manager, under "Other devices", not disk drives.)

If you used the onboard SATA, it was good. If you used Linux or Solaris, it worked either way. But on Windows, with a SAS HBA, nope.

It turns out, the SSDs left a mandatory field blank in their identify response, and Windows cared when the field was empty on a SATL-encapsulated disk versus just normal AHCI.

Once the underlying cause was identified, they cut a FW update real fast (and updated the firmware on the site without bumping the FW number, as I recall), because it also affected their enterprise drives with the same controller, and that was causing them issues...

[1] - OCZ Vertex 3s

rincebrain · on Nov 14, 2021

Help I can't stop.

When we were testing ZFS (at the same prior job) while I convinced my bosses to go with it for a big setup on an upcoming cluster, we bought a couple hundred of a consumer disk to play with (my workplace was obsessed with the idea of using consumer disks over enterprise ones and just using increased redundancy+backups to deal with the gap), threw them in a couple of disk enclosures, and went to play.

Set up a pretty vanilla OpenIndiana system, build a big pool out of dozens of disks with raidz3 vdevs, and start writing bunches of data to them and then scrubbing, and a curious thing occurred...

...every time we scrubbed, the checksum error count on not just one or two, but _all_ of the drives would go up sometimes - never any read or write errors or even a blip out of any disk or controller, no SMART counter bump, nothing.

So after a lot of experimenting (not the CPU, ECC RAM is fine, controller is fine...), it turned out...

...Samsung managed to manufacture disks that made SMART requests dangerous. [1]

You see, if you issued a SMART command while there was data in the drive's write cache (which it has, remember, already told the OS that it wrote safely), it would just...drop it.

So my background monitoring of the counters as I did all this IO was making writes get lost before they reached the platters, and so on scrub, then they would naturally be noticed as wrong...

...and if SMART lined up properly there too, they'd still have some on scrubbing again...

The "best" part was that Samsung did cut a firmware update to fix this, eventually...but it _doesn't change the firmware version reported_, so you can't know if your drive is fixed just from looking. (We kept a spreadsheet of them and updated it as we flashed them.)

[1] - https://www.smartmontools.org/wiki/SamsungF4EGBadBlocks

rincebrain · on Nov 15, 2021

One more bonus, for anyone who yet comes upon this.

I bought a monitor, once upon a time, a Dell UP2414Q, as I recall. It was a nice monitor, when it worked.

Unfortunately, there's a terrible secret in high-framerate high-resolution stream transport - for at least some protocols (hello DisplayPort), they do this by sending two video streams for halves of the screen and the monitor muxes them.

Unfortunately, on this monitor, it would go...wrong, sometimes. Sometimes, it would think that the picture on the "right" half should be stretched to fit the entire width, and then the left side displayed over that on the left (or vice-versa). Sometimes, it would just display nothing for one side. Most excitingly, sometimes it would just do what I would describe as "crash" - if the monitor was allowed to go to sleep, it would sometimes just...stop responding to any inputs on the physical buttons or commands from the computer, and require a hard power removal and reinsertion of the power cable before behaving again. So you got to leave this monitor non-asleep the entire time or endure suffering.

You can see a fun comparison of what a screenshot thinks is onscreen versus a photo of it in [1][2].

And all of those things could all happen any time the resolution shifted - switching to/from fullscreen in something, for example.

If you look, you can find references to a revision of the monitor which ships with a firmware that lacks these problems, but there exists no documented path to update from one to the other. I could not, when I looked, find someone who had the access and desire to try and extract the firmware from the board without any sort of provided tooling to do it, and I was not going to buy monitors with that model and hope I found one of the right rev. Dell support might have been willing to replace it under warranty, but I had bought it refurbished, so "ha ha".

Ultimately, I gave the monitor to someone as a gift when theirs was woefully insufficient (with complete disclosure of the flaws in it) before I tried flashing any exciting payloads. Probably the least satisfying story ending of these, but true.

[1] - https://i.imgur.com/NAG5hEU.jpg [2] - https://i.imgur.com/9R7TWQE.jpg

amichal · on Nov 14, 2021

I once spent nearly 2 weeks as a team of three chasing a intermittent bug in some multithreaded Windows NT codebase (around 1999). We were sure it was some subtle issue with multiple CPUs since it only happened on the one quad cpu machine we had and took hours of 100% cpu between failures. Feeling increasingly foolish and relatively new to multithreaded code I was sure I was missing something obvious. The guy with a EE/hardware background comes in super early on a Monday (like 3am) and gets the longest run in yet without a crash and then notices that the crash happens increasing often throughout the day. He opens the case of our rented beast of a machine none of us could afford to risk damage to and reseats the CPUs. Test loop ran at full 100% CPU load for 72 hours straight.

herpderperator · on Nov 14, 2021

What is the significance of the 3am early morning and later realisation that it happens more frequently throughout the day? How did that lead to him wanting to reseat the CPUs?

ninjanomnom · on Nov 14, 2021

I can at least answer why reseating the cpu was an early response. Heat is one of the few things that would persist across full reboots and gradual heat buildup could easily cause these symptoms. There are a lot of places to check still if that's all you have to go on but the cpu is a decent first choice as it's not uncommon for the thermal past to be applied poorly.

jfrunyon · on Nov 14, 2021

If you run into an issue that takes longer to occur after a cold start, and gets more frequent with use, odds are very good it's heat-related.

blakesterz · on Nov 13, 2021

Everytime I read a story like this I'm reminded of "The case of the 500-mile email"

https://www.ibiblio.org/harris/500milemail.html

"...But then I tried to send an email to Memphis (600 miles). It failed. Boston, failed. Detroit, failed. I got out my address book and started trying to narrow this down. New York (420 miles) worked, but Providence (580 miles) failed...."

lifthrasiir · on Nov 13, 2021

> Make sure your loops terminate. Bounded loop variants that increase or decrease strictly monotonically guarantee loop termination. Don't be too clever with loops!

More generally, isolate any non-simple loop into a dedicated function that calls a callback and make use of such functions throughout the API. Not just because of testability but also because an inlined loop is very hard to track as is. It can even be said that async functions exist partly in order to eliminate an event loop tucked in a state machine. Alas, this observation is not really actionable in C.

kazinator · on Nov 14, 2021

Loops that depend on some graph (say, mutated in place) to reach a fixed point can present a challenge here. You have to reason about the algorithm that changes the values at the graph nodes, and whether it guarantees forward progress toward a fixed point.

Floating point convergence is another risky thing in this category. If you're iterating to convergence, the test has to nail it, and if there is any risk of it going sideways, there has to be some exit strategy baked in.

I think more than once in the past I've added a maximum count to some loop that might not otherwise terminate, more than generous enough not to trigger as a false positive.

I seem to recall that Donald Knuth's TeX famously has some infinite looping detection in some of its formatting code. I can't find any references to this via web searching, though.

Another category are mutually tail-recursive functions. The termination of a tail-recursive network of functions could be as hard to prove as a goto graph. Introducing a maximum iteration count could mean adding a count parameter to many of the functions, passing count -1 and checking for dropping to or below zero in more than one place.

LorenPechtel · on Nov 14, 2021

Yeah, more than once I've coded suicide counters into indeterminate loops to catch corrupted graphs. And recently I had a fairly simple user-created graph that I forgot to put a suicide counter on. Of course I found out the omission the hard way.

What did you expect it to do when you said A waits on B, B waits on C, C waits on D and D waits on A??? (Actually, the main part worked so it didn't immediately smack him in the face. However, there was a routine to find the completion time, if an item didn't exist use the completion time of it's parent(s).)

hinkley · on Nov 13, 2021

You can always go through and clever up the two or three complex loops that will actually create a substantial performance benefit. But leave the rest of them dead boring.

I think a lot of times we write code as if this is the final form it's going to be in. Five features later, we're hardly going to recognize it, and each change is a wedge that can open up a chink in the logic and introduce a nasty bug. If you start clever, there's nowhere to go but down.

couchand · on Nov 14, 2021

Unfortunately we're terrible at prediction. The code you spent all night getting clever with gets rewritten next week, but the code you just threw together survives until the new platform, when it gets ported directly because nobody remembers why it was written like that.

dmingod666 · on Nov 13, 2021

Weirdest bug I saw was OTRS getting slow at the end of the month. It was weird, no pressure on the DB or the server, it would get slow with barely any users and then suddenly it would work really fast with no change.

After a wild goose chase I saw one of the servers took too long to authenticate SSH than the other when the problem happend, turns out, the only difference between the 2 was one of them had a reference to a domain and the other did not.

From this clue I checked there was an internal AD server that was getting called by its domain name on every page call of OTRS. When that DNS server got slow, each page got slow, switching to direct IP fixed it. DNS getting slow on an internal system is wayy down on most people's lists if it even exists..

kazinator · on Nov 14, 2021

A loop not handling all the cases that can occur in system call (errors or other situations), resulting in infinite execution at 100% CPU, isn't weird at all. It's almost an everyday problem.

Almost every "thread using 100% CPU" bug I've ever debugged was due to blowing through some function that was expected to return the happy case, possibly after blocking for a while, but was instead failing.

yissp · on Nov 14, 2021

This was an a well-written and interesting post, but the bug doesn't seem all that weird to me. I'm sure there are numerous race conditions lurking in decade+ old code that are waiting for someone to trip on them.

spurgu · on Nov 14, 2021

I find it impressive that QNX is still around, I remember trying it out in the late 90's, booting it off a floppy disk!

rawoke083600 · on Nov 14, 2021

Great post ! Well written and to the point and damn interesting :)

amichal · on Nov 14, 2021

I once chased