The hardest part is debugging concurrency and weird scheduling bug. It's hard to...

grishka · on June 27, 2021

As someone who wrote and debugged a lot of concurrent code, here's another advice: log everything. Log as much as you could in the part where you think the bug is. Log every line that's run if you have to. You'll then skim through the log file looking for any unexpected patterns.

This approach works better than using a debugger, even on a single-core system, because these kinds of bugs tend to be hard to reproduce and take many iterations. You don't want it to hit a breakpoint a zillion times before it finally shows itself.

And another one, tangential to what you said. Read your code line by line and ask yourself "what would break if a context switch happens right here" for each line.

halfer53 · on June 27, 2021

That's a very smart way of debugging concurrency. I did a lot of logging as well when I was doing debugging concurrency. Overtime once you got familar with the projects, I started to develope instincts on how everything fits together. That's when reading code line by line, making educated gusses start to become viable way of debugging concurrency.

But for complex projects, reading code or relying on instincts may not work alone as brain power run out of capacity. That's why logging helps a lot

sn41 · on June 27, 2021

You are right. Great work!

One somewhat related theoretical observation about concurrency is in some article by Dijkstra (I don't remember the reference right now): he says that debugging using traces (essentially printf) does not work for concurrency, since it is projecting multidimensional data (data present at the same time in multiple processes) unnecessarily linearized onto a single dimension (a sequence of printfs) and then trying to make sense of what is happening. It may not work, even if you print timestamps.

His view was to promote theoretical proofs of correctness of concurrent code, rather than debugging, but to me at least, this is much more difficult.

aquark · on June 27, 2021

Very true, and not only do print statements inherently serialized threads, they also change the timing so significantly that probably your bug disappears anyway.

erwincoumans · on June 27, 2021

I would expect blazing fast in-memory logging with thread-id and time stamps, so timings aren't affected (much).

lowbloodsugar · on June 27, 2021

Back on the N64, I updated the bit of code that swapped threads to write, to a ring buffer, the outgoing/incoming PCs, thread IDs and clock. Found tons of unexpected issues. In another thread you can print that or save it to disk or whatever. Or just wait till it crashes and read memory for it. Found the last crash bug with it. Meanwhile, a colleague took it, and drew color coded bars on the screen so we could see exactly what was taking the time. Those were the days. =)

pjerem · on June 27, 2021

If you don’t mean to reveal it, what were you working at on the N64 ?

lowbloodsugar · on June 27, 2021

Ironically, trying to make a name for myself.

inglor_cz · on June 27, 2021

My tip: if you can pinpoint the place where the bug occurs, trigger a SIGSEGV there and run the entire thing under Valgrind. It shows you a lot of interesting data.