> We have achieved this by incorporating hundreds of micro-optimizations. Each micro-optimization might improve the performance by as little as 0.05%. If we get one that improves performance by 0.25%, that is considered a huge win. Each of these optimizations is unmeasurable on a real-world system (we have to use cachegrind to get repeatable run-times) but if you do enough of them, they add up.
> Cachegrind [does] does event-based profiling, i.e. it counts instructions, memory accesses, etc, rather than time. When making a lot of very small improvements, noise variations often swamp the effects of the improvements, so being able to see that instruction counts are going down by 0.2% here, 0.3% there, is very helpful.
Of course since processors are out of order and superscalar reducing instructions isn't always a worthy goal either. You might end up with less instructions but less things happening in parallel. And then of course you have to balance this with the cache efficiency of having to look at less instructions to complete a task.
Basically, optimization at this level is really hard.
And yet, in practice, I have found that optimizing for instruction counts works really well. I've gotten way more mileage out of Cachegrind's instruction counts than I ever have out of its cache or branch prediction simulations.
I was tempted to respond after your first comment, but this followup provoked me to action. When you wrote your post several years ago, modelling with Cachegrind may still have been a defensible approach. But the divergence between its generic processor simulation and actual real world performance continues to diverge. As you note, this divergence first became very apparent with the cache and branch predictions. Currently, I'd argue that your time will almost always be better spent checking the CPU's built-in performance monitors rather than using Cachegrind. For the cases where Cachegrind used to be useful, 'perf' is a joy!
I'm not sure that's a strong argument for your case. Similarly to the way that Usain Bolt could probably beat me in the 100m even if he was in a wheelchair with flat tires, I'm sure Richard Hipp could probably do a decent job of optimizing SQLite with just a stub of pencil and a scrap of napkin. I just think he'd be more effective with better tools.
Interesting! That reminds me of my own experience in a different context (from https://blog.mozilla.org/nnethercote/2011/07/01/faster-javas...):
> Cachegrind [does] does event-based profiling, i.e. it counts instructions, memory accesses, etc, rather than time. When making a lot of very small improvements, noise variations often swamp the effects of the improvements, so being able to see that instruction counts are going down by 0.2% here, 0.3% there, is very helpful.