The cycle-counts returned by cachegrind are repeatable, to 7 or 8 significant figures. That means that I can make a small change and rerun the test and know whether or not the change helped or hurt even if the difference is only 0.01%. I don't think perf is quite so repeatable, is it?
Also, the cg_annotate utility gives me a complete program listing showing me the cycle counts spent on each line of code, which is invaluable in tracking down hotspots in need of work. If perf provides such a tool, I am unaware of it.
Remember that I'm not trying to optimize for a specific CPU. SQLite is cross-platform. I want to do optimizations that help on all CPUs using all compilers. I'm measuring the performance on the "cachegrind virtual CPU" of a binary prepared using GCC and -Os because that combination gives repeatable measurements that are easy to map into specific lines of source code. But the optimizations themselves should usually apply across all CPUs and all compilers and all compiler optimization settings.
Nkruz is, of course, welcomed to use any tool he likes to optimize his projects. But, at least for the moment, I'm finding cachegrind to be a better tool to help with implementing micro-optimizations.
> The cycle-counts returned by cachegrind are repeatable, to 7 or 8 significant figures. ... I don't think perf is quite so repeatable
It can come close. I just tried, and for 'speedtest1' seemed to be getting 3 significant digits for the cycle counts, and 4 for the instruction count. You'd probably gain another one or two if you were to measure computation only and remove the printf() and other I/O statements. The underlying performance counters are pretty much cycle accurate.
> cg_annotate utility gives me a complete program listing showing me the cycle counts spent on each line of code... If perf provides such a tool, I am unaware of it.
Yes, that record/report combination I quoted above does just this, with insignificant runtime overhead. Unlike the total counts, this one is sampled, so you might get a little more variation. It's definitely good enough for quickly finding hotspots, and there are other (harder to use) tools that can use the precise "PEBS" events you need complete counts. These even allow you to do nifty things like track the number of times each branch statement is mispredicted.
> I'm measuring the performance on the "cachegrind virtual CPU" of a binary prepared using GCC and -Os because that combination gives repeatable measurements that are easy to map into specific lines of source code.
Absolutely, this is the right way to view cachegrind. My question would be whether it is the right generic CPU to be using, and whether optimizations made on it translate well to other modern CPUs. Many of them will, but I think you'd have faster turnaround time even better success with an approach that uses a real CPU and its performance counters.
> I'm finding cachegrind to be a better tool to help with implementing micro-optimizations.
Please realize I have the utmost respect for your work on SQLite. It's my most frequent answer when asked for an example of C code to study, learn from, and pattern after. I'm certain you will manage to optimize it with any tool you choose, but having spent many hours with GProf and cachegrind myself, I (will the zeal of a recent convert) think you'll be amazed with some of the things that are now possible with performance counters.
Also, the cg_annotate utility gives me a complete program listing showing me the cycle counts spent on each line of code, which is invaluable in tracking down hotspots in need of work. If perf provides such a tool, I am unaware of it.
Remember that I'm not trying to optimize for a specific CPU. SQLite is cross-platform. I want to do optimizations that help on all CPUs using all compilers. I'm measuring the performance on the "cachegrind virtual CPU" of a binary prepared using GCC and -Os because that combination gives repeatable measurements that are easy to map into specific lines of source code. But the optimizations themselves should usually apply across all CPUs and all compilers and all compiler optimization settings.
Nkruz is, of course, welcomed to use any tool he likes to optimize his projects. But, at least for the moment, I'm finding cachegrind to be a better tool to help with implementing micro-optimizations.