The advantage of 'rep scasb' is that some future version of the intel processor will be more clever and handle a word per cycle.
An advantage of optimizing the hell out of the library version is that nobody will be tempted to roll their own string compare in their application code. Slow APIs are terrible because they force application developers to work around them. So the answer isn't just to write something slow, then measure. You'll find performance doesn't matter because everyone has avoided using it.
For short strings (the common case), scasb performs the same as a smarter C loop, reduces code size, and saves branches (which is good for caching), which is why most compilers compile strlen to it automatically. But this is kind of a silly point since performant code doesn't call stlen often.
My overall point was that it makes little sense for developers coming up for similar hacks on their own pet projects or apps. This is a linear runtime issue, and the efficiency-speed up here is trivial. Obviously, if you're maintaining gnu's libc, you probably know about this more than I do and in a better position to decide if this is needed, but to Joe Developer, it's just a side-track that obscures the end-goals.
The speed-up I was referring to howeverwasn't "strlen speedup" but "your entire app running with naive strlen implementation, vs your entire app with clever strlen."
I also wasn't saying this has no place in it. I was trying to add that these hacks aren't usually what makes your app execute twice as fast or feel more 'snappy', unless the bread and butter of your app is string processin (in which case there are better running time algorithms, not just low-level hacks).
I love these as much as the next person, but early and misappropriated optimization, in my opinion, is suboptimal as a practice.
I think you're making a good point, but it's misguided here. We're talking about a 2x speedup for strlen on pretty much every Linux system and probably other systems as well. That's millions of machines. This optimization has probably saved untold amounts of computing time.
I'd really like to retire this argument, but I don't exactly agree.
My point is basically as follows:
1) Yes, it probably makes sense for GNU libc to use the optimized implementation.
2) Yes, it's really interesting to dissect when you're looking for clever implementations and hacks.
3) These "2x" speed-up numbers I feel are all nice metrics, but in the end, don't amount to much. Outside of this argument, I feel people misunderstand how long something takes in a computer. The relative time a network or hard drive read takes, a memory read takes, a CPU instruction cache miss takes, and a dumb comparing of bytes via a single instruction are all a magnitude of difference apart.
So let's say you're writing your http caching server and you're using the new strlen algorithm. The amount of time your code will spend fetching the item from memory and putting it on the wire will completely eclipse the speedup you get from this fancy strlen. Not to mention if you're writing this in a high-level language, the nanoseconds you save on a linear algorithm will simply not matter.
So you can make the argument that over the last few years, the total time and energy spent by all Linux machines saved by using the new algorithm is worth it. I don't exactly buy it because most of modern computers' lives are dictated by waiting for input, processing it in burst and then more waiting. Most CPUs are sitting around the world with single-digit utilization. If we all loaded up a bunch of work to do in 1990 and the world's CPU power spent time crunching it, we might arrive at an answer a few hours or days sooner. But all it means in realistic terms, is that your Linux box will arrive at answer a few nanoseconds sooner and get to finally start waiting for its next batch sooner (whether this entails serving http requests or waiting for your next key stroke)
I'm all for optimization, but I think it has to be appropriate and measured. It probably makes sense to spend time on nano-optimizations for maintainers of one of the most-used libraries in the world, but all I'm trying to say, that for most readers of HN, it's better to spend time working on algorithm run-time optimization or caching policies than looking for getting side-tracked with strlen implementations.
As I pointed out below, I think we can assume "measure before optimize" in this community.
But per your main point, I think you're wrong in assuming that where your time is best spent is true for most HN readers. My research project is a compiler that generates code for the Cell. This kind of optimization - which in general is a vectorization - is directly applicable to what I do. And, yes, in the kinds of applications I target, the difference when this kind of optimization is applied is measurable and significant.
HN has different kinds of hackers. Something that is outside of your scope might be in someone else's scope.
Of course, that's Knuth's warning that "Premature optimization is the root of all evil." I assumed we've all heard that before, and we're only thinking about applying optimizations after measurement.
Unfortunately though, this is slower than comparing 4 bytes at a time yourself.
Working bytes 1 at a time is pretty expensive. 4 at a time is great on a 32 bit cpu (As long as they're aligned).