1 million strlens on the same random 100 byte string:
Pentium 4 3.0GHz, gcc 4.3.2 (Debian Lenny)
glibc: 0.52ns/char
libc: 0.60ns/char ???
easy: 0.94/char
obsd: 0.94/char
I would expect libc and glibc to match being a Debian machine, and they matched on the Atom 330. I can't account for the difference here, perhaps the difference in the Debian library compilation flags and my -O3 make a difference on this processor.
Don't tell the Gentoo crowd, you'll only encourage them.
I'd give it 5 also. It's a ridiculously clever fun book.
I agree with some of the reviewers though, the title probably means less people manage to find it than if it was called "Bit manipulation bible" or something.
That makes the X86 assembly version look straightforward: http://www.int80h.org/strlen/. I think it is just a fallback implementation that is used when no optimized version has been created for the target processor. Every time I've looked my compiler has used a hand-optimized version for every platform. Quite possibly, this is 100% dead code.
That looks like a pretty straightforward translation of the basic while (d++ = s++); code. It uses higher-level instructions, but those probably get microcoded to the same thing anyway. (Note: my information on the inner workings of x86 CPUs may be a bit dated.)
HN has a strange fascination with strlen implementations. For anyone who missed it a few months ago, cperciva put up an interesting article about strlen for UTF-8 (variable character width) strings.
And this is why I read (free|open)bsd's cvsweb repo when I'm curious as to how things are implemented. I used to visit that repo in university when I was bored just to read some clear, concise, well-documented code.
I used to use and love FreeBSD (Since switched to OS X as desktop os, still run FreeBSD servers), but OpenBSD source code just looked more approachable.
McKusicks' wonderful book and video course helped.
The architecture independent versions aside, I'm afraid a lot of people will take from this comment the BSD version is "cleaner" and therefore "better." While it is cleaner, the BSD version requires very many more jumps (4 times more if I remember correctly, I haven't looked at glibc in quite awhile).
The glibc version is written in such a way to minimize the jumps in the assembly and therefore produce faster code. Besides reading 4 bytes at a time, it does some clever (and very nontrivial) "magic." But again, all to reduce the number of jumps.
Don't in any circumstances refer to Unix source code for or during your work on GNU! (Or to any other proprietary programs.)
If you have a vague recollection of the internals of a Unix program, this does not absolutely mean you can't write an imitation of it, but do try to organize the imitation internally along different lines, because this is likely to make the details of the Unix version irrelevant and dissimilar to your results.
For example, Unix utilities were generally optimized to minimize memory use; if you go for speed instead, your program will be very different. You could keep the entire input file in memory and scan it there instead of using stdio. Use a smarter algorithm discovered more recently than the Unix program. Eliminate use of temporary files. Do it in one pass instead of two (we did this in the assembler).
Possibly the original non-free version was similar to the OpenBSD, or perhaps the GlibC developers took applied this rule where not strictly necessary. Don't know; just offering a potential clue as to why the GNU version is so different.
When they say "Unix," I think they mean the original, proprietary implementation. I don't think this is the reason why they differ. The BSD implementation is the obvious one. The GNU library version is optimized for a 32-bit processor. Given their domains, that makes sense to me.
"change amd64's MACHINE_ARCH from x86_64 to amd64. There are many many
reasons for this, quite a few of them technical, and not all of them
in response to Intel's broken ia32e crud. The gcc toolchain stays at
x86_64 for now."
I'm worried nobody's going to mod this up because it doesn't talk about bit twiddling hacks or assembly cycle counts, but it is in fact that real answer to this problem. Don't use ASCIIZ when string processing is a bottleneck.
Indeed, in college I was interviewing for a job, and the interviewer asked how I would determine the length of a null terminated string, and I thought I nailed it by giving a typical "while(*(str++)) c++;" implementation.
I had a job interview once where the question was "find all descendants of a DOM node with a certain tag name, and return them in document order." I thought for about 30 seconds, and then wrote up on the board:
element.getElementsByTagName(tagName);
That was what he was looking for. ;-) Of course, then he had me do it out as if getElementsByTagName didn't exist.
Will I employ similar technique in my own code? Absolutely not. It hinders readability and byte comparison instructions in consecutive memory addresses are so stupidly fast that I'll probably save less CPU time combined in all executions of my program than the total amount of time it took me to think this hack up.
If you're a maintainer of glibc though (which is known for its exceptionally clear and straight-forward code,) then I might consider accepting a patch from someone who thought it up.
edit: I thought similarly of DJB's loop unrolling when I saw it in qmail. It's cute, but will it make any noticeable difference today? Probably not.
I agree, but I'm sure a gigantic portion of developers never have to worry about this. Also, most architectures have efficient SIMD instructions to handle things like strlen.
I'm just curious if the speed-up of this hack was ever benchmarked. Or if anyone spent a while tracking down a source of bottlenecks, and found it to be an inefficient strlen implementation.
The advantage of 'rep scasb' is that some future version of the intel processor will be more clever and handle a word per cycle.
An advantage of optimizing the hell out of the library version is that nobody will be tempted to roll their own string compare in their application code. Slow APIs are terrible because they force application developers to work around them. So the answer isn't just to write something slow, then measure. You'll find performance doesn't matter because everyone has avoided using it.
For short strings (the common case), scasb performs the same as a smarter C loop, reduces code size, and saves branches (which is good for caching), which is why most compilers compile strlen to it automatically. But this is kind of a silly point since performant code doesn't call stlen often.
My overall point was that it makes little sense for developers coming up for similar hacks on their own pet projects or apps. This is a linear runtime issue, and the efficiency-speed up here is trivial. Obviously, if you're maintaining gnu's libc, you probably know about this more than I do and in a better position to decide if this is needed, but to Joe Developer, it's just a side-track that obscures the end-goals.
The speed-up I was referring to howeverwasn't "strlen speedup" but "your entire app running with naive strlen implementation, vs your entire app with clever strlen."
I also wasn't saying this has no place in it. I was trying to add that these hacks aren't usually what makes your app execute twice as fast or feel more 'snappy', unless the bread and butter of your app is string processin (in which case there are better running time algorithms, not just low-level hacks).
I love these as much as the next person, but early and misappropriated optimization, in my opinion, is suboptimal as a practice.
I think you're making a good point, but it's misguided here. We're talking about a 2x speedup for strlen on pretty much every Linux system and probably other systems as well. That's millions of machines. This optimization has probably saved untold amounts of computing time.
I'd really like to retire this argument, but I don't exactly agree.
My point is basically as follows:
1) Yes, it probably makes sense for GNU libc to use the optimized implementation.
2) Yes, it's really interesting to dissect when you're looking for clever implementations and hacks.
3) These "2x" speed-up numbers I feel are all nice metrics, but in the end, don't amount to much. Outside of this argument, I feel people misunderstand how long something takes in a computer. The relative time a network or hard drive read takes, a memory read takes, a CPU instruction cache miss takes, and a dumb comparing of bytes via a single instruction are all a magnitude of difference apart.
So let's say you're writing your http caching server and you're using the new strlen algorithm. The amount of time your code will spend fetching the item from memory and putting it on the wire will completely eclipse the speedup you get from this fancy strlen. Not to mention if you're writing this in a high-level language, the nanoseconds you save on a linear algorithm will simply not matter.
So you can make the argument that over the last few years, the total time and energy spent by all Linux machines saved by using the new algorithm is worth it. I don't exactly buy it because most of modern computers' lives are dictated by waiting for input, processing it in burst and then more waiting. Most CPUs are sitting around the world with single-digit utilization. If we all loaded up a bunch of work to do in 1990 and the world's CPU power spent time crunching it, we might arrive at an answer a few hours or days sooner. But all it means in realistic terms, is that your Linux box will arrive at answer a few nanoseconds sooner and get to finally start waiting for its next batch sooner (whether this entails serving http requests or waiting for your next key stroke)
I'm all for optimization, but I think it has to be appropriate and measured. It probably makes sense to spend time on nano-optimizations for maintainers of one of the most-used libraries in the world, but all I'm trying to say, that for most readers of HN, it's better to spend time working on algorithm run-time optimization or caching policies than looking for getting side-tracked with strlen implementations.
As I pointed out below, I think we can assume "measure before optimize" in this community.
But per your main point, I think you're wrong in assuming that where your time is best spent is true for most HN readers. My research project is a compiler that generates code for the Cell. This kind of optimization - which in general is a vectorization - is directly applicable to what I do. And, yes, in the kinds of applications I target, the difference when this kind of optimization is applied is measurable and significant.
HN has different kinds of hackers. Something that is outside of your scope might be in someone else's scope.
Of course, that's Knuth's warning that "Premature optimization is the root of all evil." I assumed we've all heard that before, and we're only thinking about applying optimizations after measurement.
Sorry for asking such a stupid question, but isn't that code horribly broken? It assumes that the next 3 byte after a string are readable. What if I malloc'ed memory for the string in such a way, that the \0 is the last byte of the memory page and the next byte after \0 is on the next page, which is not mapped into my VM space? In such a case the CPU will throw a page-fault and the process will die because of SIGSERV or SIGBUS. Or is the glibc version of malloc padding all memory to have at least 3 byte beyond its last byte?
The glibc strlen begins by checking the first few bytes, if necessary, so that it can continue with the assumption that the string is 4-byte or 8-byte aligned. So no, it's not horribly broken: all the reads are aligned, and you're not going to get a page boundary in the middle of the word.
(Also, you're not going to be reading off the end of a malloc'ed block, because essentially all malloc implementations, including the one in glibc, return blocks whose start address and size are multiples of the architecture's word size.)
A cool page is Paul Hsiehs' webpage. There you will find some good discussion on the trick used. http://www.azillionmonkeys.com/qed/asmexample.html
Look for part 5: "A fast implementation of strlen()".
Interesting, but this is somewhat pointless if you ask me.
It relies on certain assumptions that are not the part of the C standard, so while this code works on the majority of platforms, this is not a portable C code. In which case going all the way down to the assembly level makes more sense. Especially considering there are typically dedicated CPU instructions for the exact purpose of searching a zero in a contiguous block of memory. Something like "rep scasb" on x86.
Go have the argument with glibc then (And all the other people who use optimizations like this).
"friendlier to the micro-architecture" doesn't even make sense. Check the chip timings for rep scasb. It's not friendly.
We're not really discussing wether you should be using strlen on large strings or not, but even if it's used say a million times on strings of length 80 or so, you'd see an improvement worth having.
Check any assembly language forum, book, etc and there will be discussion on why rep scasb/movsb/cmpsb are lame.
Would you implement string copy with rep movsb as well?
Friendlier to the microarchitecture means: fewer branches, fewer BTB entries, less impact on the icache. Sorry, you wrote like you might have already known that.
You know that VC++ does implement copies with movsd/movsb, right?
Sorry, I don't read a lot of books and forums on assembly programming. Just the PRM. I'm just stuck reading/writing a lot of assembly on projects.
>> Friendlier to the microarchitecture means: fewer branches, fewer BTB entries, less impact on the icache. Sorry, you wrote like you might have already known that.
If it's less clock cycles to do branching and comparing by dword (which it is for medium to long strings) than doing rep scasb, then what else matters...?
>> You know that VC++ does implement copies with movsd/movsb, right?
I've stepped through VC++ string copy code in softice many a time thanks.
Notice how I was asking about 'movsb', and you replied with 'movsd/movdb'. Notice the difference?
I have no idea what points you're trying to make here.
You can trade per-byte cycle counts for lower cost to invoke the routine, and for not evicting cache and BTB entries.
On your second point, I assumed it was the "rep" part of the instruction that you were railing against. Apparently it's the "not knowing the difference between a byte and a dword" part. That's awesome. You can have the last word, if you'd like.
rep movsb/rep movsd works well for moving data. However, you obviously can't use that approach for searching for a 0. That's why the code is optimized as it was. My point is that using rep scasb is suboptimal.
Don't know what you're talking about "lower cost to invoke the routine", and the cache/BTB entries would be negligible on a small routine like this.
You seem kinda angry and bitter whenever you reply to me :/ Chill out eh.
The glibc strlen is something like twice as fast as the naive implementation, but there is something else out there that knocks its socks off.
Secondary conclusion would be: remember not to compare GHz across different processors.