Glibc's strlen implementation: Probably not what you'd guess

jws · on March 10, 2009

I'm not done, but I thought I'd share my quick benchmarks:

  1 million strlens on the same random 100 byte string:
    Atom 330, 1.6GHz, gcc 4.3.2 (Debian Lenny)
      glibc:   1.3ns/char
      easy:    3.4ns/char
      obsd:    3.4ns/char

    Core2 Duo, 2.8GHz, gcc version 4.0.1 (Apple Inc. build 5484)
      libc:    0.12ns/char
      glibc:   0.39ns/char
      easy:    0.58ns/char
      obsd:    0.60ns/char
  
  easy: while ( *p++) c++;
  obsd: openbsd, for (s = str; *s; ++s); return s-str;

The preliminary conclusions are:

The glibc strlen is something like twice as fast as the naive implementation, but there is something else out there that knocks its socks off.

Secondary conclusion would be: remember not to compare GHz across different processors.

jws · on March 10, 2009

And to add more mystery:

  1 million strlens on the same random 100 byte string:
    Pentium 4 3.0GHz, gcc 4.3.2 (Debian Lenny)
      glibc:   0.52ns/char
      libc:    0.60ns/char ???
      easy:    0.94/char
      obsd:    0.94/char

I would expect libc and glibc to match being a Debian machine, and they matched on the Atom 330. I can't account for the difference here, perhaps the difference in the Debian library compilation flags and my -O3 make a difference on this processor.

Don't tell the Gentoo crowd, you'll only encourage them.

jfarmer · on March 10, 2009

A mean isn't helpful without standard deviation. The difference could be due to chance.

Also, http://funroll-loops.info/ for those who haven't seen it.

And I used to use Gentoo exclusively. :)

jws · on March 10, 2009

Standard deviation is below my chosen two digits.

DarkShikari · on March 10, 2009

Apple's libc is probably SIMD-optimized. You can make strlen a ton faster with some basic SSE instructions.

dchest · on March 10, 2009

Here's the source code for Apple's implementation (i386): http://www.opensource.apple.com/darwinsource/10.5.5/Libc-498...

(requires registration, here's gist link: http://gist.github.com/77178)

jwilliams · on March 10, 2009

... Yep - And OpenBSD is intended to run on as much hardware as possible - hence this simpler form might suit their purposes.

dchest · on March 10, 2009

Nope, the have implementations for different architectures, you can find them here:

http://www.openbsd.org/cgi-bin/cvsweb/src/lib/libc/arch/

The linked code is for the case when there's no arch-specific implementation.

jwilliams · on March 10, 2009

Thanks - sort of what I meant though :)

axod · on March 10, 2009

Clever stuff. If you're interested in such things, best book I've seen is "Hackers Delight" http://www.amazon.com/Hackers-Delight-Henry-S-Warren/dp/0201...

Author covers really clever techniques to count bits, count non zero bytes like this etc etc. Bit manipulation at its best.

yters · on March 10, 2009

The author's website has all the source code for download:

http://www.hackersdelight.org/

pchristensen · on March 10, 2009

Holy crap, all 15 reviews of that book are 5 stars! That's the best average I've ever seen on Amazon.

axod · on March 10, 2009

I'd give it 5 also. It's a ridiculously clever fun book.

I agree with some of the reviewers though, the title probably means less people manage to find it than if it was called "Bit manipulation bible" or something.

briansmith · on March 10, 2009

That makes the X86 assembly version look straightforward: http://www.int80h.org/strlen/. I think it is just a fallback implementation that is used when no optimized version has been created for the target processor. Every time I've looked my compiler has used a hand-optimized version for every platform. Quite possibly, this is 100% dead code.

sketerpot · on March 11, 2009

That looks like a pretty straightforward translation of the basic while (d++ = s++); code. It uses higher-level instructions, but those probably get microcoded to the same thing anyway. (Note: my information on the inner workings of x86 CPUs may be a bit dated.)

aminuit · on March 10, 2009

HN has a strange fascination with strlen implementations. For anyone who missed it a few months ago, cperciva put up an interesting article about strlen for UTF-8 (variable character width) strings.

http://www.daemonology.net/blog/2008-06-05-faster-utf8-strle...

there · on March 10, 2009

for comparison: http://www.openbsd.org/cgi-bin/cvsweb/src/lib/libc/string/st...

yan · on March 10, 2009

And this is why I read (free|open)bsd's cvsweb repo when I'm curious as to how things are implemented. I used to visit that repo in university when I was bored just to read some clear, concise, well-documented code.

I used to use and love FreeBSD (Since switched to OS X as desktop os, still run FreeBSD servers), but OpenBSD source code just looked more approachable.

McKusicks' wonderful book and video course helped.

mitchellh · on March 11, 2009

The architecture independent versions aside, I'm afraid a lot of people will take from this comment the BSD version is "cleaner" and therefore "better." While it is cleaner, the BSD version requires very many more jumps (4 times more if I remember correctly, I haven't looked at glibc in quite awhile).

The glibc version is written in such a way to minimize the jumps in the assembly and therefore produce faster code. Besides reading 4 bytes at a time, it does some clever (and very nontrivial) "magic." But again, all to reduce the number of jumps.

tjr · on March 10, 2009

http://www.gnu.org/prep/standards/html_node/Reading-Non_002d...

Don't in any circumstances refer to Unix source code for or during your work on GNU! (Or to any other proprietary programs.)

If you have a vague recollection of the internals of a Unix program, this does not absolutely mean you can't write an imitation of it, but do try to organize the imitation internally along different lines, because this is likely to make the details of the Unix version irrelevant and dissimilar to your results.

For example, Unix utilities were generally optimized to minimize memory use; if you go for speed instead, your program will be very different. You could keep the entire input file in memory and scan it there instead of using stdio. Use a smarter algorithm discovered more recently than the Unix program. Eliminate use of temporary files. Do it in one pass instead of two (we did this in the assembler).

scott_s · on March 10, 2009

I'm confused. What are you getting at? The code linked is from OpenBSD, which is under a BSD-style license which is considered GPL compatible.

tjr · on March 10, 2009

Possibly the original non-free version was similar to the OpenBSD, or perhaps the GlibC developers took applied this rule where not strictly necessary. Don't know; just offering a potential clue as to why the GNU version is so different.

scott_s · on March 10, 2009

When they say "Unix," I think they mean the original, proprietary implementation. I don't think this is the reason why they differ. The BSD implementation is the obvious one. The GNU library version is optimized for a 32-bit processor. Given their domains, that makes sense to me.

dchest · on March 10, 2009

You're missing the architecture-dependent implementations. Here's one for i386:

http://www.openbsd.org/cgi-bin/cvsweb/src/lib/libc/arch/i386...

tptacek · on March 10, 2009

Note that this doesn't get used on x86-64, where the repne scasb sequence is apparently penalized.

dchest · on March 10, 2009

Their x86-64 implementation is in amd64 tree:

http://www.openbsd.org/cgi-bin/cvsweb/src/lib/libc/arch/amd6...

It uses repne scasb.

Comment in x86-65 directory says:

"change amd64's MACHINE_ARCH from x86_64 to amd64. There are many many reasons for this, quite a few of them technical, and not all of them in response to Intel's broken ia32e crud. The gcc toolchain stays at x86_64 for now."

tptacek · on March 10, 2009

Weird, I wonder why I thought it wasn't. I'll go look. Thanks!

henning · on March 10, 2009

That's the basic definition you might ask someone to write on a whiteboard in a job interview.

How do they stack up in benchmarks?

Edit: See other posts in this discussion for actual benchmarks. >_>

mynameishere · on March 10, 2009

It's faster if you just store the length in a separate memory location. This is very fast:

  getLength()
  {
    return length;
  }

tptacek · on March 10, 2009

I'm worried nobody's going to mod this up because it doesn't talk about bit twiddling hacks or assembly cycle counts, but it is in fact that real answer to this problem. Don't use ASCIIZ when string processing is a bottleneck.

dchest · on March 11, 2009

Probably because it's not a real answer to _this_ problem (counting the number of bytes in a null-terminated string)?

tptacek · on March 11, 2009

Meh. Fair point.

lacker · on March 10, 2009

The lesson here is use library functions. Someone smarter than you has probably optimized the hell out of them.

albemuth · on March 10, 2009

The glibc guys did not heed such advice :P

tlrobinson · on March 11, 2009

Indeed, in college I was interviewing for a job, and the interviewer asked how I would determine the length of a null terminated string, and I thought I nailed it by giving a typical "while(*(str++)) c++;" implementation.

The "right" answer was just to use strlen()...

nostrademons · on March 11, 2009

I had a job interview once where the question was "find all descendants of a DOM node with a certain tag name, and return them in document order." I thought for about 30 seconds, and then wrote up on the board:

    element.getElementsByTagName(tagName);

That was what he was looking for. ;-) Of course, then he had me do it out as if getElementsByTagName didn't exist.

yan · on March 10, 2009

It's cute.

Will I employ similar technique in my own code? Absolutely not. It hinders readability and byte comparison instructions in consecutive memory addresses are so stupidly fast that I'll probably save less CPU time combined in all executions of my program than the total amount of time it took me to think this hack up.

If you're a maintainer of glibc though (which is known for its exceptionally clear and straight-forward code,) then I might consider accepting a patch from someone who thought it up.

edit: I thought similarly of DJB's loop unrolling when I saw it in qmail. It's cute, but will it make any noticeable difference today? Probably not.

tlb · on March 10, 2009

In places where I've done this sort of thing, I prefer

  #ifdef NO_CLEVER_OPTIMIZATIONS
      obvious version
  #else
      complicated fast version
  #endif

It's good as documentation, good as a test case, and good for isolating weird problems like compiler optimization bugs.

path · on March 11, 2009

Agreed, that's a great way to handle the case

lacker · on March 10, 2009

Sure, but... you're not writing glibc. How many computers run that code? At some point, if it's faster and correct, it's worth it.

yan · on March 10, 2009

I agree, but I'm sure a gigantic portion of developers never have to worry about this. Also, most architectures have efficient SIMD instructions to handle things like strlen.

I'm just curious if the speed-up of this hack was ever benchmarked. Or if anyone spent a while tracking down a source of bottlenecks, and found it to be an inefficient strlen implementation.

axod · on March 10, 2009

You could do this on x86 as a single instruction rep scasb

Unfortunately though, this is slower than comparing 4 bytes at a time yourself.

Working bytes 1 at a time is pretty expensive. 4 at a time is great on a 32 bit cpu (As long as they're aligned).

tlb · on March 10, 2009

The advantage of 'rep scasb' is that some future version of the intel processor will be more clever and handle a word per cycle.

An advantage of optimizing the hell out of the library version is that nobody will be tempted to roll their own string compare in their application code. Slow APIs are terrible because they force application developers to work around them. So the answer isn't just to write something slow, then measure. You'll find performance doesn't matter because everyone has avoided using it.

tptacek · on March 10, 2009

For short strings (the common case), scasb performs the same as a smarter C loop, reduces code size, and saves branches (which is good for caching), which is why most compilers compile strlen to it automatically. But this is kind of a silly point since performant code doesn't call stlen often.

yan · on March 10, 2009

My overall point was that it makes little sense for developers coming up for similar hacks on their own pet projects or apps. This is a linear runtime issue, and the efficiency-speed up here is trivial. Obviously, if you're maintaining gnu's libc, you probably know about this more than I do and in a better position to decide if this is needed, but to Joe Developer, it's just a side-track that obscures the end-goals.

scott_s · on March 10, 2009

Someone posted numbers that showed this is a 2x speedup over the naive implementation. That's not "trivial."

Most developers don't have to worry about such optimizations. But this is Hacker News, and a lot of hackers do.

yan · on March 10, 2009

Point taken, and you're right.

The speed-up I was referring to howeverwasn't "strlen speedup" but "your entire app running with naive strlen implementation, vs your entire app with clever strlen."

I also wasn't saying this has no place in it. I was trying to add that these hacks aren't usually what makes your app execute twice as fast or feel more 'snappy', unless the bread and butter of your app is string processin (in which case there are better running time algorithms, not just low-level hacks).

I love these as much as the next person, but early and misappropriated optimization, in my opinion, is suboptimal as a practice.

whughes · on March 10, 2009

I think you're making a good point, but it's misguided here. We're talking about a 2x speedup for strlen on pretty much every Linux system and probably other systems as well. That's millions of machines. This optimization has probably saved untold amounts of computing time.

yan · on March 10, 2009

I'd really like to retire this argument, but I don't exactly agree.

My point is basically as follows:

1) Yes, it probably makes sense for GNU libc to use the optimized implementation.

2) Yes, it's really interesting to dissect when you're looking for clever implementations and hacks.

3) These "2x" speed-up numbers I feel are all nice metrics, but in the end, don't amount to much. Outside of this argument, I feel people misunderstand how long something takes in a computer. The relative time a network or hard drive read takes, a memory read takes, a CPU instruction cache miss takes, and a dumb comparing of bytes via a single instruction are all a magnitude of difference apart.

So let's say you're writing your http caching server and you're using the new strlen algorithm. The amount of time your code will spend fetching the item from memory and putting it on the wire will completely eclipse the speedup you get from this fancy strlen. Not to mention if you're writing this in a high-level language, the nanoseconds you save on a linear algorithm will simply not matter.

So you can make the argument that over the last few years, the total time and energy spent by all Linux machines saved by using the new algorithm is worth it. I don't exactly buy it because most of modern computers' lives are dictated by waiting for input, processing it in burst and then more waiting. Most CPUs are sitting around the world with single-digit utilization. If we all loaded up a bunch of work to do in 1990 and the world's CPU power spent time crunching it, we might arrive at an answer a few hours or days sooner. But all it means in realistic terms, is that your Linux box will arrive at answer a few nanoseconds sooner and get to finally start waiting for its next batch sooner (whether this entails serving http requests or waiting for your next key stroke)

I'm all for optimization, but I think it has to be appropriate and measured. It probably makes sense to spend time on nano-optimizations for maintainers of one of the most-used libraries in the world, but all I'm trying to say, that for most readers of HN, it's better to spend time working on algorithm run-time optimization or caching policies than looking for getting side-tracked with strlen implementations.

scott_s · on March 10, 2009

As I pointed out below, I think we can assume "measure before optimize" in this community.

But per your main point, I think you're wrong in assuming that where your time is best spent is true for most HN readers. My research project is a compiler that generates code for the Cell. This kind of optimization - which in general is a vectorization - is directly applicable to what I do. And, yes, in the kinds of applications I target, the difference when this kind of optimization is applied is measurable and significant.

HN has different kinds of hackers. Something that is outside of your scope might be in someone else's scope.

yan · on March 10, 2009

Let's continue this on IRC tonight or via email? If you care or are interested, that is.

scott_s · on March 10, 2009

I don't think there's anything left to say, really. We're making orthogonal points.

scott_s · on March 10, 2009

Of course, that's Knuth's warning that "Premature optimization is the root of all evil." I assumed we've all heard that before, and we're only thinking about applying optimizations after measurement.

jrg · on March 10, 2009

which is a good time to mention the recent ACM Ubiquity article, "The Fallacy of Premature Optimization" (http://www.acm.org/ubiquity/volume_10/v10i3_hyde.html) by Randall Hyde.

tgos · on March 10, 2009

Sorry for asking such a stupid question, but isn't that code horribly broken? It assumes that the next 3 byte after a string are readable. What if I malloc'ed memory for the string in such a way, that the \0 is the last byte of the memory page and the next byte after \0 is on the next page, which is not mapped into my VM space? In such a case the CPU will throw a page-fault and the process will die because of SIGSERV or SIGBUS. Or is the glibc version of malloc padding all memory to have at least 3 byte beyond its last byte?

gjm11 · on March 10, 2009

The glibc strlen begins by checking the first few bytes, if necessary, so that it can continue with the assumption that the string is 4-byte or 8-byte aligned. So no, it's not horribly broken: all the reads are aligned, and you're not going to get a page boundary in the middle of the word.

(Also, you're not going to be reading off the end of a malloc'ed block, because essentially all malloc implementations, including the one in glibc, return blocks whose start address and size are multiples of the architecture's word size.)

shmoo · on March 10, 2009

Not directly related to your question, but see http://sourceware.org/bugzilla/show_bug.cgi?id=5807

The ((longword - lomagic) & himagic) should be ((longword - lomagic) & ~longword & himagic)

AlexMeyer · on March 11, 2009

A cool page is Paul Hsiehs' webpage. There you will find some good discussion on the trick used. http://www.azillionmonkeys.com/qed/asmexample.html Look for part 5: "A fast implementation of strlen()".

huhtenberg · on March 10, 2009

Interesting, but this is somewhat pointless if you ask me.

It relies on certain assumptions that are not the part of the C standard, so while this code works on the majority of platforms, this is not a portable C code. In which case going all the way down to the assembly level makes more sense. Especially considering there are typically dedicated CPU instructions for the exact purpose of searching a zero in a contiguous block of memory. Something like "rep scasb" on x86.

axod · on March 10, 2009

rep scasb is horribly slow.

tptacek · on March 10, 2009

It's 2x slower per iteration, but much faster to invoke, and friendlier to the microarchitecture. It's not as simple as you're making it out to be.

Also, using strlen() on a 4k string at all borders on malpractice.

axod · on March 10, 2009

Go have the argument with glibc then (And all the other people who use optimizations like this).

"friendlier to the micro-architecture" doesn't even make sense. Check the chip timings for rep scasb. It's not friendly.

We're not really discussing wether you should be using strlen on large strings or not, but even if it's used say a million times on strings of length 80 or so, you'd see an improvement worth having.

Check any assembly language forum, book, etc and there will be discussion on why rep scasb/movsb/cmpsb are lame.

Would you implement string copy with rep movsb as well?

tptacek · on March 10, 2009

Friendlier to the microarchitecture means: fewer branches, fewer BTB entries, less impact on the icache. Sorry, you wrote like you might have already known that.

You know that VC++ does implement copies with movsd/movsb, right?

Sorry, I don't read a lot of books and forums on assembly programming. Just the PRM. I'm just stuck reading/writing a lot of assembly on projects.

axod · on March 10, 2009

>> Friendlier to the microarchitecture means: fewer branches, fewer BTB entries, less impact on the icache. Sorry, you wrote like you might have already known that.

If it's less clock cycles to do branching and comparing by dword (which it is for medium to long strings) than doing rep scasb, then what else matters...?

>> You know that VC++ does implement copies with movsd/movsb, right?

I've stepped through VC++ string copy code in softice many a time thanks.

Notice how I was asking about 'movsb', and you replied with 'movsd/movdb'. Notice the difference?

tptacek · on March 10, 2009

I have no idea what points you're trying to make here.

You can trade per-byte cycle counts for lower cost to invoke the routine, and for not evicting cache and BTB entries.

On your second point, I assumed it was the "rep" part of the instruction that you were railing against. Apparently it's the "not knowing the difference between a byte and a dword" part. That's awesome. You can have the last word, if you'd like.

axod · on March 10, 2009

rep movsb/rep movsd works well for moving data. However, you obviously can't use that approach for searching for a 0. That's why the code is optimized as it was. My point is that using rep scasb is suboptimal.

Don't know what you're talking about "lower cost to invoke the routine", and the cache/BTB entries would be negligible on a small routine like this.

You seem kinda angry and bitter whenever you reply to me :/ Chill out eh.

tptacek · on March 10, 2009

It costs cycles to call a C function. I seem angry and bitter all the time. But my point is just, there's an argument in favor of scasb.

axod · on March 10, 2009

So you're comparing inlined rep scasb, with non-inlined alternative. Interesting comparison I guess.

Sure, it would bloat the code a little to inline the optimized version, but it could be done in tight inner loops if required.

tptacek · on March 10, 2009

I'm assuming you're not inlining a function with a loop in it, but OK, you can also just expand the 7 insns everywhere you call strlen.

hvdm · on March 10, 2009

A very quick test shows that the glibc version is more than 3 times faster for strings of length 80 compared to the bsd version.