Counting the number of matching characters in two ASCII strings

ComputerGuru · on May 21, 2021

I love Daniel’s optimization posts, but this one seems a little too esoteric: it’s for when autovectorization isn’t fast enough (edit: or isn’t available) but you can’t or don’t want to use SIMD; it gives you a solution solidly better than the former but worse than the latter, and more opaque than either.

In all cases, obviously the real magic is in the 64-bit word comparison routine (which doesn’t get more than a passing mention); converting the former problem into the latter is extremely straightforward and obvious - if you know that optimized solutions for the latter exist (I personally didn’t know of this bit twiddling hack (aside from doing it via SIMD, obviously) and am grateful to learn about the approach in Muła’s code).

tyingq · on May 21, 2021

Are there many cases where plain ASCII is okay anymore, anyway? Wouldn't proper UTF-whatever comparisons mandate doing it the slow way, if you wanted counts versus just "same/not the same"?

nonameiguess · on May 21, 2021

The actual killer application of character count matching is bioinformatics. You only need four distinct characters to encode DNA, which of course, means in reality you only need two bits, but I don't know if anyone has bothered to develop a two-bit character encoding system and instructions that can operate on them. They have come up with extremely optimized implementations that compare byte streams, though, utilizing ASCII GCTA.

jakobnissen · on May 21, 2021

Bioinformatician here. Yes, people have come up with many encoding systems: 2-bit encoding systems, 3-bit encoding systems (for N's), various encoding schemes that allows long runs of N's to be stored with little memory costs, and even reduced amino acid alphabets for e.g. 4-bit amino acid sequences.

In my experience, the performance bottlenecks in bioinformatics is usually IO, parsing, de/compression and needlessly slow algorithms. You rarely get meaningful speed improvements by internally encoding the sequences one way or another (storing small fixed-size sequences in machine integers is an exception). There can be other advantages, though, like reduced memory consumption.

Partly for that reason, most DNA data is just stored as gzipped text files.

GianFabien · on May 21, 2021

Could be an area to use FPGA based arrays for far greater throughput. Some recent RISC-V designs include FPGA blocks.

TchoBeer · on May 21, 2021

except amino acids are encoded for with three bases, meaning that the actual character encoding system is 6 bits.

masklinn · on May 21, 2021

Proper comparison would have to be even slower as you’d need to deal with normalisation & equivalence.

The « slow way » of the post would also count matching code units which is utterly useless, really.

bombcar · on May 21, 2021

That would be more interesting, something that could handle a UTF-8 string in a similar situation (I guess you'd compare the bytes and then do a walk to see if it was part of a multi-byte character? Maybe the high-bit would help?).

rocqua · on May 21, 2021

For UTF-8 I believe the question is ill-posed. The "position" of a "character" in unicode is very hard to determine and depends largely on what you take to be a character.

bmn__ · on May 21, 2021

Then UTF-8 should not be operated on directly. When reading from a file etc., first decode the input into an internal string format that has straight-forward linear access to each gc boundary.

eesmith · on May 21, 2021

Even then, https://en.wikipedia.org/wiki/Unicode_equivalence . What does "matching character" mean given combining characters?

rocqua · on May 21, 2021

Probably pick a canonical encoding and compare there.

ygra · on May 22, 2021

Yeah, that's what the normalization forms are for. But still, almost no question how to do things in Unicode can be answered in a single sentence without disregarding 99 % of the complexity.

Thaxll · on May 21, 2021

HTTP headers.

joawowiij · on May 21, 2021

hex, base64 encoded strings, etc

tyingq · on May 21, 2021

That makes sense for "same/not-same" comparisons to me. I don't know why you would want counts of different characters though.

bombcar · on May 21, 2021

"Password must be different by at least 6 characters" perhaps, though that's the inverse, really.

Fuzzy matching is another aspect that comes to mind.

sfink · on May 21, 2021

Yes! Plain ASCII is great, even when you need to properly handle non-ASCII. When >90% of your strings are ASCII in practice, then having an "is ASCII" bit in your string headers can save a ton of memory, and gain more performance than you lose in flag checks and extra code.

Heck, if your usual encoding is UTF-8, the flag bit can be purely advisory and ignored when it doesn't matter.

bombcar · on May 21, 2021

If the optimizing compiler can figure out to use SIMD shouldn’t it also be able to degrade to this when SIMD isn’t available?

And if we’re already down the “write it understandable and trust the compiler” path isn’t that the better way to go?

It’s interesting to me that C is now effectively a “higher level” language than it may appear on the surface because of optimization by compilers.

f00zz · on May 21, 2021

> I personally didn’t know of this bit twiddling hack and am grateful to learn about the approach in Muła’s code

Yeah the last line had me scratching my head for a bit. Cool stuff.

kens · on May 21, 2021

If you like this sort of weird bit operations, you should check out HAKMEM, a huge list of hacks created at MIT in 1972.

"Here is some little known data which may be of interest to computer hackers. The items and examples are so sketchy that to decipher them may require more sincerity and curiosity than a non-hacker can muster. Doubtless, little of this is new, but nowadays it's hard to tell."

https://w3.pppl.gov/~hammett/work/2009/AIM-239-ocr.pdf - see e.g. page 79

codezero · on May 21, 2021

I love the curiosity/hacking of the early days of computing, hooking up your registers to an audio amplifier? sure!

"""

This is a display hack (that is, it makes pretty patterns) with the low 9 bits = Y and the 9 next higher a X; also, it makes interesting, related noises with a stereo amplifier hooked to the X and Y signals.

"""

P-ala-din · on May 21, 2021

I would also recommend Bit Twiddling Hacks By Sean Eron Anderson [1] and Hacker's Delight.

[1] - https://graphics.stanford.edu/~seander/bithacks.html

MontagFTB · on May 21, 2021

I'd still take the traditional solution (https://godbolt.org/z/K5hWqezMP) over the latter one (https://godbolt.org/z/jaWMM491M). The latter solution requires a traditional check at the tail end of it, so the compiler appears to include the traditional implementation in the latter's solution anyways.

Unless this is a significant bottleneck for your application, the latter does not seem practical, and a lot of profiling would have to go into justifying the latter code (which would only be worthwhile if, as I said, this were a significant bottleneck.)

superjan · on May 21, 2021

I did not know the trick of using memcpy to get the compiler to load a register from an arbitrary char* pointer. Given that the architecture supports it, can you assume it will be optimized like that? I assume this is done to stay away from undefined behavior.

bombcar · on May 21, 2021

Poking around godbot seems to indicate ARM and x86 does it, but web assembly doesn't: https://www.godbolt.org/z/r86f9nr1q which I guess makes sense (but web assembly has to become actual code at some point, so maybe the optimization is lost entirely?).

superjan · on May 21, 2021

Thanks, good idea to share this link in the comments.

sm4sh · on May 21, 2021

Well okay, but guess which one I'd rather have to maintain in a big codebase? Clearly written by a theoretical engineer.

humblepie · on May 21, 2021

For a large codebase, this would be encapsulated and heavily commented, etc. It's something that you pull out when you need to make something "go to eleven" kind of thing.

raldi · on May 21, 2021

Exactly. You could have some exhaustive unit tests as well, and after that there would be nothing to maintain.

teachingassist · on May 21, 2021

Plenty of library functions are already going to be optimised in this type of way - just because they're not visible in your codebase doesn't mean they are not there.

rocqua · on May 21, 2021

So if your big codebase were profiled, and the 99th percentile case happened to spend 80% of its time in this function, you would still stick to the old function unwaveringly? Even if your system was latency sensitive, and your 99 percentile spikes were far from acceptable?

Sure, there are many cases where this is a premature optimization. This blog never said otherwise. But that doesn't mean there is never a time for optimization. And if profiling points at this, then this speedup can be valuable.

commandlinefan · on May 21, 2021

> rather have to maintain

Well, as long as it works, what difference does it make? As long as it's prefixed with a

// This compares the matching number of bytes

comment, you should never have to mess with it.

rocqua · on May 21, 2021

With a compiler or architecture change, I guess this could create awkward to handle performance regressions.

Forge36 · on May 21, 2021

It's about use cases.

There was an old lady 6 about unoptimized behavior in VB6 and why is wasn't necessary to do so. I had to write a wrapper for it which was optimized in my code. Was it harder to read? Yes. Did it produce a significant performance improvement? Also yes.

Software development is about tradeoffs. Demoing one way to improve a simple function helps others trunk how they write code, and more importantly why they wrote specific code

JKCalhoun · on May 21, 2021

I've worked with engineers that must read these blogs and try to out-clever everyone.

They succeed, and that's the problem.

codezero · on May 21, 2021

Can you do this with an xor and popcount? I suppose that's what the author is implying when they say the compiler takes care of this.

danbruc · on May 21, 2021

You can, the return statement is essentially a popcount.

  c1 = 0x8080808080808080
  c2 = 0x7F7F7F7F7F7F7F7F
  c3 = 0x0101010101010101

  e = ~(x ^ y)
  
  return popcount((e & c1) & ((e & c2) + c3))

For those wondering how this works. If two bytes are identical, then the XOR will yield 0x00 and the negation will yield 0xFF. Masking out the seven least significant bits with 0x80 will just yield 0x80 again. Masking out the most significant bit with 0x7F will just yield 0x7F again and adding 0x01 will yield 0x80. Finally combing 0x80 and 0x80 with AND just yields 0x80, i.e. exactly the most significant bit set.

On the other hand if two bytes are not identical, then at least one of the two following things happens. The two bytes differ in the most significant bit and therefore the most significant bit of the XNOR is zero. Then masking out the seven least significant bits with 0x80 yields 0x00 and therefore the final AND yields 0x00.

Or the two bytes differ in at least one of the seven least significant bits and therefore at least one of the seven least significant bits of the XNOR is zero. Then masking out the most significant bit with 0x7F yields something smaller than 0x7F and adding 0x01 yields something smaller than 0x80, i.e. the most significant bit is zero, and therefore the final AND again yields 0x00 because the first operand is either 0x00 or 0x80.

The remaining work is to count the number of set bit, either with popcount or some other construction like the one in the article.

sfink · on May 21, 2021

Except this is suboptimal, as is the code in the original post: it specifies ASCII, so (e & c1) is guaranteed to be zero. In your code, that means you could just do

    return popcount((e & c2) + c3);

(Thanks for writing this out, as otherwise I wouldn't have spotted it.)

hairtuq · on May 22, 2021

Actually, e is the inverse of the xor, so (e & c1) is guaranteed to be c1, and you still need popcount(c1 & ((e & c2) + c3)).

danbruc · on May 21, 2021

The author certainly meant extended ASCII and therefore you can not make this change. Sure, it may be technical incorrect but ASCII is so commonly used to refer to extended ASCII that this should probably be the default interpretation unless it is clear from the context, say a discussion of the history of character encodings, that it is meant in its original sense.

hairtuq · on May 22, 2021

Slightly simpler:

    c = x ^ y;
    return popcount(((c >> 1) | 0x8080808080808080) - c) & 0x8080808080808080);

f00zz · on May 21, 2021

Cool trick, but if I'm reading this correctly the test fails for characters with codes larger than 127

maxnoe · on May 21, 2021

Which is probably why the title says "ASCII"

f00zz · on May 21, 2021

I wasn't reading it correctly, it does work for characters > 127. For pure ASCII you wouldn't even need t1

GianFabien · on May 21, 2021

Isn't this another example of relying on a "sufficiently smart compiler" ?

KETpXDDzR · on May 21, 2021

Where are the benchmarks?

b20000 · on May 21, 2021

but the real question is, can you write the most optimal version on the spot in under a minute, while your interviewer is babbling and distracting you, and use a binary search?