I love Daniel’s optimization posts, but this one seems a little too esoteric: it’s for when autovectorization isn’t fast enough (edit: or isn’t available) but you can’t or don’t want to use SIMD; it gives you a solution solidly better than the former but worse than the latter, and more opaque than either.
In all cases, obviously the real magic is in the 64-bit word comparison routine (which doesn’t get more than a passing mention); converting the former problem into the latter is extremely straightforward and obvious - if you know that optimized solutions for the latter exist (I personally didn’t know of this bit twiddling hack (aside from doing it via SIMD, obviously) and am grateful to learn about the approach in Muła’s code).
Are there many cases where plain ASCII is okay anymore, anyway? Wouldn't proper UTF-whatever comparisons mandate doing it the slow way, if you wanted counts versus just "same/not the same"?
The actual killer application of character count matching is bioinformatics. You only need four distinct characters to encode DNA, which of course, means in reality you only need two bits, but I don't know if anyone has bothered to develop a two-bit character encoding system and instructions that can operate on them. They have come up with extremely optimized implementations that compare byte streams, though, utilizing ASCII GCTA.
Bioinformatician here. Yes, people have come up with many encoding systems: 2-bit encoding systems, 3-bit encoding systems (for N's), various encoding schemes that allows long runs of N's to be stored with little memory costs, and even reduced amino acid alphabets for e.g. 4-bit amino acid sequences.
In my experience, the performance bottlenecks in bioinformatics is usually IO, parsing, de/compression and needlessly slow algorithms. You rarely get meaningful speed improvements by internally encoding the sequences one way or another (storing small fixed-size sequences in machine integers is an exception). There can be other advantages, though, like reduced memory consumption.
Partly for that reason, most DNA data is just stored as gzipped text files.
That would be more interesting, something that could handle a UTF-8 string in a similar situation (I guess you'd compare the bytes and then do a walk to see if it was part of a multi-byte character? Maybe the high-bit would help?).
For UTF-8 I believe the question is ill-posed. The "position" of a "character" in unicode is very hard to determine and depends largely on what you take to be a character.
Then UTF-8 should not be operated on directly. When reading from a file etc., first decode the input into an internal string format that has straight-forward linear access to each gc boundary.
Yeah, that's what the normalization forms are for. But still, almost no question how to do things in Unicode can be answered in a single sentence without disregarding 99 % of the complexity.
Yes! Plain ASCII is great, even when you need to properly handle non-ASCII. When >90% of your strings are ASCII in practice, then having an "is ASCII" bit in your string headers can save a ton of memory, and gain more performance than you lose in flag checks and extra code.
Heck, if your usual encoding is UTF-8, the flag bit can be purely advisory and ignored when it doesn't matter.
In all cases, obviously the real magic is in the 64-bit word comparison routine (which doesn’t get more than a passing mention); converting the former problem into the latter is extremely straightforward and obvious - if you know that optimized solutions for the latter exist (I personally didn’t know of this bit twiddling hack (aside from doing it via SIMD, obviously) and am grateful to learn about the approach in Muła’s code).