I love Daniel’s optimization posts, but this one seems a little too esoteric: it...

tyingq · on May 21, 2021

Are there many cases where plain ASCII is okay anymore, anyway? Wouldn't proper UTF-whatever comparisons mandate doing it the slow way, if you wanted counts versus just "same/not the same"?

nonameiguess · on May 21, 2021

The actual killer application of character count matching is bioinformatics. You only need four distinct characters to encode DNA, which of course, means in reality you only need two bits, but I don't know if anyone has bothered to develop a two-bit character encoding system and instructions that can operate on them. They have come up with extremely optimized implementations that compare byte streams, though, utilizing ASCII GCTA.

jakobnissen · on May 21, 2021

Bioinformatician here. Yes, people have come up with many encoding systems: 2-bit encoding systems, 3-bit encoding systems (for N's), various encoding schemes that allows long runs of N's to be stored with little memory costs, and even reduced amino acid alphabets for e.g. 4-bit amino acid sequences.

In my experience, the performance bottlenecks in bioinformatics is usually IO, parsing, de/compression and needlessly slow algorithms. You rarely get meaningful speed improvements by internally encoding the sequences one way or another (storing small fixed-size sequences in machine integers is an exception). There can be other advantages, though, like reduced memory consumption.

Partly for that reason, most DNA data is just stored as gzipped text files.

GianFabien · on May 21, 2021

Could be an area to use FPGA based arrays for far greater throughput. Some recent RISC-V designs include FPGA blocks.

TchoBeer · on May 21, 2021

except amino acids are encoded for with three bases, meaning that the actual character encoding system is 6 bits.

masklinn · on May 21, 2021

Proper comparison would have to be even slower as you’d need to deal with normalisation & equivalence.

The « slow way » of the post would also count matching code units which is utterly useless, really.

bombcar · on May 21, 2021

That would be more interesting, something that could handle a UTF-8 string in a similar situation (I guess you'd compare the bytes and then do a walk to see if it was part of a multi-byte character? Maybe the high-bit would help?).

rocqua · on May 21, 2021

For UTF-8 I believe the question is ill-posed. The "position" of a "character" in unicode is very hard to determine and depends largely on what you take to be a character.

bmn__ · on May 21, 2021

Then UTF-8 should not be operated on directly. When reading from a file etc., first decode the input into an internal string format that has straight-forward linear access to each gc boundary.

eesmith · on May 21, 2021

Even then, https://en.wikipedia.org/wiki/Unicode_equivalence . What does "matching character" mean given combining characters?

rocqua · on May 21, 2021

Probably pick a canonical encoding and compare there.

ygra · on May 22, 2021

Yeah, that's what the normalization forms are for. But still, almost no question how to do things in Unicode can be answered in a single sentence without disregarding 99 % of the complexity.

Thaxll · on May 21, 2021

HTTP headers.

joawowiij · on May 21, 2021

hex, base64 encoded strings, etc

tyingq · on May 21, 2021

That makes sense for "same/not-same" comparisons to me. I don't know why you would want counts of different characters though.

bombcar · on May 21, 2021

"Password must be different by at least 6 characters" perhaps, though that's the inverse, really.

Fuzzy matching is another aspect that comes to mind.

sfink · on May 21, 2021

Yes! Plain ASCII is great, even when you need to properly handle non-ASCII. When >90% of your strings are ASCII in practice, then having an "is ASCII" bit in your string headers can save a ton of memory, and gain more performance than you lose in flag checks and extra code.

Heck, if your usual encoding is UTF-8, the flag bit can be purely advisory and ignored when it doesn't matter.

bombcar · on May 21, 2021

If the optimizing compiler can figure out to use SIMD shouldn’t it also be able to degrade to this when SIMD isn’t available?

And if we’re already down the “write it understandable and trust the compiler” path isn’t that the better way to go?

It’s interesting to me that C is now effectively a “higher level” language than it may appear on the surface because of optimization by compilers.

f00zz · on May 21, 2021

> I personally didn’t know of this bit twiddling hack and am grateful to learn about the approach in Muła’s code

Yeah the last line had me scratching my head for a bit. Cool stuff.