Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I love Daniel’s optimization posts, but this one seems a little too esoteric: it’s for when autovectorization isn’t fast enough (edit: or isn’t available) but you can’t or don’t want to use SIMD; it gives you a solution solidly better than the former but worse than the latter, and more opaque than either.

In all cases, obviously the real magic is in the 64-bit word comparison routine (which doesn’t get more than a passing mention); converting the former problem into the latter is extremely straightforward and obvious - if you know that optimized solutions for the latter exist (I personally didn’t know of this bit twiddling hack (aside from doing it via SIMD, obviously) and am grateful to learn about the approach in Muła’s code).



Are there many cases where plain ASCII is okay anymore, anyway? Wouldn't proper UTF-whatever comparisons mandate doing it the slow way, if you wanted counts versus just "same/not the same"?


The actual killer application of character count matching is bioinformatics. You only need four distinct characters to encode DNA, which of course, means in reality you only need two bits, but I don't know if anyone has bothered to develop a two-bit character encoding system and instructions that can operate on them. They have come up with extremely optimized implementations that compare byte streams, though, utilizing ASCII GCTA.


Bioinformatician here. Yes, people have come up with many encoding systems: 2-bit encoding systems, 3-bit encoding systems (for N's), various encoding schemes that allows long runs of N's to be stored with little memory costs, and even reduced amino acid alphabets for e.g. 4-bit amino acid sequences.

In my experience, the performance bottlenecks in bioinformatics is usually IO, parsing, de/compression and needlessly slow algorithms. You rarely get meaningful speed improvements by internally encoding the sequences one way or another (storing small fixed-size sequences in machine integers is an exception). There can be other advantages, though, like reduced memory consumption.

Partly for that reason, most DNA data is just stored as gzipped text files.


Could be an area to use FPGA based arrays for far greater throughput. Some recent RISC-V designs include FPGA blocks.


except amino acids are encoded for with three bases, meaning that the actual character encoding system is 6 bits.


Proper comparison would have to be even slower as you’d need to deal with normalisation & equivalence.

The « slow way » of the post would also count matching code units which is utterly useless, really.


That would be more interesting, something that could handle a UTF-8 string in a similar situation (I guess you'd compare the bytes and then do a walk to see if it was part of a multi-byte character? Maybe the high-bit would help?).


For UTF-8 I believe the question is ill-posed. The "position" of a "character" in unicode is very hard to determine and depends largely on what you take to be a character.


Then UTF-8 should not be operated on directly. When reading from a file etc., first decode the input into an internal string format that has straight-forward linear access to each gc boundary.


Even then, https://en.wikipedia.org/wiki/Unicode_equivalence . What does "matching character" mean given combining characters?


Probably pick a canonical encoding and compare there.


Yeah, that's what the normalization forms are for. But still, almost no question how to do things in Unicode can be answered in a single sentence without disregarding 99 % of the complexity.


HTTP headers.


hex, base64 encoded strings, etc


That makes sense for "same/not-same" comparisons to me. I don't know why you would want counts of different characters though.


"Password must be different by at least 6 characters" perhaps, though that's the inverse, really.

Fuzzy matching is another aspect that comes to mind.


Yes! Plain ASCII is great, even when you need to properly handle non-ASCII. When >90% of your strings are ASCII in practice, then having an "is ASCII" bit in your string headers can save a ton of memory, and gain more performance than you lose in flag checks and extra code.

Heck, if your usual encoding is UTF-8, the flag bit can be purely advisory and ignored when it doesn't matter.


If the optimizing compiler can figure out to use SIMD shouldn’t it also be able to degrade to this when SIMD isn’t available?

And if we’re already down the “write it understandable and trust the compiler” path isn’t that the better way to go?

It’s interesting to me that C is now effectively a “higher level” language than it may appear on the surface because of optimization by compilers.


> I personally didn’t know of this bit twiddling hack and am grateful to learn about the approach in Muła’s code

Yeah the last line had me scratching my head for a bit. Cool stuff.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: