I love Daniel’s optimization posts, but this one seems a little too esoteric: it’s for when autovectorization isn’t fast enough (edit: or isn’t available) but you can’t or don’t want to use SIMD; it gives you a solution solidly better than the former but worse than the latter, and more opaque than either.
In all cases, obviously the real magic is in the 64-bit word comparison routine (which doesn’t get more than a passing mention); converting the former problem into the latter is extremely straightforward and obvious - if you know that optimized solutions for the latter exist (I personally didn’t know of this bit twiddling hack (aside from doing it via SIMD, obviously) and am grateful to learn about the approach in Muła’s code).
Are there many cases where plain ASCII is okay anymore, anyway? Wouldn't proper UTF-whatever comparisons mandate doing it the slow way, if you wanted counts versus just "same/not the same"?
The actual killer application of character count matching is bioinformatics. You only need four distinct characters to encode DNA, which of course, means in reality you only need two bits, but I don't know if anyone has bothered to develop a two-bit character encoding system and instructions that can operate on them. They have come up with extremely optimized implementations that compare byte streams, though, utilizing ASCII GCTA.
Bioinformatician here. Yes, people have come up with many encoding systems: 2-bit encoding systems, 3-bit encoding systems (for N's), various encoding schemes that allows long runs of N's to be stored with little memory costs, and even reduced amino acid alphabets for e.g. 4-bit amino acid sequences.
In my experience, the performance bottlenecks in bioinformatics is usually IO, parsing, de/compression and needlessly slow algorithms. You rarely get meaningful speed improvements by internally encoding the sequences one way or another (storing small fixed-size sequences in machine integers is an exception). There can be other advantages, though, like reduced memory consumption.
Partly for that reason, most DNA data is just stored as gzipped text files.
That would be more interesting, something that could handle a UTF-8 string in a similar situation (I guess you'd compare the bytes and then do a walk to see if it was part of a multi-byte character? Maybe the high-bit would help?).
For UTF-8 I believe the question is ill-posed. The "position" of a "character" in unicode is very hard to determine and depends largely on what you take to be a character.
Then UTF-8 should not be operated on directly. When reading from a file etc., first decode the input into an internal string format that has straight-forward linear access to each gc boundary.
Yeah, that's what the normalization forms are for. But still, almost no question how to do things in Unicode can be answered in a single sentence without disregarding 99 % of the complexity.
Yes! Plain ASCII is great, even when you need to properly handle non-ASCII. When >90% of your strings are ASCII in practice, then having an "is ASCII" bit in your string headers can save a ton of memory, and gain more performance than you lose in flag checks and extra code.
Heck, if your usual encoding is UTF-8, the flag bit can be purely advisory and ignored when it doesn't matter.
If you like this sort of weird bit operations, you should check out HAKMEM, a huge list of hacks created at MIT in 1972.
"Here is some little known data which may be of interest to
computer hackers. The items and examples are so sketchy that to decipher them may require more sincerity and curiosity than a
non-hacker can muster. Doubtless, little of this is new, but
nowadays it's hard to tell."
I love the curiosity/hacking of the early days of computing, hooking up your registers to an audio amplifier? sure!
"""
This is a display hack (that is, it makes pretty patterns) with the low 9 bits = Y and the 9 next higher a X; also, it makes interesting, related noises with a stereo amplifier hooked to the X and Y signals.
I'd still take the traditional solution (https://godbolt.org/z/K5hWqezMP) over the latter one (https://godbolt.org/z/jaWMM491M). The latter solution requires a traditional check at the tail end of it, so the compiler appears to include the traditional implementation in the latter's solution anyways.
Unless this is a significant bottleneck for your application, the latter does not seem practical, and a lot of profiling would have to go into justifying the latter code (which would only be worthwhile if, as I said, this were a significant bottleneck.)
I did not know the trick of using memcpy to get the compiler to load a register from an arbitrary char* pointer. Given that the architecture supports it, can you assume it will be optimized like that? I assume this is done to stay away from undefined behavior.
Poking around godbot seems to indicate ARM and x86 does it, but web assembly doesn't: https://www.godbolt.org/z/r86f9nr1q which I guess makes sense (but web assembly has to become actual code at some point, so maybe the optimization is lost entirely?).
For a large codebase, this would be encapsulated and heavily commented, etc. It's something that you pull out when you need to make something "go to eleven" kind of thing.
Plenty of library functions are already going to be optimised in this type of way - just because they're not visible in your codebase doesn't mean they are not there.
So if your big codebase were profiled, and the 99th percentile case happened to spend 80% of its time in this function, you would still stick to the old function unwaveringly?
Even if your system was latency sensitive, and your 99 percentile spikes were far from acceptable?
Sure, there are many cases where this is a premature optimization. This blog never said otherwise. But that doesn't mean there is never a time for optimization. And if profiling points at this, then this speedup can be valuable.
There was an old lady 6 about unoptimized behavior in VB6 and why is wasn't necessary to do so.
I had to write a wrapper for it which was optimized in my code. Was it harder to read? Yes.
Did it produce a significant performance improvement? Also yes.
Software development is about tradeoffs. Demoing one way to improve a simple function helps others trunk how they write code, and more importantly why they wrote specific code
For those wondering how this works. If two bytes are identical, then the XOR will yield 0x00 and the negation will yield 0xFF. Masking out the seven least significant bits with 0x80 will just yield 0x80 again. Masking out the most significant bit with 0x7F will just yield 0x7F again and adding 0x01 will yield 0x80. Finally combing 0x80 and 0x80 with AND just yields 0x80, i.e. exactly the most significant bit set.
On the other hand if two bytes are not identical, then at least one of the two following things happens. The two bytes differ in the most significant bit and therefore the most significant bit of the XNOR is zero. Then masking out the seven least significant bits with 0x80 yields 0x00 and therefore the final AND yields 0x00.
Or the two bytes differ in at least one of the seven least significant bits and therefore at least one of the seven least significant bits of the XNOR is zero. Then masking out the most significant bit with 0x7F yields something smaller than 0x7F and adding 0x01 yields something smaller than 0x80, i.e. the most significant bit is zero, and therefore the final AND again yields 0x00 because the first operand is either 0x00 or 0x80.
The remaining work is to count the number of set bit, either with popcount or some other construction like the one in the article.
Except this is suboptimal, as is the code in the original post: it specifies ASCII, so (e & c1) is guaranteed to be zero. In your code, that means you could just do
return popcount((e & c2) + c3);
(Thanks for writing this out, as otherwise I wouldn't have spotted it.)
The author certainly meant extended ASCII and therefore you can not make this change. Sure, it may be technical incorrect but ASCII is so commonly used to refer to extended ASCII that this should probably be the default interpretation unless it is clear from the context, say a discussion of the history of character encodings, that it is meant in its original sense.
but the real question is, can you write the most optimal version on the spot in under a minute, while your interviewer is babbling and distracting you, and use a binary search?
In all cases, obviously the real magic is in the 64-bit word comparison routine (which doesn’t get more than a passing mention); converting the former problem into the latter is extremely straightforward and obvious - if you know that optimized solutions for the latter exist (I personally didn’t know of this bit twiddling hack (aside from doing it via SIMD, obviously) and am grateful to learn about the approach in Muła’s code).