Fastest branchless binary search

3cats-in-a-coat · on Aug 11, 2023

Every time I see people trying to eliminate branches, I wonder, do we realize that having long pipelines where a branch misprediction stalls the pipeline is not actually a necessary part of architecture?

The pipeline is long because we do lots of analysis and translation on the fly, just in time, which could easily be done in most cases ahead of time, as it's not a very stateful algorithm.

This is how Transmeta Crusoe CPUs worked. Imagine NOT caring that you have a branch.

After all, if you think about it, all operations are branches. Down to bitwise operations and summing two numbers. It's branching in the ALU, carry bits and what not. You can't compute absolutely anything without looking at the state of one or more bits, and making a decision that changes the outcome. But the reason those branches don't hurt performance is that they're not branches on the main pipeline. These local branches have a very short (or non-existent) "pipeline". And the main pipeline is therefore not affected, despite the actual state of the system is.

ChuckMcM · on Aug 11, 2023

Is that you Dave? :-) There was paper[1] comparing superscalar CISC approaches to uniscalar RISC in terms of work done over time versus instructions per clock. I recall telling srk at the time that it was a good treatise on how choosing the metric (IPC vs throughput) can influence ones thinking about what is "good" and what is "bad."

The IPC crowd reasons that if they can get a higher IPC value, then the process guys can crank the clock and everybody wins. The thoughput crowd takes the more pragmatic approach that Moore's law is dead and making silicon clock faster melts it so being clever in the ISA wins the day. Both camps have had successes and setbacks over the last 20 years.

I am really enjoying the RISC-V stuff which is getting back into this level of question with regard to CPU architectures. Also a good place to look if you want to catch up on what "modern" superscalar ideas are being added with support of instruction set flexibility. My guess is that wins the game in the long run.

[1] This was in the early oughts so could have been HotChips, Microprocessor Forum, or HPC workshops.

jasonwatkinspdx · on Aug 11, 2023

I'm sorry but this is completely mistaken.

Transmeta's translation did not somehow eliminate branch costs.

In fact I distinctly remember a comp.arch thread and post from Linus himself while he worked at Transmeta that paraphrased was "the cpu's job is to generate cache misses as fast as possible."

Compulsory misses are a thing. No form of JIT is capable of eliminating them. Capacity misses are inevitable in the real world, even with the monster caches we have now.

Itanium thought it could eliminate branch costs with static analysis. How did that work out?

I really wish programmers would actually read a book about computer architecture before they so confidently conclude that it's simple to build things better than state of the art processors. You are underestimating the scale of smart that has gone into current processors by at least 7 orders of magnitude in my opinion.

rtomanek · on Aug 11, 2023

Can you give some book recommendations, please?

jasonwatkinspdx · on Aug 11, 2023

Hennessy and Patterson's Computer Architecture: A Quantitative Approach is the definitive textbook. They also have a 2nd textbook Computer Organization and Design. The former is more focused on people who might go after a Computer Engineering degree while the latter is more an exploration of microprocessor architecture relevant to a wide audience of programmers.

Previous editions are available free online and just as good for learning the big picture imo. Actually the older editions might be better for learning microarchitecture specifically because the more recent editions have cut some material on that in order to cover mobile and cloud computing.

Const-me · on Aug 11, 2023

> as it's not a very stateful algorithm

It might be stateless, but it depends on many things unknown at compile time.

One of them is the input data being processed. Binary search is exactly that, compiler don’t know at which position the result will be found.

Another one is micro-architecture, most notably cache hierarchy, and composition of EUs inside codes. If you switch to ISA with instructions resembling the micro-ops of the current CPUs, you’ll have to re-compile for every micro-architecture. However, this one is technically solvable with JIT compiler in OSes, like current GPUs where programs ship in byte code formats (DXBC, SPIR-V, NVPTX) then user-mode half of GPU driver recompiles into actual hardware instructions.

Another huge one, other CPU threads are running unknow code. Even if you drop hyper-threading making cores independent, there’re still resources shared across the complete chip: L3 cache, off-chip memory, I/O bandwidth, and electricity i.e. thermals.

scottlamb · on Aug 11, 2023

> > > as it's not a very stateful algorithm

> It might be stateless, but it depends on many things unknown at compile time.

> One of them is the input data being processed. Binary search is exactly that, compiler don’t know at which position the result will be found.

Are you and 3cats-in-a-coat are talking about the same "it"? I think they're talking about the work that requires such a long pipeline [1], as they've stated that doing more ahead of time would reduce the length of the pipeline and thus importance of accurate branch prediction. You are talking about...the branch predictor? the application's algorithm being executed (binary search in this case)?

[1] my CPU microarchitectural knowledge isn't that good; I have only the vaguest idea of what that is. Register renaming / data dependency analysis / instruction scheduling, maybe? translation to micro-ops? other things? don't know!

robertlagrant · on Aug 11, 2023

> After all, if you think about it, all operations are branches

Isn't this definition the crux of it? If you redefine everything as a Branch™, even things that are not branches, then you can definitely precompute some Branch™es.

But the sort of non-Branch™ branch elimination is talking about actually branching computation routes in code due to an if/else statement or similar, is it not?

That would still be useful to do in your world of Branch™es, just only the Branch™es that try and compute the results of more than one future at the same time (i.e. branches).

taeric · on Aug 11, 2023

You could rephrase that the pipeline is long because there is a lot of independent work that you can do in a processor. Consider, for every independent operation that can be done, you can potentially do that much at the same time.

And I don't just mean decode, fetch, and execute. If your computer has independent ALU and Shifting units, you can potentially be shifting something as you are also adding. Have a dedicated Adder and a Multiplication unit? No reason you can't try and do both as the same time.

This directly leads to wanting multiple instructions in flight at the same time, which means you have to be able to fetch and decode instructions faster than you are processing them. Also naturally leads to situations where you want to reorder so that the N Add instructions don't keep you from seeing an independent Shift that you could do in the same time.

You can think that things are more complicated than they need to be. And, you may not even be wrong. But a ton of engineering is going into making what we have. If you think you can make something a ton faster by not doing it this way, probably a good thing to dig into how accurate that claim is.

dragontamer · on Aug 11, 2023

> The pipeline is long because we do lots of analysis and translation on the fly, just in time, which could easily be done in most cases ahead of time, as it's not a very stateful algorithm.

You're going down the wrong path, again, as Intel did with Itanium.

We have pipelines because CPUs are performing Tomasulo's algorithm at runtime, because there are 10 pipelines in practice (for Intel systems), and all can be run in parallel. Multiply can be issued on pipeline0, pipeline1, or pipeline5.

If there are 3 multiply instructions, all three pipelines (p0, p1, and p5) all get issued a multiply on _THIS_ clock tick.

----------

The analysis must be dynamic because instructions such as "mov" can take a variable amount of time: loading data from L1 cache is 4-clock ticks, L2 cache is 40 clock ticks, L3 cache is ~150 clock ticks, and RAM is 400+ clock ticks.

That means a "lazy" strategy, where you analyze the state of the pipeline "as late as possible" before making a decision wins out. If you do pre-analysis (ie: what Intel's compiler was supposed to do with Itanium), you pretty much lose out on all variable-time instructions (ex: cache vs RAM).

If you know for certain that you have no memory-operations, then you probably should use a GPU instead of a CPU.

---------

    theLoop: 
    mov ebx, eax[ecx] ; y = array[x]
    add ebx, edx ; y += z
    add ecx, 4
    cmp ecx, 
    jnz theLoop

How long should you schedule the "add" instruction in the above assembly code? If you got a dynamic system just analyzing your pipelines and "lazily" issuing the instructions, you "perform it when its ready" (ie: after the mov instruction is complete, as you need ebx / variable-y to have been finished reading from RAM before progressing).

"add ecx, 4" can actually be done right now in parallel. You split ecx into two registers (ecx-old and ecx-new). You issue "mov ebx, eax[ecx-old]", and then you can execute future instructions with ecx-new. If the next loop is branch predicted (as would be the case in binary search, since you almost always loop), the pipeline/branch prediction stuff works together and you can execute mov ebx, eax[ecx0], mov ebx, eax[ecx1], mov ebx, eax[ecx2], mov ebx, eax[ecx3]... all in parallel (!!!!), as ecx3 gets branch predicted and multiple loops start running in parallel..

Its not hard, its not very difficult analysis, its low power. Everyone does this parallel computation now. If you go static / compiled ahead-of-time, you can't do this kind of thing anymore. So you lose out in speed compared to regular CPUs that have OoO analysis going on.

----------

Furthermore, most code with branches can seemingly be replaced with branchless code (ex: cmov instructions, min instruction, max instruction, etc. etc.)

So instead of magic compilers trying to solve an unsolvable problem (how do you schedule memory load/stores to keep the processor busy even though you don't know how long any memory operation takes...)... you write a magic compiler that solves a very easy to solve problem. (IE: convert the branchy-if statement into a branchless max or branchless cmov instruction instead).

--------

EDIT: I added "theLoop:" to the assembly above, because I realized that "branches" can be beneficial in the face of OoO / Tomasulo's algorithm. "Executing future loops" before the 1st loop is done is very beneficial.

phire · on Aug 11, 2023

To add to this, the amount of speculation modern CPUs do is insane.

In the "theLoop" example, say the first load misses L1 and L2 cache and takes needs 40 cycles to load.

AMD's Zen 4 has enough resources to issue about 64 iterations of "theLoop:" all before that first load needs to complete. With the micro op cache, it can probably issue those 64 loop iterations in about 35 cycles.

Zen 4 has enough resources to start all 64 loads early, so they run in parallel.

Who cares if there is a branch mispredict and it you actually needed to do only 16 iterations of "theLoop"? In this example, the loop length isn't dependant on any of the loads so the CPU can actually finish executing the "add ecx, 4; cmp max; jnz theLoop;" instructions of the first 16 iterations of the "theLoop". And then detect the mispredict, flush the pipeline and load the correct branch location, all before the 40 cycles it takes for that first load to complete.

------

It's just absolutely insane. No short pipeline without a branch predictor could possibly compete.

dragontamer · on Aug 11, 2023

Yeah, its not so much a game of "don't use the branch predictor". Its more so a game of "definitely use the branch predictor, but eliminate unnecessary branches so that the branch-predictor can be used on more important parts of your code".

I don't think people realize how much branch predictor + Out-of-Order + superscalar builds and combines with each other to build modern CPUs. Its done this way for a reason.

awesomeMilou · on Aug 11, 2023

I have little experience in hardware related performance optimizations, so excuse me if some of the things you wrote went over my head. But this sentence:

> Everyone does this parallel computation now. If you go static / compiled ahead-of-time, you can't do this kind of thing anymore. So you lose out in speed compared to regular CPUs that have OoO analysis going on.

Are you implying that in JIT compiled languages I can care less about making my code branchless, since both CPU and JIT compiler will optimize to a branchless version on the fly whereever possible, whereas in AoT compiled languages such as C++ I loose this edge?

Are JIT compiled languages just better when it comes to utilizing the branch predictor?

dragontamer · on Aug 11, 2023

JIT is completely separate issue entirely. I'm not talking about C++ vs Java here. I'm talking about the designs of different assembly languages.

I'm talking about Intel Itanium (ia64) vs AMD x86-64. Or VLIW (compiler-driven parallelism) vs Out-of-Order pipelined processors.

Intel Itanium had "premade bundles". Without getting into too far into the weeds, *the compiler* was responsible for discovering parallelism. The compiler then bundled instructions together

So think of this code:

    theLoop:
    mov ebx, eax[ecx] ; y = array[x]
    add ebx, edx ; y += z
    add ecx, 4
    cmp ecx, 
    jnz theLoop

The Itanium Assembly language would allow the compiler to emit:

    mov ebx, eax[ecx] ; 
    add ebx, edx : add ecx, 4; // This line executed in parallel
    cmp ecx; 
    jnz theLoop

The compiler discovers that "add ebx, edx : add ecx, 4" are both independent, and therefore "bundles" them together. Intel Itanium then ran faster and without as much need of a decoder to discover this information ahead of time.

But look at how few optimizations are available to Itanium or its compiler!! The amount of parallelism in practice (for classic x86 code) was much bigger, especially when you consider Tomasulo's Algorithm.

mhh__ · on Aug 11, 2023

JITs can do a lot of things but in practice its a better mental model to just assume that the equivalent AOT code is faster and the JIT is merely recouping performance lost to the inefficiencies of interpretation or weird semantics (javascript being the implication)

mhh__ · on Aug 11, 2023

People have spent probably a billion dollars trying to get this to work. It's a noble goal but when will it actually stick?

shmageggy · on Aug 11, 2023

> If only there was a clean fast bare-metal language to write all this in..

The author includes a footnotes for "BUT RUST.." and "BUT ZIG..", but how about Nim? Looks like there is a native library implementation of `lowerBound` https://github.com/nim-lang/Nim/blob/version-2-0/lib/pure/al... Yes, it's not a "bare-metal" language, but it compiles to one (or two), so it would be interesting to see what it compiles to here.

Also, what's wrong with C?

matklad · on Aug 11, 2023

> BUT ZIG.. There’s no binary search implementation in Zig that I could find, rather it calls to C++.

Zig’s binary search is here, it’s an unoptimized textbook version: https://github.com/ziglang/zig/blob/b835fd90cef1447904d3b009...

At TigerBeetle, we have our own branchless implementation here:

https://github.com/tigerbeetle/tigerbeetle/blob/e996abcf7154...

josephg · on Aug 11, 2023

Rust's binary search is here. Also a pretty textbook version of the function: https://github.com/rust-lang/rust/blob/4d7a80d48697171ed151c...

How much faster is the branchless version in practice? I make heavy use of binary search in some code I maintain, and I wonder if it would make much of a difference to switch to a more efficient version of the function.

WinLychee · on Aug 11, 2023

Rust's binary search used to be branchless but due to some issues down in LLVM land, the codegen doesn't issue a `cmov` where it should. https://github.com/rust-lang/rust/issues/53823

old branchless version: https://github.com/rust-lang/rust/pull/45333/files

abhishekjha · on Aug 11, 2023

I don't know much about Rust data types or Rust in general but does it not have any integer overflow?

I see (left+right)/2. Is it like python with unbounded precision?

cornstalks · on Aug 11, 2023

It’s bounded precision, but Rust limits the max size of an object/array to isize’s max[1], not usize’s max. So adding two isize::MAX values using usize will never overflow.

[1]: https://doc.rust-lang.org/stable/reference/types/numeric.htm...

LegionMammal978 · on Aug 11, 2023

Such an overflow could still be problematic for slices of zero-sized types, which can contain up to usize::MAX elements, since the total size in bytes is always 0. But (left + right) / 2 doesn't actually occur anywhere in the code, only left + size / 2, which clearly can't overflow as long as size is sane.

quietbritishjim · on Aug 11, 2023

> I see (left+right)/2.

There is no (left+right)/2 on that page.

   let mid = left + size / 2;

It's only size that is being divided by 2, which is the size of the segment still under consideration (which is initially the whole array). left is the starting index of segment still under consideration (which is initially 0).

Phemist · on Aug 11, 2023

Left and right are both usizes, which are 64-bit pointers. You will need work on an array of 2^63 elements before you have to worry about integer overflow issues.

This array would not fit in any kind of memory for the foreseeable future :)

creata · on Aug 11, 2023

Not that it makes much of a difference, but Rust does compile to 32-bit platforms.

https://doc.rust-lang.org/nightly/rustc/platform-support.htm...

Phemist · on Aug 11, 2023

Ah yes, fair point. It makes it a bit more subtle, in that the cases where you have to worry about integer overflows on the pointer addition, are cases where you have an array of 2^((64 or 32) - 1) bools... which seems rather silly to do a binary search on?

vulcan01 · on Aug 11, 2023

In debug mode, integer overflow panics. In release mode, overflow does occur; that's why checked_add and friends exist.

josephg · on Aug 11, 2023

Integer overflows cause panics in debug mode. And its undefined behaviour in release mode.

Where do you see that in the code? I can't see (left+right)/2 anywhere in the code I linked?

cornstalks · on Aug 11, 2023

> And its undefined behaviour in release mode.

No, it uses 2’s complement and is well defined in release mode. From [1]:

> When you’re compiling in release mode with the --release flag, Rust does not include checks for integer overflow that cause panics. Instead, if overflow occurs, Rust performs two’s complement wrapping.

[1]: https://doc.rust-lang.org/book/ch03-02-data-types.html#integ...

marcianx · on Aug 11, 2023

I'm pretty sure it's not undefined behavior in rust in release mode if it does overflow. It's fully specified behavior, I believe.

TillE · on Aug 11, 2023

An equivalent generic sort function in C would, at best, require a bunch of additional distracting cruft. This is exactly what C++ templates are for.

eru · on Aug 11, 2023

> Also, what's wrong with C?

Oodles and oodles of undefined behaviour, for example.

C ain't clean.

User23 · on Aug 11, 2023

Nah, thanks to CompCert[1] C actually has one of the highest quality and most predictable compilers.

[1] https://compcert.org/

insanitybit · on Aug 11, 2023

What is the practicality of actually using CompCert? And what does it actually defend against?

All I know is that it tries to ensure that the behavior of the program is the same as the input, meaning that the compiler itself does not have bugs. But how does that actually interact with undefined behavior in the language? What happens to an array out of bounds?

And what are the limitations imposed, since it only works on a subset of the C language?

It seems unlikely that I could just switch compilers and suddenly have a safe program written in C.

LegionMammal978 · on Aug 11, 2023

It looks like CompCert explicitly does not attempt to prevent UB from resulting in unpredictable behavior; all it guarantees is that "the observable behavior of [the compiled code] C improves on one of the allowed observable behaviors of [the source program] S", where "improving on" allows for arbitrary behavior on UB [0]:

> Third, the compiler is allowed to improve the behavior of the source program. Here, to improve means to convert a run-time error (such as crashing on an integer division by zero) into a more defined behavior. This can happen if the run-time error (e.g. division by zero) was optimized away (e.g. removed because the result of the division is unused). However, if the source program is known to be free of run-time errors, perhaps because it was verified using static analyzers or deductive program provers, improvement as described above never takes place, and the generated code behaves exactly as one of the allowed behaviors of the source program.

So you still have to prove your program's correctness by some other means.

[0] https://compcert.org/man/manual001.html#sec3

User23 · on Aug 11, 2023

You have to scroll down a bit, but see here[1]. Also searching [site:regehr.org compcert] will find more. CompCert C makes it impossible to reproduce at least some UB bugs:

  For CompCert, the high-level correctness theorems I proved are all of the first kind above: if the source program goes wrong, nothing is guaranteed about the behavior of the compiled code.  It happens, however, that the stepwise simulation lemmas I build on actually prove property 2 above: the compiled code cannot crash "earlier" than the source code, and will produce the same pre-crash observables, then possibly more.  So maybe I should strengthen my high-level correctness theorems...

That was over a decade ago, so perhaps it's been done. I confess I haven't closely followed CompCert in some time, due to having priorities elsewhere. As you can see elsewhere in that reference CompCert C will fail to generate the example UB code.

But yes, mechanically proving additional properties with some kind of SAT solver is also an open area of research. Dafny is the example I’m familiar with. Its conditions and invariants allow things like automatically proved binary search[2]. Converting CompCert C statements to SMTLibv2 assertions (which is, to an extremely rough first approximation what Dafny does) would certainly be considerably easier than it would be for most other languages. It would still be a serious effort of course.

[1] https://blog.regehr.org/archives/232

[2] https://github.com/dafny-lang/dafny/blob/master/Test/dafny4/...

LegionMammal978 · on Aug 11, 2023

Apart from compiler bugs, this issue with "time-traveling UB" shouldn't really occur in practice with any side effects other than volatile accesses. As noted by Martin Uecker [0],

> In portable C code I/O is performed using function calls of the standard library. A compiler can generally not assume that a function returns to the caller, because the function could call ‘exit’, ‘abort’, ‘longjmp’, or enter an infinite loop. For this reason, it is never allowed for a compiler to use any information derived from an operation with potential UB that comes after a function call in a way that could affect observable behavior before the function call. Consequently, observable behavior that is accessed via function calls (i.e. all except volatile accesses) can not be affected by time-traveling optimizations even when using the abstract interpretation. In principle, a compiler could use special knowledge about C library functions and use the fact those functions always return to the caller, but we are not aware of any compiler which does this (and it hardly seems worthwhile).

I believe LLVM has a "mustreturn" attribute expressing this, but as noted, no standard C library messes with any such attributes for functions that cause observable behavior.

So time-traveling UB really only affects programs which use volatile accesses for MMIO, and it might eventually be banned entirely from the C standard if the linked proposal N3128 goes through. I'd say that even "some C bugs" is a overstatement outside of embedded programming, given the rarity of MMIO in a hosted environment.

[0] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3128.pdf

User23 · on Aug 12, 2023

> through. I'd say that even "some C bugs" is a overstatement outside of embedded programming, given the rarity of MMIO in a hosted environment.

I don’t concede your claims, but I’m enough out of the loop that I won’t attempt to refute them either. However, for what it’s worth, embedded programming is exactly where I most want that.

Furthermore, I don’t know of any other popular bare metal language that’s even close to having a compiler offering mechanically provable assertions about its compiled behavior. Do you?

LegionMammal978 · on Aug 12, 2023

> Furthermore, I don’t know of any other popular bare metal language that’s even close to having a compiler offering mechanically provable assertions about its compiled behavior. Do you?

Well, if you ask the question that narrowly, then I suppose not, unless CakeML's GC is lightweight enough to be stuffed onto some board or another.

But 99% of the time (i.e., when I am not doing something intentionally strange to test the language's limits), the correctness of the program I'm writing, or at least of its dependencies (including the standard library), is far more questionable than the correctness of the compiler; after all, the output of, say, GCC and LLVM has been tested by oodles of real-world software. Even if I'm really paranoid, I can disable all optimizations and reenable any less-questionable ones individually.

Thus the GP's complaint, that C is filled to the brim with undefined behavior that will silently cause corruption in production code instead of failing loudly, since all the runtime sanitizers (AFAIK) swear up and down that they aren't very secure.

I do hope that that formal verification on the program's side becomes more accessible in the future, though, either via dedicated systems like Daphny or via "gradual verification" as permitted by things like Why3 frontends for different languages. (In particular, I've been meaning to look into the Creusot project for Rust.) But short of formal verification, robust runtime assertions and guardrails are generally the best bet, and C as a language simply doesn't provide very good tools for those.

eru · on Aug 12, 2023

> Furthermore, I don’t know of any other popular bare metal language that’s even close to having a compiler offering mechanically provable assertions about its compiled behavior. Do you?

You can automatically extract low level code out of certain high level, provable languages. I think some of them might offer the guarantee you asked for (assuming you eg use a certified C compiler for the last leg).

Ygg2 · on Aug 11, 2023

How much of available C libraries are CompCert compatible?

eru · on Aug 12, 2023

Well, that one tells you that the compilers translate your program according to the language spec.

That's helpful, but doesn't make the language itself saner.

w0mbat · on Aug 11, 2023

Is that still lower_bound? Maybe I am misreading the code but it looks like this returns any match, not the earliest match (when there are dupes).

It’s common to have multiple matches even in a unique list if the comparison function is say looking for a certain string prefix to do autocomplete, but we want the earliest in the list.

badtension · on Aug 11, 2023

You halve the remaining length each time there is a match and only exit the loop when length is 0 so it should be the first one.

klysm · on Aug 11, 2023

I guess it’s good to have the option of not caring if you want even more speed

commandlinefan · on Aug 11, 2023

Or, less contentiously - if you know you don't care/don't need to disambiguate dups, look how much efficiency you're losing on a case that isn't important.

The most common argument against optimization is "just use the existing code", but the "existing code" _always_ attempts to handle special edge cases that probably don't apply in your case. We programmers waste a lot of end-user time saving a bit of programmer time.

jfk13 · on Aug 11, 2023

On the other hand, assuming it's OK to ignore the special edge cases (or not even thinking of them) can come back to bite you (or your users) when eventually one of them does show up in a situation you didn't anticipate.

plagiarist · on Aug 11, 2023

Anecdotally this happens to me every time I omit an edge case. Now I try to always write a unit test for the special edge case being deliberately omitted. I just wish I could catch everything that way.

shultays · on Aug 11, 2023

My lower_bound is a bit better then

  template <class ForwardIt, class T, class Compare>
  constexpr ForwardIt super_optimized_lower_bound(
        ForwardIt first, ForwardIt last, const T& value, Compare comp) {
    return first;
  }

It doesn't really work like for some cases std::lower_bound but it is super fast

sltkr · on Aug 11, 2023

As far as I can tell it does return the earliest match. Why do you think it doesn't?

w0mbat · on Aug 11, 2023

I am possibly confused, getting moreso the more I read the code. As it is ignoring the sign of the result of compare I don't even see how it would work at all. The sign is what tells you which half to throw away as you search.

hajrice · on Aug 11, 2023

I wish all blog posts started the way his does: "You’re a busy person so I’ll first jump right to it. Here it is, the fastest general (and simple) binary search C++ implementation:"

ape4 · on Aug 11, 2023

Except I am too busy to read "You’re a busy person"

ant6n · on Aug 11, 2023

But not too busy to read the comments!

hinkley · on Aug 11, 2023

The comments have a higher density of opinions and experience.

eesmith · on Aug 11, 2023

But does it have a higher density of good opinions and experience? ;)

hinkley · on Aug 12, 2023

Do you want to fish in a lake that doesn’t have many fish but they’re all large, or in a lake that has lots of fish?

eesmith · on Aug 12, 2023

Could you explain the metaphor? Are you asking me if I'm a sport fisherman who wants the largest fish, or a commercial fisherman who wants a large haul?

It probably doesn't help that, I don't fish. What I know comes originally from Alan Dean Foster's serialization of "The Last Starfighter", at https://archive.org/details/laststarfighter0000fost_f6k9/pag... :

> There weren’t many fish to be had in the small desert lake nearby. That didn’t worry Mr Boone. As any true fisherman knows, catching fish has nothing to do with fishing. The catching is an adjunct, a corollary to the actual art of fishing, which consists of killing time on a small boat as simply as possible, utilizing only the minimal amount of energy necessary to maintain life while simultaneously consuming as much cold beer and snacks as the body will tolerate.

d-lisp · on Aug 12, 2023

I am statistically inclined to say so,

ant6n · on Aug 12, 2023

I am statistically inclined to say no.

anonymoushn · on Aug 11, 2023

The Zig stdlib does not call out to C++ for binary search. The binary search is currently here: https://github.com/ziglang/zig/blob/b835fd90cef1447904d3b009...

(edit: fixed link to not decay, thanks)

LoganDark · on Aug 11, 2023

Versioned link: https://github.com/ziglang/zig/blob/b835fd90cef1447904d3b009...

mhdm · on Aug 11, 2023

Thanks, updated the footnote.

alain94040 · on Aug 11, 2023

I don't get it. The problem with binary search and branches is not the branches themselves, it's the fact that until you have done the comparison, you don't know which memory location in the array to fetch next. It doesn't matter if you use branches or anything else, the question is what do you want the processor to do?

There is a data dependency: until I read the middle index, I can't tell if I want to search the data in the upper section or the lower section. I can speculate and issue reads to both. That would fix the dependency, but create more memory traffic. Is this the right trade-off?

What do you want? Removing branches is not it.

mhdm · on Aug 11, 2023

Prefetching is the right tradeoff for large arrays. Addressed at the end of the article: https://mhdm.dev/posts/sb_lower_bound/#prefetching

VHRanger · on Aug 12, 2023

Right, this is why proper faster binary search use eytzinger array layout:

https://algorithmica.org/en/eytzinger

returningfory2 · on Aug 11, 2023

If the array is fully in L1 cache, isn't the cost of the branch mis-predict much greater than the memory fetches?

alain94040 · on Aug 11, 2023

Full array in L1 is not a typical scenario for binary search. Binary search is usually for large data sets, that are in DRAM.

thefifthsetpin · on Aug 11, 2023

Binary search reduces the search space exponentially as it proceeds, so actually quite a lot of the total comparisons can hit L1d cache. (Maybe half of them for a ~250GB dataset.)

Of course, you could keep a cacheable partial index of your huge dataset to accelerate the early part of your search as well.

Bognar · on Aug 11, 2023

Sounds like we're just reinventing B-trees.

thefifthsetpin · on Aug 12, 2023

Yep. I considered phrasing my answer that way.

bjourne · on Aug 11, 2023

On my Cascade Lake processor "-mllvm -x86-cmov-converter=false" almost halves the performance of the binary search:

    | Benchmark | gcc  | clang | clang -cmov |
    |-----------|------|-------|-------------|
    | slow u32  | 23.4 | 46.7  | 45.8        |
    | fast u32  | 18.1 | 19.8  | 31.4        |

The numbers are nanoseconds/bsearch on a 100mb uint32 array. Seem to me that clang (15.0.7) is just much worse at optimizing this particular piece of code than gcc (13.2.1). You can see the assembly here https://godbolt.org/z/cbx5Kdjs6. The gcc assembly looks way cleaner to me.

mhdm · on Aug 11, 2023

Then you'll want to look at https://mhdm.dev/posts/sb_lower_bound/#prefetching

100mb is large enough that the branchy version turns out to have a slight advantage, more due to quirks of x86 (speculative execution) rather than being better.

bjourne · on Aug 13, 2023

So I spent too much time benchmarking:

    | Entries | gcc | gcc+pref. | clang | clang+pref. | clang -cmov | clang -cmov+pref. |
    |---------|-----|-----------|-------|-------------|-------------|-------------------|
    | 0.5     | 210 | 88        | 213   | 191         | 109         | 93                |
    | 1.0     | 231 | 107       | 235   | 211         | 134         | 112               |
    | 2.0     | 289 | 168       | 306   | 275         | 231         | 179               |
    | 5.0     | 369 | 231       | 389   | 343         | 338         | 239               |
    | 10      | 413 | 268       | 437   | 384         | 410         | 276               |
    | 25      | 469 | 311       | 490   | 435         | 506         | 318               |
    | 50      | 515 | 346       | 537   | 478         | 586         | 356               |
    | 100     | 564 | 387       | 588   | 522         | 670         | 399               |

Entries are in millions and times in ns per bsearch call. Prefetching does all the difference, but perhaps not for the right reason. On my machine (broadwell) the two prefetches that you suggested makes gcc emit the cmovb that clang with -cmov uses. The second one is enough to make it prefer cmovb but not the first one. Maybe a hand-hacked assembly loop based on the code gcc emits but without the prefetches would run even faster.

LoganDark · on Aug 11, 2023

Does anyone know where the "BUT RUST" link was supposed to lead? It seems to be already out of date due to being unversioned, I can't tell whether it's supposed to lead to the middle of the `starts_with` doc comment or not.

quietbritishjim · on Aug 11, 2023

Looking at the archive.org captures just before [1] and just after [2] the article was published, it looks like it was meant to be this line of code, now on line 2779 [3]:

            let mid = left + size / 2;

[1] https://web.archive.org/web/20230602210213/https://doc.rust-...

[2] https://web.archive.org/web/20230709221353/https://doc.rust-...

[3] https://doc.rust-lang.org/src/core/slice/mod.rs.html#2779

epilys · on Aug 11, 2023

So this is basically a *safe* optimization, since the index will always be valid and there's no need for the compiler to do a check and unwrap, panic.

This is how unsafe {} should be used. Sometimes some things are true but the compiler can't know that. And here the unsafe {} means that it dereferences a raw pointer (the index that we know is valid). If the unsafe {} safety condition is valid, unsafe {} is, well, safe.

someplaceguy · on Aug 11, 2023

Furthermore, it's an optional optimization (you could just copy the code and replace the unsafe access with a safe one, if you're paranoid) and it's not like if you write it in C++ it will be any safer than Rust's unsafe?!

LoganDark · on Aug 11, 2023

Alright, and as a versioned link that'll be: https://github.com/rust-lang/rust/blob/7d8386f05cedf33c7475c...

Thank you!

veber-alex · on Aug 11, 2023

You can get versioned link directly into rustdoc:

https://doc.rust-lang.org/1.71.1/src/core/slice/mod.rs.html#...

LoganDark · on Aug 11, 2023

I couldn't find any direct references to this anywhere, believe me I spent a long 30 minutes looking... >.<

But thank you anyway <3

mhdm · on Aug 11, 2023

It was meant to link to rust's binary search implementation. Updated to https://doc.rust-lang.org/1.71.1/src/core/slice/mod.rs.html#...

djoldman · on Aug 11, 2023

Interesting that the results don't hold up with a more complicated comp comparison function:

> For somewhat realistic scenarios of binary searching with a slower comp() function I’ve thought of searching through ids, phone numbers, accounts and keywords. I’ve thus settled on testing searching 8-byte strings.

> ...

> In this case std::lower_bound is very slightly but consistently faster than sb_lower_bound. To always get the best performance it is possible for libraries to use sb_lower_bound whenever directly working on primitive types and std::lower_bound otherwise.

I would like to see the analysis here.

pclmulqdq · on Aug 11, 2023

I believe this happens because branch prediction lets you pipeline multiple simultaneous comparisons, and rewind whenever the branch predictor is wrong (about half the time for truly random data and inputs). The CMOV approach blocks after each comparison function due to the data dependency. On average, you're doing two comparisons at a time with branches, and one with CMOV, so when comparison time is greater than branch prediction penalty, you would expect a crossover to occur.

gpderetta · on Aug 11, 2023

Switching to an N-way search (for N>2) could help extract the lost parallelism. At the cost of touching more cachelines though, so it is not necessarily a win.

pclmulqdq · on Aug 11, 2023

You're also wasting a lot of work doing that. If you go two levels deep, you now do 2 comparisons per iteration (caching the result from the previous level), but you are guaranteed to waste one of those. The CPU with branch predictor is doing about 1.5 comparisons per iteration (worst case) thanks to the 50% mispredict rate, although this depends a lot on how long the comparison takes: a very long comparison function will converge to 1 comparison per iteration. Only if you replicate the speculation for yourself - which will necessarily have to be a lot worse than a CPU branch predictor to be fast and branchless - do you have any hope of an efficiency gain.

gpderetta · on Aug 11, 2023

You do additional comparisons at each recursion depth, but you have a shallower recursion depth though. I think you break-even already at 4-way search.

pclmulqdq · on Aug 12, 2023

I think that only helps if you assume exactly 1.5 comparisons per level with a branch predictor, and I think that will be hard to hit: you basically need the comparison function to be very short to have that (shorter than the mispredict penalty). Then, you are doing 3 comparisons per two levels vs 3 comparisons for 2 levels. If you were to extend this logic to do 3 levels at a time, you would be doing 7 comparisons to do the same amount of work that the branch predictor does in expected <4.5 comparisons.

gpderetta · on Aug 12, 2023

Ok, I have been nerd sniped. I need to get near a compiler and a profiler ASAP.

kolbe · on Aug 11, 2023

> In this case std::lower_bound is very slightly but consistently faster than sb_lower_bound. To always get the best performance it is possible for libraries to use sb_lower_bound whenever directly working on primitive types and std::lower_bound otherwise.

I will say that if this is the case, there are probably much better versions of binary search out there for primitive types. I made one just screwing around with SIMD that's 3x faster than std::lower_bound until becoming memory bound:

https://github.com/matthewkolbe/ThinkingInSimd/tree/main/alg...

twoodfin · on Aug 11, 2023

I can’t find any assertion in the post about the content of any of the input data sets or search keys, other than that they’re “unpredictable”.

I assume purely random, but if these 8-byte strings were anything but pure information, modern branch predictors can sniff their way to better performance than cmov pretty easily.

funkychicken · on Aug 11, 2023

It looks like the unpredictable attribute now influences the cmov conversion pass (as of June 1st, so probably in clang 17/18)

https://reviews.llvm.org/D118118

foldr · on Aug 11, 2023

>gcc has __builtin_expect_with_probability(cond, 0, 0.5) but it does nothing (tested v10). ↩

I wonder what could possibly be the use of this builtin. Branch prediction varies enough between different processors that it seems unlikely that anything useful could be done with a fine-grained probability estimate.

twoodfin · on Aug 11, 2023

Presumably it’s not about matching the dynamic branch prediction estimate but rather exploiting static (or profile-provided) information to assist other optimizations.

Inlining seems like a natural application: Inlining is perhaps the most critical optimization to get right, and getting it right is a complex balance of frequency of use vs. added code bloat and associated further optimization cost. Having more than one bit of branch-likelihood as input to these decisions could easily be helpful.

foldr · on Aug 11, 2023

Good point. That's the only plausible explanation in this thread, I think. However, if you're at the level of performance tuning where you want to optimize inlining beyond the compiler defaults, I would have thought that it would make more sense to just experiment with forcing certain functions to be inlined or not.

gpderetta · on Aug 11, 2023

I think it is used for code layout: typically you want to make the most often taken branch as fallthrough as it can be slightly faster even when perfectly predicted.

I also tried (and failed) to use expect with probability to generate a cmov, but in retrospect it is obvious: the parameter is not the prediction probability, but the taken/not-taken probability, and even a branch that goes either way with 50% probability can still be predicted perfectly (for example a branch that switches every other iteration), so the parameter can't be used to decide between predication and prediction.

edit: what we really need is an [[unpredictable]] attribute.

twoodfin · on Aug 11, 2023

Indeed: A branch taken 50% of the time can be readily predicted 100% of the time if it follows a pattern the CPU can infer.

Why you have to be careful choosing realistic input data if you are microbenchmarking at this level for a real purpose.

pixelesque · on Aug 11, 2023

clang has one:

https://clang.llvm.org/docs/LanguageExtensions.html#builtin-...

ot · on Aug 11, 2023

Which also does not affect cmov conversion. Every time this comes up I bring up this long-standing LLVM bug: https://bugs.llvm.org/show_bug.cgi?id=40027

dzaima · on Aug 11, 2023

__builtin_unpredictable is gonna be fixed in LLVM/clang 17: https://github.com/llvm/llvm-project/commit/09515f2c20111628...

(also, bugs.llvm.org is old; the more up-to-date (but still open) issue is https://github.com/llvm/llvm-project/issues/39374)

ot · on Aug 12, 2023

Amazing, finally!!!

foldr · on Aug 11, 2023

>typically you want to make the most often taken branch as fallthrough as it can be slightly faster even when perfectly predicted.

Sure, but that application doesn't require a fine-grained probability estimate.

wongarsu · on Aug 11, 2023

At least when I last read it a couple years ago, Intel's optimization manual recommends certain code layouts. For example in the absence of better information a conditional jump backwards would be predicted as true (because that's what loops look like) while a jump forward is seen as less likely to happen (error conditions and else branches look like that).

Not sure how consistent this is across architectures, and how useful it still is. But at least it used to be a thing

kragen · on Aug 11, 2023

pretty common but not, as i thought, universal since stretch https://en.wikipedia.org/wiki/Branch_predictor#Static_branch...

but the only simpler approach is 'predict not taken' which in a sense is what a cortex-m0 does

bjourne · on Aug 11, 2023

There are ARM processors where bits in the branch instructions encode whether they should be predicted as taken or not. So on such architectures such hints can be useful. On x86-64, however, they are afaik completely useless (from an optimization perspective).

Pannoniae · on Aug 11, 2023

Read https://en.algorithmica.org/hpc/pipelining/branching/ and the following chapter.

TL;DR: you can generate different, and more efficient code if you know how much it is going to be mispredicted.

foldr · on Aug 11, 2023

Nothing in that link shows how a compiler could make use of a fine-grained probability estimate in a practical way to guide optimizations. I'm perfectly aware of the general concept of branch prediction and the annotations that certain architectures have in their instruction sets.

gpderetta · on Aug 11, 2023

The built-in doesn't provide misprediction information to the compiler.

andyg_blog · on Aug 11, 2023

This is not a valid drop in replacement for lower_bound. Accessing iterators as front[index] is only for random access iterators like vector has. The author may have realized this if they benchmarked on other containers.A forward iterator must be advanced and then dereferenced.

fn-mote · on Aug 11, 2023

Can you do binary search without random access? I would not expect so. Am I missing a point that someone could explain?

andyg_blog · on Aug 11, 2023

The C++ standard measures the time complexity of lower_bound and binary_search by the number of comparisons performed. So yes, it's possible on e.g., a linked list like std::list (by this definition)

kccqzy · on Aug 11, 2023

Observe that you can restrict your code to only call std::advance on the iterator. So definitely you could do binary search without random access, and you would actually iterate over all elements, but you just skip the comparison function for the majority of elements.

commonlisp94 · on Aug 11, 2023

Good point. I don't think anyone actually uses it for containers that only support forward iteration. But the author did declare the signature with ForwardIterator.

mhdm · on Aug 11, 2023

For a proper/sane forward iterator, front[index] is the same thing as *(front + index) and front + index achieves the same as std::advance(front, index) (just not in place).

Or are you pointing out how first[length] and first += length + rem would technically result in advancing a forward iterator through to first + length twice? (Trivially fixed with a middle = first + length temporary variable btw).

ectospheno · on Aug 11, 2023

Every post like this makes me update my junior engineer watchlist. Then I have to go patiently explain why mucking about in the code to save one instruction because you read a blog post is a horrible idea. Easily half of all silly junior code is pointless optimization.

Having said that, I did enjoy the post.

sph · on Aug 11, 2023

What kind of junior engineers are you hiring that prematurely rewrite a hot loop to use branchless assembly logic?

The junior employees I have had to deal with tended to build towering abstractions that no one really needed, rather than optimising hot loops.

ectospheno · on Aug 11, 2023

Yes, I would say that is the other half.

amluto · on Aug 11, 2023

Oh, C++, the language where templates are extremely powerful but their usability is so poor that code like this is kind of idiomatic and even compiles:

    template <class ForwardIt, class T, class Compare>

constexpr ForwardIt sb_lower_bound( ForwardIt first, ForwardIt last, const T& value, Compare comp) { auto length = last - first;

This looks generic, it compiles by itself, it contains a fancy word “ForwardIt”, and it works if you try it with a vector or an array. But you can’t actually subtract forward iterators, and binary searching a range of forward iterators is quite silly.

Rust would have required using the correct trait. Maybe C++ will improve the situation in the long run.

edg-l · on Aug 11, 2023

Check out https://en.algorithmica.org/hpc/data-structures/binary-searc...

fl0ki · on Aug 12, 2023

Pro tip, the footer link https://doc.rust-lang.org/src/core/slice/mod.rs.html#2520 is not a permalink because it doesn't include a commit ID. It looks like it's already linking to the wrong thing now (and the article is only two months old), and I don't know what the right thing was meant to be. I suggest using GitHub permalinks in future, pinned to specific commits.

postalrat · on Aug 11, 2023

How can you call it branchless if it has "while (length > 0) {"

dehrmann · on Aug 11, 2023

I took a stab at the same problem a while ago. Since the upper bound of iterations is based on the input length, if you write your search in a way that extra iterations don't change the result, you can use a switch fallthrough to "unroll" the loop and not have to branch.

https://github.com/ehrmann/branchless-binary-search/blob/mas...

Const-me · on Aug 11, 2023

That particular branch is well-predicted for long arrays. You probably only gonna have a single misprediction for the complete binary search function, the performance consequence of that branch is negligible.

People (and compilers, too) don’t write branchless code just for the sake of it, they do because it helps with performance.

kragen · on Aug 11, 2023

sometimes branchless code also stops timing side channel attacks, but it's not enough by itself

sltkr · on Aug 11, 2023

This is literally addressed in the article:

    More branchless, more better?
    Short answer: no. This section can be skipped but here’s the long answer: For n elements where 2k <= n < 2k+1 sb_lower_bound will always do k+1 comparisons. On a 64-bit machine that means at most 64 iterations of the while (length > 0) loop. So it’s possible to write a “fully branchless” version that doesn’t have the length check by using a switch with intentional fall-through.
    
    size_t length = last - first;
    size_t rem;
    switch(std::bit_width(length)) {
       case 64:
          rem = length % 2;
          length /= 2;
          first += comp(first[length], value) \* (length + rem);
       case 63:
          rem = length % 2;
          length /= 2;
          first += comp(first[length], value) \* (length + rem);
       // ...
       case 1:
          rem = length % 2;
          length /= 2;
          first += comp(first[length], value) \* (length + rem);
    }
    return first;
    
    If you’re not familiar with switch, think of it as a jump into code. In our case to the exact place from which there are exactly the right number of comparisons left to do.
    
    “Is it any faster?” No. Modern x86 processors handle loop conditions well as they’re predictable; we’re very likely to remain in the loop. And that’s good especially because it saves us from writing templates or macros or copy-paste-edit the 64 cases.

(Maybe people will RTFA if I paste it into the comments.)

amethyst · on Aug 11, 2023

TBH I had to read that part two times before realizing that it was describing an an unrolled loop.

einpoklum · on Aug 11, 2023

Because that's a branch that's "always" taken, except for once - which, speed-wise, is very close to always taken, i.e. not a branch. The author counts the branching inside the loop, which is often taken and often not taken.

postalrat · on Aug 11, 2023

It's not impossible to remove that branch. Until you do that then this is a reduced branch implementation.

fsckboy · on Aug 11, 2023

the article covers that right at the top: “branchless” because the if compiles down to a conditional move instruction rather than a branch/conditional jump.

jezze · on Aug 11, 2023

Wouldnt it be faster to do length & 1 instead of length % 2 and also length >> 1 instead of length / 2? But maybe the compiler does this behind the scenes?

ok123456 · on Aug 11, 2023

Compiler will do these strength reduction optimizations if they're applicable.

planede · on Aug 11, 2023

That's an optimization you can rely on, yes.

Chabsff · on Aug 11, 2023

Yeah, these are trivial optimizations.

tgv · on Aug 11, 2023

If the compiler doesn't optimize the if(cmp(...)) first += length + rem; perhaps a sign operation could be used when cmp returns a number. Something along the lines of:

    first += (length + rem) * ((sign(cmp(...)) + 1) / 2)

Has anyone tried that? DDG isn't very helpful.

JonChesterfield · on Aug 11, 2023

Branches on a comparison function, doesn't count it as simple functions end up as a conditional move. Branches in a loop back edge, doesn't count it as there's a branch predictor.

I was hoping for a binary search without branches in it. Not really what we got.

MrYellowP · on Aug 11, 2023

I am confused.

> Same function interface as std::lower_bound, but 2x faster, and shorter. “branchless” because the if compiles down to a conditional move instruction rather than a branch/conditional jump.

Assembly programmers did "fastest branchless binary searches" using cmov decades ago and we didn't need to think about the compiler at all. I know, because I was one of them. Optimizing branches away was and still is a nice pastime. This is not new, or original, and I don't understand what's noteworthy about it.

"Elitist", you say? Well ... yeah! At least I know what I'm doing, unlike people who solely rely on and blindly trust the compiler to do their work for them.

sixthDot · on Aug 11, 2023

https://probablydance.com/2023/04/27/beautiful-branchless-bi...

WinLychee · on Aug 11, 2023

Other fast binary searches https://github.com/sslotin/amh-code/tree/main/binsearch

commonlisp94 · on Aug 11, 2023

lower_bound and upper_bound are typically implemented in terms of partition_point, which is much more general than this version of lower_bound taking an element.

Aardwolf · on Aug 11, 2023

Even more reason to use this if you don't need the more general case and its associated performance cost

commonlisp94 · on Aug 11, 2023

Partition point is even simpler and can be optimized the same way.

> if you don't need the more general case and its associated performance cost

In C++ the "more general case" typically doesn't have a performance cost.

sesuximo · on Aug 11, 2023

I’m too lazy to try it, but putting __builtin_unpredictable around the implementation of comp might make std::lower_bound less branchy

1letterunixname · on Aug 11, 2023

It's not branchless if there are are any Jcc other than JMP because that's an opportunity for a branch prediction miss and a pipeline stall.

victor106 · on Aug 11, 2023

Anyone knows if Java already does this optimization?

up2isomorphism · on Aug 12, 2023

Learn assembly: you will get these “tricks” automatically.

To be honest it is way more efficient to read these long posts.

notorandit · on Aug 11, 2023

Branchless?

You need at least 1 branch to exit the loop.

ant6n · on Aug 11, 2023

Do you tho? You could unroll the loop ceil(log2(MAX_SIZE)) times.