Every time I see people trying to eliminate branches, I wonder, do we realize that having long pipelines where a branch misprediction stalls the pipeline is not actually a necessary part of architecture?
The pipeline is long because we do lots of analysis and translation on the fly, just in time, which could easily be done in most cases ahead of time, as it's not a very stateful algorithm.
This is how Transmeta Crusoe CPUs worked. Imagine NOT caring that you have a branch.
After all, if you think about it, all operations are branches. Down to bitwise operations and summing two numbers. It's branching in the ALU, carry bits and what not. You can't compute absolutely anything without looking at the state of one or more bits, and making a decision that changes the outcome. But the reason those branches don't hurt performance is that they're not branches on the main pipeline. These local branches have a very short (or non-existent) "pipeline". And the main pipeline is therefore not affected, despite the actual state of the system is.
Is that you Dave? :-) There was paper[1] comparing superscalar CISC approaches to uniscalar RISC in terms of work done over time versus instructions per clock. I recall telling srk at the time that it was a good treatise on how choosing the metric (IPC vs throughput) can influence ones thinking about what is "good" and what is "bad."
The IPC crowd reasons that if they can get a higher IPC value, then the process guys can crank the clock and everybody wins. The thoughput crowd takes the more pragmatic approach that Moore's law is dead and making silicon clock faster melts it so being clever in the ISA wins the day. Both camps have had successes and setbacks over the last 20 years.
I am really enjoying the RISC-V stuff which is getting back into this level of question with regard to CPU architectures. Also a good place to look if you want to catch up on what "modern" superscalar ideas are being added with support of instruction set flexibility. My guess is that wins the game in the long run.
[1] This was in the early oughts so could have been HotChips, Microprocessor Forum, or HPC workshops.
Transmeta's translation did not somehow eliminate branch costs.
In fact I distinctly remember a comp.arch thread and post from Linus himself while he worked at Transmeta that paraphrased was "the cpu's job is to generate cache misses as fast as possible."
Compulsory misses are a thing. No form of JIT is capable of eliminating them. Capacity misses are inevitable in the real world, even with the monster caches we have now.
Itanium thought it could eliminate branch costs with static analysis. How did that work out?
I really wish programmers would actually read a book about computer architecture before they so confidently conclude that it's simple to build things better than state of the art processors. You are underestimating the scale of smart that has gone into current processors by at least 7 orders of magnitude in my opinion.
Hennessy and Patterson's Computer Architecture: A Quantitative Approach is the definitive textbook. They also have a 2nd textbook Computer Organization and Design. The former is more focused on people who might go after a Computer Engineering degree while the latter is more an exploration of microprocessor architecture relevant to a wide audience of programmers.
Previous editions are available free online and just as good for learning the big picture imo. Actually the older editions might be better for learning microarchitecture specifically because the more recent editions have cut some material on that in order to cover mobile and cloud computing.
It might be stateless, but it depends on many things unknown at compile time.
One of them is the input data being processed. Binary search is exactly that, compiler don’t know at which position the result will be found.
Another one is micro-architecture, most notably cache hierarchy, and composition of EUs inside codes. If you switch to ISA with instructions resembling the micro-ops of the current CPUs, you’ll have to re-compile for every micro-architecture. However, this one is technically solvable with JIT compiler in OSes, like current GPUs where programs ship in byte code formats (DXBC, SPIR-V, NVPTX) then user-mode half of GPU driver recompiles into actual hardware instructions.
Another huge one, other CPU threads are running unknow code. Even if you drop hyper-threading making cores independent, there’re still resources shared across the complete chip: L3 cache, off-chip memory, I/O bandwidth, and electricity i.e. thermals.
> It might be stateless, but it depends on many things unknown at compile time.
> One of them is the input data being processed. Binary search is exactly that, compiler don’t know at which position the result will be found.
Are you and 3cats-in-a-coat are talking about the same "it"? I think they're talking about the work that requires such a long pipeline [1], as they've stated that doing more ahead of time would reduce the length of the pipeline and thus importance of accurate branch prediction. You are talking about...the branch predictor? the application's algorithm being executed (binary search in this case)?
[1] my CPU microarchitectural knowledge isn't that good; I have only the vaguest idea of what that is. Register renaming / data dependency analysis / instruction scheduling, maybe? translation to micro-ops? other things? don't know!
> After all, if you think about it, all operations are branches
Isn't this definition the crux of it? If you redefine everything as a Branch™, even things that are not branches, then you can definitely precompute some Branch™es.
But the sort of non-Branch™ branch elimination is talking about actually branching computation routes in code due to an if/else statement or similar, is it not?
That would still be useful to do in your world of Branch™es, just only the Branch™es that try and compute the results of more than one future at the same time (i.e. branches).
You could rephrase that the pipeline is long because there is a lot of independent work that you can do in a processor. Consider, for every independent operation that can be done, you can potentially do that much at the same time.
And I don't just mean decode, fetch, and execute. If your computer has independent ALU and Shifting units, you can potentially be shifting something as you are also adding. Have a dedicated Adder and a Multiplication unit? No reason you can't try and do both as the same time.
This directly leads to wanting multiple instructions in flight at the same time, which means you have to be able to fetch and decode instructions faster than you are processing them. Also naturally leads to situations where you want to reorder so that the N Add instructions don't keep you from seeing an independent Shift that you could do in the same time.
You can think that things are more complicated than they need to be. And, you may not even be wrong. But a ton of engineering is going into making what we have. If you think you can make something a ton faster by not doing it this way, probably a good thing to dig into how accurate that claim is.
> The pipeline is long because we do lots of analysis and translation on the fly, just in time, which could easily be done in most cases ahead of time, as it's not a very stateful algorithm.
You're going down the wrong path, again, as Intel did with Itanium.
We have pipelines because CPUs are performing Tomasulo's algorithm at runtime, because there are 10 pipelines in practice (for Intel systems), and all can be run in parallel. Multiply can be issued on pipeline0, pipeline1, or pipeline5.
If there are 3 multiply instructions, all three pipelines (p0, p1, and p5) all get issued a multiply on _THIS_ clock tick.
----------
The analysis must be dynamic because instructions such as "mov" can take a variable amount of time: loading data from L1 cache is 4-clock ticks, L2 cache is 40 clock ticks, L3 cache is ~150 clock ticks, and RAM is 400+ clock ticks.
That means a "lazy" strategy, where you analyze the state of the pipeline "as late as possible" before making a decision wins out. If you do pre-analysis (ie: what Intel's compiler was supposed to do with Itanium), you pretty much lose out on all variable-time instructions (ex: cache vs RAM).
If you know for certain that you have no memory-operations, then you probably should use a GPU instead of a CPU.
---------
theLoop:
mov ebx, eax[ecx] ; y = array[x]
add ebx, edx ; y += z
add ecx, 4
cmp ecx,
jnz theLoop
How long should you schedule the "add" instruction in the above assembly code? If you got a dynamic system just analyzing your pipelines and "lazily" issuing the instructions, you "perform it when its ready" (ie: after the mov instruction is complete, as you need ebx / variable-y to have been finished reading from RAM before progressing).
"add ecx, 4" can actually be done right now in parallel. You split ecx into two registers (ecx-old and ecx-new). You issue "mov ebx, eax[ecx-old]", and then you can execute future instructions with ecx-new. If the next loop is branch predicted (as would be the case in binary search, since you almost always loop), the pipeline/branch prediction stuff works together and you can execute mov ebx, eax[ecx0], mov ebx, eax[ecx1], mov ebx, eax[ecx2], mov ebx, eax[ecx3]... all in parallel (!!!!), as ecx3 gets branch predicted and multiple loops start running in parallel..
Its not hard, its not very difficult analysis, its low power. Everyone does this parallel computation now. If you go static / compiled ahead-of-time, you can't do this kind of thing anymore. So you lose out in speed compared to regular CPUs that have OoO analysis going on.
----------
Furthermore, most code with branches can seemingly be replaced with branchless code (ex: cmov instructions, min instruction, max instruction, etc. etc.)
So instead of magic compilers trying to solve an unsolvable problem (how do you schedule memory load/stores to keep the processor busy even though you don't know how long any memory operation takes...)... you write a magic compiler that solves a very easy to solve problem. (IE: convert the branchy-if statement into a branchless max or branchless cmov instruction instead).
--------
EDIT: I added "theLoop:" to the assembly above, because I realized that "branches" can be beneficial in the face of OoO / Tomasulo's algorithm. "Executing future loops" before the 1st loop is done is very beneficial.
To add to this, the amount of speculation modern CPUs do is insane.
In the "theLoop" example, say the first load misses L1 and L2 cache and takes needs 40 cycles to load.
AMD's Zen 4 has enough resources to issue about 64 iterations of "theLoop:" all before that first load needs to complete. With the micro op cache, it can probably issue those 64 loop iterations in about 35 cycles.
Zen 4 has enough resources to start all 64 loads early, so they run in parallel.
Who cares if there is a branch mispredict and it you actually needed to do only 16 iterations of "theLoop"? In this example, the loop length isn't dependant on any of the loads so the CPU can actually finish executing the "add ecx, 4; cmp max; jnz theLoop;" instructions of the first 16 iterations of the "theLoop". And then detect the mispredict, flush the pipeline and load the correct branch location, all before the 40 cycles it takes for that first load to complete.
------
It's just absolutely insane. No short pipeline without a branch predictor could possibly compete.
Yeah, its not so much a game of "don't use the branch predictor". Its more so a game of "definitely use the branch predictor, but eliminate unnecessary branches so that the branch-predictor can be used on more important parts of your code".
I don't think people realize how much branch predictor + Out-of-Order + superscalar builds and combines with each other to build modern CPUs. Its done this way for a reason.
I have little experience in hardware related performance optimizations, so excuse me if some of the things you wrote went over my head. But this sentence:
> Everyone does this parallel computation now. If you go static / compiled ahead-of-time, you can't do this kind of thing anymore. So you lose out in speed compared to regular CPUs that have OoO analysis going on.
Are you implying that in JIT compiled languages I can care less about making my code branchless, since both CPU and JIT compiler will optimize to a branchless version on the fly whereever possible, whereas in AoT compiled languages such as C++ I loose this edge?
Are JIT compiled languages just better when it comes to utilizing the branch predictor?
JIT is completely separate issue entirely. I'm not talking about C++ vs Java here. I'm talking about the designs of different assembly languages.
I'm talking about Intel Itanium (ia64) vs AMD x86-64. Or VLIW (compiler-driven parallelism) vs Out-of-Order pipelined processors.
Intel Itanium had "premade bundles". Without getting into too far into the weeds, *the compiler* was responsible for discovering parallelism. The compiler then bundled instructions together
So think of this code:
theLoop:
mov ebx, eax[ecx] ; y = array[x]
add ebx, edx ; y += z
add ecx, 4
cmp ecx,
jnz theLoop
The Itanium Assembly language would allow the compiler to emit:
mov ebx, eax[ecx] ;
add ebx, edx : add ecx, 4; // This line executed in parallel
cmp ecx;
jnz theLoop
The compiler discovers that "add ebx, edx : add ecx, 4" are both independent, and therefore "bundles" them together. Intel Itanium then ran faster and without as much need of a decoder to discover this information ahead of time.
But look at how few optimizations are available to Itanium or its compiler!! The amount of parallelism in practice (for classic x86 code) was much bigger, especially when you consider Tomasulo's Algorithm.
JITs can do a lot of things but in practice its a better mental model to just assume that the equivalent AOT code is faster and the JIT is merely recouping performance lost to the inefficiencies of interpretation or weird semantics (javascript being the implication)
> If only there was a clean fast bare-metal language to write all this in..
The author includes a footnotes for "BUT RUST.." and "BUT ZIG..", but how about Nim? Looks like there is a native library implementation of `lowerBound` https://github.com/nim-lang/Nim/blob/version-2-0/lib/pure/al...
Yes, it's not a "bare-metal" language, but it compiles to one (or two), so it would be interesting to see what it compiles to here.
How much faster is the branchless version in practice? I make heavy use of binary search in some code I maintain, and I wonder if it would make much of a difference to switch to a more efficient version of the function.
It’s bounded precision, but Rust limits the max size of an object/array to isize’s max[1], not usize’s max. So adding two isize::MAX values using usize will never overflow.
Such an overflow could still be problematic for slices of zero-sized types, which can contain up to usize::MAX elements, since the total size in bytes is always 0. But (left + right) / 2 doesn't actually occur anywhere in the code, only left + size / 2, which clearly can't overflow as long as size is sane.
It's only size that is being divided by 2, which is the size of the segment still under consideration (which is initially the whole array). left is the starting index of segment still under consideration (which is initially 0).
Left and right are both usizes, which are 64-bit pointers. You will need work on an array of 2^63 elements before you have to worry about integer overflow issues.
This array would not fit in any kind of memory for the foreseeable future :)
Ah yes, fair point. It makes it a bit more subtle, in that the cases where you have to worry about integer overflows on the pointer addition, are cases where you have an array of 2^((64 or 32) - 1) bools... which seems rather silly to do a binary search on?
No, it uses 2’s complement and is well defined in release mode. From [1]:
> When you’re compiling in release mode with the --release flag, Rust does not include checks for integer overflow that cause panics. Instead, if overflow occurs, Rust performs two’s complement wrapping.
What is the practicality of actually using CompCert? And what does it actually defend against?
All I know is that it tries to ensure that the behavior of the program is the same as the input, meaning that the compiler itself does not have bugs. But how does that actually interact with undefined behavior in the language? What happens to an array out of bounds?
And what are the limitations imposed, since it only works on a subset of the C language?
It seems unlikely that I could just switch compilers and suddenly have a safe program written in C.
It looks like CompCert explicitly does not attempt to prevent UB from resulting in unpredictable behavior; all it guarantees is that "the observable behavior of [the compiled code] C improves on one of the allowed observable behaviors of [the source program] S", where "improving on" allows for arbitrary behavior on UB [0]:
> Third, the compiler is allowed to improve the behavior of the source program. Here, to improve means to convert a run-time error (such as crashing on an integer division by zero) into a more defined behavior. This can happen if the run-time error (e.g. division by zero) was optimized away (e.g. removed because the result of the division is unused). However, if the source program is known to be free of run-time errors, perhaps because it was verified using static analyzers or deductive program provers, improvement as described above never takes place, and the generated code behaves exactly as one of the allowed behaviors of the source program.
So you still have to prove your program's correctness by some other means.
You have to scroll down a bit, but see here[1]. Also searching [site:regehr.org compcert] will find more. CompCert C makes it impossible to reproduce at least some UB bugs:
For CompCert, the high-level correctness theorems I proved are all of the first kind above: if the source program goes wrong, nothing is guaranteed about the behavior of the compiled code. It happens, however, that the stepwise simulation lemmas I build on actually prove property 2 above: the compiled code cannot crash "earlier" than the source code, and will produce the same pre-crash observables, then possibly more. So maybe I should strengthen my high-level correctness theorems...
That was over a decade ago, so perhaps it's been done. I confess I haven't closely followed CompCert in some time, due to having priorities elsewhere. As you can see elsewhere in that reference CompCert C will fail to generate the example UB code.
But yes, mechanically proving additional properties with some kind of SAT solver is also an open area of research. Dafny is the example I’m familiar with. Its conditions and invariants allow things like automatically proved binary search[2]. Converting CompCert C statements to SMTLibv2 assertions (which is, to an extremely rough first approximation what Dafny does) would certainly be considerably easier than it would be for most other languages. It would still be a serious effort of course.
Apart from compiler bugs, this issue with "time-traveling UB" shouldn't really occur in practice with any side effects other than volatile accesses. As noted by Martin Uecker [0],
> In portable C code I/O is performed using function calls of the standard library. A compiler can generally not assume that a function returns to the caller, because the function could call ‘exit’, ‘abort’, ‘longjmp’, or enter an infinite loop. For this reason, it is never allowed for a compiler to use any information derived from an operation with potential UB that comes after a function call in a way that could affect observable behavior before the function call. Consequently, observable behavior that is accessed via function calls (i.e. all except volatile accesses) can not be affected by time-traveling optimizations even when using the abstract interpretation. In principle, a compiler could use special knowledge about C library functions and use the fact those functions always return to the caller, but we are not aware of any compiler which does this (and it hardly seems worthwhile).
I believe LLVM has a "mustreturn" attribute expressing this, but as noted, no standard C library messes with any such attributes for functions that cause observable behavior.
So time-traveling UB really only affects programs which use volatile accesses for MMIO, and it might eventually be banned entirely from the C standard if the linked proposal N3128 goes through. I'd say that even "some C bugs" is a overstatement outside of embedded programming, given the rarity of MMIO in a hosted environment.
> through. I'd say that even "some C bugs" is a overstatement outside of embedded programming, given the rarity of MMIO in a hosted environment.
I don’t concede your claims, but I’m enough out of the loop that I won’t attempt to refute them either. However, for what it’s worth, embedded programming is exactly where I most want that.
Furthermore, I don’t know of any other popular bare metal language that’s even close to having a compiler offering mechanically provable assertions about its compiled behavior. Do you?
> Furthermore, I don’t know of any other popular bare metal language that’s even close to having a compiler offering mechanically provable assertions about its compiled behavior. Do you?
Well, if you ask the question that narrowly, then I suppose not, unless CakeML's GC is lightweight enough to be stuffed onto some board or another.
But 99% of the time (i.e., when I am not doing something intentionally strange to test the language's limits), the correctness of the program I'm writing, or at least of its dependencies (including the standard library), is far more questionable than the correctness of the compiler; after all, the output of, say, GCC and LLVM has been tested by oodles of real-world software. Even if I'm really paranoid, I can disable all optimizations and reenable any less-questionable ones individually.
Thus the GP's complaint, that C is filled to the brim with undefined behavior that will silently cause corruption in production code instead of failing loudly, since all the runtime sanitizers (AFAIK) swear up and down that they aren't very secure.
I do hope that that formal verification on the program's side becomes more accessible in the future, though, either via dedicated systems like Daphny or via "gradual verification" as permitted by things like Why3 frontends for different languages. (In particular, I've been meaning to look into the Creusot project for Rust.) But short of formal verification, robust runtime assertions and guardrails are generally the best bet, and C as a language simply doesn't provide very good tools for those.
> Furthermore, I don’t know of any other popular bare metal language that’s even close to having a compiler offering mechanically provable assertions about its compiled behavior. Do you?
You can automatically extract low level code out of certain high level, provable languages. I think some of them might offer the guarantee you asked for (assuming you eg use a certified C compiler for the last leg).
Is that still lower_bound? Maybe I am misreading the code but it looks like this returns any match, not the earliest match (when there are dupes).
It’s common to have multiple matches even in a unique list if the comparison function is say looking for a certain string prefix to do autocomplete, but we want the earliest in the list.
Or, less contentiously - if you know you don't care/don't need to disambiguate dups, look how much efficiency you're losing on a case that isn't important.
The most common argument against optimization is "just use the existing code", but the "existing code" _always_ attempts to handle special edge cases that probably don't apply in your case. We programmers waste a lot of end-user time saving a bit of programmer time.
On the other hand, assuming it's OK to ignore the special edge cases (or not even thinking of them) can come back to bite you (or your users) when eventually one of them does show up in a situation you didn't anticipate.
Anecdotally this happens to me every time I omit an edge case. Now I try to always write a unit test for the special edge case being deliberately omitted. I just wish I could catch everything that way.
I am possibly confused, getting moreso the more I read the code. As it is ignoring the sign of the result of compare I don't even see how it would work at all. The sign is what tells you which half to throw away as you search.
I wish all blog posts started the way his does: "You’re a busy person so I’ll first jump right to it. Here it is, the fastest general (and simple) binary search C++ implementation:"
Could you explain the metaphor? Are you asking me if I'm a sport fisherman who wants the largest fish, or a commercial fisherman who wants a large haul?
> There weren’t many fish to be had in the small desert lake nearby. That didn’t worry Mr Boone. As any true fisherman knows, catching fish has nothing to do with fishing. The catching is an adjunct, a corollary to the actual art of fishing, which consists of killing time on a small boat as simply as possible, utilizing only the minimal amount of energy necessary to maintain life while simultaneously consuming as much cold beer and snacks as the body will tolerate.
I don't get it. The problem with binary search and branches is not the branches themselves, it's the fact that until you have done the comparison, you don't know which memory location in the array to fetch next. It doesn't matter if you use branches or anything else, the question is what do you want the processor to do?
There is a data dependency: until I read the middle index, I can't tell if I want to search the data in the upper section or the lower section. I can speculate and issue reads to both. That would fix the dependency, but create more memory traffic. Is this the right trade-off?
Binary search reduces the search space exponentially as it proceeds, so actually quite a lot of the total comparisons can hit L1d cache. (Maybe half of them for a ~250GB dataset.)
Of course, you could keep a cacheable partial index of your huge dataset to accelerate the early part of your search as well.
The numbers are nanoseconds/bsearch on a 100mb uint32 array. Seem to me that clang (15.0.7) is just much worse at optimizing this particular piece of code than gcc (13.2.1). You can see the assembly here https://godbolt.org/z/cbx5Kdjs6. The gcc assembly looks way cleaner to me.
100mb is large enough that the branchy version turns out to have a slight advantage, more due to quirks of x86 (speculative execution) rather than being better.
Entries are in millions and times in ns per bsearch call. Prefetching does all the difference, but perhaps not for the right reason. On my machine (broadwell) the two prefetches that you suggested makes gcc emit the cmovb that clang with -cmov uses. The second one is enough to make it prefer cmovb but not the first one. Maybe a hand-hacked assembly loop based on the code gcc emits but without the prefetches would run even faster.
Does anyone know where the "BUT RUST" link was supposed to lead? It seems to be already out of date due to being unversioned, I can't tell whether it's supposed to lead to the middle of the `starts_with` doc comment or not.
Looking at the archive.org captures just before [1] and just after [2] the article was published, it looks like it was meant to be this line of code, now on line 2779 [3]:
So this is basically a *safe* optimization, since the index will always be valid and there's no need for the compiler to do a check and unwrap, panic.
This is how unsafe {} should be used. Sometimes some things are true but the compiler can't know that. And here the unsafe {} means that it dereferences a raw pointer (the index that we know is valid). If the unsafe {} safety condition is valid, unsafe {} is, well, safe.
Furthermore, it's an optional optimization (you could just copy the code and replace the unsafe access with a safe one, if you're paranoid) and it's not like if you write it in C++ it will be any safer than Rust's unsafe?!
Interesting that the results don't hold up with a more complicated comp comparison function:
> For somewhat realistic scenarios of binary searching with a slower comp() function I’ve thought of searching through ids, phone numbers, accounts and keywords. I’ve thus settled on testing searching 8-byte strings.
> ...
> In this case std::lower_bound is very slightly but consistently faster than sb_lower_bound. To always get the best performance it is possible for libraries to use sb_lower_bound whenever directly working on primitive types and std::lower_bound otherwise.
I believe this happens because branch prediction lets you pipeline multiple simultaneous comparisons, and rewind whenever the branch predictor is wrong (about half the time for truly random data and inputs). The CMOV approach blocks after each comparison function due to the data dependency. On average, you're doing two comparisons at a time with branches, and one with CMOV, so when comparison time is greater than branch prediction penalty, you would expect a crossover to occur.
Switching to an N-way search (for N>2) could help extract the lost parallelism. At the cost of touching more cachelines though, so it is not necessarily a win.
You're also wasting a lot of work doing that. If you go two levels deep, you now do 2 comparisons per iteration (caching the result from the previous level), but you are guaranteed to waste one of those. The CPU with branch predictor is doing about 1.5 comparisons per iteration (worst case) thanks to the 50% mispredict rate, although this depends a lot on how long the comparison takes: a very long comparison function will converge to 1 comparison per iteration. Only if you replicate the speculation for yourself - which will necessarily have to be a lot worse than a CPU branch predictor to be fast and branchless - do you have any hope of an efficiency gain.
You do additional comparisons at each recursion depth, but you have a shallower recursion depth though. I think you break-even already at 4-way search.
I think that only helps if you assume exactly 1.5 comparisons per level with a branch predictor, and I think that will be hard to hit: you basically need the comparison function to be very short to have that (shorter than the mispredict penalty). Then, you are doing 3 comparisons per two levels vs 3 comparisons for 2 levels. If you were to extend this logic to do 3 levels at a time, you would be doing 7 comparisons to do the same amount of work that the branch predictor does in expected <4.5 comparisons.
> In this case std::lower_bound is very slightly but consistently faster than sb_lower_bound. To always get the best performance it is possible for libraries to use sb_lower_bound whenever directly working on primitive types and std::lower_bound otherwise.
I will say that if this is the case, there are probably much better versions of binary search out there for primitive types. I made one just screwing around with SIMD that's 3x faster than std::lower_bound until becoming memory bound:
I can’t find any assertion in the post about the content of any of the input data sets or search keys, other than that they’re “unpredictable”.
I assume purely random, but if these 8-byte strings were anything but pure information, modern branch predictors can sniff their way to better performance than cmov pretty easily.
>gcc has __builtin_expect_with_probability(cond, 0, 0.5) but it does nothing (tested v10). ↩
I wonder what could possibly be the use of this builtin. Branch prediction varies enough between different processors that it seems unlikely that anything useful could be done with a fine-grained probability estimate.
Presumably it’s not about matching the dynamic branch prediction estimate but rather exploiting static (or profile-provided) information to assist other optimizations.
Inlining seems like a natural application: Inlining is perhaps the most critical optimization to get right, and getting it right is a complex balance of frequency of use vs. added code bloat and associated further optimization cost. Having more than one bit of branch-likelihood as input to these decisions could easily be helpful.
Good point. That's the only plausible explanation in this thread, I think. However, if you're at the level of performance tuning where you want to optimize inlining beyond the compiler defaults, I would have thought that it would make more sense to just experiment with forcing certain functions to be inlined or not.
I think it is used for code layout: typically you want to make the most often taken branch as fallthrough as it can be slightly faster even when perfectly predicted.
I also tried (and failed) to use expect with probability to generate a cmov, but in retrospect it is obvious: the parameter is not the prediction probability, but the taken/not-taken probability, and even a branch that goes either way with 50% probability can still be predicted perfectly (for example a branch that switches every other iteration), so the parameter can't be used to decide between predication and prediction.
edit: what we really need is an [[unpredictable]] attribute.
At least when I last read it a couple years ago, Intel's optimization manual recommends certain code layouts. For example in the absence of better information a conditional jump backwards would be predicted as true (because that's what loops look like) while a jump forward is seen as less likely to happen (error conditions and else branches look like that).
Not sure how consistent this is across architectures, and how useful it still is. But at least it used to be a thing
There are ARM processors where bits in the branch instructions encode whether they should be predicted as taken or not. So on such architectures such hints can be useful. On x86-64, however, they are afaik completely useless (from an optimization perspective).
Nothing in that link shows how a compiler could make use of a fine-grained probability estimate in a practical way to guide optimizations. I'm perfectly aware of the general concept of branch prediction and the annotations that certain architectures have in their instruction sets.
This is not a valid drop in replacement for lower_bound. Accessing iterators as front[index] is only for random access iterators like vector has. The author may have realized this if they benchmarked on other containers.A forward iterator must be advanced and then dereferenced.
The C++ standard measures the time complexity of lower_bound and binary_search by the number of comparisons performed. So yes, it's possible on e.g., a linked list like std::list (by this definition)
Observe that you can restrict your code to only call std::advance on the iterator. So definitely you could do binary search without random access, and you would actually iterate over all elements, but you just skip the comparison function for the majority of elements.
Good point. I don't think anyone actually uses it for containers that only support forward iteration. But the author did declare the signature with ForwardIterator.
For a proper/sane forward iterator, front[index] is the same thing as *(front + index) and front + index achieves the same as std::advance(front, index) (just not in place).
Or are you pointing out how first[length] and first += length + rem would technically result in advancing a forward iterator through to first + length twice? (Trivially fixed with a middle = first + length temporary variable btw).
Every post like this makes me update my junior engineer watchlist. Then I have to go patiently explain why mucking about in the code to save one instruction because you read a blog post is a horrible idea. Easily half of all silly junior code is pointless optimization.
Oh, C++, the language where templates are extremely powerful but their usability is so poor that code like this is kind of idiomatic and even compiles:
template <class ForwardIt, class T, class Compare>
constexpr ForwardIt sb_lower_bound(
ForwardIt first, ForwardIt last, const T& value, Compare comp) {
auto length = last - first;
This looks generic, it compiles by itself, it contains a fancy word “ForwardIt”, and it works if you try it with a vector or an array. But you can’t actually subtract forward iterators, and binary searching a range of forward iterators is quite silly.
Rust would have required using the correct trait. Maybe C++ will improve the situation in the long run.
Pro tip, the footer link https://doc.rust-lang.org/src/core/slice/mod.rs.html#2520 is not a permalink because it doesn't include a commit ID. It looks like it's already linking to the wrong thing now (and the article is only two months old), and I don't know what the right thing was meant to be. I suggest using GitHub permalinks in future, pinned to specific commits.
I took a stab at the same problem a while ago. Since the upper bound of iterations is based on the input length, if you write your search in a way that extra iterations don't change the result, you can use a switch fallthrough to "unroll" the loop and not have to branch.
That particular branch is well-predicted for long arrays. You probably only gonna have a single misprediction for the complete binary search function, the performance consequence of that branch is negligible.
People (and compilers, too) don’t write branchless code just for the sake of it, they do because it helps with performance.
More branchless, more better?
Short answer: no. This section can be skipped but here’s the long answer: For n elements where 2k <= n < 2k+1 sb_lower_bound will always do k+1 comparisons. On a 64-bit machine that means at most 64 iterations of the while (length > 0) loop. So it’s possible to write a “fully branchless” version that doesn’t have the length check by using a switch with intentional fall-through.
size_t length = last - first;
size_t rem;
switch(std::bit_width(length)) {
case 64:
rem = length % 2;
length /= 2;
first += comp(first[length], value) \* (length + rem);
case 63:
rem = length % 2;
length /= 2;
first += comp(first[length], value) \* (length + rem);
// ...
case 1:
rem = length % 2;
length /= 2;
first += comp(first[length], value) \* (length + rem);
}
return first;
If you’re not familiar with switch, think of it as a jump into code. In our case to the exact place from which there are exactly the right number of comparisons left to do.
“Is it any faster?” No. Modern x86 processors handle loop conditions well as they’re predictable; we’re very likely to remain in the loop. And that’s good especially because it saves us from writing templates or macros or copy-paste-edit the 64 cases.
(Maybe people will RTFA if I paste it into the comments.)
Because that's a branch that's "always" taken, except for once - which, speed-wise, is very close to always taken, i.e. not a branch. The author counts the branching inside the loop, which is often taken and often not taken.
the article covers that right at the top: “branchless” because the if compiles down to a conditional move instruction rather than a branch/conditional jump.
Wouldnt it be faster to do length & 1 instead of length % 2 and also length >> 1 instead of length / 2? But maybe the compiler does this behind the scenes?
If the compiler doesn't optimize the if(cmp(...)) first += length + rem; perhaps a sign operation could be used when cmp returns a number. Something along the lines of:
Branches on a comparison function, doesn't count it as simple functions end up as a conditional move. Branches in a loop back edge, doesn't count it as there's a branch predictor.
I was hoping for a binary search without branches in it. Not really what we got.
> Same function interface as std::lower_bound, but 2x faster, and shorter. “branchless” because the if compiles down to a conditional move instruction rather than a branch/conditional jump.
Assembly programmers did "fastest branchless binary searches" using cmov decades ago and we didn't need to think about the compiler at all. I know, because I was one of them. Optimizing branches away was and still is a nice pastime. This is not new, or original, and I don't understand what's noteworthy about it.
"Elitist", you say? Well ... yeah! At least I know what I'm doing, unlike people who solely rely on and blindly trust the compiler to do their work for them.
lower_bound and upper_bound are typically implemented in terms of partition_point, which is much more general than this version of lower_bound taking an element.
The pipeline is long because we do lots of analysis and translation on the fly, just in time, which could easily be done in most cases ahead of time, as it's not a very stateful algorithm.
This is how Transmeta Crusoe CPUs worked. Imagine NOT caring that you have a branch.
After all, if you think about it, all operations are branches. Down to bitwise operations and summing two numbers. It's branching in the ALU, carry bits and what not. You can't compute absolutely anything without looking at the state of one or more bits, and making a decision that changes the outcome. But the reason those branches don't hurt performance is that they're not branches on the main pipeline. These local branches have a very short (or non-existent) "pipeline". And the main pipeline is therefore not affected, despite the actual state of the system is.