RISC-V Conditional Moves

kouteiheika · 2025-10-03T02:42:45 1759459365

Yeah, RISC-V has the `Zicond` extension, but it's not a "proper" conditional move in the traditional sense. For the usual situations where the compiler would use a single conditional move on any other ISA it now needs multiple instructions to get around the fact that `Zicond` will set the destination register to zero if the conditional move isn't made. This totally sucks for performance if you don't have a "sufficiently advanced magic core" which can macro-fuse `Zicond` instruction sequences.

That said, RISC-V does have a proper conditional move instruction. And the funny part: it has multiple! `xtheadcondmov` and `xmipscmove` both implement "real" conditional moves. The catch is that those are vendor-specific extensions; compared to the official narrative that "it doesn't fit the design of RISC-V" apparently the actual hardware vendors see the value of adding real cmovs to their hardware. I wonder how many more vendor-specific extensions will it take before a common cross-vendor extension is standardized, if ever?

(And yes, I'm perfectly aware of why `Zicond` was designed the way it was. I don't really want to get into a discussion whether that's the right design or not long-term.)

camel-cdr · 2025-10-03T06:50:52 1759474252

> For the usual situations where the compiler would use a single conditional move on any other ISA it now needs multiple instructions

Only if you need the full properties of cmove. In many cases it just generates a single Zicond.

While some companies implement a 3R1W integer pipeline and use fusion, others keep the integer side 2R1W. If you use 2R1W you can get wider issue for the same area, if you have a four issue integer pipeline you may be able to add a fifth integer execition unit for cheaper than moving it to 3R1W, which may give you a higher performance gain.

dzaima · 2025-10-03T21:38:20 1759527500

"3R1W integer pipeline" is kinda ambiguous; I think it'd be extremely-stupid for any core to have all their ALUs be 3R. Much more sane is having ~half be such (if even that), and the rest at 2R.

Or, better yet, have the 3R extra port come from some of the 2R being split up; e.g. for a block of 3×2R1W ALUs, be able to split one up for its read ports, reusing it as 2×3R1W when needed, thereby being able to do 3R1W at 66% the throughput of 2R1W without any extra register ports (i.e. 1.3x throughput benefit of 3R1W over two 2R1W instrs). Probably has some extra costs from scheduling & co needing to handle 3R though.

chc4 · 2025-10-03T06:09:54 1759471794

"sufficiently advanced magic core" is a fairly funny term when a lot of other RISCV behavior basically assumes any real processor will have a macro op fusing frontend for specific series of instructions, and even provides recommendations for what fusions cores should implement.

chasil · 2025-10-03T03:08:46 1759460926

When you say "xmips" was there a difference of technique between MIPS and ARM?

kouteiheika · 2025-10-03T06:34:03 1759473243

The `xmipscmove` is just the name of the extension. The 'mips' here means MIPS-the-company, and not MIPS-the-ISA. It's supported by the MIPS P8700 CPU which, counterintuitively, is a RISC-V CPU and not MIPS (it's named "MIPS" because the company which designed it is called "MIPS", not because it uses the MIPS architecture).

wren6991 · 2025-10-02T22:47:54 1759445274

Surprised an article published on September 28, 2025 does not include any mention of the P (packed integer SIMD) extension.

P adds instructions like integer multiply-accumulate, which have a third register read (for rd). So, they're taking the opportunity to add a few forms of 3-register select instructions:

   MVM     Move Masked
           for each bit i:  X(rd)[i] = X(rs2)[i] ? X(rs1)[i] : X(rd)[i]

   MVMN    Move Masked Not
           for each bit i:  X(rd)[i] = X(rs2)[i] ? X(rd)[i]  : X(rs1)[i]

   MERGE   Merge
           for each bit i:  X(rd)[i] = X(rd)[i]  ? X(rs2)[i] : X(rs1)[i]

Actually I say I'm surprised but given the way the spec is currently spread around different parts of the internet, it's easy to miss if you're not following the mailing lists!

sylware · 2025-09-29T13:08:55 1759151335

This is implemented with instruction fusion. Just need to document properly and publish properly what will end up "standard instruction fusion patterns" (like the div/rem one).

Adding more instructions is kind of non productive for a R(educed)ISC ISA. It has to be weighted with extreme care. Compressed instructions went thru for the sake of code density (marketing vs arm thumb instructions).

In the end, programs will want probably to stay conservative and will implement only the core ISA, at best giving some love to some instruction fusion patterns and that's it, unless being built knowingly for a specific risc-v hardware implementation.

mort96 · 2025-10-02T22:03:07 1759442587

> In the end, programs will want probably to stay conservative and will implement only the core ISA

This is probably not the case. The core ISA doesn't include floating point, it doesn't include integer multiply or divide, it doesn't include atomic and fence instructions.

What has happened is that most compilers and programs for "normal desktop/laptop/server/phone class systems" all have some baseline set of extensions. Today, this is more or less what we call the "G" extension collection (which is short-hand for IMAFD_Zicsr_Zifencei). Though what we consider "baseline" in "normal systems" will obviously evolve over time (just like how SSE is considered a part of "baseline amd64" these days but was once a new and exotic extension).

Then lower power use cases like MCUs will have fewer instructions. There will be lots of MCUs without stuff like hardware floating point support that won't run binaries compiled for the G extension collection. In MCU use cases, you typically know at the time of compiling exactly what MCU your code will be running on, so passing the right flags to the compiler to make sure it generates only the supported instructions is not an issue.

And then HPC use cases will probably assume more exotic extensions.

And normal "desktop/phone/laptop/server" style use cases will have runtime detection of things like vector instructions in some situations, just like in aarch64/amd64.

int_19h · 2025-10-03T08:56:31 1759481791

Was there ever a time when SSE was not a part of baseline amd64? Just going off the dates, SSE showed up in Pentium 3, and if I remember correctly AMD picked it up in 32-bit Athlons already.

mort96 · 2025-10-03T10:50:36 1759488636

I think you're right. I should've said x86 (or maybe IA-32?), not amd64.

dzaima · 2025-10-04T13:51:03 1759585863

gcc, via -m32, still defaults to no SSE, i.e. pre-1999, i.e. ≥26 years and counting for increasing default.

clang, as far back as Compiler Explorer goes (i.e. clang 3.0.0, i.e. 2011), always assumes SSE for -m32; presumably because there's nothing to be backwards-compatible to, unlike gcc.

Doesn't look particularly good for "default will just change at some point", though we can hope.

panick21_ · 2025-10-02T23:47:54 1759448874

Its not known as "G". The standard that is target by the software ecosystem is RVA20, RVA22, RVA23.

https://riscv.org/ecosystem-news/2025/04/risc-v-rva23-a-majo...

mort96 · 2025-10-03T09:57:48 1759485468

Thanks, seems I'm out of date (or just wrong). G is indeed IMAFD_Zicsr_Zifencei and I've always viewed it as a "reasonable baseline for most normal code", I wasn't up to date on the RVA/B/C stuff.

panick21_ · 2025-10-04T17:24:59 1759598699

Generally RV64GC was the original target and was renamed and is basically RVA20.

sylware · 2025-10-03T09:50:51 1759485051

What??

Ofc, if your program uses floating point calculations you will want to use the hardware machine instructions for that.

Here, we were talking about about all those machine instructions which do not bring much more on top of the core ISA. Those would be implemented using fusion, appropriate for R(educed)ISC silicon. The trade-off is code density, and code density on modern silicon, probably in very specific niches, but there, program machine instructions would be generated (BTW, probably written instead of generated for those niches...) with those very specific niches in mind.

And RISC-V hardware implementations, with proper publishing of most common, and pertinent, machine instruction fusion patterns, will be able to "improve" step by step, targetting what they actually run and what would make real difference. Sure, this will require a bit of coordination to agree on machine instruction fusion patterns.

mort96 · 2025-10-03T09:54:22 1759485262

You said "programs will want probably to stay conservative and will implement only the core ISA". I'm saying that the core ISA is very very limited and most programs will want to use more than the core ISA.

sylware · 2025-10-03T10:01:36 1759485696

What???

Re-read my post, please.

The problem is those machine instructions not bringing much more than the core ISA which do not require an ISA extension.

mort96 · 2025-10-03T10:35:56 1759487756

Integer multiply requires an ISA extension. The core ISA does not have integer multiply.

sylware · 2025-10-03T12:55:48 1759496148

Allright, now this is ridiculous.

Stop using AIs and/or trolling, thx.

mort96 · 2025-10-03T13:00:24 1759496424

I genuinely do not understand what part of my comments you take issue with. You said that programs will assume the core RISC-V ISA. I said that no, most programs will assume the existence of some extensions, including integer multiply/divide and floating point.

There are two possibilities here:

* Either I'm misunderstanding what you're saying, and you did not mean that most programs will use only the core ISA.

* Or you're trying to say that integer multiply/divide and floating point is part of the core ISA.

Which one is it?

If it's the first one, could you try to clarify? Because I can't see another way to interpret the phrase "programs will want probably to stay conservative and will implement only the core ISA".

cestith · 2025-10-03T14:28:58 1759501738

Okay, I’m neither party in this back and forth and I don’t know either of you. I have an idea what the misunderstanding might be, but I could be entirely wrong.

I think sylware doesn’t mean the core ISA exactly, but the core with the standard extensions rather than manufacturer-specific extensions.

sylware · 2025-10-04T09:28:58 1759570138

It is sort of obvious and 101: with a heavy technical context "not spoken explicitely", LLMs fail hard and they end up trolling. Usually they completely miss the point, that in a row, like here.

Let's start over for microsoft GPT-6.

It all depends on the program: if it does not need more than a conservative use of the ISA to run at a reasonable speed on targeted hardware, it should not use anything else. Those people tend to forget that large implementations of RISC-V will probably be heavy on machine instruction fusion.

In the end, adding 'new machine instructions' is only to be though about, after proper machine instruction fusion investigation.

They are jumping the gun way too easily on 'adding new machine instructions', forgetting completely about machine instruction fusion.

dzaima · 2025-10-04T13:56:49 1759586209

There's not much sign that RISC-V will be extremely-fusion-focused; indeed it'd be good for the base ISA, but Zba, Zbb, Zicond add a bunch of common patterns as distinct instructions, and things often fused on other architectures (compare + branch) is a single instruction in even the base RV64I. That largely leaves fusing constant computation as a fusable thing, and.. that's kinda it, to achieve what current x86 & ARM cores do. (there's then of course doing crazier things like fusing multiple bitwise/arith ops together, but at that point having a too-minimal base ISA comes back to bite you again, meaning that some should-be-cheap fusions would actually need to fuse ≥3 instrs instead of just two)

In any case, "force hardware to do an extremely-stupid amount of fusion to get back the performance lost from intentionally not adding/using useful instructions" isn't a sane thing to target in any universe no matter ones goals; you're just wasting silicon & hardware development time that would be better spent actually doing useful things. Fusion is neat (esp. for fixing past mistakes or working around fixed-size instructions (i.e. all x86 & ARM use fusion for, but a from-scratch designed ISA with variable-length instrs (e.g. RISC-V) should need neither)), but it's still very unquestionably strictly worse than just having more actual instructions and using them.

sylware · 2025-10-05T11:26:31 1759663591

There is a rational for (compare + branch) in one instruction if I recall properly: no status flags register, which makes out-of-order CPU design much easier and more.

Again, the bulk of the programs out there don't need those extensions to be reasonably performant on modern silicon hardware. In other words, all programs out there will want to stick to a conservative usage of the ISA anyway ("core-ish").

Programs requiring floating point hardware in order to be "usable" will mandate probably a cache line vector ISA extension silicon block (they won't even use the FPU ISA extension). Who would even use a FPU silicon block nowadays for floating point calculations (unless niche and small hardware implementation)?

(x86 and arm are out: they have strong IP locks in many places in the world, there are not to be considered for any sane future. Those are just legacy burden and full of "marketing" instructions)

dzaima · 2025-10-05T13:21:53 1759670513

Avoiding flags is indeed a decision backed by reason; but to do so, you don't necessarily need to have `beq a0, a1, label`, you can just do `xor t0, a0, a1; beqz t0, label`. Having full `beq` instead of just `beqz` is exactly as unnecessary as `sh3add` from Zba, except some mild difference in frequency of those, depending on codebase. Having just beqz would even have the benefit that the label could be 17-bit instead of 12-bit!

Indeed, most sane software doesn't need most extensions to be "reasonably performant"; in fact, most sane software is reasonably-performant even on two decades old hardware!

But, unfortunately, there's a ton of software doing things quite inefficiently, and it will continue to exist forever unless something crazy happens like a non-insignificant amount of humans starting to care (impossible) or LLMs becoming functional enough to rewrite entire codebases (more possible than humans caring, at least).

You're extremely-heavily underestimating software doing random garbage in floating point (using it to compute a square root or multiplying an integer by 0.4 or something; ad-hoc game logic/physics that isn't written in a vectorizable way; doing a bunch of things where integers would do in FP (esp. languages which expose floats as the main datatype, esp. JavaScript))

It may be neat to dream about a hypothetical world where none of that garbage exists, but that dream isn't coming true today, nor is there any sign that it will at any point in the future. Basing architecture/compiler/configuration decisions around this hypothetical is just purely entirely stupid.

And even in that dream world a lot of code would benefit from sh1add/sh2add/sh3add from Zba, Zbb's min/max is useful in a ton of places, memory managers might want clz for computing bucket from size, anything doing bitwise stuff would benefit from andn and much of Zbs. And of course ideally the vast majority of code would be running in RVV instead of scalar code.

sylware · 2025-10-05T14:48:54 1759675734

If what I read was right, the flags decision was made because RISC-V designers knew it is an awful pain for large out-of-order implementations. It is basically feedback from experience. I guess there is much more advantages to only that.

It seems to be also why the core ISA has only 32bits instructions: because smaller instructions hardly brings anything, that based the same feedback from experience. Maybe only on super small ultra tiny embedded micro-controllers with a very old silicon process... This smells more aggressive marketing using super niche or broken programs to justify itself.

Of course, there is no perfect REDUCED ISA: trade-offs were made based on the designers experience. Expect arm people to press hard on the bad side of those trade-offs (a trade-off has good sides and bad sides, definition), because risc-v is a death sentence for them (and they are making a push right now on HN, I can tell you...). Yep, arm and x86_64 have strong IP locks all around the glob... RISC-V, none, free for all to implement.

Nowadays, programs requiring floating point hardware acceleration for reasonable performance use vector machine instructions. I think this is a mistake of RVA2x: the FPU extension should not be there. Only cache line size vector machine instructions should be there. The FPU extension would be for niche/specialized/small hardware. And a scalar is a vector with one used dimension... and the "synchronous"/"inline" handling of floating point operations... yummy.

I have a lot of doubts on compressed instructions, because I don't see code density being that much of an killer feature (it sounds more like arm marketing to me), and I recall reading numbers going in this way and not the other way for the general case.

What I am very sure of: nobody wants to design a clean and modern ISA to handle the "bad" programs, come on, and in the worst case scenario that will fit only some "bads" not all of them anyway, choices will have to be made on the "accelerated bads"... sane? nope.

All that said, I am coding RISC-V and x86_64 assembly, and did a little bit of arm64: for the code I wrote, arm and risc-v were nearly the same.

What I am keeping an eye on is the memory reservation/ZACAS stuff though. Because hart(and io) synchronization in a world of (hart read/write queues) and cache memory coherency seems to be critical for "normal" performance and very quickly.

And another thing people tend to forget: RISC-V is standard across vendors/implementors, namely it is appropriate and reasonable to write fast code path variants in assembly... and that could change A LOT of things, well at least in the "system/kernel area" (extremely hard to do planned obsolescence is a killer feature...).

dzaima · 2025-10-05T15:26:50 1759678010

> Nowadays, programs requiring floating point hardware acceleration for reasonable performance use vector machine instructions.

While some very-important software like video codecs, and various sporadic projects where some drive-by open-source dev decided to add an optimized path will use vector, that's, like, on the order of 0.001% (number out of my ass) of all software that runs slowly enough to be noticable, and the remaining 99.999% remains slow. Much as I like working with SIMD/vector, it's a very tiny minority of people that do.

RVV does also actually make good use of scalar FP, with .vf instruction variants which take one operand from the scalar registers, allowing storing constants in the scalar registers instead of wasting vector registers (and with LMUL it's very easy to exhaust the entire RVV register file). Especially important in matmul kernels.

> What I am very sure of: nobody wants to design a clean and modern ISA to handle the "bad" programs

And yet that's what RVA23 & co basically have to be, and are. And as such they have scalar FP, vector FP, and basically everything else that's potentially useful (other than 3-source-operand instructions).

I do wonder how much compressed actually benefits perf-wise, but it's very clearly true that, at least icache-wise, reducing code size by, say, 20%, is equivalent to adding 20% more icache; and 20% of a typical L1 icache is quite a lot of area to save.

sylware · 2025-10-07T23:49:51 1759880991

dav1d is AV1 decoding with C code just for posture: nearly everything is assembly using vector machine instructions from arm64 to x86_64 avxNNN, and of course risc-v.

I don't even mention ffmpeg.

RVAXX looks like more a grab bag to match x86_64 and aarch64, feature wise, and it includes bad features: this is very probably to ease porting only.

In a risc-v world, there would be much more assembly of code path variants (cross-vendor neutral-ish), and high level languages with assembly written interpreters.

No more C42+ , only ultra-stable-in-time core-ish ISA assembly... and Big Tech hates that because planned obsolescence is excrutiatingly harder to do.

dzaima · 2025-10-10T13:23:30 1760102610

dav1d is still in the "some very-important software" group; the vast vast majority of software doesn't and will not write everything in assembly. If you think RISC-V is gonna in any way change that, you're.. just trivially simply plain wrong, 99.999% of people will not bother learning assembly regardless of how good of an idea you think that'd be. (never mind that even if more people learned assembly, 99.999999% of said assembly would be full of system-killing bugs)

RVA23 doesn't in any way ease porting, no clue what you're on about there; besides vectorizable code, where you necessarily do simply just need RVV or similar to get good performance, base RV64G does just cover everything needed for software to be able to run. All RVA23 does is just provide a baseline with extensions to be able to achieve good performance, and cheap instructions that software should've been already utilizing for decades but hasn't generally been able to due to legacy hardware not supporting them (importantly clz/ctz/cpop, but also to a smaller extent min/max, zicond, bit rotates).

Granted, RVA23 does have some more questionable (though still actually useful) inclusions like the cache block ones that force 64-byte cache lines to not be dysfunctional, but that also brings up the massive wart of mandatory 4K pages in base RISC-V, not even RVA23, that's explicitly chosen for ease-of-portability.

Maybe RV64 is the end-of-line for CPUs, but, for all we know, RV64 today might be what Intel 8086 was in 1978, and RV64 is to be extended and grown and eventually just replaced in the future.

mort96 · 2025-10-04T13:05:10 1759583110

To be clear, I am not and have never used language models or other forms of "AI" in writing online comments. Not that you'll believe me, but that's the truth.

In an effort to show that I'm sincere and that this topic genuinely interests me, let me show you my RISC-V CPU implemented in Logisim: https://github.com/mortie/rv32i-logisim-cpu. For this project, I did actually only implement (most of) the core ISA; so in order to run C programs compiled with clang, I actually had to tell clang to generate code for the core RV32I. That means integer multiplication and division in the C source code was turned into loops which used addition, subtraction, shifts and branches to implement multiplication and division.

> It all depends on the program: if it does not need more than a conservative use of the ISA to run at a reasonable speed on targeted hardware, it should not use anything else.

Essentially all programs will benefit significantly from at the very least integer multiply and divide. And every single CPU that's even capable of running anything like a mainstream "phone/laptop/desktop/server class" operating system has the integer multiply and divide extension.

So to say that most programs will use the core ISA and not extensions is wild. Only a tiny minority of executables compiled for the absolute tiniest of RISC-V MCUs (or, y'know, my own Logisim RV32I CPU) will be compiled for the core RISC-V ISA.

sylware · 2025-10-05T10:40:19 1759660819

You are still ignoring what I say.

Stop using AI, thx.

mort96 · 2025-10-05T11:52:20 1759665140

No, you're the one ignoring what I say. I asked a very clear question in a good-faith attempt to clear up confusion. You ignored it.

Honestly you're acting like an LLM instructed to produce antagonistic, bad-faith arguments. You're certainly not acting like a human who has any idea what he's talking about.

sylware · 2025-10-05T13:42:15 1759671735

Well, stop missing the point from light years away like LLMs each time there is a strong non-explicit technical context.

mort96 · 2025-10-05T14:50:54 1759675854

I gave you ample opportunity to make yourself clear. I will give you one more. Please answer the question this time, or don't bother responding at all.

* Either I'm misunderstanding what you're saying, and you did not mean that most programs will use only the core ISA.

* Or you're trying to say that integer multiply/divide and floating point is part of the core ISA.

Which one is it?

sylware · 2025-10-05T15:07:43 1759676863

It seems microsoft GPT oX still using its bullet points output, still completely missing the point without explicit technical context (here the technical context is heavy and implicit).

mort96 · 2025-10-05T15:27:19 1759678039

Okay, I give up. I have given you plenty of chances. You're stuck in a loop in your dialog tree. This conversation is over, and I will not comment further.

sylware · 2025-10-06T11:49:34 1759751374

Please, be my guest.

Pet_Ant · 2025-10-02T19:16:15 1759432575

Compressed instructions are also for microcontroller use. RISC-V -rightly or wrongly- is trying to be an ISA that can handle the whole stack from embedded microcontrollers to a top-end server.

As such, there are compromises for both aims.

sylware · 2025-10-03T09:54:05 1759485245

"sweet spot"

vardump · 2025-10-02T19:26:20 1759433180

Instruction fusion still means lower code density. You can go overboard, but the newer ARM instruction set(s) are pretty good.

duskwuff · 2025-10-02T21:39:56 1759441196

As an aside: it's only relevant on microcontrollers nowadays, but ARM T32 (Thumb) code density is really good. Most instructions are 2 bytes, and it's got some clever ways to represent commonly used 32-bit values in 12 bits:

https://developer.arm.com/documentation/ddi0403/d/Applicatio...

wren6991 · 2025-10-03T10:35:05 1759487705

RISC-V code density is pretty good these days with Zcmp (push, pop, compressed double move) and Zcb (compressed mul, sign/zero-extend, byte load/store). There is also Zcmt but it's kind of cursed. Hopefully density will keep improving once mainstream compilers have full support for Zilsd/Zclsd (load/store pair for RV32).

T32 is a pretty good encoding but far from perfect. If they had the chance to redo it I doubt they would spend a full 1/32nd of the encoding space on asrs, for example.

Findecanor · 2025-10-02T19:53:41 1759434821

Not necessarily lower density. On ARM you would often need cmp and csel, which are two instructions, eight bytes.

RISC-V has cmp-and-branch in a single instruction, which with c.mv normally makes six bytes. If the cmp-and-branch instruction tests one of x8..x15 against zero then that could also be a compressed instruction: making four bytes in total.

astrange · 2025-10-02T20:27:22 1759436842

ARMv8.7 added some new instructions for int min/max to replace cmp+csel. (I'm surprised it took them so long to add popcnt.)

https://www.corsix.org/content/arm-cssc

sylware · 2025-10-03T09:36:12 1759484172

Compressed instruction only matter for niche (and even in such niche, nowadays, I guess it is very probably very questionable), here you would not use compressed instructions, just the right instruction pattern for fusion, like div/rem.

sylware · 2025-10-03T09:32:54 1759483974

RISC-V instructions are pretty good, without any IP lock like ARM or x86_64.

mshockwave · 2025-10-03T05:38:16 1759469896

> In the end, programs will want probably to stay conservative and will implement only the core ISA

Unlikely, as pointed out in sibling comments the core ISA is too limited. What might prevail is profiles, specifically profiles for application processors like RVA22U64 and RVA23U64, which the latter one makes a lot more sense IMHO.

sylware · 2025-10-04T09:37:48 1759570668

Come on, what was to be understood is to 'stick to the core ISA' as much as possible.

I had to clarify the obvious: if a program does not need more than a conservative usage of the ISA to run at reasonable speed, no hardcore change to the hardware should be investigated.

Additionnally, the 'adding new machine instructions' fan boys tend to forget about machine instruction fusion (they probably want they names in the extension specifications) which has to be investigated first, and often in such niche cases, it may be not the CPU to think about, but specialized ASIC blocks and/or FPGA.

wren6991 · 2025-10-02T22:56:19 1759445779

> publish properly what will end up "standard instruction fusion patterns" (like the div/rem one).

The div/rem one is odd because I saw it suggested in the ISA manual, but I have yet to ever see that pattern crop up in compiled code. Usually it's just in library functions like C stdlib `div()` which returns a quotient and remainder, but why on earth are you calling that library function on a processor that has a divide instruction?

cpgxiii · 2025-10-03T01:44:31 1759455871

> but why on earth are you calling that library function on a processor that has a divide instruction?

Because they rightfully expect that div() compiles down to the fastest div/rem idiom for the target hardware. Mainstream compilers go to great lengths to optimize calls to the core C math functions.

wren6991 · 2025-10-03T10:27:19 1759487239

You still have the overhead of a function call. If you just use / % operators then you'll get a call inserted to the libgcc or compiler-rt routine if you don't have the M extension, and those routines are div or mod only. Using stdlib for integer division seems like an odd choice.

If stdlib div() were promoted to a builtin one day (it currently is not in GCC afaict), and its implementation were inlined, then the compiler would recognise the common case of one side of the struct being dead, and you'd still end up with a single div/rem instruction.

cpgxiii · 2025-10-03T20:05:57 1759521957

Interesting, this is a case where GCC and Clang are "dumb" and MSVC does a better job. For code

  #include <cstdlib>
  #include <cstdint>
  #include <utility>
  
  std::pair<int64_t, int64_t> LibDivWithRemainder(int64_t numerator, int64_t denominator) {
    const auto res = std::div(numerator, denominator);
    return std::make_pair(res.quot, res.rem);
  }
  
  std::pair<int64_t, int64_t> ManDivWithRemainder(int64_t numerator, int64_t denominator) {
    const int64_t quot = numerator / denominator;
    const int64_t rem = numerator % denominator;
    return std::make_pair(quot, rem);
  }

GCC (x86-64 trunk @ -O2) produces

  "LibDivWithRemainder(long, long)":
    sub     rsp, 8
    call    "ldiv"
    add     rsp, 8
    ret
  "ManDivWithRemainder(long, long)":
    mov     rax, rdi
    cqo
    idiv    rsi
    ret

Clang (x86-64 @ -O2) produces

  LibDivWithRemainder(long, long):
    jmp     ldiv@PLT
  ManDivWithRemainder(long, long):
    mov     rax, rdi
    mov     rcx, rdi
    or      rcx, rsi
    shr     rcx, 32
    je      .LBB1_1
    cqo
    idiv    rsi
    ret
  .LBB1_1:
    xor     edx, edx
    div     esi
    ret

while MSVC (x64 @ /O2) produces

  mov     rax, rdx
  cdq
  idiv    r8
  mov     QWORD PTR [rcx], rax
  mov     rax, rcx
  mov     QWORD PTR [rcx+8], rdx
  ret     0

for both

monocasa · 2025-10-02T23:45:51 1759448751

One piece I would find interesting to see data on, but don't really know how to get meaningful information on against modern, non-academic cores is the fact that reorder buffers I've seen aren't just arrays of single instructions (and dependency metadata), but instead look more like rows of small traces of a few alu/ldst/etc instructions, and one control flow instruction per row. It kind of ends up looking a lot like a a modern CISC microcode word with a triad/quad/etc of operations and a sequencing op (but in this case the sequencing op is for the main store, not the microprogram store). That means in some cases, you have to fill the ROB entries with NOPs (that to be fair don't actually hit the backend) in order to account for a control flow op in an inopportune place.

The conventional wisdom is that conditional moves mainly uplift in order pipelines, but I feel like there could be a benefit to increased ROB residency on OoOE cores as well with the right architecture.

But like I said, I don't have a good way to prove that or not.

phire · 2025-10-03T02:44:09 1759459449

The main argument against using CMOV on OoO cores is that it creates data dependencies. The backend can't start executing downstream dependant μops until after the cmov executes.

At a first glance, it might seem insane to replace a simple data dependancy with a control-flow dependency, control-flow dependencies are way more expensive as they might lead to a miss-predict and pipeline flush.

But to understand modern Massively-Out-of-Order cores, you really need to get in the mindset of "The branch predictor stupidly accurate", and actually optimise for the cases when it was right.

If the data is anything other than completely random, the branch predictor will guess correctly (at least some of the time) and the dependency is now invisible to the backend. The dependency chain is broken and the execution units can execute both segments in parallel.

--------

So while more CMOVs might help with ROB residency, I'm really not sure that would translate to overall improved performance.

But this does make me wonder if it might be worth while designing an μarch that could dynamically swap between executing a CMOV style instruction as a branch or conditional move? If the CMOV is predictable, insert a fake branch into the branch predictor and handle it that way from now on.

imtringued · 2025-10-03T10:51:25 1759488685

>The main argument against using CMOV on OoO cores is that it creates data dependencies. The backend can't start executing downstream dependant μops until after the cmov executes.

That doesn't sound like a very well thought out argument. The moment you are conditional with respect to two independent conditions, you can run both conditional moves in parallel.

>At a first glance, it might seem insane to replace a simple data dependancy with a control-flow dependency, control-flow dependencies are way more expensive as they might lead to a miss-predict and pipeline flush.

The moment you have N parallel branches such as from unrolling a data parallel loop, you have a combinatorial explosion of 2^N possible paths to take. You have to successfully predict through all of them and you certainly can't execute them in parallel anymore.

Also, you're saying there is a miss predict and a pipeline flush, but those are concepts that relate to prefetching instructions and are completely irrelevant to conditional moves that do not change the instruction pointer. If you have nothing to execute, because you're waiting for a dependent instruction that is currently executing, then you're stalling the pipeline, which is equivalent to executing a NOP instruction. It's a waste of a cycle (not really, because you're waiting for a good reason), but it can't be more expensive than that.

adgjlsfhk1 · 2025-10-03T03:52:11 1759463531

Doing the translation dynamically would be really interesting. Compilers (with some exception to PGO compiles) do a really bad job figuring out whether to go branchless or branched, and programmers only get it right ~half the time.

yvdriess · 2025-10-03T09:11:48 1759482708

Totally agree. I have experienced 'ideal' circumstances of 33% taken/untaken branches where you will be hard pressed to make cmov perform better on real life workloads. Pass along other data inputs that do predict better and your cmov becomes a liability.

It's pretty hard to make modern compilers reliably emit cmovs in my experience. I had to resort to inline asm.

phkahler · 2025-10-03T02:46:47 1759459607

I've always thought instead of compare-and-branch, they should have just made it compare, or a better name would be "if". if r1<r2 execute the next instruction. This should have worked like a 16bit prefix to whatever instructions are supported. Risc-v would have only supported jump, jalr, and branch. Then as they realized the importance of conditional instructions the could have just changed the spec to allow "if" to be combined with load, store, add, etc...

IMHO this approach seems to fit modern CPU designs reasonably well. There is no explicit flag or predicate register, but it does require fusing 2 instructions with possibly different operand. But restricting which instructions can use it might help (even better if its completely orthogonal).

phire · 2025-10-03T02:55:17 1759460117

This is what ARM's Thumb-2 has with its various If-Then-Else instructions. One instruction can skip upto four subsequent instructions if the condition fails.

It can also do else clauses, instructions that get executed only when the condition fails.

I'm not sure how well this approach would work on modern CPUs; These days, Thumb-2 is generally only used on small microprocessors, and it's notable that ARM64 didn't carry that feature forwards.

phkahler · 2025-10-08T19:33:12 1759951992

I think the else part of that seems excessive. You don't want predicated floating around beyond the next instruction IMHO.

brucehoult · 2025-09-29T20:36:48 1759178208

> some SiFive cores implement exactly this fusion.

I was not able to open the given link, but it's not true, at least for the U74.

Fusion means that one or more instructions are converted to one internal instruction (µop).

SiFive's optimisation [1] of a short forward conditional branch over exactly one instruction has both instructions executing as normal, the branch in pipe A and the other instruction simultaneously in pipe B. At the final stage if the branch turns out to be taken then it is not in fact physically taken, but is instead implemented by suppressing the register write-back of the 2nd instruction.

There are only a limited set of instructions that can be the 2nd instruction in this optimisation, and loads and stores do not qualify. Only simple register-register or register-immediate ALU operations are allowed, including `lui` and `auipc` as well as C aliases such as `c.mv` and `c.li`

> The whole premise of fusion is predicated on the idea that it is valid for a core to transform code similar to the branchy code on the left into code similar to the branch-free code on the right. I wish to cast doubt on this validity: it is true that the two instruction sequences compute the same thing, but details of the RISC-V memory consistency model mean that the two sequences are very much not equivalent, and therefore a core cannot blindly turn one into the other.

The presented code ...

      mv rd, x0
      beq rs2, x0, skip_next
      mv rd, rs1
    skip_next:

... vs ...

    czero.eqz rd, rs1, rs2

... requires that not only rd != rs2 (as stated) but also that rd != rs1. A better implementation is ...

      mv rd, rs1 // safe even if they are the same register
      bne rs2, x0, skip
      mv rd, x0
    skip:

The RISC-V memory consistency model does not come into it, because there are no loads or stores.

Then switching to code involving loads and stores is completely irrelevant:

      lw x1, 0(x2)
      bne x1, x0, next
    next:
      sw x3, 0(x4)

First of all, this code is completely crazy because the `bne` is fancy kind of `nop` and a core could convert it to a canonical `nop` (or simply drop it).

Even putting the `sw` between the `bne` and the label is ludicrous. There is no branch-free code that does the same thing -- not only in RISC-V but also in arm64 or amd64. SiFive's optimisation will not trigger with a store in that position.

[1] SiFive materials consistently describe it as an optimisation not as fusion e.g. in the description of the chicken bits CSR in the U74 core complex manual.

sxzygz · 2025-09-30T03:36:13 1759203373

Thanks for your input. I didn’t know what to make of the article.

brucehoult · 2025-10-01T01:06:39 1759280799

Having taken a second look, this article does in fact have a point, but it is actually nothing at all to do with conditional moves in the RISC-V instruction set Zicond extension -- or amd64 or arm64 style conditional moves either, if they were added at some point.

It is not even about RISC-V but about instruction fusion in general in any ISA with a memory model at least as strong as RVWMO -- which includes x86. I'm not as familiar with the Aarch64 memory model, but I think this probably also applies to it.

The point here is that if an aggressive implementation wants to implement instruction fusion that removes conditional branches (or indirect branches) to make a branch-free µop -- for example, to turn a conditional branch over a move into something similar to the `czero` instruction -- then in order to maintain memory ordering AS SEEN BY A DIFFERENT CORE the fused µop has to also have `fence r,w` properties.

That is all.

It is irrelevant to this whether the actual RISC-V instruction set has a conditional move instruction, or the properties it has if it exists.

It is irrelevant to the situation where a human programmer or a compiler might choose to transform branchy code into branch-free code. They have a more global view of the program and can make sure things make sense. A CPU core implementing fusion has only a local view.

Finally, I'll note that instruction fusion is at present hypothetical in RISC-V processors that you can buy today while it has been used in both x86 and Arm chips for a long time.

Intel's "Core" µarch had fusion of e.g. `cmp;bCC` sequences in 2006, while AMD added it with Bulldozer in 2011. Arm introduced a limited capability -- `CMP r0, #0; BEQ label` is given as an example -- in A53 in 2012 and A57, A72 etc expanded the generality.

Upcoming RISC-V cores from companies such as Ventana and Tenstorrent are believed to implement instruction fusion for some cases.

Just for completeness, I'll again repeat that SiFive's U74 optimises execution of a condition branch and a following simple ALU instruction that execute simultaneously in two pipelines, but this is NOT fusion into a single µop.

phire · 2025-10-03T02:04:40 1759457080

> but about instruction fusion in general in any ISA with a memory model at least as strong as RVWMO -- which includes x86

No... It's kind of an artefact of RISC-V's memory model being weak. x86 side-steps the issue because it insists that stores always occur in program order, allowing it to fuse away conditional branches without issue.

(Note: the actual hardware implementation of x86 cpus issues the stores anyway, and then rewinds if it later detects a memory ordering violation)

RISC-V ran into this corner case because it wanted the best of both worlds: A Weak memory model, but still have strong ordering across branches.

Looks like ARM avoided this issue because its memory model is weaker, branches don't force any ordering, which means the arm compiler might need to insert a few extra memory barrier instructions.

---------

TBH, I don't think this fusing instructions edge case is a big deal. For smaller RISC-V cores, you aren't reordering memory operations in the first place.

And for larger RISC-V cores, you already need a complex mechanism for dealing with store order violationss, so you just throw your fused come instruction at it. Your core already needs to deal with sync points that aren't proper branches, because non-taken branches also enforce ordering.

wren6991 · 2025-10-02T22:52:52 1759445572

I feel like the author might be slightly missing the point here:

> The whole premise of fusion is predicated on the idea that it is valid for a core to transform code similar to the branchy code on the left into code similar to the branch-free code on the right.

The idea of Zicond afaict is that the compiler transforms select sequences into (usually multiple) Zicond instructions, and cores with more register ports available can fuse Zicond compounds into more complex select macro-ops. It's a 2R1W vocabulary for describing selects which require more than 2 read ports.

As an aside I evaluated Zicond on my scalar 3-stage implementation and found that at 1 CPI for ALU ops and 2-cycle taken branch cost, the branchless sequences GCC produced for Zicond were never better and sometimes worse than the equivalent branching sequence. It really does seem to be targeting bigger cores, or constant-time execution

Dwedit · 2025-10-02T20:06:42 1759435602

32-bit ARM had literally every instruction be conditional.

phire · 2025-10-03T02:48:10 1759459690

And that design is generally regarded to have been a mistake, results in very bad code density.

ARM went out of their way to remove it. Multiple times, with both AArch64 and the various implementations of Thumb.

chasil · 2025-10-03T03:29:56 1759462196

I can't imagine the pressure at Acorn, with the Olivetti acquisition impending, on Furber and Wilson to deliver their design unoptimized for the required task.

A CPU for this century is not the one for last?

throwaway81523 · 2025-10-03T07:22:28 1759476148

Risc-V has always seemed like a 20th century processor to me, heh.

pjmlp · 2025-10-03T09:29:46 1759483786

Given its origin from MIPS, it kind of is.