So the 1024-bit number is the number of vector output bits per cycle, i.e. 2×FMA...

adrian_b · on March 2, 2025

"Datapath width" is somewhat ambiguous.

For most operations, an ALU has a width in 1-bit subunits, e.g. adders, and the same number as the number of subunits is the width in bit lines of the output path and of each of the 2 input paths that are used for most input operands. Some operations use only one input path, while others, like FMA or bit select may need 3 input paths.

The width of the datapath is normally taken to be the number of 1-bit subunits of the execution units, which is equal to the width in bit lines of the output path.

Depending on the implemented instruction set, the number of input paths having the same width as the output path may vary, e.g. either 2 or 3. In reality this is even more complicated, e.g. for 4 execution units you may have 10 input paths, whose connections can be changed dynamically, so they may be provide 3 input paths for some execution units and 2 input paths for other execution units, depending on what micro-operations happen to be executed there during a clock cycle. Moreover there may be many additional bypass operand paths.

Therefore, if you say that the datapath width for a single execution unit is of 256 bits, because it has 256 x 1-bit ALU subunits and 256 bit lines for output, that does not determine completely the complexity of the execution unit, because it may have a total input path width with values varying e.g. between 512 bit lines to 1024 bit lines or even more (which are selected with multiplexers).

The datapath width for a single execution units matters very little for the performance of a CPU or GPU. What matters is the total datapath width, summed over all available execution units, which is what determines the CPU throughput when executing some program.

For AVX programs, starting with Zen 2 the AMD CPUs had a total datapath width of 1024 bits vs. 768 bits for Intel, which is why they were beating easily the Intel CPUs in AVX benchmarks.

For 512-bit AVX-512 instructions, Zen 4 and the Intel Xeon CPUs with P-cores have the same total datapath width for instructions other than FMUL/FMA/LD/ST, which has resulted in the same throughput per clock cycle for the programs that do not depend heavily on floating-point multiplications. Because Zen 4 had higher clock frequencies in power-limited conditions, Zen 4 has typically beaten the Xeons in AVX-512 benchmarks, with the exception of the programs that can use the AMX instruction set, which is not implemented yet by AMD.

The "double-pumped" term used about Zen 4 has created a lot of confusion, because it does not refer to the datapath width, but only to the number of available floating-point multipliers, which is half of that of the top models of Intel Xeons, so any FP multiplications must require a double number of clock cycles on Zen 4.

The term "double-pumped" is actually true for many models of AMD Radeon GPUs, where e.g. a 2048-bit instruction (64 wavefront size) is executed in 2 clock cycles as 2 x 1024-bit micro-operations (32 wavefront size).

On Zen 4, it is not at all certain that this is how the 512-bit instructions are executed, because unlike on Radeon, on Zen 4 there are 2 parallel execution units that can execute the instruction halves simultaneously, which results in the same throughput as when the execution is "double-pumped" in a single execution unit.