JIT is completely separate issue entirely. I'm not talking about C++ vs Java her...

JIT is completely separate issue entirely. I'm not talking about C++ vs Java here. I'm talking about the designs of different assembly languages.

I'm talking about Intel Itanium (ia64) vs AMD x86-64. Or VLIW (compiler-driven parallelism) vs Out-of-Order pipelined processors.

Intel Itanium had "premade bundles". Without getting into too far into the weeds, *the compiler* was responsible for discovering parallelism. The compiler then bundled instructions together

So think of this code:

    theLoop:
    mov ebx, eax[ecx] ; y = array[x]
    add ebx, edx ; y += z
    add ecx, 4
    cmp ecx, 
    jnz theLoop

The Itanium Assembly language would allow the compiler to emit:

    mov ebx, eax[ecx] ; 
    add ebx, edx : add ecx, 4; // This line executed in parallel
    cmp ecx; 
    jnz theLoop

The compiler discovers that "add ebx, edx : add ecx, 4" are both independent, and therefore "bundles" them together. Intel Itanium then ran faster and without as much need of a decoder to discover this information ahead of time.

But look at how few optimizations are available to Itanium or its compiler!! The amount of parallelism in practice (for classic x86 code) was much bigger, especially when you consider Tomasulo's Algorithm.