So I spent too much time benchmarking:

    | Entries | gcc | gcc+pref. | clang | clang+pref. | clang -cmov | clang -cmov+pref. |
    |---------|-----|-----------|-------|-------------|-------------|-------------------|
    | 0.5     | 210 | 88        | 213   | 191         | 109         | 93                |
    | 1.0     | 231 | 107       | 235   | 211         | 134         | 112               |
    | 2.0     | 289 | 168       | 306   | 275         | 231         | 179               |
    | 5.0     | 369 | 231       | 389   | 343         | 338         | 239               |
    | 10      | 413 | 268       | 437   | 384         | 410         | 276               |
    | 25      | 469 | 311       | 490   | 435         | 506         | 318               |
    | 50      | 515 | 346       | 537   | 478         | 586         | 356               |
    | 100     | 564 | 387       | 588   | 522         | 670         | 399               |

Entries are in millions and times in ns per bsearch call. Prefetching does all the difference, but perhaps not for the right reason. On my machine (broadwell) the two prefetches that you suggested makes gcc emit the cmovb that clang with -cmov uses. The second one is enough to make it prefer cmovb but not the first one. Maybe a hand-hacked assembly loop based on the code gcc emits but without the prefetches would run even faster.