Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Because the cost of that unnecessary mov is very small, so the win from human assembly is very small.

But rules of thumb are like this. If you know enough to question the rule of thumb, go ahead. Hand assembly in hot code can be worth the cost.

It's also possible the value in ecx is used again outside the snippet?



In that context, it's not very small, it's 20% (all instructions are register-to-register instructions, so they all have the same weight). It's huge.

Yes, there's the possibility that ecx is used elsewhere, and in that case, my second comment is irrelevant, because I was answering to the possibility that such big wart is to be expected from compilers because they crop up regularly.

But then again, it's unlikely that it's used elsewhere, because eax has the return value of the C snippet, there's nothing else to do, the function can return. So the original question remains: did this come from a C compiler? If yes, it's crappy code.


> In that context, it's not very small, it's 20% (all instructions are register-to-register instructions, so they all have the same weight). It's huge.

Huge in space sure. Not in execution time.


It's 20% of the execution time. All these instructions use the same number of cycles.


Do they? I put together two quick and dirty nonsense test programs this is option2:

   int main (void) {
       for (int i = 0; i < 1000000000; ++i) {
            asm volatile (
                ".intel_syntax\n"
                "mov eax, edi\n"
                "sar eax, 31\n"
                "add edi, eax\n"
                "xor eax, edi\n"
            :::);
       }
       return 0;
   }
option1 has the extraneous mov ecx, eax, and then add with ecx.

I confirmed with objdump -d that the assembly hadn't been touched and that the loops were the same. On my otherwise mostly idle dual L5640 system and pinned to a single cpu (just in case), option1 consistently runs in 3.14 seconds and option2 consistently runs in 3.15 seconds.

Adding an extra zero, both option1 and option2 runs in 30.94-30.95 user seconds. The extraneous move doesn't seem to cost any actual time.


Microbenchmarks don't usually tell the whole story. Once the bloat adds up the cache misses and macro-scale benchmarks will show a difference.


I'm sure the size penalty adds up in some cases.

But if you look at your program that must go faster, and you see unnecessary moves in the hot section(s), go ahead and remove them, but don't be surprised if it doesn't change much.

If you went and did your whole program by hand, the debloating might also not change much. That's why there's a rule of thumb.

If you have the skill to make a change to the compiler so it can output a better sequence of instructions, I suspect thsat's pretty difficult, but it may make enough of a difference over a large number of programs to be worthwhile.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: