Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Great analysis. The next step would be to see if the code to implement the function can be shorter than 12 bytes (and array of bytes holding the number of days in the month such that f(x) = days_of_month[x]; is longer than the computation. Of course from a pure cycle counting advantage doing it as a calculation will always be faster as memory is slow to access.


How about 15 bytes? (x86, month in ecx, result in eax)

  // 28 + ((0x3bbeecc>>m>>m)&3);
  b8 cc ee bb 03    mov  eax,0x3bbeecc
  d3 f8             sar  eax,cl
  d3 f8             sar  eax,cl
  83 e0 03          and  eax,0x3
  83 c0 1c          add  eax,0x1c


I think you can save a byte on the add with

    04 1c             add al, 0x1c
(but i've never written any x86 assembly)


You are right. I was lazy, I compiled it with gcc -O2 and extracted the assembler with objdump -Mintel -d.


In fact you can save 2 bytes if you only care about the result in AL, bringing it down to 13 bytes:

    b8 cc ee bb 03    mov  eax,0x3bbeecc
    d3 f8             sar  eax,cl
    d3 f8             sar  eax,cl
    24 03             and  al, 3
    04 1c             add  al, 28


I feel like you should be able to get away with a 2-byte constant of 0x3bec, but I haven't been able to simplify the bitshift amount enough, from (n >> 2 << 2) + 2 * (n % 2).

ETA: I think the simplest I can get it is:

  days(n) = 28 + (0x1df6 << (1^n&3) >> n) & 3
My intuition is that the 1^n&3 isn't going to be cheaper.


If you want to trade bytes of instructions for bytes of constants, you could store the number of days - 28, so you only need two bits of constant per month.


Once you do that, you can trade your array for a 32-bit integer:

f(n) = 28 + ((0x3bbeecc >> n >> n) & 3)


I think this has to win for fewest/cheapest instructions and genuinely clever, practical programming. I've been doing a lot of cache optimization lately, so I'm a fan of anything that reduces the amount of bits required to store information.


Ohh I like that a lot.


I mean, you have to load the instructions, too. That being said those have pretty intelligent pre-fetching baked in, but if it's the computational bottleneck, you can get the values into registers for the lookup.

I'm not sure this post is meant to be a model for good implementation advice, though, performance-oriented or otherwise. Just an interesting tidbit.

EDIT: I'm trying to figure out how to write code using registers instead of indexes into memory and it ends up with lots of conditionals instead of basically one instruction. The conversation about speed is... tricky. Can some people who are more expert help out?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: