Great analysis. The next step would be to see if the code to implement the function can be shorter than 12 bytes (and array of bytes holding the number of days in the month such that f(x) = days_of_month[x]; is longer than the computation. Of course from a pure cycle counting advantage doing it as a calculation will always be faster as memory is slow to access.
I feel like you should be able to get away with a 2-byte constant of 0x3bec, but I haven't been able to simplify the bitshift amount enough, from (n >> 2 << 2) + 2 * (n % 2).
ETA:
I think the simplest I can get it is:
days(n) = 28 + (0x1df6 << (1^n&3) >> n) & 3
My intuition is that the 1^n&3 isn't going to be cheaper.
If you want to trade bytes of instructions for bytes of constants, you could store the number of days - 28, so you only need two bits of constant per month.
I think this has to win for fewest/cheapest instructions and genuinely clever, practical programming. I've been doing a lot of cache optimization lately, so I'm a fan of anything that reduces the amount of bits required to store information.
I mean, you have to load the instructions, too. That being said those have pretty intelligent pre-fetching baked in, but if it's the computational bottleneck, you can get the values into registers for the lookup.
I'm not sure this post is meant to be a model for good implementation advice, though, performance-oriented or otherwise. Just an interesting tidbit.
EDIT: I'm trying to figure out how to write code using registers instead of indexes into memory and it ends up with lots of conditionals instead of basically one instruction. The conversation about speed is... tricky. Can some people who are more expert help out?