The energy requirement for a quire is very high unless you are talking less than 12 bit posits or so (in which case the strategy actually becomes superior from what I've seen due to the lack of needing to convert back to a float/posit via rounding). >500 flops to hold state, or a >500 bit RAM for a (32, 2) posit quire uses a crapload of power compared to just a 32 bit register.
Pipelining becomes difficult too, as carries mean that potentially every bit held in the quire needs to be updated, so in the naive implementation you need to wait for the latency of a >500 bit adder.
You can solve this partially by bucketing based on where in the quire a posit value expanded into fixed point should be added as fixed point addition is associative (ignoring overflow) so in a single cycle you don't need to wait for everything and can re-order as needed, but there's still the potential that a carry would need to propagate over the entire quire. Also, this means additional flop/RAM state and more bookkeeping which burns even more energy. Maybe you can get more exotic with asynchronous logic (like Nvidia did a paper on recently for handling logarithmic addition), but good luck verifying timing and everything for that.
It's been quite a while since I've run numbers on this, but I wouldn't be surprised if you could get 4-10+ posit non-quire FMAs in the same power budget as a single posit FMA using a quire, and fit a lot more onto a chip.
Don't get me wrong, I think the posit is a wonderful idea (make the bits you are storing in memory more meaningful), but I think a lot of Gustafson's proposals hinges upon the quire being available too ("the end of error"), which is a strong ask save for the few applications that actually need it.
Pipelining becomes difficult too, as carries mean that potentially every bit held in the quire needs to be updated, so in the naive implementation you need to wait for the latency of a >500 bit adder.
You can solve this partially by bucketing based on where in the quire a posit value expanded into fixed point should be added as fixed point addition is associative (ignoring overflow) so in a single cycle you don't need to wait for everything and can re-order as needed, but there's still the potential that a carry would need to propagate over the entire quire. Also, this means additional flop/RAM state and more bookkeeping which burns even more energy. Maybe you can get more exotic with asynchronous logic (like Nvidia did a paper on recently for handling logarithmic addition), but good luck verifying timing and everything for that.
It's been quite a while since I've run numbers on this, but I wouldn't be surprised if you could get 4-10+ posit non-quire FMAs in the same power budget as a single posit FMA using a quire, and fit a lot more onto a chip.
Don't get me wrong, I think the posit is a wonderful idea (make the bits you are storing in memory more meaningful), but I think a lot of Gustafson's proposals hinges upon the quire being available too ("the end of error"), which is a strong ask save for the few applications that actually need it.