This is way too complex, a purely combinatorial version would never meet timing in a design where you cared enough to make it in an FPGA. The question doesn't make sense on it's face. You want a pipelined version.
I asked why to get to the bottom of wht he's really asking.
You could use multi cycle paths so this doesn't limit your global timing. Then you would save the registers. This is much more useful for ASIC design, because on an FPGA the registers are there whether you use them or not.
However total latency will be lower than a pipe-lined version which is nice if that's what your optimizing for.
This is way too complex, a purely combinatorial version would never meet timing in a design where you cared enough to make it in an FPGA. The question doesn't make sense on it's face. You want a pipelined version.
I asked why to get to the bottom of wht he's really asking.