How to divide the register set's power consumption by about 5

The latest source code archive contains the enhanced decoder for the register set, including 3 strategies:

Straight (fast)
update only meaningful control lines
update only meaningful control lines when the related field is used

I provide a pseudo-randomised test to compare these strategies and the outcome is great:

[yg@Host-001 R7]$ ./test.sh 
Testing R7:
  straight decoder:R7_tb_dec.vhdl:165:5:(report note): 100000 iterations, 702273 toggles
  latching decoder:R7_tb_dec.vhdl:165:5:(report note): 100000 iterations, 301068 toggles
  Instr-sensitive :R7_tb_dec.vhdl:165:5:(report note): 100000 iterations, 160231 toggles
R7: OK

There is a ratio of approx. 1/5 between the first and third result, which I explain below :

Given that the probability of one bit being set is pretty close to 1/2, it makes sense that the first "straight" decoder toggles the output bits every other time in average. There are 14 control lines to drive and with a 1/2 probability, 7 lines change.

The next method gives a better result, that you can understand using similar logic : we get 3 toggles per instruction, which makes total sense. There are 2 decoders but only 1/2 chance of change, so we can focus on one decoder. Each decoder updates only 3 of the 7 control lines because the other 4 give results that will not be used. So far, so good, no surprise at all.

The last method gives an average toggle rate of 1.6 per instruction. This is one half of the previous result and though it should be taken with a lot of precaution, the benefit is clear. Some instructions (about 1/4) don't use the SND field, and the SRI field is not used when Imm8 or Imm4 fields are used, giving a further significant reduction of toggles.

Of course, these numbers are NOT representative of real use cases. I used pretty uncorrelated bits as sources, while real workloads have some sorts of patterns. The numbers will certainly increase or decrease, depending on each program.

There is a compromise for each situation and the 3 methods are provided in the source code, so you can choose the best trade-off between latency and consumption. The numbers are pretty good and I think I reached the point of diminishing return. Any "enhancement" will increase the logic complexity with insignificant gains...

A little note

Census of the gates

Discussions

Become a Hackaday.io Member