The previously described method for balancing a tree of MUX relays has reminded me the difficulty of creating a similar dual-MUX tree in FPGA where there is no memory blocks. I encountered this problem while designing the YASEP in VHDL for Actel's A3P family and the buffering was pretty complicated.
In ASIC/FPGA though the partition constraints are relaxed and slight unbalances can help the place/router put gates further or closer. Furthermore, the synchronicity of the arrival of all the signals ensures a more consistent timing. The elimitation of huge final fanout also brings a better overall operating speed.
20170317
The previous research dealt with the relay version of the register set and memory decoder, which I now realize are specific cases of the Knapsack problem. I'm now diving in the realm of Combinatorial optimization!
With ASIC or FPGA, we can relax some of the constraints and a little imbalance (10% or less) is not critical. We don't even have a constraint of having a power-of-two number of sinks because the gate inputs are not in series and tied in the middle. As a consequence, for an ASIC or FPGA, the original approach of rotation is easy and makes sense !
Let's examine the example of a 16-addresses register set, such as used by the #YASEP Yet Another Small Embedded Processor or the #F-CPU (v2). There are 4 address lines, and let's consider a 16-bits wide register (wider registers are just duplicates). The total fan-in is 16+32+64+128=240 for the 16×MUX16. The problematic fan-in of 128 (the last stage of the MUX) is reduced to four fan-ins of 240/4=60. This is slightlty faster (because there are less buffers to traverse, I estimate the gain to be equivalent to 1 gate propagation time due to the propagation in forward and reverse directions) and makes a more regular structure.
Actually the structure can be decomposed in groups of bits that are as wide as the number of address lines. For our 4-addresses register set, we can create groups of 4 bits, each with a fan-in of (1+2+4+8)=15 (per bit). For a full-custom design, these nibbles can be hand-optimised then duplicated as required, but it's also possible to just let the synthesizer&place&route decide how to allocate the 4 buffer lines. I should write some VHDL for this...
20170320
Interleaving.
There is more than a way to interleave the control lines. For example, simple rotations have at least 3 versions:
step=4 step=2 step=1 0 a b c d a b c d a b c d 1 a b c d a b c d b c d a 2 a b c d b c d a c d a b 3 a b c d b c d a d a b c 4 b c d a c d a b a b c d 5 b c d a c d a b b c d a 6 b c d a d a b c c d a b 7 b c d a d a b c d a b c 8 c d a b a b c d a b c d 9 c d a b a b c d b c d a A c d a b b c d a c d a b B c d a b b c d a d a b c C d a b c c d a b a b c d D d a b c c d a b b c d a E d a b c d a b c c d a b F d a b c d a b c d a b cI don't even mention the direction (negative steps) or miroring (dcba instead of abcd).
Then there is the matter of "reversal" (using a different symmetry)...
0 a b c d
1 b c d a
2 c d a b
3 d a b c
4 d a b c
5 c d a b
6 b c d a
7 a b c d
8 a b c d
9 b c d a
A c d a b
B d a b c
C d a b c
D c d a b
E b c d a
F a b c d
All these combinations should be tried and tested to see which yields the best speed, considering that each FPGA or gate array has their own characteristics, fanout trees, etc..
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.