Scheduling

The Y8 design works nicely with a standard, classic synchronous clock. It was indeed designed this way: each clock cycle would be a new instruction (with a few exceptions). This design is simple though a previous log (61. Making Y8 more energy-efficient with a deglitcher) shows that a more power-efficient (yet slower) method exists, using SR latches. Similar systems exist with transparent latches and the logical conclusion is to use a multi-phases clock.

In fact, a wide range of choices is available but I have barely scratched the surface, only looking for specific enhancements at specific places. For example I have only latched data at the instruction decoder level but other places would also be good targets : the output of the register set (with the IMM/REG MUX) and the result bus come to mind. Clock gating is nice but data gating is nice too :-)

A 4-phases clock (with many local gates) sounds like the best approach though it wouldn't account for the individual latencies of the individual stages. In fact, it is very similar to an unbalanced pipeline... This situation already appeared with the YASEP where a configurable pipeline was explored.

There are already several options for the decoder (either fast or latched) but I have to get ready for more options (and combinations). The clock generator's design will also follow from the chosen options. Fortunately the units themselves are not affected, only the connections between them.

In the current design iteration of the Y8, the instruction cycle is easily split into about 4 parts.

The fetch/decode step : when the main clock goes up, the NPC value is latched by the memory block, the RAM is read and the value is fed to the decoding logic (approx 3 or 4 tiles). Given an access time of about 3ns and 1ns per tile, that first phase is around 7ns on baseline A3P.
The operand read/select phase : lookup the register set and mix/select the immediate operand, going through approx. 5 tiles (or 5ns)
The "execution" phase is at least 7 tiles deep, given the results of the ALU. Some headroom might be needed.
The writeback evaluates the condition, checks the parity/sign/nullity of the result, MUXes the late units (and eventually latches the result). The result is fanout to the register set, which is subject to sample&hold constraints. That would be about 5 tiles long.

Using the worst case of 7ns per phase, the "minor clock" would be around 140MHz and the instruction cycle 35MHz. This could be bumped to about 45MHz if the writeback step overlaps the fetch/decode step. This is not very far from the estimated 49MHz of the unoptimised/unpipelined synthesis (13. First synthesis).

I'd like to offer at least the "simple/fast" and the "phased" options because one promotes speed while the other draws less current but most importantly is easier/smaller to implement in ASIC, since latches use less room that flip-flops (which are really just 2 transparent latches back to back).

T-latches are typically implemented by Actel with a MUX2 looped back to itself. ASICs prefer pass-gates, which is more or less the same. Sprinkling T-latches all over a chip is an old practice, because

there is less load on the clock network
2 latches make a DFF
T-latches use less surface and the register set would use 1/2 the surface than with DFF.

OTOH, DFF come for free with FPGA...

GHDL in a docker container

Scheduling (2)

Discussions

Become a Hackaday.io Member