Second sketch

I have the drawing somewhere but it's not very different from the first one. There are two significant changes though.

The first change is the "instruction buffer", ex-Instruction L0, now called "explicit branch target buffer" (eBTB). It's a big change that also affects the instructions so I'll cover it later in detail.

The second change is the instruction decoder that is now meant to decode 3 instructions per cycle, instead of two. It uses an additional third slot in parallel with the 2 other pipelines (each inside one glob) to process the control stack, the jumps, the calls, the returns, and other instructions of the sort.

I couldn't... I just couldn't resolve myself to make a choice of which pipeline had to be sacrificed or modified to handle the jumps. And the nature of the operation is so different that it makes no sense to mix the control flow logic with the computations.

Ideally, the instructions are fed in this order : Glob1-Glob2-Stack. Any "slot" can be absent, so the realignment is going to be a bit tricky, and it increases the necessary bandwidth for the memories (FSB and caches). But the ILP can reach 3 in some cases, and is better than if I mixed the stack decoder with the Glob2 decoder, as this would create slot allocation contentions.

So the overall effect is a reasonable increase in ILP, the glob decoders remain simple, a special dedicated datapath can be carved out for the stack's operation and the "control" slot only communicates with the globs through status bits.

...

The other change, as alluded before, is with the instruction buffers that cache the L1, which are now explicitly addressed (at least at a first level). Most jumps work with a prefetch instruction that targets one of the lines, then a "switch" instruction that selects one of the lines for execution (and may stall if absent yet).

I have allocated 3 bits for the line ID. That makes 8 lines for prefetching upcoming instructions for a branch target. More lines (4?) are also dedicated-linked to the control stack.

Another line is the PC, which is actually implemented as a pair of line for double-buffering.

A direct Jump instruction acts as a direct prefetch to the double buffer, but it will stall the pipelines.

The targets for function calls, loops, switch/case, IF, and others are easy to statically schedule a few cycles in advance most of the time, leading to an efficient parallelism of all the units. Then a variety of branch instructions will conditionally select a different eBTB line ID, which gets virtually copied into Line #0.

So the eBTB has 2+8+4 lines working as L0 with specific ties to other units. More are possible using indexed/relative addressing of the eBTB.

This scheme mimics what is already happening with the data accesses : the core is decoupled from the memory system and communicates through registers. Instead, here, there is no register even though the lines virtually count as 8 address registers. But you can't read them back or alter them.

A given target line can only be set with a prefetch instruction.
A target line's address can be saved on the control stack (SPILL) and restored (UNSPILL), under program's control.
Target lines are invalidated/hidden/forbidden across IPC/IPE/IPR because the change of the module makes the address irrelevant. However the addresses remain in cache and the stack keeps some extra metadata to keep the return smooth.

So the Y32 behaves as a 3-pipeline core, with one register set per pipeline, where the 2 short data pipelines (read-decode / ALU / writeback) can communicate with each other and memory, but the control/jump pipeline can't readback or move outside of its unit, to preserve safety/security/speed. The only way to steer execution of the control unit is:

with explicit instructions containing immediate data,
or by reading flags and status bits from the other pipelines.

The recent reduction of the PC width (now 24 bits) changes a lot of things compared to a classical/canonical RISC CPU and this enables certains instructions and combinations, while also reducing the computation's critical datapath.

Compared to YASEP32, the core is larger but has higher ILP and can reach higher operating frequencies due to finer pipelining, Y32 is also more capable of keeping a high throughput because it is designed to better handle cache memories from the ground up.

First sketch (and discussion)

eBTB v2.0

Discussions

Become a Hackaday.io Member