Decode Instructions into StackOps?

I'm hacking on Verilog for the CPU, and trying to implement a simple, five-stage pipeline (fetch, decode, execute, memory, and write-back). However, this is proving significantly more complex than I'd anticipated. It seems like complexity just keeps going up and up and up and up.

I'm trying to think of alternative methods of achieving good performance.

I'm thinking one approach is to take the approach that Intel CPUs do by breaking individual CPU instructions into smaller units, which I'm going to call stack-ops. The idea is that a single RISC-V instruction is decoded into a MISC instruction packet. That packet is then interpreted sequentially.

For example,

addi    x1, x0, 1

would be decoded into:

GETREG(0) LIT(1) ADD SETREG(1) NEXT

This would take 5 clock cycles to complete, just like it would with a 5-stage pipeline:

GETREG(0) pushes 0 onto the evaluation stack (since X0 is hardwired 0).
LIT(1) pushes 1 onto the evaluation stack.
ADD sums the two numbers on the evaluation stack.
SETREG(1) pops the evaluation stack and stores the sum into X1.
NEXT pops the next instruction word off the pre-decoded instruction queue and restarts evaluation.

To restore the desirable characteristic of having 1 CPI, you'd need five such execution units, and an instruction queue at least 5 deep to keep all five execution units busy. I'd need to coordinate access to the register file (so as to avoid multiple concurrent writes), of course. The register file would also need five read ports as well (to be able to satisfy all five executors concurrently; otherwise, we'll need to block on register file access).

Whether this ends up being simpler or not requires additional study.

Possible CPU Development Plan

CPU Update: Sticking with 5-stage Pipeline

Discussions

Become a Hackaday.io Member