Homebrew soft core

In the last week I tried to create my own RISC-V soft core for this contest, but now contest is over and I was not able to achieve minimum requirements (100% passing RV32I compliance tests and ability to run RTOS Zephyr), but I've got really close - my current version is passing 54 out of 55 tests (through Verilator), including misaligned load/store exceptions and a number of control and status registers with atomic reading/writing (and all of that takes 3.4K LUTs) - see https://gitlab.com/shaos/retro-v

For now I made a decision that for this particular project exceptions and extra registers are overkill, so I rolled back a little and stayed with straightforward RISC-V implementation (only 2.2K LUTs of iCE40UP5K FPGA plus some BRAMs for 32 registers) that covers most of user level instructions and passing most relevant RV32I tests:

Check         I-ADD-01 ... OK
Check        I-ADDI-01 ... OK
Check         I-AND-01 ... OK
Check        I-ANDI-01 ... OK
Check       I-AUIPC-01 ... OK
Check         I-BEQ-01 ... OK
Check         I-BGE-01 ... OK
Check        I-BGEU-01 ... OK
Check         I-BLT-01 ... OK
Check        I-BLTU-01 ... OK
Check         I-BNE-01 ... OK
Check       I-CSRRC-01 ... FAIL
Check      I-CSRRCI-01 ... FAIL
Check       I-CSRRS-01 ... FAIL
Check      I-CSRRSI-01 ... FAIL
Check       I-CSRRW-01 ... FAIL
Check      I-CSRRWI-01 ... FAIL
Check I-DELAY_SLOTS-01 ... OK
Check      I-EBREAK-01 ... FAIL
Check       I-ECALL-01 ... FAIL
Check   I-ENDIANESS-01 ... OK
Check     I-FENCE.I-01 ... OK
Check             I-IO ... OK
Check         I-JAL-01 ... OK
Check        I-JALR-01 ... OK
Check          I-LB-01 ... OK
Check         I-LBU-01 ... OK
Check          I-LH-01 ... OK
Check         I-LHU-01 ... OK
Check         I-LUI-01 ... OK
Check          I-LW-01 ... OK
Check I-MISALIGN_JMP-01 ... FAIL
Check I-MISALIGN_LDST-01 ... FAIL
Check         I-NOP-01 ... OK
Check          I-OR-01 ... OK
Check         I-ORI-01 ... OK
Check     I-RF_size-01 ... OK
Check    I-RF_width-01 ... OK
Check       I-RF_x0-01 ... OK
Check          I-SB-01 ... OK
Check          I-SH-01 ... OK
Check         I-SLL-01 ... OK
Check        I-SLLI-01 ... OK
Check         I-SLT-01 ... OK
Check        I-SLTI-01 ... OK
Check       I-SLTIU-01 ... OK
Check        I-SLTU-01 ... OK
Check         I-SRA-01 ... OK
Check        I-SRAI-01 ... OK
Check         I-SRL-01 ... OK
Check        I-SRLI-01 ... OK
Check         I-SUB-01 ... OK
Check          I-SW-01 ... OK
Check         I-XOR-01 ... OK
Check        I-XORI-01 ... OK
--------------------------------
FAIL: 10/55

About actual design - my idea was to get a standalone 32-bit CPU kind of thing (small FPGA board with flashed in soft core) that will use EXTERNAL memory with 8-bit data bus to look like some kind of RETRO, but with GCC support. 8-bit data bus means that every 32-bit instruction will be loaded at least in 4 steps and I figured out how to decode and execute those instructions in the same time with loading. I called this design Retro-V and it's got version number 1.0.0. Now more details.

Retro-V soft core has 2-stage pipeline ( or more precisely 1.5-stage pipeline ; ) with 4 cycles per stage, so on average every instruction takes 4 cycles (with 40 MHz clock it will be 10 millions instructions per sec max):

Cycle 1 - Fetch 1st byte of the instruction (lowest one that actually has opcode in it)
Cycle 2 - Fetch 2nd byte of the instruction, determine destination register (rd) and check if instruction is valid
Cycle 3 - Fetch 3rd byte of the instruction, read 1st argument from register file (if needed)
Cycle 4 - Fetch 4th byte of the instruction (highest one), read 2nd argument from register file (if needed), decode immediate value (if needed)
Cycle 5 (overlaps with Cycle 1 of the next instruction) - Execute complete instruction (with optional write back in case of branching)
Cycle 6 (overlaps with Cycle 2 of the next instruction) - Write back to register file if destination register is not x0 (that is always 0)

As you can see Retro-V core reads from register file in cycles 3 and 4 and write to register file in cycles 1 and 2 (the same as 5 and 6 for 2nd stage of pipeline). The fact that reading and writing are always performed in different moments in time allows us to implement register file by block memory inside FPGA. Also it is obvious that this design doesn't have hazard problem if the same register is written in one instruction and we have read in the next because instruction reads 1st argument in cycle 3 and write back from previous instruction is already happened in previous cycle. In case of jump (JAL, JALR or BRANCH instructions) next instruction from pipeline alread performed 1st cycle, so it stops right there and next cycle is 1st one from new address effectively re-initing the pipeline (so branch penalty is only 1 cycle). In case of memory access (LOAD or STORE instructions) state machine stays in cycle 4 for a while (to load or store bytes from/to memory one by one wasting from 1 to 5 extra cycles) and next instruction in pipeline is kind of frozen between cycle 1 and cycle 2 in the same time.

If we count only "visible" cycles (from the beginning of one instructions to the beginning of the next one) then:

JAL/JALR take 5 cycles always (because of jump)
BEQ/BNE/BLT/BGE/BLTU/BGEU take 4 cycles if condition is false (no jump) or 5 cycles if true
LB/LBU take 5 cycles (because of 1 extra cycle to read 1 byte from memory)
LH/LHU take 6 cycles (because of 2 extra cycles to read 2 bytes from memory)
LW takes 8 cycles (because of 4 extra cycles to read 4 bytes from memory)
SB takes 6 cycles (because of 1 extra cycle to write 1 byte to memory and 1 preparational cycle)
SH takes 7 cycles (because of 2 extra cycles to write 2 bytes to memory and 1 preparational cycle)
SW takes 9 cycles (because of 4 extra cycles to write 4 bytes to memory and 1 preparational cycle)
Everything else takes 4 cycles (plus 2 hidden cycles on the 2nd stage of pipeline)

Address bus is 16-bit wide (eventhough internally it's still 32-bit), so external memory could be up to 64KB (technically speaking it's configurable, so if FPGA has extra signal lines then address bus could be wider - up to all possible 32 bits).

Initial notes

1st test on FPGA

Discussions

Become a Hackaday.io Member