Close

Polaris CPU Needs Serious Redesign

A project log for Kestrel Computer Project

The Kestrel project is all about freedom of computing and the freedom of learning using a completely open hardware and software design.

samuel-a-falvo-iiSamuel A. Falvo II 07/08/2016 at 06:494 Comments

Tonight, I completed the memory stage of the 5-stage pipeline for the Redtail microarchitecture used by Polaris CPU. I then decided, let me run it through Yosys to get a ballpark idea of how many logic cells this thing will use. Pleasantly, I found it needed only 165. Not bad!

Then I figured, "Let's look at the register file."

What I found was that the 32-deep, 64-bit wide, 2-read-1-write register file that you would typically find in just about any RISC consumes 5666 logic cells in an iCE40-class device. This immediately rules out the 4K FPGAs that I was planning on using for the Backbone implementation of the Kestrel-3. By my math, I was banking on it needing around 2500 cells. 5666 is just too much. I'm guessing it needs this many logic cells because the chip doesn't have sufficient routing resources, and Yosys had to resort to using LUTs as wires. Not only that, but for a chip capable of handling 500MHz clocks off its PLLs, the top speed that the register file could work at is only 42MHz (according to the estimate provided by the icetime command). This, alone, means that the Backbone bus clock will need to drop to 25MHz.

Ouch!!

I next figured, let's look at the ALU. This required a gob-smacking 1726 logic cells. Again, I didn't expect it to be ultra-tiny, but this is just too big. That's more than half of the logic cells found in a iCE40HX-4K. And this is just the ALU; this doesn't even consider the magnitude comparators needed in the execute stage for conditional branching support.

Thankfully, instruction fetch takes only 65 logic cells, and decode needs only 212. At least I can be proud of something.

So, the writing is on the wall; if I want a CPU that fits in a 4K device, and I do, then my previous estimates about its performance are right out the window. Gone. Kaput.

I simply must architect it like the 6502, using hardwired state-machines or vertical microcode in place of a 5-stage pipeline. The register file must be single-ported to eliminate as many wires as I can (at the expense of two extra cycles per instruction: one to fetch an operand, and one to write the result back). The ALU and magnitude comparators have to be written at the gate and/or LUT4 level; I cannot trust the Verilog compiler or Yosys to produce the most efficient logic possible. Finally, the entire microarchitecture has to be tailored just for the 16-bit Backbone bus it'll connect to. This may mean I have to shrink the ALU to just 16 bits. Yuck.

I was really, really, really hoping I didn't have to go this route. With a CPU that delivers around 1.5 to 3 MIPS performance (thanks to the lack of a pipeline), the computer will feel about one quarter to one half as fast as a Commodore 64, which I think is unacceptable for a general purpose hacking computer. But, I don't know what else I can do. :( :( :( :( :(

Discussions

Will Long wrote 07/10/2016 at 17:00 point

For the register file, can you use the embedded block RAM on the iCE instead of logic cells? After glancing at the product guide entry for the iCE40HX-4K, it looks like it has 80k in memory. Maybe this document is helpful in using those modules (link below). I guess it's possible that Yosys might not support the EBR though. 

http://www.latticesemi.com/view_document?document_id=47775

  Are you sure? yes | no

Samuel A. Falvo II wrote 07/10/2016 at 17:12 point

Yosys does not infer the EBR as far as I can see.  You have to use it manually.

The reason I did not want to use the EBR originally is because I wanted a 2-read-1-write port memory for the register file; that really is the only way you can make a typical, 1 CPI pipeline.  EBRs are only 1-read-1-write ported.  Even worse, the read ports are synchronous only, adding to instruction latency.

The new design, which will be micro-sequenced, will use 16kb of EBR for the register file (since it can only be 16-bits wide at most, I have to use four EBRs).  This has some nasty implications: all instructions will take at least four cycles to execute, and control flow instructions will all take 8 cycles to complete.  :(

  Are you sure? yes | no

Ed S wrote 07/09/2016 at 07:37 point

Well, it worked for the Z80 - a 4-bit ALU in a byte-wide machine!

  Are you sure? yes | no

Yann Guidon / YGDES wrote 07/09/2016 at 11:55 point

or the MC68000 with internal 16-bits datapath for a 32-bits architecture :-)

  Are you sure? yes | no