-
CPU Work Continues...
07/07/2016 at 15:03 • 0 commentsOver my vacation, I've managed to get a large portion of the CGIA components written, and as well, started to implement the CPU pipeline stages. Per my recent log, I've decided on the classic Hennessey & Patterson "5 stage pipeline" design for my RISC-V implementation.
Work on that pipeline design was coming along quite nicely until I tried to implement JAL/JALR instructions. The problem is I didn't want a bunch of special-purpose buses running everywhere, since they use up valuable resources inside the FPGA. If these buses were only going to be used for one or two instructions out of the complete set of about 80, I didn't feel they were justified. However, this was the POV of working with the pipeline from entry-to-exit (that is, designing the pipeline from instruction fetch to writeback).
Working in the opposite direction, though, is enlightening. Many of those special purpose buses turns out to be required if you want proper pipeline behavior, and want to retain the desirable characteristic of 1 clock per instruction throughput. But, it does something else which turns out more important, I think: it obviously tells you what signals you need (versus what you think you need) from the previous pipeline stage(s). This actually simplifies the design of the pipeline as a whole. Whether or not it offsets the resource consumption of these weird, one-off buses, I couldn't tell you. My hunch is that it doesn't. But, at least the operation should be correct, which at this stage of development, is more important.
I also discovered why Bcc and JAL/JALR instructions are so damn annoying to work on. Bcc instructions operates with four inputs, not your typical two. They are:
- Left-hand comparison register value
- Right-hand comparison register value
- The branch instruction's address
- The branch displacement (used only if the branch is taken)
Thankfully, nothing is stored in any destination register. But, the JAL/JALR instructions require six inputs, assuming we want to just reuse the conditional branch hardware for unconditional purposes (I treat JAL(R) instructions as special cases for BEQ):
Left-hand comparison value (can be any value; doesn't matter)
Right-hand comparison value (must be the same as left-hand)
The jump instruction's address
The jump displacement
The destination register.
The constant "4", used to add to the instruction's address to calculate the return address to store in the destination.
Yikes! Fixed adders, magnitude comparators, and the ALU all operating concurrently and asynchronously from each other, with the ALU and comparators requiring independent inputs from the decode stage, will conspire to make the execute stage the limiting factor in determining how fast the CPU will run, and as well, the 2nd largest consumer of LUTs in the FPGA (the register file will be the largest by far).
So far, I have instruction fetch, a prototypical decode, and writeback stages written. I'm currently working on the memory stage now, and will hopefully get these designs to meet in the middle once I start working on the execute stage. However, requirements will back-propegate: the needs of subsequent stages will dictate the needs of previous stages, and that implies I could end up rewriting some of the stages I've already written.
-
CPU Update: Sticking with 5-stage Pipeline
07/01/2016 at 00:20 • 0 commentsAfter some considerations, and some calculations, and mind-experiments, I've decided to just stick with a 5-stage pipeline design. Decoding RISC-V into MISC instructions is certainly possible, and possibly might even use comparable numbers of LUTs and DFFs; however, it would require a 5-read-5-write register file design, which is more complex than your usual 2-read-1-write design for single-issue, in-order pipeline designs.
Oh well. It was worth some thought!
That said, on the flight home last night, I managed to get a significant portion of the RV32I instruction set decoded in a "decode stage" of a 5-stage pipeline. This includes conditional branches, loads, stores, register- and immediate-type ALU operations. Nothing else is decoded yet though, and illegal instructions are not detected.
The operating assumption is that the "execute" stage is just the ALU, and the "writeback" stage is just the register file. I have simple implementations for these stages already, as a part of an earlier design attempt at making a 5-stage pipe.
Things yet to get working include, but is not limited to, the following:
- The entire Memory stage. This couples the CPU to the data memory, and stalls the pipeline if the addressed slave isn't fast enough to keep up.
- Exceptions and interrupts. This includes external interrupts as well as the ability to execute ECALL, EBREAK, and illegal instructions.
- RV64I extensions.
- Extending the ALU's operation to support SEQ, SNE, SGE, and SGEU comparison operations.
- LUI, AUIPC, JAL, JALR, FENCE, CSRRW, CSRRWI, CSRRS, CSRRSI, CSRRC, CSRRCI, WFI, and ERET instructions.
- Result forwarding to reduce unnecessary pipeline bubbles.
- Pipeline stall logic.
- Pipeline flush logic to support FENCE and conditional branches.
This seems like an awful lot of work, and it is admittedly non-trivial. However, only a few items on this list scare me. Most of this work seems like it ought to be easy to get working.
-
Decode Instructions into StackOps?
06/28/2016 at 05:24 • 0 commentsI'm hacking on Verilog for the CPU, and trying to implement a simple, five-stage pipeline (fetch, decode, execute, memory, and write-back). However, this is proving significantly more complex than I'd anticipated. It seems like complexity just keeps going up and up and up and up.
I'm trying to think of alternative methods of achieving good performance.
I'm thinking one approach is to take the approach that Intel CPUs do by breaking individual CPU instructions into smaller units, which I'm going to call stack-ops. The idea is that a single RISC-V instruction is decoded into a MISC instruction packet. That packet is then interpreted sequentially.
For example,
addi x1, x0, 1
would be decoded into:
GETREG(0) LIT(1) ADD SETREG(1) NEXT
This would take 5 clock cycles to complete, just like it would with a 5-stage pipeline:
- GETREG(0) pushes 0 onto the evaluation stack (since X0 is hardwired 0).
- LIT(1) pushes 1 onto the evaluation stack.
- ADD sums the two numbers on the evaluation stack.
- SETREG(1) pops the evaluation stack and stores the sum into X1.
- NEXT pops the next instruction word off the pre-decoded instruction queue and restarts evaluation.
To restore the desirable characteristic of having 1 CPI, you'd need five such execution units, and an instruction queue at least 5 deep to keep all five execution units busy. I'd need to coordinate access to the register file (so as to avoid multiple concurrent writes), of course. The register file would also need five read ports as well (to be able to satisfy all five executors concurrently; otherwise, we'll need to block on register file access).
Whether this ends up being simpler or not requires additional study.
-
Possible CPU Development Plan
06/27/2016 at 14:20 • 0 commentsOne of the simplest practical processors I know is my own S16X4, built on the Steamer-16 architecture by Myron Plichota. (I hope I spelled that right. Typing from memory here.) RISC-V is offers a substantially more complex ISA to decode. In the interests of agile methodology, theoretically, I should be able to always have a shippable product. Considering the scope of the project, how do I reconcile continuous testing with having a minimum viable processor?
I'm thinking of just replicating the original 12 instructions of the original S16X4; the idea is that instruction set defines the minimum viable RISC-V subset. To recap the S16X4 instructions:
- NOP
- LIT (loads a word-width literal into the data stack)
- FWM (fetch word from memory)
- SWM (store word to memory)
- ADD
- AND
- XOR
- ZGO (branch if top of stack is zero)
- GO (unconditional branch)
- NZGO (branch if top of stack is not zero)
- FBM (fetch byte from memory)
- SBM (store byte to memory)
The closest analogs in the RISC-V are as follows:
NOP --> ADDI X0, X0, 0, so we need to support at least the ADDI instruction.
LIT --> ADDI Xn, Xn, imm12, possibly followed by further SLLI Xn, Xn, shamt / ADDI Xn, Xn, imm12 pairs as appropriate to load the full register width with desired data. So, besides ADDI, we also need SLLI.
FWM/FBM --> LBU, LHU, LWU, LDU. These are simple enough to implement as-is, since the func3 bits of the opcode can be directly exposed to the external bus to determine transfer size.
SWM/SBM --> SB, SH, SW, SD.
ADD, AND, XOR --> These involve register-register operations, so these three instructions translate naturally to their RISC-V counterparts. And, honestly, supporting their immediate versions is easy enough to implement we might as well implement ANDI and XORI as well. In fact, XORI will be required if we want to perform 2's compliment negation on a register.
ZGO, NZGO, GO --> BEQ, BNE, JAL. Easy peasy.
And that's it, I think. So the absolute minimum RISC-V instruction subset to support includes:
- ADDI
- SLLI
- ADD
- AND
- XOR
- LBU
- LHU
- LWU
- LDU
- SB
- SH
- SW
- SD
- BEQ
- BNE
- JAL
Wow, not bad! Only 16 instructions! Supporting more instructions can be done by fleshing out the rest of the instruction decoder and/or pipeline incrementally from here. The only instruction format not supported is U-format, used only by LUI and AUIPC; however, given the rest of the implementation already works, these are pretty trivial instructions to add.
Things not covered are CSRRW, CSRRS, CSRRC, ECALL, EBREAK, ERET, WFI, and FENCE instructions. With the exception of ECALL, EBREAK, and ERET, these instructions are actually somewhat more complicated to implement anyway, and should be implemented only after we have a minimal instruction set to debug with. I think.
-
Software-Managed Caching for Future Kestrel-3??
06/17/2016 at 21:28 • 2 commentsThe current vision for the Kestrel-3 targets a 6 MIPS processor, primarily because it'll be seeking instructions over a 12.5MHz 16-bit bus and will not have any cache hardware on-board. This bottleneck exists because it (for now at least) makes for a very simple implementation. But, eventually, I'd like to drive the CPU to faster speeds. Ideally, to 50MHz.
Presently, the CPU addresses all of external memory physically. That is, if I tell the CPU to read a byte from $0E00000000000001, it will read from $0E00000000000001 on the external bus. This will get chopped down to $0E000001 on the Backbone bus (since I only expose 32 bits of address space there), and from there, external circuitry will respond to this transaction in the usual way you're familiar with from Z-80 or 6502 circuits. No surprises there.
However, this also means things are pretty slow. 32-bit accesses need two bus transactions at a minimum, so software will run the fastest if everything can be kept to 8- or 16-bits. Even then, you're incurring wait states like crazy, since all RISC-V instructions are 32-bits wide. I will need to fetch instructions and load and store data to an internal, scratchpad RAM if I want to run at faster than 12.5MHz speeds with no wait-states. The problem is, the FPGA I am planning on purchasing has only 10KiB of internal block memory. Thus, some mechanism is required to map an address like $0E00000000000001 to something narrower, like $0C81.
Traditionally, this is done with a memory management unit (MMU), and most frequently, using a technique called paging. General purpose computing architectures today all seem to agree on 4KiB sized pages. Indeed, the MMU specified in v1.7 of the RISC-V supervisor specifications currently sets the page size to 4KiB. Obviously, with such a small amount of physically addressable memory visible to the CPU, we'd want something smaller; hence why caches use 16, 32, or 64 byte "cache lines." Besides, burst transactions to external RAM is much, much faster than waiting for blocks of data from a hard drive, so these smaller transfer sizes do not cost that much. However, these block sizes all assume a hardware cache controller; the only reason I'd ever consider 128 or 256 byte transfer sizes is to help amortize the additional overhead from emulating a cache controller in software.
The Lattice iCE40HX4K FPGA comes with 10KiB of RAM onboard, which isn't much; however, it seems pretty useful as local cache memory (the 68020 only had 256 bytes back in the day, and it made a measurable difference!). However, hardware cache controllers are insane to get right, and off-the-shelf cores I've seen seems to use up a ton of logic. Maybe I'm looking in the wrong places; but, with LUTs are at a premium with the iCE40 FPGAs, anything I can do to avoid a massive hardware investment is of interest to me.
So, I'm thinking of implementing caching in software by using a fine-grained memory protection unit, something that protects memory down to, I dunno, say 64 to 256 bytes.
Here's how I see things working.
First, I'd expose the 10KiB block of memory as another peripheral in the Kestrel's I/O allotment, along with a set of control registers. Further, it would only be accessible when the CPU is running in machine-mode. When the computer first boots, or whenever the CPU is running in machine-mode, the MMU is turned off (as you'd expect), meaning the CPU will encounter a ton of wait-states as it attempts to fetch instructions from external memory. I expect the CPU to maintain close to 6 MIPS performance during this cold-boot phase.
One of the steps taken by the system firmware would be to initialize the "line" fault handler. Once this is done, the MMU is enabled by de-escalating to supervisor mode.
At this point, the next instruction fetch will cause a cache line miss (since only the cache line handler has been initialized, not the actual cache state). The handler temporarily takes over (running in machine-mode), looks at the fault address, and uses this information to load a base address and length into a DMA engine. The DMA engine, then, reads the required cache line into local memory. Meanwhile, the CPU can update its metadata: access and dirty bits, cache line mapping registers, and so on. When the metadata and DMA have completed (which ideally would conclude concurrently), we return from the exception handler. Back in supervisor- or user-mode, the processor state is set up now so that the instruction will restart, and should proceed without issue and at maximum bandwidth.
With 10KiB at my disposal, I can actually go one step further and pre-initialize 2KiB of the system firmware into the cache, and "lock" it. This would make exception handling much faster. This would leave 8KiB left over for "normal" cache purposes. Remembering the 68040 had a total of 8KiB of cache (albeit split into separate 4K chunks for data and code), it seems reasonable that this arrangement would bring the CPU to levels of performance on par with the MC68040. According to Wikipedia, I can reasonably expect the CPU to execute about 100 instructions on average before incurring a cache miss, so if the handler overhead plus the time taken to execute 100 instructions from a hot cache takes less than 16 microseconds (100 instructions at 6 MIPS rate), it's a net win. I would be happy with this outcome if it meant I could get by with a reduction in hardware investment.
I figure this would be a worthy, and low-cost, experiment to play with. It's premature to try now, of course, but I figure this would be a project for much later down the road.
-
Blog Article on CGIA Now Live
06/16/2016 at 21:38 • 0 commentsI posted a long-form blog article on my "actual" Kestrel blog (I'm long overdue for an article there anyway), specifically on the topic of how the CGIA is going to work under the hood. If you're interested, you can read it here.
-
CGIA Feeder Circuit Written
06/16/2016 at 16:53 • 0 commentsThe feeder circuit for the CGIA shifter has been written, and so far as I can tell, seems to work well. Of course, I will need to verify that the line buffers, the feeder, the shifter, and a degenerate CRTC all work together. Maybe I'll work on that next.
-
CGIA Dot Shifter Verilog Almost Done.
06/15/2016 at 15:11 • 0 commentsSince I'm still operating in a different timezone from everyone else around me, I decided to use the three hour sleep schedule difference to work some more on the CGIA.
First, I decided to split the CGIA into its own repository. When it comes time to integrate it into an actual chip, I'll just refer to it by git submodule. I know that it's a bit of a burden to do so, but this buys me two benefits: first, it makes it more reusable to people in general, and second, it lets me package the core up into something that the FuseSoC package manager can use. Note: as I write this article, I'm doing all my work in my personal fork, so don't expect to find much in the official repo yet. I'll land my WIP as soon as I'm done with the dot shifter implementation.
Which brings me to the second point, where I'm happy to say that I've implemented a video shift register capable of shifting by 1, 2, 4, or 8 bits at a time. This register is a hard requirement to support 2, 4, 16, or 256 color depths. Note that these are "chunky" video modes, not planar.
Using the shift register is a module called the "shifter" (named after the Atari ST's video chip), which is responsible for taking the output of the shift register above and putting the appropriate color palette index out onto a "color bus." Though, I'm probably going to rename "color bus" to "index bus" to avoid confusion after inserting the palette registers.
Things that remain to be supported include an index XOR mask, which would let lower color depths make use of all 256 palette registers; also, I need to add logic to the shifter to actually read the contents of the line buffer. I think I'll call the shifter done-for-now after I finish these two things.
-
CGIA Fetcher Logic Written
06/14/2016 at 13:59 • 0 commentsWhile on the flight yesterday, I wrote the Verilog implementing the Wishbone bus master interface for fetching a scanline's worth of data (the "video fetcher"). As of this writing, the cgia branch is not merged to master.
I'm thinking of moving it into its own repository though, especially since it can be useful for any Wishbone compatible system, and not just the Kestrel-3. I'll post a link here when that happens.
-
Backbone Backplane Boards are on order!
06/11/2016 at 23:30 • 0 commentsI managed to fix the last known set of hardware bugs that I could find, and so I placed an order for a set of three Backbone boards. Price was pretty reasonable too, considering prototype pricing.
Now to place the order for parts, and I can go on vacation with a clear slate.