Project | Kestrel Computer Project

« Back to project details Sort by:

DX-Forth for Kestrel-2DX Now Prints Numbers
01/08/2018 at 17:55 • 0 comments

DX-Forth now prints numbers. Oh, you don't know about DX-Forth? You should probably read up on the progress of the Kestrel-2DX then.
Kestrel-2DX Update: TIM/V
09/24/2017 at 18:19 • 0 comments

In case you missed the good news, you might be interested in reading my recent announcement of TIM/V for the Kestrel-2DX.
Kestrel-2DX to be submitted for 2017 Hackaday Prize.
09/05/2017 at 03:00 • 0 comments

For grins, I've decided to place the Kestrel-2DX up for the 2017 Hackaday Prize, 5th category. To make this easier for Hackaday folks, I've created a new project dedicated to the Kestrel-2DX. I will continue logging my progress with it on that project page. So folks who are interested in the Kestrel, please do follow the Kestrel-2DX project page too! Thanks!

Kestrel-2DX Runs its First C Program

09/02/2017 at 03:56 • 4 comments

So, I decided to try getting my RISC-V GCC compiler working with my Kestrel-2DX again, and this time, for whatever reason, my mental model just "clicked" and things worked. It was an iterative process to get this far; however, I'll describe things in the rough order I've accomplished things.

Here's the C program I wanted to run:

/* Invert the contents of the video frame buffer on cold boot. */
unsigned long long *scrp, *endp;
void
_start(void) {
    scrp = (unsigned long long *)0x10000;
    endp = scrp + 2000;
    while(scrp < endp) {
        *scrp ^= 0xFFFFFFFFFFFFFFFF;
        scrp++;
    }
    while(1) ;
}

I was able to compile this into statically-linked ELF binary file with the following command:

./riscv64-unknown-elf-gcc -O1 -c floop.c -o floop.o -march=RV64I
./riscv64-unknown-elf-ld floop.ld floop.o -o floop.exe -nostdlib

You'll notice that I have a custom loader script, which looks like this:

ENTRY(_start)
MEMORY
{
        ROM (rx) : ORIGIN = 0x00000, LENGTH = 0x8000
        RAM (rwx) : ORIGIN = 0x14000, LENGTH = 0x8000
}
SECTIONS {
        .text :
        {
                . = ALIGN(4);
                *(.text)
                *(.text*)
                . = ALIGN(4);
        } >ROM
        .rodata :
        {
                . = ALIGN(8);
                *(.rodata)
                *(.rodata*)
                . = ALIGN(8);
        } >ROM
        .data :
        {
                . = ALIGN(8);
                *(.data);
                *(.data*);
                . = ALIGN(8);
        } >RAM
        .bss :
        {
                . = ALIGN(8);
                *(.bss)
                *(.bss*)
                . = ALIGN(8);
        } >RAM
}

RAM technically starts at address 0x10000; however, the MGIA fetches its video frame from that location, so we configure the linker to place global data variables 16KB away.

Then, to pull out just the code and constant data, I used the following:

./riscv64-unknown-elf-objcopy -j .text -j .rodata floop.exe -O binary floop.bin

At this point, I have a raw binary image. A problem remains, however. The constant data precedes the code; thus, I cannot just have the processor reset directly into _start.

$ xxd floop.bin   # note data at offset $00, not $38 or something.
0000000: dec0 ad0b ea1d ad0b 1000 0000 0000 0000  ................
0000010: b707 0100 3747 0100 1307 07e8 83b6 0700  ....7G..........
0000020: 93c6 f6ff 23b0 d700 9387 8700 e398 e7fe  ....#...........
0000030: b747 0100 9387 07e8 3707 0100 2330 f700  .G......7...#0..
0000040: 6f00 0000                                o...

Thus, I need a bootstrap of some kind, and I need to place the C code somewhere away from address 0.

So, I write a simple bootstrap routine in raw assembly to scan ROM for a special "resident" structure (an idea I learned from coding directly on and for AmigaOS); nothing fancy, just something that would let me find the address of _start.

        include "regs.i"
        addi    a0, x0, $200    ; Start at address 512
L0:     auipc   a1, 0
        ld      a1, chkdword-L0(a1)
        lui     a2, $8000
L1:     ld      a3, 0(a0)       ; Did we find the checkword?
        beq     a3, a1, yup
        addi    a0, a0, 8
        blt     a0, a2, L1
        addi    a0, x0, -1      ; Deadlock with all LEDs lit if not found.
        lui     a1, $20000
        sh      a0, 2(a1)
        jal     x0, *
yup:    ld      a3, 8(a0)       ; Get startup procedure's *offset*
        add     a3, a3, a0      ; Get startup procedure's *address*
        lui     sp, $14000      ; Set up C stack pointer.
        jalr    x0, 0(a3)       ; Let there be C.
        align   8
chkdword:       dword   $0BAD1DEA0BADC0DE
        adv     $8000, $CC

I then altered the C code to include the following at the very start:

struct Resident {
        unsigned long long r_matchWord;
        void (*r_fn)();
};
void _start(void);
const struct Resident R = { 0x0BAD1DEA0BADC0DE, &_start };
static unsigned long long *scrp, *endp;
// ...etc...

After recompiling as above, I now needed to embed the C code into the binary file my personal assembler produced.

dd if=floop.bin of=rom.bin bs=512 seek=1

Whoops, this has the effect of truncating the file; I have to re-pad it to 32KB before making the Verilog module with the ROM's contents.

dd if=/dev/zero of=rom.bin bs=1 count=1 seek=32767

There, I now have a completed 32KB image. I rebuild the Verilog ROM module:

make rtl/rom.v

edit the resulting Verilog file because Xilinx's flavor of Verilog is retarded and won't accept the sane syntax that Yosys, Icarus, *AND* Verilator accepts. One of these days, I'll fix my tooling to automate this.

Anyway, after that was done, I resynthesized the design, and lo and behold, I had a white video display! That means _start was discovered and dispatched to, and code generated by the C compiler was running! Woooo!!

Kestrel-2DX
08/31/2017 at 16:13 • 0 comments
For those wondering what I've been up to in my ever-so-copious amounts of spare time, I've been hacking anew on the Kestrel-2 using my Nexys-2 FPGA board. I didn't want to announce it until I was confident that I could "complete" (for some definition of complete) the build. I'm now at that stage where I'm confident I can complete it in a reasonable time period.
Specifications
- KCP53000 CPU at 25MHz. This provides an RV64I instruction set architecture for you to play with. (WORKS)
- MGIA provides 640x200 bitmapped, monochrome graphics. (WORKS)
- 32KiB of actual ROM. This means more space available for OS and programs loaded from SD card. (WORKS)
- 48KiB of block RAM. Two bits of the GPIA's output register now lets you select which 16KiB page of memory to use as the MGIA frame buffer. (NOTE: This, unfortunately, needs to drop to 24KB if I were to port the design to the Terasic DE-1 board that I have. It's FPGA just doesn't have the block RAM resources that the Nexys-2 does. Sorry.) (WORKS)
- GPIA for, among other things, talking to the SD card, detecting VSYNC from the MGIA, and controlling the Nexys-2's 4-digit, 7-segment LEDs (though, if I can't afford it any more, this last feature is the first to go away). (WORKS)
- Two PMODs reserved for SD card interfaces (versus Kestrel-2's one). (Planned)
- KIA cores for using a PS/2 keyboard, and perhaps in a later release, the mouse. Strongly considering getting another PS/2 port for the board, so I can have both keyboard and PS/2-compatible mouse. (Planned)
- 16MB of "expansion RAM" allocated for experimentation with accessing external static, pseudo-synchronous, or synchronous RAM resources. (Honestly, it's really however much memory space you need; the "CPU" module only exposes a 25-bit address by default, but with some editing of the Verilog files, it can be widened as far as you need.)
Purpose/Mission
Frankly, to explore how to talk to external RAM chips reliably, thus opening up the opportunity to realize my ideal Kestrel-3 concept again. Nothing more than that; inasmuch, I figure 32KB of ROM and 48KB of RAM ought to be way, way more than enough for my needs. Hence the DX part of the 2DX badge: it's a Developer Architecture.
Indeed, the Kestrel-2DX is the "test mule" I've always wanted, but was never able to get off the ground. Except for some reason, I'm now able to get it off the ground, and I'm frankly quite ecstatic about it. Maybe this is a sign of things to come wrt to the Kestrel-3.
Where is the Repository?
I'm maintaining development in a Fossil repository independent of my mainline Github account, on account of its still extremely experimental nature. I didn't want to get any hopes up by announcing, "Oh, hey, look what I'm doing!", only to get distracted with life, and have it fall into disarray, disuse, and eventually be removed in one of my famous fits of fury. Also, I find working with Fossil far easier and more productive than with Git + Github combination.
When I'm happy with the result, I'll merge everything back into the official Github repository as an official version of the Kestrel-2 lineage.
Why Fossil?
Considering that Fossil is a single binary, compiled using little more than a plain-vanilla dialect of C and some POSIX libraries, I'm seriously thinking of moving all of my Kestrel-related material into Fossil instead of Git. My reasoning is as follows:
1. To get Git working, you need to port a C compiler, then Perl, Python, Bash, plus a litany of dependencies: for Perl, for Python, for Bash, and then you need the Git-specific dependencies. It's probably easier to port Linux and its userspace as an entity than it would be to port Git to a foreign, brand-new platform like the Kestrel. Meanwhile, the footprint to get Fossil ported to the Kestrel-3 remains daunting; but, it's substantially smaller than Git's: a C compiler, and a POSIX environment. Thus, it should be markedly easier to get Fossil working on a future Kestrel than it would be to port Git. This would directly enable continued Kestrel development on Kestrel platforms. Eat your own dogfood, as they say.
2. I enjoy working with Fossil far more than I enjoy working with the Git + Github dichotomy. No, installing Gitlab or Gitorious will not help; the dichotomy remains even with these self-hosted options. Having the ticket system and wiki tightly integrated into the tool is wonderfully liberating to me.
3. Fossil has auto-sync features built-in. With Git, I must constantly remember to push and pull, particularly between the forks that I maintain.
I could go on, but other reasons I like working in Fossil tend to be more contentious, as the bi-annual resurgence of Fossil-vs-Git articles demonstrates on Reddit and other sources.
Reminder: My Current Plan of Attack
08/20/2017 at 04:44 • 0 comments
This log is more for me than it is for you; yet, you might find this somewhat informative. ;) I need to remember this for posterity, especially considering I'm taking forever making real progress from an external point of view.

To reaffirm, my immediate goal is to make a circuit inside a Lattice iCE40HX8K-compatible FPGA that can:
- Accept addresses and data from my host PC, and populate static RAM and/or I/O registers accordingly.
- Accept addresses from my host PC and report back the contents of memory and/or I/O.
That literally is it. No Turing completeness. No asynchronous behavior of any kind. However, I do want it to do these things using RISC-V instructions. Here's how I intend on making this work.

The IPA

The Initial Program Adapter (IPA) is a core I've developed which causes the reading bus master to block until the next 16-bit halfword arrives over a serial interconnect. This could be a Z-80, a 6502, a 68060, or your choice of RISC-V hardware. It exposes no addressable registers; rather, it's intended to sit throughout the address space you'd normally place ROM.

This core is already done.

The "Processor"

The KCP53010 is not a self-standing processor at the moment. In fact, it's only now starting to resemble a proper five-stage pipeline. The pipeline is naked at the moment; to feed it instructions, you must do so using the pipeline's own control signals. The pipeline currently implements STORE instructions (8, 16, 32, and 64-bit), as well as all OP-IMM instructions. This is sufficient for populating RAM, but not for inspecting its contents. For example, to load a code or data image into static RAM, you might want to send the following instruction stream to the IPA:
```
ADDI X1, X0, 0      ; x1 points into RAM, where code image is to go.
ADDI X2, X0, aaa    ; x2 is first byte
SB   X2, 0(X1)
ADDI X2, X0, bbb    ; x2 is second byte
SB   X2, 1(X1)
ADDI X2, X0, ccc    ; If x2 happens to get valid halfword,
SH   X2, 2(X1)      ; then we can optimize and store it.
... etc ...
```
Before implementing an "instruction fetch" stage to the pipeline, though, I need to complete the LOAD class of instructions. This will enable me to support inspection of memory as well, so I can implement, for example, a memory test routine on my host PC.

After LOAD and the instruction fetch stage is complete, I intend to implement JAL and JALR instructions. This should make the KCP53010 Turing complete, although still not a complete RISC-V implementation. This should be good enough to support basic programs like "Hello world, what is your name?" "Oh, hello Foo!" type programs.

The SIA

The Serial Interface Adapter (SIA) core implements the serial interface used to communicate with the host PC using a basic 4-wire synchronous serial interface. This core is already done.

The Host PC Interface

I dread this step the most.

This is actually an ESP8266 device which I had received courtesy of Dr. Ting from an SVFIG meeting. This will adapt the USB interface from the host PC to the 4-wire synchronous serial that the IPA and SIA implements. This will make programming the device challenging, since I will need to bit-bang the interface to the FPGA. Absolute yuck. It's not even very clear to me how to go about testing this module to know it's working properly.

Summary

Things were much, much easier with the Kestrel-2, where I could rely on block RAM, a significantly simpler processor model, and a working video interface from day one. Bootstrapping the Kestrel-3 is significantly more complex, and I'm not too happy with it. However, as my log history shows, every attempt to go a simpler route has so far failed. I truly wish I had a better way of bringing a system up quickly.
KCP53010 Pipeline Register Bypass Works
08/14/2017 at 05:51 • 0 comments
I just finished register bypass logic for the KCP53010 core. This allows execute and memory pipeline stages to feed their destination register contents back to the decode stage before register write-back actually completes, preventing a pipeline stall when, for example, the destination of one instruction is used as the source for the next instruction or two.

I'm sure things are still buggy; however, my tests so far seems to indicate everything is working.

Prior to this logic being added, you had to manually 'pad' instructions in the pipeline. E.g., to add a constant to a register, you'd need to feed the pipeline with five instructions, like so:
```
ADDI X1, X0, 256
NOP  ; or, ADDI X0, X0, 0
NOP
NOP
NOP
; At this point, the value will be stored in X1.
; We can now use it.
ADDI X1, X1, 256
NOP
...etc...
```
So, as you can imagine, if you wanted to execute something like:
```
ADDI X1, X0, 256   ; X1 := 256
ADDI X1, X1, 256   ; X1 := X1 + 256 = 512
SD   X1, 1(X2)
```
you would consume something on the order of 15 clock cycles. With the feedback logic, we need not have to pad instructions out like this, since we can execute read-after-write instructions immediately:
```
ADDI X1, X0, 256   ; [1]
ADDI X1, X1, 256
SD   X1, 1(X2)
NOP  ; decode SB
NOP  ; effective address calculation [2]
NOP  ; store bits 63..48
NOP  ; store bits 47..32
NOP  ; store bits 31..16
NOP  ; store bits 15..0
```
The value for X1 in instruction [1] above actually gets written into the register file at point [2] in the instruction stream. But, thanks to forwarding/feedback logic, we can use that value (and its subsequent replacement!) in intervening cycles.

This reduces the instruction stream's latency to just 9 clock cycles. The bulk of the time is consumed by the lengthy store operation.

There are opportunities for speeding this up further; but, I'm going to leave it as-is for now. I still need to implement pipeline stall logic, so that the pipeline stalls while a memory fetch or store operation is in-progress.
An obvious opportunity for performance enhancement is to perform memory writes in the background (ZipCPU does this, for example); however, this optimization may not always work for memory reads (you'd have to be careful about choosing your registers wisely to avoid blocking). The logic to detect when to stall in this case is pretty tricky, so for now, I'd like to keep things simple. 9 cycles for a 64-bit, 3-instruction write sequence is not horrible.
Quick Update Before Work...
07/28/2017 at 15:12 • 2 comments
Just a quick update before I rush into the office.
I've been working on the KCP53010 CPU's pipeline stages on and off over the last couple of weekends. Overall, I'm happy with the results so far. You might say that the pipeline fully recognizes STORE and OP-IMM instructions, although supporting more than these (especially LOAD, OP, OP-IMM-32, and OP-32) is quite easily implemented with only a handful of Verilog lines of code.
Last night, I wrote the very first lines of code that integrates these different stages together into a real pipeline. It does not work yet, but its current behavior is very promising indeed. I haven't had the time to implement a real integration test for it yet, so I just relied on the RESET behavior and how it pipes a NOP (ADDI X0, X0, 0) instruction through the queue. After looking at the waveforms manually, I'm pleased at the results so far.
Some things which need to be done include (but isn't limited to):
- Clock the register file's source addresses on the falling clock edge, instead of on the rising edge. ALTERNATIVELY, take the source addresses from the instruction register's inputs instead of from the instruction register itself. Either one of these approaches allows me to deliver the contents of the register file concurrently with the instruction decoder outputs, thus letting me keep a 5-stage pipeline. Otherwise, I'd need to introduce a separate "register fetch" pipeline stage.
- Move SEL_O signal generation into the memory stage (load/store unit). Right now, it's an explicit input; however, since its value depends on the output of the ALU in the execute stage, there's no way to precompute it at any earlier stage.
- Make use of BUSY_O and related pipeline stall signals to control instruction flow through the pipeline. Right now, instructions just flow synchronously with the clock.
- Implement register bypass/feedback logic. This would prevent pipeline stalls or erroneous computations when the source of an instruction I comes from the destination of the previous instruction I-1.
There's a lot of work that needs to happen yet; but, I think I can swing it. I just need to take this slowly, one step at a time.
Commencing Third Pipeline Stage
07/02/2017 at 23:27 • 0 comments

Since my last update, I've made many small and incremental improvements to the load/store unit and the register "write-back" side of the X-Register Set modules. To a reasonably good approximation, I think this completes 90% of my work on these stages. I think there are some small artifacts that need to be added still, but these will depend upon the cooperation of other units not yet written, so will have to wait.
With that said, I think it's time to start on the Integer Execute stage of the pipeline. This is the stage that basically encapsulates the ALU I've already written for the KCP53000.
LSU Features
The KCP53010's front-side bus will conform to Wishbone B.4 Pipeline Mode specifications. This new direction satisfies several problems I was having before with the KCP53000, allowing me to collapse several support modules into the core of the CPU effortlessly.
The B.3/B.4 Standard Mode/Furcula bus ties the master and slave side of the bus inextricably together, which required more sophisticated state machines when adapting to other buses. The 64-bit to 16-bit bridge (KCP53003) added a significant amount of overhead to the circuit, as did all the other bridges that were required to interface the KCP53000 to the Kestrel-2 hardware. It worked; but, it was very slow, and only just barely met timing requirements for a working computer.
The B.4 Pipelined operation greatly reduces the complexity involved with bridging different bus widths. Supporting 64-bit, 32-bit, 16-bit, and 8-bit transfers over a 16-bit external bus came surprisingly easy once I realized that the command and response (or master and slave, as referenced in the Verilog sources) sides of the bus can be cleanly divorced from each other. I'm banking on this simplification to reduce both layout pressure as well as bump the CPU's operating frequency to a more comfortable rate.
Disadvantages
Because I now natively support Wishbone, the CPU is now directly responsible for handling address misalignment and data path routing. Right now, the LSU doesn't take misalignment into consideration. This is a known bug, but will be addressed later. However, I'm thinking the hardware to detect and respond to this (and similar) condition(s) will still result in a net reduction in complexity.

Mega Progress Update

06/04/2017 at 19:34 • 0 comments

I could have sworn that I'd posted an update already, but looking at my logs feed, I clearly have not.

Topics covered below include:

Serial Interface Adapter Core Completed
Initial Program Adapter Core
KCP53010: Successor to KCP53000 CPU

Serial Interface Adapter Core Completed

Not much more to say than that. It's done. It's not as small as I'd like, but on the other hand, it's also more flexible than your typical UART design. It allows you to send and receive serial data streams (LSB first only), with or without start bits, stop bits, etc. Frame checking is up to the software using it. It supports configurable FIFO depths and widths (up to 16-bits wide), allowing you to tune the core for your needs. Those who have programmed the Commodore-Amiga's internal UART will be right at home with how this adapter works. A nice, wide divisor allows for data rates as low as hundreds of bits per second, to as high as tens of megabits per second.

Data is sent over a pair of wires, TXD and TXC, forming data and forwarded clock, respectively. Data is received on RXD and RXC, respectively. It should be noted that it can be synchronized on RXD, RXC, or both. For lower-speed applications, RXD is sufficient. For higher-speeds, you probably want to ignore RXD and focus just on RXC. The choice is yours.

This core provides a 16-bit Wishbone B.4 Pipelined Mode slave interface; it should be easily usable with 8-bit devices as well.

New Initial Program Adapter Core

The Kestrel-3 code-base now includes a new core, currently with the name "IPA". This core has one mission: to facilitate loading the initial bootstrap code into RAM on a ROM-less computer design. From the processor's perspective, it looks exactly like ROM memory, and sits where ROM normally would; however, on the back-end, it parasitically feeds of the RXD and RXC pins of the SIA core. The idea is simple: when the processor reads a half-word from anywhere in ROM's address space, it blocks until the IPA receives two bytes. The bytes must be sent in PC-standard 8N1 serial format. The IPA is synchronized on the RXC input, so you'll need either a proper USART or a microcontroller to drive it. Since I have two Arduinos and an ESP8266 microcontroller at my disposal, this is not a blocking drawback.

The idea is you spoon-feed the computer an instruction stream designed to explicitly store data into memory, like so:

; X1 = pointer into RAM
; X2 = value to store (byte)
ADDI    X1,X0,0
ADDI    X2,X0,$03
SB      X2,0(X1)
ADDI    X2,X0,$7F
SB      X2,1(X1)
; ...etc...

and so on until you have loaded 1KB to 2KB worth of code into RAM. If you need more than this, you'll need to manually reset X1 somehow, and continue loading your data. This approach is slow, of course; however, it saves me the hassle of needing to implement a DMAC just for the serial port. LUTs are precious in these smaller FPGAs, so this is a pretty big win for me. Besides, this only has to happen exactly once upon system reset, and the bootstrapper doesn't need to be terribly large (4KB seems like an awfully large bootstrapper to me).

When the initial program is loaded, you kick it off by sending a JAL X0, 0(X0) instruction.

The IPA exposes a Wishbone B.4 Pipelined Slave interface, and only supports 16-bit half-words. Attempting to read or write bytes from this space will fail in unpredictable ways. Don't do it. Thankfully, when the CPU fetches instructions, it fetches them 16-bits at a time.

This is not the first ROM-less Kestrel computer I've made. Indeed, my very first, the W65C816-based proof of concept Kestrel-1, only connected to SRAM and a single VIA chip for I/O. The architecture of the Kestrel-1 and the iCE40-targetting Kestrel-3 designs share much in common.

	Kestrel 1p4	Kestrel-3
CPU	W65C816P-14, 4MHz	KCP530x0, 25MHz
Performance	2 MIPS max.	6 MIPS max. (KCP53000), 12 MIPS est. max. (KCP53010)
Word Width	8/16	16/64
RAM	32KB max.	256KB min., 512KB typ., 2^60 B max.
I/O	1 VIA with 16-bit parallel I/O	1 SIA, V.4 compatible serial, 110bps to 12.5Mbps possible.
IPL Mechanism	Bus mastering IPL Port, driven from PC Parallel port	ROM emulation via IPA core, shared with SIA core.

New KCP53010 CPU Design Coming

You may have noticed that I'm adopting the Wishbone B.4 Pipelined Mode interface going forward in most everything I do these days. The KCP53000, however, uses a Wishbone B.3 master interface. It also requires bus arbiters and 64b/16b bridges to talk to FPGA-accessible resources. Coupling B.3 to B.4 peripherals will require yet more logic to be added to the design. This is unsustainable, and requires that I rework the CPU's memory data paths. I'm also dropping Furcula bus support. Not that I don't think it's a good idea, but it didn't pull its weight like I expected it to.

Since B.4 supports pipelining, I'm once again attempting to work towards a 5-stage pipelined, RISC-V processor architecture. My initial attempt ended in flames, melting flesh, pestilence, and was probably partly responsible for Batman Vs. Superman. OK, mostly hyperbole, but no one who knew me or what I was experiencing at the time would argue that it didn't end up a categorical failure in every sense of the word. I learned a lot of what not to do, but not a whole lot of what to do.

The KCP53010 design aims to implement a text-book, five-stage pipeline: instruction fetch, decode and register-fetch, execute, memory access, and write-back. (I may have to split the decode and register-fetch steps because of how the block RAM works on iCE40 FPGAs; I'll cross that bridge when I get there.) This time, instead of working top-down, I'm working bottom-up. Or, more precisely, inside-out, and back to front.

This means I implement the register write-back stage first. This stage appears to work currently, although I know it's far from finished. That's OK though; it's always safe to revisit the design later as I learn new things, provided tests are maintained in synchrony. Keeping the tests updated is the key.

I'm currently working on memory access stage as I type this. Experience gained working on this stage has already taught me several things:

Wishbone B.4 Pipelined Mode is surprisingly simple to implement when you have the right implementation strategy, and is insanely complicated if you don't. Thankfully, I've settled on a design which makes quite a bit of sense to me; in fact, I'm now thinking it's even simpler to implement than Wishbone B.3's quasi-asynchronous handshaking protocol, since I can think of commanding the bus and accepting responses in isolation of each other, instead of having to conflate the two. This leads to better testability and superior isolation of concerns in the Verilog source code.
Some aspects of the data flow through a RISC-V processor which I thought belonged in the ALU more rightly belongs at the head-end of the writeback stage (e.g., zero- and sign-extension). This should reduce the ALU complexity compared to the KCP53000's ALU, which I believe will let me hit higher clock speeds, even if only slightly. The ALU is *the* limiting factor of the KCP53000 design, so this is a welcome insight.

As I type this, Yosys/Arachne-PnR report that the load/store circuitry consumes no more than 310 LUTs. Even if I use two of these (one for instruction fetch and one for data access), this results in a significantly smaller memory interface than the KCP53000+arbiters+bridges approach I've been using (estimated to be half the size!), with a whole lot less combinatorial logic in the hot path. I anticipate this will contribute towards an improvement in top clock speed supported. In fact, this is small enough that I can probably use three of these units, two for memory, and one for CSRs, greatly reducing the burden of making CSR access as fast as possible, and still have a design smaller than the memory access hierarchy that the KCP53000 required!

I expect to have a blog article written up on my current design soon. I'll post a link here when I finish it. I'll also be giving a talk on the design at the June SVFIG meeting.

My immediate goal is to create a dumb pipeline that takes pre-decoded RISC-V instructions and executes them. This forces the design to remain testable throughout the whole design and implementation steps. Control circuitry and required state machines won't go into the circuit until much, much later; I can probably reuse a lot of this from the KCP53000, honestly. Consuming pre-decoded instructions lets me focus on the essentials of the data-flow and error reporting circuits, again making things that much easier to implement/test.

Conclusion

That's it for now; I'll have more as time progresses of course. Progress is very slow due to work obligations, but I try to do what I can when I can. Until next time...

Prev Next