-
Vacation Time!
06/11/2016 at 00:24 • 0 commentsI have some vacation time coming up, so I'll be taking a break from the hardware aspects of the Kestrel-3. I was hoping to have the Backbone Bus board layout completed by now. I'll make a last-ditch effort to finish it this weekend and get some boards ordered. Hopefully, I'll get the parts and the boards back by the time I return home. But, even if not, that's OK; there's still some Verilog to get working.
OK, a lot of Verilog.
While away from home, I'm planning on spending some quality time with the Icarus Verilog to work on bits of the CPU and/or the CGIA. I have no idea how far I'll get in the next two weeks or so, this being a vacation and all. However, as I accomplish things, I'll be sure to keep folks updated with some posts here.
For those wondering, I'm going to participate in a new Star Trek New Voyages film shoot (I usually work digital ingest, which is what allows me to work on Kestrel things). However, if I end up working something other than digital ingest, like grip or audio, I'm sure that'll distract me from Kestrel work. It's all good though; a vacation is supposed to be fun and relaxing.
Eagerly looking forward to (up to) two weeks of Kestrel development time! Since this is a vacation, though, no promises on any kind of useful productivity.
-
Brief Update
06/09/2016 at 14:51 • 0 commentsFor those wondering what I've been doing of late, check out my Backbone Bus project. It'll be the backplane bus for my development version of the Kestrel-3.
It has been recommended that I space the slots on 0.8" centers for use with existing card cage assemblies. Pretty sure that's not going to happen this time around. But, if there's any actual interest in this, I may consider it for a future project.
-
Design Review for Kestrel-3 Backplane
05/26/2016 at 15:01 • 0 commentsOn the 6502 forums, I posted an article which is a solicitation for a design review of the backplane for the Kestrel-3. If anyone here wants to comment on it as well, I'd be happy to entertain your thoughts here as well. Thanks.
-
Kestrel-3 Hardware Design Approach
05/24/2016 at 20:23 • 0 commentsIdeally, I'd like to synthesize the Kestrel-3 in an FPGA board with a price/performance tradeoff in the same ballpark as the venerable Digilent Nexys-2 board (no longer manufactured). However, contemporary FPGA development boards are either too simple, and thus would require me to spend more on attachments to make up for the lack of on-board features; or, they're way over-board, costing 2x to 3x that of my Nexys-2, while providing features which I'm in no position to exploit right now. Furthermore, almost everything out there uses an FPGA from a vendor whose tooling I simply have not been successful getting to run on my local workstation.
So, in a fit of not-so-quiet desperation, I've decided I should try and build my own kit that directly supports synthesizing a Kestrel-3. This is a *big* risk for me; I've never built anything with an FPGA on my own before, let alone several. Some may remember that my initial goal was to use my Nexys-2 as a proof-of-concept (since the Nexys-2 was already a working FPGA board I could depend upon) and then work on a real motherboard, meehhh, about a year or so afterwards. However, my repeat failures in getting any FPGA development IDEs working have resulted in a change of my priorities. Since I'm able to get iceStorm and related tools working quite trivially, the development kit will be based around one or more iCE40HX4K-TQ144 units (primarily because of the gull-wing package, which makes it possible for me to solder, albeit with some difficulty). The disadvantage of these FPGAs, though, is somewhat limited logic density combined with reduced quantity of I/O pins. I need multiple FPGAs to realize a Kestrel-3 using this configuration of hardware; at least two, in fact. One for the CPU, and one for "everything else" that makes up the Kestrel-3 motherboard logic. I may even need three.
Since I don't want to exceed the net cost of a Nexys-2, I decided that my success criteria will be a working Kestrel-3 in less than $250 of parts; labor is not included, since I'll be building it myself.
To facilitate incremental development of individual pieces, since I have no idea what I'm getting myself into, I decided on building the FPGA development board around a custom, passive backplane using DIN 41612 connectors. Each connector would provide somewhere in the vicinity of 64 (or more) uncommitted, shared I/O. I figure with a 50MHz oscillator driving the backplane bus, I'd need at least 32 pins reserved for Vdd or ground. I'm planning on placing the grounds and Vdd signals on pins B1 through B32 (right down the middle of the connector). That should leave plenty of pins to provide an externalized Wishbone bus with 28 ADR pins, 16 DAT pins, 2 SEL pins, plus pins for write-enable, cycle request, cycle grant, strobe, and acknowledge. Adding clock, reset, and FPGA CDONE signals, this comes to a total of 53 pins, so I still have some reasonable amount of room for future growth. This also occupies about one half of the available I/O pins of one of the FPGAs, so if I dedicate one FPGA to just the CPU, it poses no problem.
However, when I consider the other half of the Kestrel system, the MGIA/CGIA, GPIA, KIA, and other peripherals, I need to use up more I/Os for talking with peripherals than what I can afford to devote to the backplane bus. I wager 17 pins for 32K-color VGA, 4 pins for PS/2 ports (input only), 8 pins for a PMOD port to attach my SD card reader, 23 address pins to external RAM, 16 data pins, 2 pins for OE# and WE#, plus another 2 pins for upper and lower data strobes. Adding all that up, I get a total of 72 pins. Again, taken on its own, it looks like the FPGA can handle things well enough.
The problem is when trying to tackle all those I/O responsibilities at the same time as interfacing to the backplane bus. The backplane plus I/O load results in 72 + 53 (optimistic) = 125 pins, exceeding the iCE40HX4K-TQ144's 107 available I/O pins (and that's assuming no limitations on those available I/Os). So, either one of two things needs to happen:
1. I use a narrower bus in exchange for increasing the number of clock cycles required to accomplish a bus transaction. Because the backplane is passive, this interconnect must be strictly master/slave.
2. I use multiple FPGAs, each dedicated to one specific task. This means I have dedicated cards for CPU, memory (RAM and ROM), and I/O. But since the CPU and I/O cards will need access to memory, the interconnect will need to support multiple bus masters.
My (amortized!) cost of a backplane board is expected to come out to $50 or so, including the DIN sockets. That's double that of the RC2014 computer on Tindie, which makes me sad; but, then again, I'm considering prototype prices, and RC2014 looks like it's in regular production, so I'm sure that affects prices. Anyway, to fit a DIN plug, each adapter card will need to be at least three inches long. I can't imagine any reason why a board would need to be wider than two inches, so conservatively, let's say that both the CPU and the I/O plugin boards would come to six square inches. Using OSHPark and Digikey prices, a crude (amortized!) estimate for a plug-in card equipped with one DIN plug ($7 est.) and one FPGA ($7 est.) comes to around $24 for the CPU card, to maybe $44 for the I/O card.
So, if I have a two-card approach interfaced with a narrower interconnect, I'm looking at $50+$24+$44 = $118 total investment best-case. If I need a third card for hosting memory, which might afford me the simpler, yet wider, Wishbone bus, that would go up to $118 + $44 (worst case!) = $162. Even with prototype pricing, I have lots of headroom left over on my budget, so I think I'm doing this right no matter which route I take.
I've spent the past several days reviewing RapidIO, SD card protocols, and SPI flash memory protocols for inspiration with the narrower bus approach. However, I think the right choice is actually the externalized, multi-drop Wishbone interconnect. After all, premature optimization is the root of all evil, right? This includes cost optimization. As long as I am meeting my budget, I think I'm going to do OK.
Of course, the disadvantage of ordering from some place like OSHpark is that I have a minimum order of 3 boards for each of what I need. So, for these amortized prices to work out, I need to get things right the first time every time. I'd love to hear from folks who have worked with FPGAs before, specifically with respect to rework expenses.
-
Digilent Nexys-2 Retired
05/21/2016 at 20:40 • 7 commentsWell, I'm sad to say that I'm going to have to retire my Digilent Nexys-2 FPGA development board even before committing the first bitstream for the Kestrel-3 to it. It seems Xilinx WebPACK ISE refuses to run its license manager on 64-bit Linux, and their Vivado monstrosity does not support the FPGA that's on my board. So, unless the XS3C1000E somehow magically gets reverse-engineered tomorrow, I have a very expensive, yet not terribly effective, paper-weight.
It looks like I'll be building out a Kestrel-3 motherboard built around iCE40HX-family parts sooner rather than later. I was not looking forward to this. I'm almost certain to miss my deadlines for realizing a working Kestrel-3 in hardware by January of next year, unless someone knows of a development board with two or three iCE40HX-8Ks on it with a 50MHz oscillator, is equipped with PS/2 port(s) for keyboard (and mouse, optionally), a VGA port capable of 512 or better colors, 16MB of RAM, and at least 1MB of flash ROM to store the Kestrel's firmware in.
And for those of you about to suggest Altera FPGAs, please don't. Simply trying to download a working binary from their website resulted not in bits, but depression and anger. If you think Xilinx's Byzantine licensing and registration hoops are bad, you haven't played with Altera's website. Moreover, when I researched their parts, they wanted as much for a programming cable as they did for a typical Terasic development board.
-
Video Before Processor
05/17/2016 at 19:01 • 5 commentsIt has occurred to me that I need to implement the video interface before I implement the CPU. I'm still quite used to the architecture of the Kestrel-2, which used internal block RAMs, and so I could easily afford to just focus on the CPU first. However, there are two important reasons to focus on the video interface first:
- First and foremost, the CPU must go through a RAM controller of some kind, and with the video controller needing to know not only how much video bandwidth exists but also the organization of the bus in order to maintain accurate timing, it makes sense that the video interface be responsible for interfacing a higher-level bus to the nitty-gritty details of video RAM.
- Second, the video interface serves as a kind of replacement for your typical front-panel. Getting the video interface working first means that testing basic "hello world" tests like ld x0, negative1 ; sd x0, framebufferAddr ; jal x0, * will be much easier to get working.
From the RAM's perspective, there will be (at least) four or so bus masters:
- Sprite Y coordinate fetch,
- Sprite X coordinate, color, and data fetch,
- Background video fetch, and perhaps not least,
- the host processor bus itself.
Each of these would appear, to the RAM, as 16-bit wide bus masters. Most of the video circuitry will continue to use this data width; however, the host bus will be bridged with something that groks 64-bit data widths and sequences wide loads and stores accordingly. Only when this logic is in place can I even consider putting a CPU on the host bus.
-
Polaris CPU Design Incompatible With Current RV64S
05/16/2016 at 17:49 • 0 commentsFrom the time I started working on the Kestrel-3 development tools and emulator, I used the currently documented Draft Privileged ISA Specification V1.7 as my guide for supporting interrupts, managing the mtohost and mfromhost CSRs (which I used to talk with the emulator directly), and other supervisory aspects of the machine.
One e-mail I got back from the HW-DEV group brought to my attention differences between the Spike ISA simulator and the V1.7 privileged ISA specs. While some changes were to be expected, I did not expect the changes to be so radically different as to basically require a complete redesign of the privileged mode all-together.
Some changes that I'm personally aware of, along with some inspired speculations:
- No more mtime and mtimecmp registers; responsibility for real-time keeping is off-loaded to a peripheral. I have mixed feelings over this removal, but I think it's a good move in the long term. It certainly greatly simplifies real-time-keeping in a multi-hart configuration.
- The mstatus register is completely different in layout. The privilege stack that was previously maintained in mstatus is gone. Unfortunately, I'm not sure how it works right now. While interrupt enables exist for user, supervisor, hypervisor, and machine modes, the removal of ERET suggests to me that a stack is no longer implemented. (Note that EBREAK and ECALL are still supported.) There are also MPP, HPP, SPP, and UPP fields, whose semantics are completely opaque to me at the moment. Also, it's not clear how to tell which operating mode the CPU is currently in. With discussions of a new "debug mode," it's entirely possible that mstatus is no longer the source of truth for determining current operating mode.
- Speaking of which, the ERET instruction is no longer supported. Instead, there are URET, SRET, HRET, and MRET instructions. Whether these instructions "return to" their eponymous mode of operation or if these instructions can only be *used* in their eponymous mode is unknown to me. The Git diffs I've seen do not make the semantics clear. This change was apparently needed to work around a virtualization escape exploit.
- Polaris is configured to boot in high memory, specifically at address $FFFFFFFFFFFFFF00. This differs markedly from where Spike assumes a processor will boot from. However, this keeps changing throughout the project's history. I remember when $2000 was the official bootstrap point. Then it became $200. Then in the V1.7 privileged specs, it was configurable based on the default MTVEC register (which is what Polaris does). Then it became address zero (like Intel 8080). Now that people are talking about a debug mode of operation, it's currently at $1000. Where will it be tomorrow? Nobody knows. But, one thing is for sure: it's prudent to treat the RISC-V architecture just like the Motorola 68K series of CPUs, which had a rock-solid user-mode environment, but in any other privilege mode, anything goes.
- The nature of mtvec has changed. In V1.7, mtvec pointed to a set of five trap handlers. These handlers were intended to handle traps generated in user, supervisor, hypervisor, and machine modes of operation, in that order. There was an additional non-maskable interrupt entry point as well. Note that htvec and stvec (for hypervisor and supervisor trap handlers) pointed right at their respective handlers. This elevated machine-mode software above other modes in a way which violated Popek and Goldberg virtualization requirements. Going forward, it seems that mtvec, htvec, stvec, and a newly introduced utvec register all point directly at their handlers equally, enabling code that once sat at machine-mode to be seamlessly run at lower privilege levels. This is a change I actually agree with, although I will miss the convenience of having separate handlers for different privilege modes.
I'm positive there are others; these are just the changes I've been able to glean from reading the git commit history.
While I look forward to seeing a finished privileged ISA specification, I also do not want to wait forever to get the Kestrel-3 working in a real FPGA. I'm going to stick with the ISA specifications as they're currently defined. This means that the Kestrel-3 will be officially incompatible with any future RISC-V Privileged ISA specifications.
I think to bring the Kestrel Project back into RISC-V compliance, I will need to release a Kestrel-4 (one which is not my April Fool's joke), which is a Kestrel-3 in every way, but with a properly compliant CPU at its core.
-
CGIA/VDP Thoughts
05/15/2016 at 06:08 • 0 commentsWhile I let the CPU requirements bake a little bit, I guess I'll spend the remainder of my weekend thinking on the CGIA. When I asked if I should pursue VDP or CGIA here on Hackaday roughly three days ago, I got no response at all. When I asked the same question in a Twitter poll, I received exactly one vote for VDP and one vote for CGIA. So, with everyone literally divided equally or simply not caring, I guess I'll just have to play with each design to see which I like better and go forward with that.
After some thought on the subject, I think I have an idea of how to add VDP-like sprite pre-processing to the CGIA.
According to VGA timing specifications, a 640-pixel display actually has 800 pixels edge to edge. (The unused pixels are border and horizontal sync times.) These pixels are clocked at 25MHz in the Nexys-2 version of the Kestrel-3. The Nexys-2 RAM has a 14MHz bandwidth limitation, so accessing it at 12.5MHz is best we can do without bizarro clocking or PLL tricks. We still maintain 25MBps throughput, though, since the path to RAM is 16-bits wide. For this reason, every access to external RAM (a "transfer") takes the same amount of time as two pixels on the screen. (Ironically, this is exactly the case for TI's VDP as well!) Put another way, one horizontal scanline on the monitor corresponds to 400 transfers to external memory.
Best Case: Monochrome Without Sprites
To display a line of monochrome video, we need to fetch 640 / 16 = 40 half-words of memory from RAM. This leaves a total of 400 - 40 = 360 transfer slots available for the host CPU or other bus masters I may add later on. This is how the MGIA works today.
Worst Base: 256-color Display Without Sprites
With monochrome requiring only 40 transfers out of 400, we can actually get by with a 640 pixel-wide 256-color display. That would require 320 transfers, leaving 80 transfer slots available to the CPU or other devices.
As with the Commodore-Amiga computers, the deeper your color depth, the more drag will exist on the CPU.
Adding Sprites
I think it's prudent to allow for at least two sprites even when the video bandwidth is maximally consumed. These sprites would be used for a mouse pointer and a text cursor, respectively. Let's further simplify the problem and say these are monochromatic sprites, just like the TMS9918A VDP and VIC-II have (when sprites are not in multicolor-mode).
Assuming we have 80 transfers left on a horizontal line, and assuming our sprites are 16-pixels wide, and we have two sprites on the same line presently, it follows that fetching the video data for these sprites will require two additional transfer slots. Our worst-case budget is now down to 78 transfers.
Of course, to even decide whether or not a sprite is visible on this scanline, the CGIA would need to fetch a set of sprite Y coordinates from memory. This activity would also consume two transfer slots, thus leaving us at 76 transfers left.
We need to know where on the line the sprite appears, and its preferred color. To accomplish this, we need two additional transfers per sprite, leaving us with 72 transfers left.
Or, to put it a different way, each sprite that is visible on a scanline would require four transfers: one for the Y coordinate, one for the X coordinate, one for the color to show the sprite in, and one for the sprite's raster data. With a worst-case budget of 80 transfer slots available, it follows that we can actually allow up to 20 sprites on a single horizontal line at once.
Obviously, if we don't drive the display with a 256-color screen, that leaves a whole lot more time slots available. For example, with an 16-color backdrop, we're left with a budget of 400 - 4*40 = 240 transfers left over for sprite handling. That's enough room to show up to 60 sprites on a single scanline.
Coordinating Bus Accesses
The CGIA fetches data in bursts. For the backdrop, it fetches an entire line's worth of data in a single request for the bus. The length of this transfer is configurable; let's say N half-words for now. After reading N half-words, it stores its findings in an internal line buffer.
Sprites would have to work the same way. The VDP works by interleaving different kinds of transfers. This will not work with the CGIA; interleaving individual memory accesses will require very sophisticated state machines. Treating each phase of the display as a separate bus master is much, much easier.
After the background fetch, the next step is sprite pre-processing. This phase involves reading M half-words from memory, each corresponding to a sprite's Y coordinate on the display. Like N, M would also be configurable by the user. To turn off sprite processing entirely, one would simply set M to zero. For each sprite whose Y coordinate potentially makes that sprite visible on this scanline, we queue the sprite number for later processing.
So far, we've used N+M time slots. If no sprites are visible on this scanline, then that's all the time we actually use. Otherwise, for each sprite that sits in the queue, we need to now load the sprite data, color, and horizontal position registers. This step is not under programmer control, since the number of cycles consumed in this phase will be determined by how many visible sprites were discovered above.
If 0 <= V <= M, then the total number of time slots taken is N+M+3*V.
Conclusion
I've shown how one computes time budgets for video display subsystems. This knowledge gives the programmer the know-how to decide ideal color depth, resolution (although I didn't explain it), and number of sprites on the screen at once. As long as N+M+3*V <= 400, the CGIA should have no problems displaying your images.
Support for pattern graphics is not considered. The proper way of supporting pattern graphics without a lot of undue overhead remains elusive. Maybe after a good night's rest, I'll come up with some ideas. Otherwise, if worse comes to worse, there's always the VDP approach to doing things. :)
-
CPU Requirements Preview
05/15/2016 at 05:14 • 0 commentsI finally finished what I think is a good first draft of the requirements for the Kestrel-3's hardware CPU, and I made it available for review and commenting. (I'll move it into a more appropriate Github repository later.) Let me know what you think, especially if you find anything confusing. Whatever is confusing will need more documentation.
-
Instead of CGIA, go with a super-VDP instead?
05/11/2016 at 23:52 • 3 commentsAfter stumbling upon a fairly sizable collection of design notes and interviews from the inventor of the Texas Instruments TMS9918A VDP (video display processor), I began to entertain the possibility that maybe the Kestrel-3 should use a VDP-like video core to replace the MGIA instead of my original CGIA (Configurable Graphics Interface Adapter) idea.
Recap: CGIA
I've not disclosed details about the CGIA much anywhere because I never had a need to, but this is what I've been thinking. Take the MGIA logic as-is, and expose some of its guts to the programmer. In particular, expose the DMA engine to the programmer and make it programmable. Video data would be streamed from a single buffer, using a single DMA fetch pointer. Every HSYNC, the CGIA would fetch the next N bytes from the video buffer into a waiting scanline buffer. N would be configurable via a control register. The fetch logic would not make any attempt to interpret the meaning of the bits it read at all. Then, on the next HSYNC, the pending scanline buffer becomes the currently displayed buffer (leaving the formerly displayed buffer as the new pending buffer for DMA purposes). Here, the contents of the buffer would be clocked out at a configurable rate, with the bits routed to a palette register bank in a configurable manner. In this way, the programmer would have complete control over video bandwidth/CPU bandwidth tradeoffs.
Recap: VDP and its Progeny
The TMS9918A exposes only a minimal amount of configuration to its user; coming out of reset, it is basically configured in a 32x24 character display (what they call "pattern graphics"). The configuration it does offer is, for example, whether or not you're using 4K or 16K dynamic RAMs, whether or not background video is genlocked, and where the various display tables will appear in video RAM. The number of colors supported and the monitor synchronization timing are all hard-wired to support NTSC television, not unlike the Kestrel-3's MGIA being hardwired to support IBM VGA. A completely separate chip had to be made a few years later to address the PAL market.
Later generations of the VDP, such as the V9938, V9958, and V9990, all added higher bandwidth paths to memory when they became available, support for higher color depths, planar as well as patterned graphics, etc.; but, they otherwise retained their hardwired video timing parameters. An open source clone of the TMS9918A that borrows some V9938 features is also available online.
Finally, there is the Gameduino 1.0 device, which is perhaps the first VDP-like video interface to actually drive SVGA monitors (800x600 72Hz, to be precise). It's feature list is most impressive; however, it is limited in its resolution (400x300 addressible pixels) and available on-screen colors (you can show lots of colors, but your palette selections are limited and highly optimized for tile-based graphics, such as you'd find on NES-like gaming consoles). It fills its niche quite well, but I don't think it's appropriate for the Kestrel-3.
My Thoughts
I think, if I were to go the VDP clone route, and it is indeed appealing for a number of reasons, I do not currently see a reason why I couldn't add the programmable video timing parameters I'd like to see. Moreover, I can have it bootstrap into a planar graphics video mode that is backward compatible with the MGIA, thus allowing existing system software to work with it. No need to recompile any existing graphics drivers except where color, sprites, or different resolutions are needed.
Benefits of my original CGIA concept:
- Programmable resolutions and HSYNC/VSYNC polarities to drive SVGA monitors.
- Exposes DMA timing to the programmer to let programmer decide tradeoff between color depth and CPU performance when accessing video memory.
- Single, flat plane of bitmapped graphics.
- Dot shifter (which interprets the data fetched from above flat plane of graphics) would be configurable to use 1-, 2-, 4-, 8-, or 16-bits per pixel, thus affecting the number of colors visible on the screen at once.
- Significantly simpler implementation.
- Effortless vertical smooth scrolling by simply controlling the DMA pointer at the start of every frame.
- Extremely modular design; classic producer/consumer problem in computer science. Higher cohesion, lower coupling (at least at the Verilog level).
Detriments of my original CGIA concept:
- No sprites. They'd need to be emulated in software. For high color-depth screens, this would be a performance bottle-neck.
- No pattern graphics modes. AKA, no character display modes. This means you cannot poke a single byte into video memory and expect a complete character to be displayed. You have to write console drivers which places a glyph onto the display. (This is already done in Kestrel-3's firmware now, but it costs runtime performance.)
- No horizontal smooth scrolling. You have to emulate this entirely in software. CGIA is not optimized for game-play, but rather for productivity instead.
Benefits of adopting a VDP-like architecture:
- I can add programmable HSYNC, VSYNC timing and polarity control easily enough.
- I can, albeit with somewhat less flexibility, still expose CPU/video bandwidth tradeoffs to the programmer.
- Support for planar or chunky graphics can be added easily enough. V9958 proved that. This implies effortless vertical smooth scrolling as well.
- Support for pattern graphics/character modes.
- Sprites. Potentially lots of them. On a 640x480 VGA display, and if my math is right, I can probably pull off between 16 and 20 (monochrome) sprites on a single scan-line at any given time. With a 1024 pixel wide, monochrome display, probably closer to 40. This won't win many awards from arcade game aficionados, but it is more the adequate for supporting WIMPy user interfaces that supports effortless drag-n-drop of icons.
Detriments of adopting a VDP-like architecture:
- Significant complexity compared to my original CGIA concept.
- Internal design will require greater number of internal dependencies.
- Still no obvious way to support horizontal smooth scrolling.
I can't think of any further disadvantages. I should point out that the horizontal smooth scrolling problem is caused by the same problems in both VDP and CGIA designs. They would be solved in the same way for both as well. While I list lack of horizontal smooth scrolling as a detriment for both, any solution I come up with for solving that issue would equally apply to both VDP and CGIA.
So I'm curious about your thoughts; should I try to go with a VDP architecture? Would this make the Kestrel-3 more appealing for others to use or program for? Should I stick with a simpler graphics architecture and rely on improving CPU performance to make up for sluggish frame rates in the future?