-
We're not dead yet!
04/17/2020 at 17:49 • 0 commentsIt's... been a while since I last added a log. The great news is that I once again have DOOM running on the platform, this time using the DE10-standard. I also made substantial improvements to all the interconnects and modules:
- Wishbone compliant interfaces, including many that pipeline and block transfer
- Update code style to reflect some additional experience on my part
- Updates to use SystemVerilog at least to the degree I know it
- Switch system from Princeton to Harvard architecture
- More implicit synthesis of Intel's IP, to make things like FIFOs and dual port RAM more portable
- Some of the core elements (mostly the CPU) have Verilator tests and an easy to use test suite
I also took the opportunity to aggressively cut down sections of the DOOM code that aren't relevant to my port (at least right now). That included all of the sound support, argument parsing, and anything related to the game mode - it's hardcoded to retail. Removing all of the parsing, strings and conditional logic stripped down the binary by a fairly large amount.
I can post another video, but at the moment it's not really too different than the old one.
What this really gives me is a better baseline to experiment with things that will improve performance of the game. In particular, I'm looking at a few options right now fairly carefully:
- Modify gcc to pass at least some arguments in registers instead of always on the stack. Memory access is expensive, even with a cache.
- Finishing the pipelined version of the CPU. It's mostly there, but it fails certain regression tests related to exceptions and I need to find the deadlocks.
- Adding a DMA controller that will allow for fast block transfers between regular and video memory without the CPU.
- Profiling the game to see where most of the time is spent. I've already done this is a crude way, and not surprisingly, it seems like most of the time is recursively evaluating the players visual field to determine what needs to be rendered (thus the arg handling above). There may be other places that jump out if I did more profiling.
So that's about where we are right now. If you've been watching this project for a while, thanks! Let me know if there's anything in particular you are interested in seeing, or if you have any ideas on where to go from here.
-
Improved Interconnect
06/29/2017 at 19:13 • 1 commentTo go beyond the performance I have right now, I think I'm going to have to spend some time improving memory access. It just takes too many cycles to access memory, even with the cache helping a fair bit.
One option is to implement pipelined memory access and build a new CPU opcode that will support block memory moves. Another option would be to do another run of compacting the ISA so that there isn't as much dead space. That would have a tradeoff in decoding complexity, but I don't really think there's much of an issue there.
For the purposes of DOOM, I think the real problem though is in making the compiler smarter about using registers vs memory locations. I think that's likely the best place to start, and should mean the least amount of complexity as well. Along similar lines, I should likely try to start passing at least a few of the function parameters in registers instead of forcing it all on the stack.
Let me know your thoughts and if there's one area you find more interesting of useful than another and why.
-
More Speed
12/11/2016 at 02:00 • 0 commentsAs with all good hobbies, I went on a couple of tangents recently. While I was waiting for the new daughterboard, I decided to spend some time on speed improvements.
The first step was the easiest: I just doubled the clock speed from 50MHz to 100MHz. The external memory (SDRAM and SSRAM) had plenty of headroom, and I took the opportunity to parameterize all of the modules that use their own clock (SPI, I2C, UART, etc). That helped, but not enough. The memory operations were just too inefficient.
The next step was to add simple write FIFOs for memory modules. The idea here is that writes can return immediately because the CPU isn't expecting a response. If the CPU makes a read request, that read will stall until the write FIFO is empty, ensuring that there is consistency. I put this FIFO between the CPU and the memory controllers so that it woudl do the most good. It helped, but wasn't a game changer.
As most of you probably already know, SDRAM isn't particularly efficient at single word operations. You need to "open" a row of memory, do some operations, and then "close" it again. The overhead of the open and close operations is huge when you only do a single read or write, which is what I was doing up until this point.
By adjusting the cache memory system and the SDRAM controller, I was able to to enable pipeline operations, making memory access a lot more efficient. Since I already had the cache controller handling 4 words in a cache line, it wasn't particularly hard to enable a 4 word pipeline. The nice thing is that all of this is abstracted away from the CPU - it can still use single word operations and gave some of the benefits simply because the content gets into the cache memory more quickly. This and increasing the cache size to 4k words (from 1k) made a substantial difference in performance. I'm starting to use the Doom load time as my benchmark for these things, and I've now got it down to about 1m20s from program load to menu.
There are so many additional optimizations that I can still apply:
- Block memory transfer CPU instructions
- Improving the SDRAM controller to keep rows open longer and increase the time between refresh cycles
- Improve the SSRAM controller to handle burst operations (should help with frame rate)
I may tackle some of these in the near future.
-
New stuff
11/17/2016 at 21:13 • 0 commentsI know it's been a while since I updated this project. I'm better at working on the projects than documenting them. I've been focused on some other projects for a while, but now I've got some time and renewed interest, so expect updates soon.
I just finished spinning a new IO board for this project. It's a simple one that incorporates an RTC chip, a codec with line in/out and headphone out, PS/2 keyboard, and an aux output for the LED matrix. I've been working on an i2c master instance for my SoC, which I can then use to program the codec. Untimately, I want to use the codec to serve as an output for the PCM/WAV audio for the game. So next update will talk about the audio interface and progress on integrating the audio into Doom.
-
Cache and new codec
01/03/2016 at 20:57 • 0 commentsI picked up the Adafruit Codec module recently, and I'm working to integrate it into the system design. It's SPI based and understands how to process both MIDI and WAV, and so I'm hopeful that I'll be able set this thing up to play sound effects and music from Doom without a lot of pain.
I decided to ditch the joystick for a PS/2 keyboard interface. I've added a second interface for mouse input for good measure. More details on those if there's interest, but there are a lot of examples of how to do this on the net. There is a status register that tells the program how many events are in the queue, and reading the output register reduces the queue by one. I'm passing a slightly modified version of the standard scan codes to the application at this point, since there's no device drivers or OS at the moment.
I'm also working on a cache interface for the SDRAM. I've actually added a new parent module for the cache and SDRAM, and by setting a control signal, I can enable or disable the cache. While this isn't something I would do randomly, it does allow me to test behavior and performance very easily.
The cache design is very simple right now. There are 256 cache slots, each with can hold 4 words. Cache fill is done 4 words at a time, and if a flush is needed, there are separate dirty bits for each word so that only the changed values need to be written out. Right now this will likely be a slight reduction in performance, but when I get the bugs worked out I intend to enable pipelined SDRAM reads, which coincidentally will allow 4 words to be clocked out in one shot. When that's done things should be a lot faster.
-
Video Demo
12/11/2015 at 03:22 • 1 comment -
Detailed VGA controller description
12/11/2015 at 03:19 • 2 commentsAs I mentioned earlier, the VGA controller was an interesting part of the design for me. While not perfect, it addresses my immediate needs, and there are several opportunities to tweak the design. For example, right now the 320x240 double pixel mode is hardcoded, but this could easily be added to a control register to all the CPU to change the video mode as needed.
Here's a block diagram of all the major parts:
There are three major components: the VGA out, and the two bus interfaces.
Bus Slave Interface
This is the simplest of the three. This allows the CPU to make updates to the four palette maps. For reasons explained in the section below, these four contain the same content and are quite small - 256 24-bit words. In the future I could make these slightly larger and have the ability to palette swap.
Bus Master Interface
This one is a little more complex. The model is to fetch a scanline of data at a time from the video RAM and store it locally. The state machine uses the change of the vertical row as the trigger to start the load process. This gives the memory fetch a bit of a head start, since we have the whole horizontal blanking interval before the VGA interface starts to use the data.
Since this is 8-bit pseudocolor, each 32-bit word from video memory represents 4 pixels. The values are used as an index into the palette map, and the resulting 24-bit color is put into the scanline memory. So that we can do all of these lookups in as few clock cycles as possible, we use 4 parallel palette maps, and 4 separate scanline memories - each responsible for 1/4th of the scanline. The addressing for each of these is again driven by the state machine.
VGA Interface
This component uses a fairly standard set of counters to determine the relevant portions of the VGA control signals such as the horizontal and vertical sync, and the column position is used to select the correct values from the scanline memories. The upper bits select the word address from the all of the memories, and the lower two bits select which of the memories to use via a mux.
Another thing to note is that you have two clock domains in this module, represented by the dotted line in the diagram. I attempted to minimize the number of signals that needed to cross that boundary, since they all need to be synchronized.
-
System Block Diagram
12/11/2015 at 02:55 • 0 commentsHere's a high level view of the overall system architecture:
The interesting things to note here are that both the VGA controller and the CPU are bus masters, however since the only device/bus that the VGA controller cares about is the SSRAM, the SSRAM bus is separate from the primary system bus. Since the CPU is the only writer to the memory, this also means that we can simplify the data paths. The bus arbiter and the SSRAM datapaths are shown to the left of the CPU, with the rest of the devices on the primary system bus on the left.
Note also that the VGA controller is also a slave on the CPU bus. This is because the CPU needs to be able to update the palette maps. More on that in the next update.
-
Custom Video (or the power of FPGA)
12/09/2015 at 16:27 • 3 commentsThere are a lot of howto articles about building VGA clocks in FPGAs. My favorite is the Pong Game. More complex for me was how to take that VGA clock and use it to build a true framebuffer for a CPU. This introduces two new challenges related to memory bandwidth and multiple clock domains. I'll describe how I implemented my framebuffer in Verilog in a later article, but here I thought I'd share one of the interesting things I did when porting DOOM that was made trivial on an FPGA platform.
The native DOOM code uses a display 320x200 in 8-bit pseudocolor. What this means is that there is a palette of 256 slots that have the actual 24-bit RGB values in them, and the framebuffer memory only stores the index into that palette map. The framebuffer uses that index to determine what color to display.
My initial framebuffer design was to have a 640x480 display, and that's still what is presented on the VGA port. The challenge I ran into with DOOM was that when it wrote to the framebuffer, the output wouldn't look right because it was drawing two lines of output to every line on the screen.
The solution in the DOOM code is that they have logic to do pixel doubling - they in effect copy the pixel value for all even numbered pixels (x,y) to (x+1,y), (x,y+1), and (x+1, y+1). That's great when you have really fast memory, but in effect you are sending 4x the amount of data you need to the frame buffer, which is pretty inefficient.
The solution was to add a new video mode for the framebuffer so that it did the pixel doubling in hardware. It was a fairly easy change, essentially just ignoring the low two bits of the cursor position when pulling data from the framebuffer. This made screen updates much faster, and the net result was exactly the same.
I'll post a video about the VGA framebuffer and do a walkthrough of the code soon. The relevant code is vga_master.
-
It Works!
12/09/2015 at 04:25 • 0 commentsI have the basics down - the program will load, load the WAD file, render the player views. Controls are crude, but I was able to map some of the keyboard commands to the analog joystick and pushbuttons to get things tested. I have a Doom on FPGA video up on Youtube.
Some next steps:
- Improve performance. We're getting only about 1 FPS right now, and most of that is due to memory operations being so expensive. I will work on more advanced things like cache memory in the core and pipelining, but for now I think I'm going to look at ways to speed up the CPU clock. Right now it's running at 50 MHz, and with some tuning I should be able to improve that substantially.
- Add sound. I believe that the raw sound in the WAD files is actually MIDI, which is ideal. I can either push that out a serial port to a MIDI interface, or I can build a peripheral that will interface MIDI with an external codec.
- Add keyboard. I have a PS/2 keyboard IP core I can add, I just need to integrate it into the FPGA design and decide how I want to pass events to the CPU. This will allow me to drop the hack I currently have for IO.
- Add network. This might be a bit of a stretch. I'd like to get the integrated ethernet PHY working on the DE2i-150 board, but for this project I might be better off using one of the WizNet chips and offload the majority of the TCP stack work. Making newlib library stubs for the TCP stack stuff could work, but will likely require more work on the library and hardware side. Specifically, I think it will require at minimum that I implement a working timer system as well as interrupt handler. With those two, I have the basics that would allow me to handle external events without a polling loop explicitly in the Doom code.