PSRAM Challenges
The PSRAM presented numerous challenges throughout development. One issue arose from the SSI clock divider, which could only be set to even integers. When the RP2040s were originally clocked at 266MHz to meet N64 SRAM timings and the divider was set to 2 (resulting in a 133MHz QSPI), games experienced erratic errors and often failed to boot.
A 266/4 configuration (66MHz QSPI) was insufficient for achieving "stock" N64 bus speeds. On the PicoCart-Lite ("v1"), this wasn't a problem, as the 266/2 setting (133MHz QSPI) with short data lines to flash proved reliable and allowed for stock speeds.
This project marked a turning point in my learning journey, as my background in software engineering had previously led me to treat hardware as a weekend project. In the past, I had the luxury of "ignoring air resistance," but this endeavor demanded a thorough consideration of such variables.
Reflections, line impedance, terminating resistors, and topography all became familiar terms as a fellow Discord member and I grappled with hardware gremlins. This helpful individual even went so far as to reroute the PSRAM data lines, add terminating resistors, and assist me in resolving hardware issues over several months.
After attempting to overclock the RP2040s to a 360/4 setting (90MHz QSPI), I achieved stock speeds with mostly stable data reliability. Another hardware revision incorporating termination resistors and the 360/4 configuration appeared to be the solution. However, when testing additional boards from my batch, I discovered that at least one of them failed to operate at these frequencies.
The Quest For Data
So some notes on how the n64 bus works. The n64 sends a 32 bit address: upper 16 bits sampled on the falling edge of ALEH (Address line high), then lower 16 bits sampled on the falling edge of the ALEL (address line low). There is then a delay before the read line goes low which is the cart's cue to fetch the data and have 16 bits of data ready on the data lines when read line is asserted.
The time after ALEL -> low and read low depends on the n64 bus speed but is as quick as 1us. This gives us some time to "prefetch" a half-word of data in anticipation of the read line going low.
Read line low and the pulse that follows to latch the read data are also affected by the n64 bus speed.
- Stock speeds are 0x12 = 18 n64 cycles @ 62.5MHz.
- The read line is low for roughly 300ns
- The n64 fetches 32bits, the read line pulses for about 60ns then goes low again.
- Reads are finished once ALEH is asserted.
- So at the stock speeds, this should give us about 300ns to fetch and get data ready.
- The qspi hardware fetches 32bits of data.
- For the psram chips this means
- 22 clocks to send, 8bit command, 24bit address, 6 wait cycles, 32 bits of data.
- For the psram chips this means
Once we have the address we have 1us to prefetch the first half-word. In that 1us time:
- Set the right chip to access via the demux
- Setup the DMA to read from the xip pointer address at the appropriate transformed location
- Wait for the read to complete
We then wait for the read line to go low:
- Put the data from the dma buffer into the PIO tx fifo
- Start the DMA read for the next address
- we assume there will be another read as the n64 can read up to 256 words before sending a new address, although it can also be less.
Here is what that code looks like
if (last_addr >= 0x10000000 && last_addr <= 0x1FBFFFFF) {
// Domain 1, Address 2 Cartridge ROM
// Change the banked memory chip if needed
tempChip = ((last_addr >> 23) & 0x7) + 1;// psram_addr_to_chip(last_addr);
if (tempChip != g_currentMemoryArrayChip) {
g_currentMemoryArrayChip = tempChip;
// Set the new chip
psram_set_cs(g_currentMemoryArrayChip);
}
// Set the correct read address
(&dma_hw->ch[dma_chan])->al3_read_addr_trig = (uintptr_t)(ptr16 + (((last_addr - g_addressModifierTable[g_currentMemoryArrayChip]) & 0xFFFFFF) >> 1));
do {
// Wait for value from psram
while(!!(dma_hw->ch[dma_chan].al1_ctrl & DMA_CH0_CTRL_TRIG_BUSY_BITS)) { tight_loop_contents(); }
// Move the value out of the buffer so we can kick off the next fetch
next_word = dmaValue;
// Kick off next value fetch in the background
dma_hw->multi_channel_trigger = 1u << dma_chan;
// Wait for pio to see read line go low or ALEH happened
while((pio->fstat & 0x100) != 0) tight_loop_contents();
addr = pio->rxf[0];
if (addr == 0) { // if read line was low
// READ
pio->txf[0] = next_word;
last_addr += 2;
} else if (addr & 0x00000001) {
// WRITE
// Ignore data since we're asked to write to the ROM.
last_addr += 2;
} else {
// New address, ALEH is asserted
break;
}
} while (1);
}
While this process seems simple enough, it was difficult to pin down when to make the next dma fetch to maximize the n64's bus speed (e.g. as close to 0x12 as possible).
For slow rp2040 clock speeds, and thusly a faster qspi bus as we can use a smaller divider (e.g. 200/2) I found that fetching the next word gave better timings if done AFTER we set `pio->txf[0] = next_word`. The code posted is for 266/4 and comfortably hits 0x20 timings.
Here are my notes while I was testing clock/divider settings and finding the tightest timings that allowed games to be played.
300/4 -> boots 0x1540(336ns) (Moved DMA fetch)-> (0x1C40=448ns) (112ns diff) 22 * qclk = 293.333 13 * pclk = 42ns 210/2 -> boots 0x2040(512ns) (Moved DMA fetch)-> (0x1C40=448ns) (64ns diff) 22 * qclk = 209.524ns 64 * pclk = 302ns 180/2 -> boots 0x2E40(736ns) (Moved DMA fetch)-> (0x2240=544ns) (192ns diff) 22 * qclk = 245ns 89 * pclk = 491ns 160/2 -> boots 0x3D40(976ns) (Moved DMA fetch)-> (0x2740=624ns) (352ns diff) 22 * qclk = 275ns 113 * pclk = 701ns 140/2 -> boots 0x4D40(1232ns) (Moved DMA fetch)-> (0x3340=816ns) (416ns diff) 22 * qclk = 315ns 129 * pclk = 917ns
I still haven't figured out exactly where all my clock cycles are being spent when cases like 336/4 and even 330/4 should theoretically have enough time to make the latches. The pclk calculations are guesses based on the known time to fetch data from the psram chips and the tightest n64 bus timings.
I tried DMA'ing into an array using a full word instead of using 16bit DMA reads and consuming the array as the dma wrote to it. That resulted in even slower bus patch speeds likely due to memory contention.
When attempting to allow for increased sram read/write timings, I discovered it takes the rp2040 37ns at 360MHz to read from a statically allocated array. I wrote a small test function that read from the array 1 million times in a loop `word = array[0];` Timed using `time_us_32()` at start of loop and diff taken once finished. This seems like a very long time to read data from an array.
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.