This marks a major milestone of a working (-ish) framebuffer with software 3D rendering. It is slow to the extreme (the screenshot indicates 36 ms per frame, but this is actually 36 seconds as the poor 32-bit nanosecond-resolution timer had to be slowed down in order not to overflow during a single frame!)
The system diagram illustrates some of the current limitations:
Stupidly slow system clock. This should be bumped up to at least 50 MHz for the CPU and 100-125 MHz for the DRAM controller (presenting a virtual 32-bit bus to the CPU to hide some of the access latency)
Lack of any burst transfers between CPU and SDRAM. A series of 8 32-bit reads (a data cache line fill) executes 16 individual 16-bit accesses with a lot of handshaking overhead.
Overloaded "Memory Control" module which should be split up into more maintainable units.
I am using the public domain sdram_pnru controller, hacked up to support burst transfer which is needed for frame buffer output. This controller was originally written with generic clock domain crossing support, which adds extra synchronization cycles to every transaction. The plan is to write a new controller which will better fit the platform's needs.
The CPU is also compiled without any hardware multiply support and so on. There is a lot of work ahead.