The project is in an early state, but it is possible to render 3D scenes with transformation, clipping, lighting (only flat shading), texturing, blending and so on. In a simulation, i was able to run tuxracer on it. On MCUs, you will probably render simpler scenes because of the weak hardware.
On the iCE40UP, the accelerator runs with 24MHz and has a theoretical maximum fill rate of 12 mega pixel (two clocks per pixel). On FPGAs with enough dual port ram, a theoretical fill rate of 24 mega pixel would be possible (every clock cycle, one pixel).
To maintain this fill rate, the rasterizer implements an edge walking algorithm to avoid wasting clock cycles by checking pixels outside of the triangle. In general, it can check one pixel per clock. The fragment calculation is also fully pipelined and is capable to process one pixel per clock (perspective correction, depth testing, blending, texturing ...).
Because the implemented library uses floating point arithmetic, an MCU with FPU is recommended. An stm32l5 is capable to process up to 14k polygons. An RP2040 only gets around 3k.
There are still a lot of features missing, like smooth shading, texture filtering, stencil buffer, logic ops and what not, but it was not possible to implement it in such a small FPGA. I guess, i am now at a point where i require a bigger FPGA.
On github, i have an example, which runs on Arduinos with enough RAM. Enjoy to try it!