• Atari SFP004 Validation complete on ARM/FPGA

    Matthew Pearce03/24/2026 at 18:46 0 comments

    The FPU now works with EMUTOS and SFP004 libraries. It produces an 8x performance improvement compared to the SOFT-FP code. This is using an unmodified set of utilities from 1991.  I obtained the utilities from here: https://www.atari-forum.com/viewtopic.php?p=478222#p478222

    The  test board is an Alinx AXU3EG using musashi on the ARM PS communicating via AXI to the PL hardware FPGA FPU

    The next stage is to build a hardware PCB using a 68sec000 for true validation. PCBWAY have kindly sponsored this hardware due to it's unique design. I will provide more details when my board arrives from them.

  • 50mhz 68882 Compatibility added

    Matthew Pearce03/24/2026 at 08:58 0 comments

    Just merged a major release. It achieves timing at a 50mhz with full 68882 feature set including 2 concurrent operations. It has been tested on a custom EMUTOS build on an ALINX ZynqMP build. More details to follow. The LITE 68040 subset is also now available at 50mhz.

    MC68881/82 FPGA — CIR Coprocessor Interface Update

    What's New

    This update brings the FPGA's Coprocessor Interface Register (CIR) protocol in line with the real MC68881/82 hardware
    encoding, fixes several bus interface bugs, and adds a comprehensive hardware diagnostic test.
    It has also been tested in hardware using a custom EMUTOS build running existing FPU .prg tests. Next release will add SFP004 compatibility

    MC68881 Native Opcode Encoding

    The CIR command word now uses Motorola's native opcode encoding directly — the same encoding a real MC68020/030 CPU
    sends over the coprocessor bus. Previously the FPGA used an internal numbering scheme that required a software
    translation layer. This means:

    - FMOVE = $00, FSQRT = $04, FSIN = $0E, FADD = $22, FMUL = $23, FDIV = $20, FSUB = $28 — exactly as documented in the
    MC68881 User Manual
    - The translation switch in the F-line handler has been eliminated
    - Software that talks directly to the CIR registers (like Atari TT FPU test utilities) can use standard Motorola
    opcodes without any remapping

    CIR Bus Interface Fixes

    - 32-bit operand transfers: Fixed a bug where long-word writes to the CIR operand register were split into two 16-bit
    writes by the emulation layer, causing the FPU to receive zero instead of the actual value. Operand transfers now go
    through as single 32-bit AXI transactions, matching real hardware behaviour.
    - 68882 pending instruction pipeline: Fixed a race condition where back-to-back instructions could lose the second
    operation if its OpWord arrived while the first was completing. The pending instruction flags are now only cleared
    when actually consumed.

    Hardware Diagnostic (cirtest)

    New standalone 68K test program (cirtest.c) that exercises the full CIR dialog protocol from the Merlin2 monitor. Runs
    12 tests covering:

    - Integer load/store round-trips (positive and negative)
    - Memory-to-register arithmetic: FADD, FMUL, FDIV, FSUB
    - Register-to-register arithmetic: FADD FPn,FPm
    - Unary operations: FSQRT, FNEG, FABS
    - Double-precision transcendentals: FSIN(1.0), FSQRT(2.0)

    All 12 tests pass on hardware. All 13 GHDL simulation testbenches pass.

    Known Issue: SFP004 Peripheral Protocol

    Disassembly of FPU_HARD.PRG (Quidnunc 1991, commonly used on Atari ST/SFP004 boards) revealed it uses a different
    register protocol than the MC68020 CIR standard — it writes commands without an OpWord and expects different response
    encodings. This will be addressed in the next update to support the SFP004 peripheral access pattern alongside the
    standard CIR protocol.

    Files Changed

    19 files across VHDL RTL, testbenches, C firmware, and documentation. No changes to the ALU computation logic — only
    the bus interface encoding and protocol handling.

  • FPU Particle Animation demo

    Matthew Pearce03/19/2026 at 10:35 0 comments

    Demo of fireworks using the FPU on xilinx hardware with 68000 processor in arm emulation

    #include <stdio.h>
    #include <math.h>
    #include "../lib/merlin2_gfx.h"
    #include "../lib/merlin2_rand.h"
    
    /* 640x480 viewport centred on 1280x720 framebuffer */
    #define VP_W    640
    #define VP_H    480
    #define VP_X    320
    #define VP_Y    120
    _Static_assert(VP_X + VP_W <= GFX_SCREEN_W, "Viewport exceeds screen width");
    _Static_assert(VP_Y + VP_H <= GFX_SCREEN_H, "Viewport exceeds screen height");
    
    /* Physics */
    #define GRAVITY     0.06f
    #define PARTICLES_PER_BURST 32
    
    /* Rocket states */
    #define ROCKET_DEAD    0
    #define ROCKET_RISING  1
    #define ROCKET_BURST   2
    
    typedef struct {
        float x, y;
        float dy;
        float target_y;
        uint32_t colour;
        int state;
    } rocket_t;
    
    typedef struct {
        float x, y;
        float old_x, old_y;
        float vx, vy;
        uint32_t colour;
        int life;
        int max_life;
    } particle_t;
    
    static rocket_t rocket;
    static particle_t particles[PARTICLES_PER_BURST];
    
    static const uint32_t palette[] = {
        0xFFFF2020,  /* red */
        0xFFFFD700,  /* gold */
        0xFF20FF20,  /* green */
        0xFF4080FF,  /* blue */
        0xFFFFFFFF,  /* white */
        0xFFFF40FF,  /* magenta */
    };
    #define PALETTE_SIZE (sizeof(palette) / sizeof(palette[0]))
    
    static inline void put_pixel(int x, int y, uint32_t argb)
    {
        if ((unsigned)x < VP_W && (unsigned)y < VP_H)
            *gfx_fb_ptr(x + VP_X, y + VP_Y) = argb;
    }
    
    static uint32_t dim_colour(uint32_t colour, int life, int max_life)
    {
        if (max_life <= 0 || life <= 0)
            return 0xFF000000;
        unsigned r = (colour >> 16) & 0xFF;
        unsigned g = (colour >> 8) & 0xFF;
        unsigned b = colour & 0xFF;
        r = r * (unsigned)life / (unsigned)max_life;
        g = g * (unsigned)life / (unsigned)max_life;
        b = b * (unsigned)life / (unsigned)max_life;
        return 0xFF000000 | (r << 16) | (g << 8) | b;
    }
    
    static void spawn_rocket(void)
    {
        rocket.x = (float)rand_range(120, VP_W - 120);
        rocket.y = (float)(VP_H - 1);
        rocket.dy = -4.0f - (float)rand_range(0, 15) * 0.1f;
        rocket.target_y = (float)rand_range(80, VP_H / 3);
        rocket.colour = palette[rand_range(0, (int)PALETTE_SIZE - 1)];
        rocket.state = ROCKET_RISING;
    }
    
    static void burst_rocket(void)
    {
        float speed_base = 1.5f;
        int i;
        for (i = 0; i < PARTICLES_PER_BURST; i++) {
            particle_t *p = &particles[i];
            float angle = (float)i * (2.0f * (float)M_PI / (float)PARTICLES_PER_BURST);
            float speed = speed_base + (float)rand_range(0, 10) * 0.1f;
            p->x = rocket.x;
            p->y = rocket.y;
            p->old_x = rocket.x;
            p->old_y = rocket.y;
            p->vx = cosf(angle) * speed;
            p->vy = sinf(angle) * speed;
            p->colour = rocket.colour;
            p->max_life = rand_range(30, 50);
            p->life = p->max_life;
        }
        rocket.state = ROCKET_BURST;
    }
    
    static int update_and_draw(void)
    {
        int alive = 0;
        int i;
    
        /* Update rocket */
        if (rocket.state == ROCKET_RISING) {
            /* Erase old position */
            put_pixel((int)rocket.x, (int)rocket.y + 1, 0xFF000000);
            rocket.y += rocket.dy;
            /* Draw rocket */
            put_pixel((int)rocket.x, (int)rocket.y, 0xFFFFFFFF);
            if (rocket.y <= rocket.target_y)
                burst_rocket();
        }
    
        /* Update particles */
        for (i = 0; i < PARTICLES_PER_BURST; i++) {
            particle_t *p = &particles[i];
            if (p->life <= 0)
                continue;
    
            /* Erase at old position */
            put_pixel((int)p->old_x, (int)p->old_y, 0xFF000000);
    
            /* Physics */
            p->vy += GRAVITY;
            p->old_x = p->x;
            p->old_y = p->y;
            p->x += p->vx;
            p->y += p->vy;
            p->life--;
    
            /* Bounds check */
            if ((int)p->x < 0 || (int)p->x >= VP_W ||
                (int)p->y < 0 || (int)p->y >= VP_H) {
                p->life = 0;
                continue;
            }
    
            /* Draw at new position with fading colour */
            uint32_t c = dim_colour(p->colour, p->life, p->max_life);
            put_pixel((int)p->x, (int)p->y, c);
            alive++;
        }
    
        return alive;
    }
    
    int main(void)
    {
        int i;
    
        printf("Fireworks demo - press any key to exit\n");
    
        rand_seed(gfx_get_time());
        gfx_set_mode(1);
        gfx_clear(0xFF000000);
    
        rocket.state = ROCKET_DEAD;
        for (i = 0; i < PARTICLES_PER_BURST; i++)
            particles[i].life = 0;
    
        spawn_rocket();
    
        while (!gfx_char_ready()) {
            int alive = update_and_draw();
    
            /* Spawn next rocket when current burst dies */
            if (rocket.state != ROCKET_RISING...
    Read more »

  • Mandelbrot demo

    Matthew Pearce03/16/2026 at 12:52 0 comments


    Mandelbrot set generated using assembler on hardware.

    *------------------------------------------------------------------------
    * Mandelbrot Set Renderer
    *
    * Renders a 640x640 pixel Mandelbrot set using MC68881 FPU F-line
    * instructions.  Centred on the 1280x720 display at offset (320, 40).
    *
    * View window:  real [-2.0, +0.5], imag [-1.25, +1.25]
    * Scale:        2.5 / 640 = 0.00390625
    * Max iters:    32
    * Palette:      16 colours, cycled via (iter & 15)
    *
    * Usage: Load via S-record (L command), execute with G 2000.
    *
    * Assemble:
    *   vasmm68k_mot -Fsrec -m68000 -m68881 -o mandelbrot.srec mandelbrot.s
    *------------------------------------------------------------------------
    
    IMG_W       EQU  640
    IMG_H       EQU  640
    SCREEN_W    EQU  1280
    FB_BASE     EQU  $800000
    OFF_X       EQU  320           * horizontal offset on 1280-wide display
    OFF_Y       EQU  40            * vertical offset on 720-high display
    MAX_ITER    EQU  32
    ROW_BYTES   EQU  SCREEN_W*4   * 5120 bytes per display row
    IMG_ROW     EQU  IMG_W*4       * 2560 bytes per image row
    ROW_SKIP    EQU  ROW_BYTES-IMG_ROW  * 2560 bytes to skip between rows
    
                ORG     $2000
    
    *------------------------------------------------------------------------
    * Entry point
    *------------------------------------------------------------------------
    START
                LEA     msgTitle,A1
                MOVEQ   #13,D0
                TRAP    #15
    
    * Switch to graphics mode
                MOVEQ   #17,D0
                MOVEQ   #1,D1
                TRAP    #15
    
    * Clear to black
                MOVEQ   #18,D0
                MOVE.L  #$FF000000,D1
                TRAP    #15
    
                LEA     msgRender,A1
                MOVEQ   #13,D0
                TRAP    #15
    
    * Record start time
                MOVEQ   #8,D0
                TRAP    #15
                MOVE.L  D1,START_TIME
    
    * Set up FP constants
    * FP7 = scale = 0.00390625 (2.5/640)
                FMOVE.S #$3B800000,FP7     * 0.00390625
    
    * Compute framebuffer start address
    * FB_START = FB_BASE + OFF_Y * ROW_BYTES + OFF_X * 4
                LEA     FB_BASE,A3
                ADDA.L  #OFF_Y*ROW_BYTES+OFF_X*4,A3
    
    * Outer loop: Y pixels (D5 = 0..639)
                CLR.W   D5                 * D5 = pixel_y
    
    YLOOP
    * ci = -1.25 + pixel_y * scale
                FMOVE.W D5,FP3
                FMUL    FP7,FP3            * FP3 = pixel_y * scale
                FSUB.S  #$3FA00000,FP3     * FP3 -= 1.25  =>  ci = y*scale - 1.25
    
    * Progress output every 64 rows
                MOVE.W  D5,D0
                ANDI.W  #63,D0
                BNE.S   NOPROG
                LEA     msgRow,A1
                MOVEQ   #14,D0
                TRAP    #15
                CLR.L   D1
                MOVE.W  D5,D1
                MOVEQ   #10,D2
                MOVEQ   #15,D0
                TRAP    #15
                LEA     msgOf,A1
                MOVEQ   #14,D0
                TRAP    #15
                MOVE.L  #IMG_H,D1
                MOVEQ   #10,D2
                MOVEQ   #15,D0
                TRAP    #15
                LEA     msgNewline,A1
                MOVEQ   #13,D0
                TRAP    #15
    
    NOPROG
    * Inner loop: X pixels (D4 = 0..639)
                CLR.W   D4                 * D4 = pixel_x
    
    XLOOP
    * cr = -2.0 + pixel_x * scale
                FMOVE.W D4,FP2
                FMUL    FP7,FP2            * FP2 = pixel_x * scale
                FSUB.S  #$40000000,FP2     * FP2 -= 2.0  =>  cr = x*scale - 2.0
    
    * z = 0 + 0i
                FMOVE.L #0,FP0              * FP0 = zr = 0
                FMOVE.L #0,FP1              * FP1 = zi = 0
    
    * Iteration loop (D6 = iteration counter)
                MOVEQ   #0,D6
    
    ITERLOOP
    * zr_sq = zr * zr
                FMOVE   FP0,FP4
                FMUL    FP0,FP4            * FP4 = zr^2
    
    * zi_sq = zi * zi
                FMOVE   FP1,FP5
                FMUL    FP1,FP5            * FP5 = zi^2
    
    * Check escape: zr^2 + zi^2 > 4.0 ?
                FMOVE   FP4,FP6
                FADD    FP5,FP6            * FP6 = zr^2 + zi^2
                FCMP.S  #$40800000,FP6     * compare with 4.0
                FBGT    ESCAPED
    
    * zi_new = 2 * zr * zi + ci  (compute before zr, since we need old zr)
                FMUL    FP0,FP1            * FP1 = zr * zi  (old zi destroyed, but zi^2 safe in FP5)
                FADD    FP1,FP1            * FP1 = 2 * zr * zi
                FADD    FP3,FP1            * FP1 = 2*zr*zi + ci  (new zi)
    
    * zr_new = zr^2 - zi^2 + cr
                FMOVE   FP4,FP0            * FP0 = zr^2
                FSUB    FP5,FP0            * FP0 = zr^2 - zi^2
                FADD    FP2,FP0            * FP0 = zr^2 - zi^2 + cr  (new zr)
    
                ADDQ.W  #1,D6
                CMP.W   #MAX_ITER,D6
                BLT.S   ITERLOOP
    
    * Reached max iterations — pixel is in the set (black)
                MOVE.L  #$FF000000,(A3)+   * opaque black (ARGB)
                BRA.S   NEXTX
    
    ESCAPED
    * Pick colour from palette: index = (iter - 1) & 15
                MOVE.W  D6,D0
                SUBQ.W  #1,D0
                ANDI.W  #15,D0
                ASL.W   #2,D0              * D0 = palette offset (4 bytes each)
                LEA     PALETTE,A0
                MOVE.L  (A0,D0.W),(A3)+    * write pixel colour
    
    NEXTX
                ADDQ.W  #1,D4
                CMP.W   #IMG_W,D4
                BLT     XLOOP
    
    * End of row — advance A3 past the unused portion of the display row
                ADDA.L  #ROW_SKIP,A3
    
                ADDQ.W  #1,D5
                CMP.W   #IMG_H,D5
                BLT     YLOOP
    
    *------------------------------------------------------------------------
    * Done — flush, print timing, wait for keypress
    *------------------------------------------------------------------------
     MOVE.B #2,$FD0041...
    Read more »

  • First Benchmarks

    Matthew Pearce03/15/2026 at 10:42 0 comments

    === Whetstone Benchmark === M2 (array)... OK M3 (proc array)... OK M4 (conditionals)... OK M6 (log/exp/sqrt)... OK M7 (proc calls)... OK M8 (trig)... OK Passes: 10 Elapsed: 2191 ms KWIPS: 4564 Whetstone complete. The first Benchmarks run on an xilinx axu3eg - using musashi on arm and the FPU running on the PL fabric.

  • Fully Working FPGA MC68881

    Matthew Pearce03/13/2026 at 19:50 0 comments

    ver 2.1 build 001  MATT PEARCE 2024-2026  |  MC68000 + MC68881 FPGA
    >a 1000
    001000    00000000             OR.B    #0,D0  >FADD.L #1,FP0
    001000    F23C402200000001     FADD.L  #1,FP0
    001008    00000000             OR.B    #0,D0  >FADD.S #2.35,FP1
    001008    F23C44A240166666     FADD.S  #2.35,FP1
    001010    00000000             OR.B    #0,D0  >FADD FP0,FP1
    001010    F20000A2             FADD    FP0,FP1
    001014    00000000             OR.B    #0,D0  >RTS
    001014    4E75                 RTS
    001016    00000000             OR.B    #0,D0  >X
    MC68901 Multifunction Peripheral Initialized
    
    ================================================================================
     888b     d888                  888 d8b            .d8888b.
     8888b   d8888                  888 Y8P           d88P  Y88b
     88888b.d88888                  888               888    888
     888Y88888P888  .d88b.  888d888 888 888 88888b.        d88P
     888 Y888P 888 d8P  Y8b 888P"   888 888 888 "88b   .od888P"
     888  Y8P  888 88888888 888     888 888 888  888  d88P"
     888   "   888 Y8b.     888     888 888 888  888 888"
     888       888  "Y8888  888     888 888 888  888 888888888
                                                      FPU
    ================================================================================
    ver 2.1 build 001  MATT PEARCE 2024-2026  |  MC68000 + MC68881 FPGA
    >G 1000
    >R
    D0=00001000 D1=0000FFFF D2=00000000 D3=00000000
    D4=00000030 D5=0000002C D6=00000004 D7=000001FD
    A0=00001000 A1=00FE0070 A2=00000830 A3=00000000
    A4=00001016 A5=000005BF A6=000005EE   SP=00000FF0
      PC=00001000 SR=2700
     FP0=3FFF0000 80000000 00000000
     FP1=40000000 D6666600 00000000
     FP2=00000000 00000000 00000000
     FP3=00000000 00000000 00000000
     FP4=00000000 00000000 00000000
     FP5=00000000 00000000 00000000
     FP6=00000000 00000000 00000000
     FP7=00000000 00000000 00000000
    >



    The FPU has now been fully validated on an AXU3EG board. Musashi running on arm and the fpu running on the PL fabric.  See the github readme for details 

  • Nearly Complete

    Matthew Pearce03/05/2026 at 15:33 0 comments

    LUT Usage now fits comfortably on Artix a7-100t - now with room for a small processor as well as the fpu.


    Overview

    A VHDL-2008 implementation of a Motorola MC68881-compatible floating-point coprocessor targeting Xilinx 7-series FPGAs. The design implements the full MC68881 instruction set including all arithmetic, transcendental, program-control, system-control, and packed-decimal operations. It uses DSP-pipelined sequential FP units for the core arithmetic datapath with multi-cycle path constraints for timing closure.

    The current plan and progress tracking live in docs/fpu-progress-checklist.md.

    Features

    • Full instruction set: FADD, FSUB, FMUL, FDIV, FSQRT, FMOD, FREM, FSCALE, FSGLDIV, FSGLMUL, FABS, FNEG, FINT, FINTRZ, FGETEXP, FGETMAN, FTST, FCMP.
    • Transcendental engine: FSIN, FCOS, FTAN, FSINCOS, FASIN, FACOS, FATAN, FATANH, FSINH, FCOSH, FTANH, FETOX, FETOXM1, FTWOTOX, FTENTOX, FLOGN, FLOGNP1, FLOG2, FLOG10. BRAM-based seed tables with Taylor/CORDIC iteration.
    • Data movement: FMOVE (all formats including packed decimal .P), FMOVEM (register lists and control registers), FMOVECR (ROM constants).
    • Program control: FScc, FBcc, FDBcc, FTRAPcc, FNOP with BSUN trap gating.
    • System control: FSAVE/FRESTORE with Null/Idle/Busy frame support (45-word Busy frame with full sub-unit save/restore hierarchy).
    • IEEE 754 compliance: NaN propagation (SNaN/QNaN discrimination, payload preservation), infinity handling, signed zero, gradual underflow, all four rounding modes (nearest, zero, +inf, -inf), single/double/extended precision.
    • Exception handling: Per-operation FPSR exception policies, FPCR trap enable, accrued exception accumulation.
    • Peripheral interface: Register-mapped bus interface with DSACK handshake, suitable for M68000/M68010 peripheral-mode operation.

    Utilization (Xilinx Artix-7 200T, post-place)

    ResourceUsedAvailableUtil>#/th###
    Slice LUTs52,361133,80039.13>#/td###
    Registers13,131267,6004.91>#/td###
    Block RAM5 tiles3651.37>#/td###
    DSP48E1337404.46>#/td###

    Non-incremental synthesis + implementation, Vivado 2025.2, xc7a200tfbg676-1. Date: 2026-03-05. Includes Section 7 CIR coprocessor interface with FSAVE/FRESTORE Busy frame support and full exception dialog paths; see "CIR feature gating" below.

    Timing

    • Target clock: 10 MHz (100 ns period) — matches MC68881 bus timing.
    • Multi-cycle path constraints on sequential FP units (mul: 4 cycles, addsub: 6 cycles, div: 6 cycles) and trig engine hold states.
    • Post-route WNS: +16.631 ns (83% slack margin at 100 ns period; effective Fmax ~12 MHz).
    • WHS (hold): no violations.

    Target device compatibility

    The design fits on several FPGA families. With CIR disabled (ENABLE_CIR_g => false), the core is ~58K LUTs and fits comfortably on smaller devices:

    DeviceLUTsDSPsFit (full)?Fit (no CIR)?
    Xilinx Artix-7 200T134,600740Yes (39%)Yes (34%)
    Xilinx Artix-7 100T63,400240Yes (~83%)Yes (~72%)
    Xilinx Zynq UltraScale+ ZU3EG~71,000360Yes (~74%)Yes (~64%)
    Intel Cyclone V 5CEBA7150,720 ALMs156YesYes

    All RTL is vendor-portable (inferred DSP/BRAM, no Xilinx IP cores). Porting to Intel/Quartus requires XDC-to-SDC constraint conversion and minor DSP inference adjustments.