Project | Winter, FPGAs, and Forgotten Arithmetic

« Back to project details Sort by:

First RNS Implementation on Cyclone IV: +2 Operation over an 8-bit Bus

10/04/2025 at 13:09 • 0 comments

This log presents the first attempt at implementing a full computation chain based on the Residue Number System (RNS) on an FPGA—specifically, a Cyclone IV. Division is systematically avoided in favor of additions, bit shifts, and lookup tables. The complete pipeline, from conversion to reconstruction, is justified step by step. The code is available on my github : https://github.com/Bertrand-selvasystems/RNS-on-FPGA/.

Hardware and development environment

The hardware target is a Cyclone IV FPGA (EP4CE family). Physical programming will be performed later, as soon as the programmer arrives (ordered from Aliexpress, this will take a few more days). At this stage, the entire approach is based on software simulation using Quartus Prime Lite for synthesis, and Questa Intel FPGA (Starter Edition) for timing and functional simulation.
This is the FPGA software suite that appears the most accessible and best documented. However, you must request a .dat license file from Intel, even for the free "starter" version.

Some demonstrations for RNS

I wanted to review all the calculations necessary for the determination of the different moduli. This is very simple as soon as you know the following properties, which can be rigorously demonstrated from the very definition of the remainder in Euclidean division.

Let a and b be two integers, and q a modulus:

a = n_a × q + r_a, with 0 ≤ r_a < q
b = n_b × q + r_b, with 0 ≤ r_b < q

Addition:

(a + b) = (nₐq + rₐ) + (n_bq + r_b) = (nₐ + n_b)q + (rₐ + r_b)

The remainder modulo q is thus (a + b) mod q = (rₐ + r_b) mod q.

Multiplication:

(a × b) = (nₐq + rₐ)(n_bq + r_b)
        = [nₐ n_b q² + (nₐ r_b + n_b rₐ)q + rₐ r_b]

Modulo q, only the rₐ r_b terms remain: (a × b) mod q = (rₐ × r_b) mod q.

Subtraction:

(a - b) = (nₐq + rₐ) - (n_bq + r_b) = (nₐ - n_b)q + (rₐ - r_b)

The remainder is (rₐ - r_b + q) mod q, with the addition of q ensuring a positive result.

These relations are used for the calculation of the residues of different moduli, and can be written as r_i or i mod n...

Binary → RNS conversion without divisions

Use of nibbles: motivation and generalization

A byte is naturally structured in two 4-bit parts, called “nibbles”: the high nibble (H = x[7:4]) and the low nibble (L = x[3:0]). This binary representation is standard in digital architectures, as it allows the byte to be factored as:

x = 16 × H + L

The advantage of this decomposition is that it optimizes arithmetic operations, in particular for modulo conversion, thanks to the modular structure of 16 in most common bases. The use of nibbles, universal in digital electronics, thus allows combinatorial logic to be optimized.

Modulo 5 conversion

We first express 16 modulo 5:
16 = 3 × 5 + 1, so 16 ≡ 1 (mod 5)
It follows that:

x mod 5 = (16H + L) mod 5

x mod 5 = (((H mod 5)*(16 mod 5)) mod 5 + L mod 5) mod 5
x mod 5 = (((H mod 5)*1) mod 5 + L mod 5) mod 5

x mod 5 = (r_H + r_L) mod 5
With

(a + b) mod q = (rₐ + r_b) mod q.

Thus:

x mod 5 = (H + L) mod 5

In other words, the remainder of the division by 5 is obtained simply by summing the two nibbles and applying a reduction modulo 5. This is performed with a short addition (max 5 bits) followed by a lookup table (LUT) to bring the sum into the 0..4 interval.


function [2:0] mod5_nibble;
  input [3:0] h, l;
  begin
    mod5_nibble = lut_sum_mod5(h + l); // LUT over [0..30]
  end
endfunction

Modulo 7 conversion

We use the relation:
16 = 2 × 7 + 2, so 16 ≡ 2 (mod 7)
So: x mod 7 = (2H + L) mod 7.

The calculation of 2H is done by a simple left shift (bit-shift), then the low nibble is added and a modulo 7 reduction is applied via a LUT.

Modulo 9 conversion

Here, 16 = 1 × 9 + 7, so 16 ≡ 7 (mod 9).
Thus: x mod 9 = (7H + L) mod 9.
Long multiplication is avoided: 7H = (4H + 2H + H), i.e., two left shifts and an addition.


function [3:0] mod9_nibble;
  input [3:0] h, l;
  reg [6:0] s;
  begin
    s = (h << 2) + (h << 1) + h + l;
    mod9_nibble = lut_sum_mod9(s);
  end
endfunction

Optimization note: In all cases, the conversion consists of adding small-width values, then using a small lookup table. On FPGA, this translates to:

Additions with small fan-in, no global carry propagation (fast, low-cost logic)
Bit shifts (almost free)
Very small LUTs perfectly matched to FPGA resources

Thus, binary→RNS conversion does not penalize the operating frequency and uses little logic resources, at least in the case of a small number of moduli.

Operation in RNS domain

In the RNS system, each operation is performed locally, for each residue. In this experiment, the operation chosen is addition of 2 to each modulus, that is: for each modulus, r → (r + 2) mod m.

This choice may seem a bit silly (a lot of machinery just to perform a +2), but it allows controlled verification of the entire pipeline, including the validity of the round-trip conversions and information preservation. It's a good starting point.


function [2:0] add5_2; input [2:0] a; reg [3:0] t; begin
  t = {1'b0,a} + 4'd2; if (t >= 4'd5) t = t - 4'd5; add5_2 = t[2:0];
end endfunction
// Same scheme for mod 7 and mod 9.

Benefit: to validate the correctness of the entire pipeline, from binary to RNS and back, on an operation whose result is predictable and controllable. See the last section with the simulation results and code: It works!

RNS → binary (CRT) without divisions

CRT statement

For coprime moduli {m₁, m₂, ..., m_k} with total product M = ∏ m_i,
the Chinese Remainder Theorem (CRT) states that any system of residues (r₁, r₂, ..., r_k) corresponds to a unique integer x ∈ [0, M-1] such that:

x ≡ r_i (mod m_i) for all i

The explicit reconstruction formula is:

x = Σ_i=1..k [r_i × M_i × y_i] mod M

with M_i = M / m_i, and y_i the multiplicative inverse of M_i modulo m_i.
In our example:

{5, 7, 9}, M = 315
M₅ = 63, M₇ = 45, M₉ = 35
y₅ = 2, y₇ = 5, y₉ = 8
We use the products K_i = M_i × y_i as coefficients.


S = K5 * r5 + K7 * r7 + K9 * r9;

Numerical values: K₅ = 126, K₇ = 225, K₉ = 280.

Here is the code corresponding to the LUT definitions:

// mini-ROM Ai*ri (M=315)
  function [11:0] T5; input [2:0] r; begin
    case (r)
      3'd0: T5=12'd0;   3'd1: T5=12'd126; 3'd2: T5=12'd252;
      3'd3: T5=12'd378; 3'd4: T5=12'd504; default: T5=12'd0;
    endcase
  end endfunction
  function [11:0] T7; input [2:0] r; begin
    case (r)
      3'd0: T7=12'd0;   3'd1: T7=12'd225; 3'd2: T7=12'd450;
      3'd3: T7=12'd675; 3'd4: T7=12'd900; 3'd5: T7=12'd1125;
      3'd6: T7=12'd1350; default: T7=12'd0;
    endcase
  end endfunction
  function [11:0] T9; input [3:0] r; begin
    case (r)
      4'd0: T9=12'd0;   4'd1: T9=12'd280; 4'd2: T9=12'd560;
      4'd3: T9=12'd840; 4'd4: T9=12'd1120;4'd5: T9=12'd1400;
      4'd6: T9=12'd1680;4'd7: T9=12'd1960;4'd8: T9=12'd2240;
      default: T9=12'd0;
    endcase
  end endfunction

Modulo 315 reduction without division

To obtain a result bounded in [0,314], the reduction modulo 315 is performed without division, by conditional subtractions:
Proceed as follows:

As long as the intermediate value is greater than or equal to 2520 (8 × 315), subtract 2520
If the result remains greater than or equal to 1260 (4 × 315), subtract 1260
If still greater than or equal to 630 (2 × 315), subtract 630
If greater than or equal to 315, subtract 315
The remaining value then belongs to [0,314]

Each subtraction eliminates a block of 315 (or a multiple). This purely combinatorial strategy avoids the high cost of integer division and fits perfectly with FPGA architectures; it can be generalized to larger moduli.

ref : Piestrak, S.J., “Design of Residue Generators and Multi-Operand Modular Adders Using Carry-Save Adders,”


function [8:0] reduce315;
  input [11:0] val;
  reg [11:0] t;
begin
  t = val;
  if (t >= 12'd2520) t = t - 12'd2520;
  if (t >= 12'd1260) t = t - 12'd1260;
  if (t >= 12'd630 ) t = t - 12'd630;
  if (t >= 12'd315 ) t = t - 12'd315;
  reduce315 = t[8:0];
end endfunction

Synthesis and perspectives

The presented architecture leverages the granularity of the FPGA: each addition, shift, or LUT is mapped onto an elementary logic unit, with minimal pipeline depth. The maximum achievable frequency then depends directly on the propagation delay in these logics.

For simple operations (addition of small integers, LUT), these propagation delays are particularly short, allowing high operating frequencies, often limited only by the critical path.

On FPGA, the general optimization strategy is to favor the use of local combinatorial resources, to limit carry propagation, and to break down operations into well-synchronized sequential phases. This philosophy allows an optimal compromise between frequency, logic occupancy, and power consumption.

Applying RNS to this type of pipeline demonstrates that even conversion operations requiring division can be broken down into simpler independent sub-operations.

Furthermore, RNS allows the core computation to be performed on each modulus independently. For example, our current phase 3 (+2) is executed on each modulus independently. All calculations can therefore be performed in parallel without interdependence, paving the way for large-scale processing on simple, efficient hardware architectures. It is somewhat as if we could rigorously process, for a large number, the units, tens, hundreds, thousands, etc., completely independently...

Project perspectives

This is a first stone thrown into the pond. I will consolidate this initial code as follows in the coming logs:

Extension to 16 bits: by increasing the data width and the set of modules (for example {7, 11, 13, 15, 17, 19, 23, 29, 31}), while maintaining the simplicity of conversions thanks to the nibble + LUT architecture.
Complete FIR pipeline: replacing the elementary operation (+2) by a complete FIR chain (finite impulse response filter) entirely in RNS allows the exploration of massively parallel processing capabilities—a domain where RNS truly shines (and the reason it was used in the Duga K340 computer).
Accurate resource and frequency measurements: the goal is to objectively quantify logic occupancy, maximum frequency, and actual consumption, to support comparisons and guide implementation choices.
Real deployment via fast interface: integration between the FPGA and a microcontroller (ESP32), for example via I2S "LCD", allows validation of real-world operation and evaluation of the system's sustainable throughput.

Your comments and technical feedback are welcome: this work is very exploratory and any help or opinion from a pro is much appreciated...

Summary project

Each clock edge triggers the orderly propagation of operands through the intermediate registers (r5, r7, r9, etc.).

The signals out_valid and out_byte stabilize after two cycles, and the +2 operation works as intended.

A small victory already...

Code

`timescale 1ns/1ps
`default_nettype none

module adder8 (
  input  wire        clk,
  input  wire        rst_n,

  // Bus entrée
  input  wire [7:0]  in_byte,
  input  wire        in_valid,
  output wire        in_ready,

  // Bus sortie
  output reg  [7:0]  out_byte,
  output reg         out_valid,
  input  wire        out_ready
);

  // ====================================================
  // Phase 1 — Capture entrée 
  // ====================================================
  reg  [7:0] x0;
  reg        v0;

  wire s1_ready = 1'b1;
  assign in_ready = (~v0) | s1_ready;

  always @(posedge clk) begin
    if (!rst_n) begin
      v0 <= 1'b0;
      x0 <= 8'd0;
    end else begin
      if (in_ready) begin
        v0 <= in_valid;
        if (in_valid) x0 <= in_byte;
      end
    end
  end

  // ====================================================
  // Phase 2 — Binaire -> RNS (5,7,9) via nibbles + LUT
  // ====================================================
  
  
    
  reg        v1;
  reg  [2:0] r5;
  reg  [2:0] r7;
  reg  [3:0] r9;

  wire [3:0] H = x0[7:4];
  wire [3:0] L = x0[3:0];
  
  
// --- MOD 5 : (H+L) mod 5 ---
// S = H+L est sur 5 bits (0..30)
function [2:0] lut_sum_mod5;
  input [4:0] s;
  begin
    case (s)
      5'd0:  lut_sum_mod5=3'd0;  5'd1:  lut_sum_mod5=3'd1;  5'd2:  lut_sum_mod5=3'd2;  5'd3:  lut_sum_mod5=3'd3;  5'd4:  lut_sum_mod5=3'd4;
      5'd5:  lut_sum_mod5=3'd0;  5'd6:  lut_sum_mod5=3'd1;  5'd7:  lut_sum_mod5=3'd2;  5'd8:  lut_sum_mod5=3'd3;  5'd9:  lut_sum_mod5=3'd4;
      5'd10: lut_sum_mod5=3'd0;  5'd11: lut_sum_mod5=3'd1;  5'd12: lut_sum_mod5=3'd2;  5'd13: lut_sum_mod5=3'd3;  5'd14: lut_sum_mod5=3'd4;
      5'd15: lut_sum_mod5=3'd0;  5'd16: lut_sum_mod5=3'd1;  5'd17: lut_sum_mod5=3'd2;  5'd18: lut_sum_mod5=3'd3;  5'd19: lut_sum_mod5=3'd4;
      5'd20: lut_sum_mod5=3'd0;  5'd21: lut_sum_mod5=3'd1;  5'd22: lut_sum_mod5=3'd2;  5'd23: lut_sum_mod5=3'd3;  5'd24: lut_sum_mod5=3'd4;
      5'd25: lut_sum_mod5=3'd0;  5'd26: lut_sum_mod5=3'd1;  5'd27: lut_sum_mod5=3'd2;  5'd28: lut_sum_mod5=3'd3;  5'd29: lut_sum_mod5=3'd4;
      5'd30: lut_sum_mod5=3'd0;
    endcase
  end
endfunction

function [2:0] mod5_nibble;
  input [3:0] h,l;
  begin
    mod5_nibble = lut_sum_mod5(h + l);  // direct LUT
  end
endfunction


// --- MOD 7 direct : (2H+L) % 7
// S = 2H + L va de 0..45
function [2:0] lut_sum_mod7;
  input [5:0] s;
  begin
    case (s)
      6'd0: lut_sum_mod7=3'd0;   6'd1: lut_sum_mod7=3'd1;   6'd2: lut_sum_mod7=3'd2;   6'd3: lut_sum_mod7=3'd3;
      6'd4: lut_sum_mod7=3'd4;   6'd5: lut_sum_mod7=3'd5;   6'd6: lut_sum_mod7=3'd6;   6'd7: lut_sum_mod7=3'd0;
      6'd8: lut_sum_mod7=3'd1;   6'd9: lut_sum_mod7=3'd2;   6'd10:lut_sum_mod7=3'd3;   6'd11:lut_sum_mod7=3'd4;
      6'd12:lut_sum_mod7=3'd5;   6'd13:lut_sum_mod7=3'd6;   6'd14:lut_sum_mod7=3'd0;   6'd15:lut_sum_mod7=3'd1;
      6'd16:lut_sum_mod7=3'd2;   6'd17:lut_sum_mod7=3'd3;   6'd18:lut_sum_mod7=3'd4;   6'd19:lut_sum_mod7=3'd5;
      6'd20:lut_sum_mod7=3'd6;   6'd21:lut_sum_mod7=3'd0;   6'd22:lut_sum_mod7=3'd1;   6'd23:lut_sum_mod7=3'd2;
      6'd24:lut_sum_mod7=3'd3;   6'd25:lut_sum_mod7=3'd4;   6'd26:lut_sum_mod7=3'd5;   6'd27:lut_sum_mod7=3'd6;
      6'd28:lut_sum_mod7=3'd0;   6'd29:lut_sum_mod7=3'd1;   6'd30:lut_sum_mod7=3'd2;   6'd31:lut_sum_mod7=3'd3;
      6'd32:lut_sum_mod7=3'd4;   6'd33:lut_sum_mod7=3'd5;   6'd34:lut_sum_mod7=3'd6;   6'd35:lut_sum_mod7=3'd0;
      6'd36:lut_sum_mod7=3'd1;   6'd37:lut_sum_mod7=3'd2;   6'd38:lut_sum_mod7=3'd3;   6'd39:lut_sum_mod7=3'd4;
      6'd40:lut_sum_mod7=3'd5;   6'd41:lut_sum_mod7=3'd6;   6'd42:lut_sum_mod7=3'd0;   6'd43:lut_sum_mod7=3'd1;
      6'd44:lut_sum_mod7=3'd2;   6'd45:lut_sum_mod7=3'd3;
      default: lut_sum_mod7=3'd0;
    endcase
  end
endfunction

function [2:0] mod7_nibble;
  input [3:0] h,l;
  begin
    mod7_nibble = lut_sum_mod7((h << 1) + l);  // s = 2H + L
  end
endfunction

// --- MOD 9 : (7H + L) % 9 ---
// s = 7H + L ∈ [0..120]. LUT directe sur s.
function [3:0] lut_sum_mod9;
  input [6:0] s;
  begin
    case (s)
      // s % 9 == 0
      7'd0,  7'd9,  7'd18, 7'd27, 7'd36, 7'd45, 7'd54, 7'd63, 7'd72, 7'd81, 7'd90, 7'd99, 7'd108, 7'd117:
        lut_sum_mod9 = 4'd0;
      // s % 9 == 1
      7'd1,  7'd10, 7'd19, 7'd28, 7'd37, 7'd46, 7'd55, 7'd64, 7'd73, 7'd82, 7'd91, 7'd100, 7'd109, 7'd118:
        lut_sum_mod9 = 4'd1;
      // s % 9 == 2
      7'd2,  7'd11, 7'd20, 7'd29, 7'd38, 7'd47, 7'd56, 7'd65, 7'd74, 7'd83, 7'd92, 7'd101, 7'd110, 7'd119:
        lut_sum_mod9 = 4'd2;
      // s % 9 == 3
      7'd3,  7'd12, 7'd21, 7'd30, 7'd39, 7'd48, 7'd57, 7'd66, 7'd75, 7'd84, 7'd93, 7'd102, 7'd111, 7'd120:
        lut_sum_mod9 = 4'd3;
      // s % 9 == 4
      7'd4,  7'd13, 7'd22, 7'd31, 7'd40, 7'd49, 7'd58, 7'd67, 7'd76, 7'd85, 7'd94, 7'd103, 7'd112:
        lut_sum_mod9 = 4'd4;
      // s % 9 == 5
      7'd5,  7'd14, 7'd23, 7'd32, 7'd41, 7'd50, 7'd59, 7'd68, 7'd77, 7'd86, 7'd95, 7'd104, 7'd113:
        lut_sum_mod9 = 4'd5;
      // s % 9 == 6
      7'd6,  7'd15, 7'd24, 7'd33, 7'd42, 7'd51, 7'd60, 7'd69, 7'd78, 7'd87, 7'd96, 7'd105, 7'd114:
        lut_sum_mod9 = 4'd6;
      // s % 9 == 7
      7'd7,  7'd16, 7'd25, 7'd34, 7'd43, 7'd52, 7'd61, 7'd70, 7'd79, 7'd88, 7'd97, 7'd106, 7'd115:
        lut_sum_mod9 = 4'd7;
      // s % 9 == 8
      7'd8,  7'd17, 7'd26, 7'd35, 7'd44, 7'd53, 7'd62, 7'd71, 7'd80, 7'd89, 7'd98, 7'd107, 7'd116:
        lut_sum_mod9 = 4'd8;

      default: lut_sum_mod9 = 4'd0; // sécurité
    endcase
  end
endfunction

function [3:0] mod9_nibble;
  input [3:0] h, l;
  reg  [6:0] s; // 0..120
  begin
    // 7H = 4H + 2H + H (pas de multiplieur)
    s = (h << 2) + (h << 1) + h + l;
    mod9_nibble = lut_sum_mod9(s);
  end
endfunction


  always @(posedge clk) begin
    if (!rst_n) begin
      v1 <= 1'b0; r5 <= 3'd0; r7 <= 3'd0; r9 <= 4'd0;
    end else begin
      v1 <= v0;
      if (v0) begin
        r5 <= mod5_nibble(H, L);
        r7 <= mod7_nibble(H, L);
        r9 <= mod9_nibble(H, L);
      end
    end
  end

  // ====================================================
  // Phase 3 — Opération (+2) en RNS
  // ====================================================
  reg        v2;
  reg  [2:0] r5_op, r7_op;
  reg  [3:0] r9_op;

  function [2:0] add5_2; input [2:0] a; reg [3:0] t; begin
    t = {1'b0,a} + 4'd2; if (t >= 4'd5) t = t - 4'd5; add5_2 = t[2:0];
  end endfunction
  function [2:0] add7_2; input [2:0] a; reg [3:0] t; begin
    t = {1'b0,a} + 4'd2; if (t >= 4'd7) t = t - 4'd7; add7_2 = t[2:0];
  end endfunction
  function [3:0] add9_2; input [3:0] a; reg [4:0] t; begin
    t = {1'b0,a} + 5'd2; if (t >= 5'd9) t = t - 5'd9; add9_2 = t[3:0];
  end endfunction

  always @(posedge clk) begin
    if (!rst_n) begin
      v2 <= 1'b0; r5_op <= 3'd0; r7_op <= 3'd0; r9_op <= 4'd0;
    end else begin
      v2 <= v1;
      if (v1) begin
        r5_op <= add5_2(r5);
        r7_op <= add7_2(r7);
        r9_op <= add9_2(r9);
      end
    end
  end

  // ====================================================
  // Phase 4 — CRT (RNS -> binaire)
  // ====================================================
  // mini-ROM Ai*ri (M=315)
  function [11:0] T5; input [2:0] r; begin
    case (r)
      3'd0: T5=12'd0;   3'd1: T5=12'd126; 3'd2: T5=12'd252;
      3'd3: T5=12'd378; 3'd4: T5=12'd504; default: T5=12'd0;
    endcase
  end endfunction
  function [11:0] T7; input [2:0] r; begin
    case (r)
      3'd0: T7=12'd0;   3'd1: T7=12'd225; 3'd2: T7=12'd450;
      3'd3: T7=12'd675; 3'd4: T7=12'd900; 3'd5: T7=12'd1125;
      3'd6: T7=12'd1350; default: T7=12'd0;
    endcase
  end endfunction
  function [11:0] T9; input [3:0] r; begin
    case (r)
      4'd0: T9=12'd0;   4'd1: T9=12'd280; 4'd2: T9=12'd560;
      4'd3: T9=12'd840; 4'd4: T9=12'd1120;4'd5: T9=12'd1400;
      4'd6: T9=12'd1680;4'd7: T9=12'd1960;4'd8: T9=12'd2240;
      default: T9=12'd0;
    endcase
  end endfunction

  reg        v3, v4;
  reg [11:0] S;
  reg  [8:0] X;   // debug
  reg  [7:0] y4;

  function [8:0] reduce315;
    input [11:0] val;
    reg   [11:0] t;
  begin
    t = val;
    if (t >= 12'd2520) t = t - 12'd2520;
    if (t >= 12'd1260) t = t - 12'd1260;
    if (t >= 12'd630 ) t = t - 12'd630;
    if (t >= 12'd315 ) t = t - 12'd315;
    reduce315 = t[8:0];
  end endfunction

  wire [8:0] S_red = reduce315(S);

  always @(posedge clk) begin
    if (!rst_n) begin
      v3 <= 1'b0; v4 <= 1'b0; S <= 12'd0; X <= 9'd0; y4 <= 8'd0;
    end else begin
      v3 <= v2;
      if (v2) S <= T5(r5_op) + T7(r7_op) + T9(r9_op);

      v4 <= v3;
      if (v3) begin
        X  <= S_red;        // debug
        y4 <= S_red[7:0];   // aligné avec v4
      end
    end
  end

  // ====================================================
  // Phase 5 — buffer de sortie
  // ====================================================
  reg  [7:0] y5;
  reg        v5;

  wire s5_ready = (~v5) | out_ready;

  always @(posedge clk) begin
    if (!rst_n) begin
      v5 <= 1'b0; y5 <= 8'd0; out_valid <= 1'b0; out_byte <= 8'd0;
    end else begin
      if (s5_ready) begin
        v5 <= v4;
        if (v4) y5 <= y4;
      end
      out_valid <= v5;
      out_byte  <= y5;
    end
  end

endmodule

`default_nettype wire

From Binary to RNS
09/29/2025 at 10:51 • 0 comments
Introduction — Why Use RNS on FPGA?

The Residue Number System (RNS) provides a modular alternative to classical binary arithmetic in embedded systems. Instead of representing integers with sequences of bits and managing carry propagation, numbers are described as a vector of modular remainders, each computed with respect to a different modulus. This parallelism eliminates the carry chain that often limits speed in FPGA designs, especially for operations such as FIR filtering, convolution, or correlation, where pipeline depth, regularity, and maximum clock frequency are critical.

In RNS, each channel—or modular path—can operate independently, enabling parallel arithmetic units, high throughput, and a highly regular architecture, even on low-cost or resource-constrained FPGAs.

Architectural Benefits and FPGA Implementation Details

Implementing RNS on FPGA provides tangible structural and computational advantages. All modular channels operate independently, removing the carry propagation found in classical binary arithmetic. This feature supports higher clock frequencies, simplified timing closure, and more regular place-and-route. Binary-to-RNS conversion is typically completed in a single cycle, leveraging simple adders or compact LUTs. FIR filtering or other linear operations proceed in a regular pipeline, while the RNS-to-binary CRT conversion is executed in a deterministic three-cycle process using precomputed lookup tables.

Memory usage remains low: for nine moduli, total LUT and ROM requirements are generally below four kilobytes, which is negligible in the context of modern FPGAs. Each modular channel acts as an independent processing path, facilitating predictable performance and robust implementation, even in cost-sensitive designs.

In practical DSP pipelines, such as a 256-tap Q15 FIR, the conversion latency is negligible compared to the main computation. The RNS structure enables higher throughput and more efficient pipelines, making it well-suited to real-time signal processing.

The Choice of Moduli — A Question of Dynamic Range, Simplicity, and Practicality

In an RNS system, each number is not represented by its bits, but by its remainders modulo several integers.
For example, using the base {7, 11, 13}, the number x = 123 is written as:

123 mod 7 = 4
123 mod 11 = 2
123 mod 13 = 6

Thus, 123 is encoded by the "vector of remainders" (4, 2, 6).

The total dynamic range of the system is given by the product of the moduli:

M = 7 × 11 × 13 = 1001

So, here, it is possible to represent, without ambiguity, all integers from 0 to 1000.

On FPGA, to process a FIR filter (16 bits input × 16 bits coefficients, and L accumulation taps), a much larger dynamic range is required.
In general, a minimum of 40 bits of dynamic range is targeted, that is:

M ≥ 2⁴⁰ ≈ 1,100,000,000,000

This requirement imposes the use of at least 9 or 10 small moduli (typically less than 100) to stay compatible with modest FPGA resources, while still covering the dynamic range of a substantial audio FIR (e.g., 256 taps, Q15 signal).

For a 256-tap FIR with Q15 data and coefficients, the worst-case sum is L × (2¹⁵−1)² = 256 × 32767² = 274,861,129,984 (≈ 2^38.08). This validates the 40-bit dynamic range criterion.

The smaller the moduli, the faster the conversion and processing... but more moduli are required to cover the desired dynamic range.

Chinese Remainder Theorem (CRT): Statement and Detailed Explanation

If you select several integers m₁, m₂, ..., mₖ that are pairwise coprime (meaning every pair has no common factor greater than 1), then the Chinese Remainder Theorem ensures a powerful property:

For any tuple of remainders (a₁, a₂, ..., aₖ), with each aᵢ in [0, mᵢ−1], there exists one and only one integer x in the interval [0, M−1], where:

M = m₁ × m₂ × ... × mₖ

such that for every i in [1, k]:
x ≡ aᵢ (mod mᵢ)

In other words, the vector of remainders is a complete and unique encoding of an integer x in [0, M−1]. No two different x share the same tuple of residues, and all x in that range are possible. This is why, in an RNS, the system is both lossless and unambiguous, provided all moduli are pairwise coprime.

Explicit Construction — How to Recover x

To reconstruct x from its residues, the CRT gives a formula that is both practical and easily automatable:
1. For each modulus m_i, compute M_i = M / m_i (i.e., the product of all other moduli).
2. Find N_i, the modular inverse of M_i modulo m_i, i.e.:
  N_i × M_i ≡ 1 mod m_i
3. Compute the sum:
  x = [Σ (a_i × M_i × N_i)] mod M
All terms are integers and all computations are explicit: there is no ambiguity, no trial and error. Everything is constructive, and every step is suitable for hardware implementation.

Example

Take moduli m₁ = 5, m₂ = 7, m₃ = 11. These are all primes, so pairwise coprime. The dynamic range is:

M = 5 × 7 × 11 = 385

Suppose the RNS tuple is (a₁, a₂, a₃) = (3, 5, 1). What is the integer x?
- M₁ = 385 / 5 = 77 ; N₁ = 77⁻¹ mod 5 = 3 (because 77 × 3 = 231 ≡ 1 mod 5)
- M₂ = 385 / 7 = 55 ; N₂ = 55⁻¹ mod 7 = 6 (because 55 × 6 = 330 ≡ 1 mod 7)
- M₃ = 385 / 11 = 35 ; N₃ = 35⁻¹ mod 11 = 6 (because 35 × 6 = 210 ≡ 1 mod 11)
Then,

x = (3 × 77 × 3) + (5 × 55 × 6) + (1 × 35 × 6) mod 385
= (693) + (1650) + (210) = 2553
2553 mod 385 = 243

Thus, the number encoded by (3, 5, 1) in base {5,7,11} is x = 243.

Key point: This method always reconstructs x correctly and efficiently as long as the moduli are pairwise coprime. If you wish to handle signed integers, simply reinterpret x in the interval [−M/2, M/2) (i.e., if x ≥ M/2, use x−M).

Binary to RNS Conversion: Naive, Folding, LUT, Padding — Detailed Examples

The Problem with the Naive Method

On a microcontroller or FPGA, if each remainder is calculated with the modulo “%” operator, every division is expensive (long integer division).
On FPGA, this step would be far more costly than the rest of the pipeline and would destroy all the benefit of using RNS.

Trick: Exploiting Binary Structure (Folding)

Many small moduli are close to a power of 2 (such as 15, 31, 63...) and are “friendly” to manipulate. For a modulus m = 2^k − 1, split the value into k-bit blocks (from the LSB), sum these blocks, and repeat the process until a single value less than m is left; this is the residue.

Detailed Example: Modulo 31

31 = 2⁵ − 1. For any 16-bit integer, it can be split into blocks of 5 bits (from LSB), and these blocks are summed.

Let x = 0b1010 1100 0011 0111 (i.e., 44087).

Block 0 (bits 0–4): 10111 = 23
Block 1 (bits 5–9): 00001 = 1
Block 2 (bits 10–14): 01011 = 11
Block 3 (bit 15, padded with zeros): 00001 = 1

Sum = 23 + 1 + 11 + 1 = 36
Reduce: 36 mod 31 = 5

And it's OK : 44087 mod 31 = 5.

Another Example: Modulo 15

15 = 2⁴ − 1.
Split into 4-bit blocks.
Suppose x = 0b1001 1011 0100 1110 = 39758

Block 0 (bits 0–3): 1110 = 14
Block 1 (4–7): 0100 = 4
Block 2 (8–11): 1011 = 11
Block 3 (12–15): 1001 = 9

Sum = 14 + 4 + 11 + 9 = 38. 38 mod 15 = 8 (38 = 2 × 15 + 8)

Vérification directe : 39758 mod 15 = 8

Note: If the number of bits is not a multiple of the block size, pad on the left with zeros. This ensures all blocks are the same size, and the sum is meaningful.

When the Modulo is Not "Clean": The Mini-LUT

Now consider a modulus that is not a power of 2 ±1, e.g., 13.

Suppose x = 0b1010 1100 0011 0111 = 44087.

Split into two bytes: high b₁ = 0b10101100 = 172, low b₀ = 0b00110111 = 55.

Precompute a LUT:
For each possible value of b₁, store

T₁₃[b₁] = (b₁ × (256 mod 13)) mod 13

Since 256 mod 13 = 9, for b₁ = 172,

172 × 9 = 1548
1548 mod 13 = 1

Finally,

r = (b₀ + T₁₃[b₁]) mod 13 = (55 + 1) mod 13 = 56 mod 13 = 4

Verification:

44087 mod 13 = 4

In summary:

A LUT (lookup table) with 256 entries × 4 bits = 1 kbit: negligible on FPGA. And the computation time is a single cycle, instant lookup.

RNS to Binary Conversion: Full Example

Typical FPGA Scheme:

For each modulo, a mini-ROM (sometimes called “mini-LUT CRT”) precalculates all possible values of

r_i × C_i mod M

(where C_i is a CRT constant for each modulo, and M is the total product of the moduli).

For each RNS remainder:

t_i = miniROM_i[r_i]

Sum all t_i.

Reduce modulo M (conditional subtraction, never a loop).

The original binary value (or the result calculated in RNS) is recovered.

Typical FPGA Pipeline:

Cycle 1: lookup of 9 mini-ROMs in parallel

Cycle 2: addition of 9 results (adder tree)

Cycle 3: reduction mod M, output

Example with base {7,15,31} (to keep numbers simple):

Let x = 12345.

12345 mod 7 = 4
12345 mod 15 = 0
12345 mod 31 = 7

Suppose the RNS calculation yields (6, 2, 9) after an addition of +2 in each channel.
The goal is to recover the final number.

CRT constants for M = 3255:
M₇=465, N₇=5, C₇=2325
M₁₅=217, N₁₅=13, C₁₅=2821
M₃₁=105, N₃₁=13, C₃₁=1365

Mini-ROM outputs for residues (6,2,9):
t = (6×2325 mod 3255, 2×2821 mod 3255, 9×1365 mod 3255) = (930, 2387, 2520)

Sum = 930 + 2387 + 2520 = 5837
5837 mod 3255 = 5837 − 3255 = 2582

This is (12345 + 2) mod 3255 = 2582. If you want the exact value 12347 (not modulo), you must choose M large enough to cover the dynamic range of all results.

Other Examples

Let’s take a complete, bitwise example from start to finish:

Suppose x = 12345 (binary: 0011 0000 0011 1001), base {7, 15, 31}.

a) Binary → RNS, folding block by block

Modulo 7 (k=3):

Blocks (LSB first):
001 (1), 100 (4), 011 (3), 000 (0), 011 (3), 000 (0)
Sum = 1 + 4 + 3 + 0 + 3 + 0 = 11
11 % 7 = 4

Modulo 15 (k=4):

Nibbles (LSB first):
1001 (9), 0011 (3), 0000 (0), 0011 (3), padding = 0
Sum = 9 + 3 + 0 + 3 + 0 = 15
15 % 15 = 0

b) RNS Operation

Suppose the calculation in RNS is “+2”.
(4+2) % 7 = 6
(0+2) % 15 = 2
(0+2) % 31 = 2
RNS vector = (6, 2, 2)

c) Binary return (CRT)

For each modulo, MiniROM7[6], MiniROM15[2], MiniROM31[2] are read (precomputed values).
Suppose these are 924, 430, and 1696. Sum: 924 + 430 + 1696 = 3050.
3050 mod 3255 = 3050 (no reduction needed).
This matches (12345 + 2) mod 3255 = 2582, if the correct ROM values are used.
To match the true result, the ROM must be built for the specific modulus set and constants.

Architectural Benefits and FPGA Implementation Details

The use of a Residue Number System (RNS) on FPGA provides several structural and computational advantages. Because all modular channels are processed independently, there is no carry propagation, making it possible to achieve much higher clock frequencies compared to traditional binary designs. The pipeline is highly regular and deterministic: conversion from binary to RNS is performed in a single clock cycle using simple additions or lookups, while the FIR processing itself is carried out over as many cycles as there are taps in the filter. The final conversion from RNS back to binary, using a precomputed Chinese Remainder Theorem (CRT) logic with mini-LUTs, typically takes only three cycles.

The memory footprint for this approach remains minimal. For nine moduli, the total LUT usage for binary-to-RNS conversion and CRT mini-ROMs generally stays below four kilobytes, which is negligible on most modern FPGAs. The inherent parallelism of the RNS architecture allows each modular channel to function as a self-contained “processing lane,” leading to more predictable timing and easier place-and-route, even on entry-level FPGAs.

In practical signal-processing pipelines—such as a 256-tap Q15 FIR filter—the latency introduced by RNS conversion becomes negligible compared to the main filter operation. The overall design benefits from higher throughput, a shorter pipeline, and a structure that is particularly robust in terms of timing closure.

Limitations, Use Cases, and Practical Perspective

While RNS arithmetic is highly effective for linear operations such as accumulation, multiplication, convolution, or filtering, it is not universally applicable. Non-linear operations, including division and fast Fourier transforms, require either conversion back to binary or hybrid architectures, which may increase complexity and reduce some of the inherent benefits. The selection of coprime moduli remains crucial, requiring careful consideration to balance the necessary dynamic range with hardware resource constraints.

For embedded, low-power, or real-time signal processing applications on FPGA, RNS architectures provide compelling advantages. They naturally exploit parallelism, reduce pipeline latency, and enable higher operating frequencies by avoiding carry chains and wide binary adders. These features make RNS approaches particularly attractive for custom DSP cores, experimental hardware, and scenarios where throughput, efficiency, and predictability are paramount.

Conclusion: Moduli Choice, Conversion Strategy, and Design Justification

For the continuation of the project, the selected RNS base consists of a set of small, pairwise-coprime moduli: 7, 11, 13, 15, 17, 19, 23, 29, 31. The product of these nine moduli is 100,280,245,065, which is a little below 2⁴⁰. This matches the requirements for a 256-tap FIR filter in Q15 (16-bit signed) arithmetic. By ensuring the RNS dynamic range is significantly larger than this threshold, the design avoids any risk of overflow.

The chosen moduli include both folding-friendly forms (such as 15, 31) and classic small primes (such as 7, 11, 13), which are handled using compact lookup tables (LUTs). All conversions remain hardware-efficient and deterministic, with uniform latency for all operations. The CRT-based reconstruction is similarly optimized using mini-LUTs for each channel, so RNS-to-binary conversion is accomplished in a fixed, low-latency pipeline.

For What Comes Next with 7, 11, 13, 15, 17, 19, 23, 29, 31 (+37 ?)

Work on the FPGA implementation will now begin. The literature review and initial prototyping have clarified both the architectural options and the selection of moduli (more or less).

For communication with the ESP32, the plan is to use two 8-bit parallel buses (one for data input, one for output). This approach maximizes data throughput while keeping the interface logic straightforward and easy to debug.

The hardware platform chosen is finally a Cyclone IV FPGA. This device was selected instead of the Tang, mainly because Intel's development tools for compilation and simulation are more accessible. The Cyclone IV is also affordable (~15€), widely available, and provides performance well suited to the requirements of this project 'more or less like the Tang Nano 9).

Winter, FPGAs, and Forgotten Arithmetic

First RNS Implementation on Cyclone IV: +2 Operation over an 8-bit Bus

Hardware and development environment

Some demonstrations for RNS

Binary → RNS conversion without divisions

Use of nibbles: motivation and generalization

Modulo 5 conversion

Modulo 7 conversion

Modulo 9 conversion

Operation in RNS domain

RNS → binary (CRT) without divisions

CRT statement

Modulo 315 reduction without division

Synthesis and perspectives

Project perspectives

From Binary to RNS

Introduction — Why Use RNS on FPGA?

Architectural Benefits and FPGA Implementation Details

The Choice of Moduli — A Question of Dynamic Range, Simplicity, and Practicality

Chinese Remainder Theorem (CRT): Statement and Detailed Explanation

Explicit Construction — How to Recover x

Example

Binary to RNS Conversion: Naive, Folding, LUT, Padding — Detailed Examples

RNS to Binary Conversion: Full Example

Other Examples

Architectural Benefits and FPGA Implementation Details

Limitations, Use Cases, and Practical Perspective

Conclusion: Moduli Choice, Conversion Strategy, and Design Justification

For What Comes Next with 7, 11, 13, 15, 17, 19, 23, 29, 31 (+37 ?)