First RNS Implementation on Cyclone IV: +2 Operation over an 8-bit Bus

This log presents the first attempt at implementing a full computation chain based on the Residue Number System (RNS) on an FPGA—specifically, a Cyclone IV. Division is systematically avoided in favor of additions, bit shifts, and lookup tables. The complete pipeline, from conversion to reconstruction, is justified step by step. The code is available on my github : https://github.com/Bertrand-selvasystems/RNS-on-FPGA/.

Hardware and development environment

The hardware target is a Cyclone IV FPGA (EP4CE family). Physical programming will be performed later, as soon as the programmer arrives (ordered from Aliexpress, this will take a few more days). At this stage, the entire approach is based on software simulation using Quartus Prime Lite for synthesis, and Questa Intel FPGA (Starter Edition) for timing and functional simulation.
This is the FPGA software suite that appears the most accessible and best documented. However, you must request a .dat license file from Intel, even for the free "starter" version.

Some demonstrations for RNS

I wanted to review all the calculations necessary for the determination of the different moduli. This is very simple as soon as you know the following properties, which can be rigorously demonstrated from the very definition of the remainder in Euclidean division.

Let a and b be two integers, and q a modulus:

a = n_a × q + r_a, with 0 ≤ r_a < q
b = n_b × q + r_b, with 0 ≤ r_b < q

Addition:

(a + b) = (nₐq + rₐ) + (n_bq + r_b) = (nₐ + n_b)q + (rₐ + r_b)

The remainder modulo q is thus (a + b) mod q = (rₐ + r_b) mod q.

Multiplication:

(a × b) = (nₐq + rₐ)(n_bq + r_b)
        = [nₐ n_b q² + (nₐ r_b + n_b rₐ)q + rₐ r_b]

Modulo q, only the rₐ r_b terms remain: (a × b) mod q = (rₐ × r_b) mod q.

Subtraction:

(a - b) = (nₐq + rₐ) - (n_bq + r_b) = (nₐ - n_b)q + (rₐ - r_b)

The remainder is (rₐ - r_b + q) mod q, with the addition of q ensuring a positive result.

These relations are used for the calculation of the residues of different moduli, and can be written as r_i or i mod n...

Binary → RNS conversion without divisions

Use of nibbles: motivation and generalization

A byte is naturally structured in two 4-bit parts, called “nibbles”: the high nibble (H = x[7:4]) and the low nibble (L = x[3:0]). This binary representation is standard in digital architectures, as it allows the byte to be factored as:

x = 16 × H + L

The advantage of this decomposition is that it optimizes arithmetic operations, in particular for modulo conversion, thanks to the modular structure of 16 in most common bases. The use of nibbles, universal in digital electronics, thus allows combinatorial logic to be optimized.

Modulo 5 conversion

We first express 16 modulo 5:
16 = 3 × 5 + 1, so 16 ≡ 1 (mod 5)
It follows that:

x mod 5 = (16H + L) mod 5

x mod 5 = (((H mod 5)*(16 mod 5)) mod 5 + L mod 5) mod 5
x mod 5 = (((H mod 5)*1) mod 5 + L mod 5) mod 5

x mod 5 = (r_H + r_L) mod 5
With

(a + b) mod q = (rₐ + r_b) mod q.

Thus:

x mod 5 = (H + L) mod 5

In other words, the remainder of the division by 5 is obtained simply by summing the two nibbles and applying a reduction modulo 5. This is performed with a short addition (max 5 bits) followed by a lookup table (LUT) to bring the sum into the 0..4 interval.


function [2:0] mod5_nibble;
  input [3:0] h, l;
  begin
    mod5_nibble = lut_sum_mod5(h + l); // LUT over [0..30]
  end
endfunction

Modulo 7 conversion

We use the relation:
16 = 2 × 7 + 2, so 16 ≡ 2 (mod 7)
So: x mod 7 = (2H + L) mod 7.

The calculation of 2H is done by a simple left shift (bit-shift), then the low nibble is added and a modulo 7 reduction is applied via a LUT.

Modulo 9 conversion

Here, 16 = 1 × 9 + 7, so 16 ≡ 7 (mod 9).
Thus: x mod 9 = (7H + L) mod 9.
Long multiplication is avoided: 7H = (4H + 2H + H), i.e., two left shifts and an addition.


function [3:0] mod9_nibble;
  input [3:0] h, l;
  reg [6:0] s;
  begin
    s = (h << 2) + (h << 1) + h + l;
    mod9_nibble = lut_sum_mod9(s);
  end
endfunction

Optimization note: In all cases, the conversion consists of adding small-width values, then using a small lookup table. On FPGA, this translates to:

Additions with small fan-in, no global carry propagation (fast, low-cost logic)
Bit shifts (almost free)
Very small LUTs perfectly matched to FPGA resources

Thus, binary→RNS conversion does not penalize the operating frequency and uses little logic resources, at least in the case of a small number of moduli.

Operation in RNS domain

In the RNS system, each operation is performed locally, for each residue. In this experiment, the operation chosen is addition of 2 to each modulus, that is: for each modulus, r → (r + 2) mod m.

This choice may seem a bit silly (a lot of machinery just to perform a +2), but it allows controlled verification of the entire pipeline, including the validity of the round-trip conversions and information preservation. It's a good starting point.


function [2:0] add5_2; input [2:0] a; reg [3:0] t; begin
  t = {1'b0,a} + 4'd2; if (t >= 4'd5) t = t - 4'd5; add5_2 = t[2:0];
end endfunction
// Same scheme for mod 7 and mod 9.

Benefit: to validate the correctness of the entire pipeline, from binary to RNS and back, on an operation whose result is predictable and controllable. See the last section with the simulation results and code: It works!

RNS → binary (CRT) without divisions

CRT statement

For coprime moduli {m₁, m₂, ..., m_k} with total product M = ∏ m_i,
the Chinese Remainder Theorem (CRT) states that any system of residues (r₁, r₂, ..., r_k) corresponds to a unique integer x ∈ [0, M-1] such that:

x ≡ r_i (mod m_i) for all i

The explicit reconstruction formula is:

x = Σ_i=1..k [r_i × M_i × y_i] mod M

with M_i = M / m_i, and y_i the multiplicative inverse of M_i modulo m_i.
In our example:

{5, 7, 9}, M = 315
M₅ = 63, M₇ = 45, M₉ = 35
y₅ = 2, y₇ = 5, y₉ = 8
We use the products K_i = M_i × y_i as coefficients.


S = K5 * r5 + K7 * r7 + K9 * r9;

Numerical values: K₅ = 126, K₇ = 225, K₉ = 280.

Here is the code corresponding to the LUT definitions:

// mini-ROM Ai*ri (M=315)
  function [11:0] T5; input [2:0] r; begin
    case (r)
      3'd0: T5=12'd0;   3'd1: T5=12'd126; 3'd2: T5=12'd252;
      3'd3: T5=12'd378; 3'd4: T5=12'd504; default: T5=12'd0;
    endcase
  end endfunction
  function [11:0] T7; input [2:0] r; begin
    case (r)
      3'd0: T7=12'd0;   3'd1: T7=12'd225; 3'd2: T7=12'd450;
      3'd3: T7=12'd675; 3'd4: T7=12'd900; 3'd5: T7=12'd1125;
      3'd6: T7=12'd1350; default: T7=12'd0;
    endcase
  end endfunction
  function [11:0] T9; input [3:0] r; begin
    case (r)
      4'd0: T9=12'd0;   4'd1: T9=12'd280; 4'd2: T9=12'd560;
      4'd3: T9=12'd840; 4'd4: T9=12'd1120;4'd5: T9=12'd1400;
      4'd6: T9=12'd1680;4'd7: T9=12'd1960;4'd8: T9=12'd2240;
      default: T9=12'd0;
    endcase
  end endfunction

Modulo 315 reduction without division

To obtain a result bounded in [0,314], the reduction modulo 315 is performed without division, by conditional subtractions:
Proceed as follows:

As long as the intermediate value is greater than or equal to 2520 (8 × 315), subtract 2520
If the result remains greater than or equal to 1260 (4 × 315), subtract 1260
If still greater than or equal to 630 (2 × 315), subtract 630
If greater than or equal to 315, subtract 315
The remaining value then belongs to [0,314]

Each subtraction eliminates a block of 315 (or a multiple). This purely combinatorial strategy avoids the high cost of integer division and fits perfectly with FPGA architectures; it can be generalized to larger moduli.

ref : Piestrak, S.J., “Design of Residue Generators and Multi-Operand Modular Adders Using Carry-Save Adders,”


function [8:0] reduce315;
  input [11:0] val;
  reg [11:0] t;
begin
  t = val;
  if (t >= 12'd2520) t = t - 12'd2520;
  if (t >= 12'd1260) t = t - 12'd1260;
  if (t >= 12'd630 ) t = t - 12'd630;
  if (t >= 12'd315 ) t = t - 12'd315;
  reduce315 = t[8:0];
end endfunction

Synthesis and perspectives

The presented architecture leverages the granularity of the FPGA: each addition, shift, or LUT is mapped onto an elementary logic unit, with minimal pipeline depth. The maximum achievable frequency then depends directly on the propagation delay in these logics.

For simple operations (addition of small integers, LUT), these propagation delays are particularly short, allowing high operating frequencies, often limited only by the critical path.

On FPGA, the general optimization strategy is to favor the use of local combinatorial resources, to limit carry propagation, and to break down operations into well-synchronized sequential phases. This philosophy allows an optimal compromise between frequency, logic occupancy, and power consumption.

Applying RNS to this type of pipeline demonstrates that even conversion operations requiring division can be broken down into simpler independent sub-operations.

Furthermore, RNS allows the core computation to be performed on each modulus independently. For example, our current phase 3 (+2) is executed on each modulus independently. All calculations can therefore be performed in parallel without interdependence, paving the way for large-scale processing on simple, efficient hardware architectures. It is somewhat as if we could rigorously process, for a large number, the units, tens, hundreds, thousands, etc., completely independently...

Project perspectives

This is a first stone thrown into the pond. I will consolidate this initial code as follows in the coming logs:

Extension to 16 bits: by increasing the data width and the set of modules (for example {7, 11, 13, 15, 17, 19, 23, 29, 31}), while maintaining the simplicity of conversions thanks to the nibble + LUT architecture.
Complete FIR pipeline: replacing the elementary operation (+2) by a complete FIR chain (finite impulse response filter) entirely in RNS allows the exploration of massively parallel processing capabilities—a domain where RNS truly shines (and the reason it was used in the Duga K340 computer).
Accurate resource and frequency measurements: the goal is to objectively quantify logic occupancy, maximum frequency, and actual consumption, to support comparisons and guide implementation choices.
Real deployment via fast interface: integration between the FPGA and a microcontroller (ESP32), for example via I2S "LCD", allows validation of real-world operation and evaluation of the system's sustainable throughput.

Your comments and technical feedback are welcome: this work is very exploratory and any help or opinion from a pro is much appreciated...

Summary project

Each clock edge triggers the orderly propagation of operands through the intermediate registers (r5, r7, r9, etc.).

The signals out_valid and out_byte stabilize after two cycles, and the +2 operation works as intended.

A small victory already...

Code

`timescale 1ns/1ps
`default_nettype none

module adder8 (
  input  wire        clk,
  input  wire        rst_n,

  // Bus entrée
  input  wire [7:0]  in_byte,
  input  wire        in_valid,
  output wire        in_ready,

  // Bus sortie
  output reg  [7:0]  out_byte,
  output reg         out_valid,
  input  wire        out_ready
);

  // ====================================================
  // Phase 1 — Capture entrée 
  // ====================================================
  reg  [7:0] x0;
  reg        v0;

  wire s1_ready = 1'b1;
  assign in_ready = (~v0) | s1_ready;

  always @(posedge clk) begin
    if (!rst_n) begin
      v0 <= 1'b0;
      x0 <= 8'd0;
    end else begin
      if (in_ready) begin
        v0 <= in_valid;
        if (in_valid) x0 <= in_byte;
      end
    end
  end

  // ====================================================
  // Phase 2 — Binaire -> RNS (5,7,9) via nibbles + LUT
  // ====================================================
  
  
    
  reg        v1;
  reg  [2:0] r5;
  reg  [2:0] r7;
  reg  [3:0] r9;

  wire [3:0] H = x0[7:4];
  wire [3:0] L = x0[3:0];
  
  
// --- MOD 5 : (H+L) mod 5 ---
// S = H+L est sur 5 bits (0..30)
function [2:0] lut_sum_mod5;
  input [4:0] s;
  begin
    case (s)
      5'd0:  lut_sum_mod5=3'd0;  5'd1:  lut_sum_mod5=3'd1;  5'd2:  lut_sum_mod5=3'd2;  5'd3:  lut_sum_mod5=3'd3;  5'd4:  lut_sum_mod5=3'd4;
      5'd5:  lut_sum_mod5=3'd0;  5'd6:  lut_sum_mod5=3'd1;  5'd7:  lut_sum_mod5=3'd2;  5'd8:  lut_sum_mod5=3'd3;  5'd9:  lut_sum_mod5=3'd4;
      5'd10: lut_sum_mod5=3'd0;  5'd11: lut_sum_mod5=3'd1;  5'd12: lut_sum_mod5=3'd2;  5'd13: lut_sum_mod5=3'd3;  5'd14: lut_sum_mod5=3'd4;
      5'd15: lut_sum_mod5=3'd0;  5'd16: lut_sum_mod5=3'd1;  5'd17: lut_sum_mod5=3'd2;  5'd18: lut_sum_mod5=3'd3;  5'd19: lut_sum_mod5=3'd4;
      5'd20: lut_sum_mod5=3'd0;  5'd21: lut_sum_mod5=3'd1;  5'd22: lut_sum_mod5=3'd2;  5'd23: lut_sum_mod5=3'd3;  5'd24: lut_sum_mod5=3'd4;
      5'd25: lut_sum_mod5=3'd0;  5'd26: lut_sum_mod5=3'd1;  5'd27: lut_sum_mod5=3'd2;  5'd28: lut_sum_mod5=3'd3;  5'd29: lut_sum_mod5=3'd4;
      5'd30: lut_sum_mod5=3'd0;
    endcase
  end
endfunction

function [2:0] mod5_nibble;
  input [3:0] h,l;
  begin
    mod5_nibble = lut_sum_mod5(h + l);  // direct LUT
  end
endfunction


// --- MOD 7 direct : (2H+L) % 7
// S = 2H + L va de 0..45
function [2:0] lut_sum_mod7;
  input [5:0] s;
  begin
    case (s)
      6'd0: lut_sum_mod7=3'd0;   6'd1: lut_sum_mod7=3'd1;   6'd2: lut_sum_mod7=3'd2;   6'd3: lut_sum_mod7=3'd3;
      6'd4: lut_sum_mod7=3'd4;   6'd5: lut_sum_mod7=3'd5;   6'd6: lut_sum_mod7=3'd6;   6'd7: lut_sum_mod7=3'd0;
      6'd8: lut_sum_mod7=3'd1;   6'd9: lut_sum_mod7=3'd2;   6'd10:lut_sum_mod7=3'd3;   6'd11:lut_sum_mod7=3'd4;
      6'd12:lut_sum_mod7=3'd5;   6'd13:lut_sum_mod7=3'd6;   6'd14:lut_sum_mod7=3'd0;   6'd15:lut_sum_mod7=3'd1;
      6'd16:lut_sum_mod7=3'd2;   6'd17:lut_sum_mod7=3'd3;   6'd18:lut_sum_mod7=3'd4;   6'd19:lut_sum_mod7=3'd5;
      6'd20:lut_sum_mod7=3'd6;   6'd21:lut_sum_mod7=3'd0;   6'd22:lut_sum_mod7=3'd1;   6'd23:lut_sum_mod7=3'd2;
      6'd24:lut_sum_mod7=3'd3;   6'd25:lut_sum_mod7=3'd4;   6'd26:lut_sum_mod7=3'd5;   6'd27:lut_sum_mod7=3'd6;
      6'd28:lut_sum_mod7=3'd0;   6'd29:lut_sum_mod7=3'd1;   6'd30:lut_sum_mod7=3'd2;   6'd31:lut_sum_mod7=3'd3;
      6'd32:lut_sum_mod7=3'd4;   6'd33:lut_sum_mod7=3'd5;   6'd34:lut_sum_mod7=3'd6;   6'd35:lut_sum_mod7=3'd0;
      6'd36:lut_sum_mod7=3'd1;   6'd37:lut_sum_mod7=3'd2;   6'd38:lut_sum_mod7=3'd3;   6'd39:lut_sum_mod7=3'd4;
      6'd40:lut_sum_mod7=3'd5;   6'd41:lut_sum_mod7=3'd6;   6'd42:lut_sum_mod7=3'd0;   6'd43:lut_sum_mod7=3'd1;
      6'd44:lut_sum_mod7=3'd2;   6'd45:lut_sum_mod7=3'd3;
      default: lut_sum_mod7=3'd0;
    endcase
  end
endfunction

function [2:0] mod7_nibble;
  input [3:0] h,l;
  begin
    mod7_nibble = lut_sum_mod7((h << 1) + l);  // s = 2H + L
  end
endfunction

// --- MOD 9 : (7H + L) % 9 ---
// s = 7H + L ∈ [0..120]. LUT directe sur s.
function [3:0] lut_sum_mod9;
  input [6:0] s;
  begin
    case (s)
      // s % 9 == 0
      7'd0,  7'd9,  7'd18, 7'd27, 7'd36, 7'd45, 7'd54, 7'd63, 7'd72, 7'd81, 7'd90, 7'd99, 7'd108, 7'd117:
        lut_sum_mod9 = 4'd0;
      // s % 9 == 1
      7'd1,  7'd10, 7'd19, 7'd28, 7'd37, 7'd46, 7'd55, 7'd64, 7'd73, 7'd82, 7'd91, 7'd100, 7'd109, 7'd118:
        lut_sum_mod9 = 4'd1;
      // s % 9 == 2
      7'd2,  7'd11, 7'd20, 7'd29, 7'd38, 7'd47, 7'd56, 7'd65, 7'd74, 7'd83, 7'd92, 7'd101, 7'd110, 7'd119:
        lut_sum_mod9 = 4'd2;
      // s % 9 == 3
      7'd3,  7'd12, 7'd21, 7'd30, 7'd39, 7'd48, 7'd57, 7'd66, 7'd75, 7'd84, 7'd93, 7'd102, 7'd111, 7'd120:
        lut_sum_mod9 = 4'd3;
      // s % 9 == 4
      7'd4,  7'd13, 7'd22, 7'd31, 7'd40, 7'd49, 7'd58, 7'd67, 7'd76, 7'd85, 7'd94, 7'd103, 7'd112:
        lut_sum_mod9 = 4'd4;
      // s % 9 == 5
      7'd5,  7'd14, 7'd23, 7'd32, 7'd41, 7'd50, 7'd59, 7'd68, 7'd77, 7'd86, 7'd95, 7'd104, 7'd113:
        lut_sum_mod9 = 4'd5;
      // s % 9 == 6
      7'd6,  7'd15, 7'd24, 7'd33, 7'd42, 7'd51, 7'd60, 7'd69, 7'd78, 7'd87, 7'd96, 7'd105, 7'd114:
        lut_sum_mod9 = 4'd6;
      // s % 9 == 7
      7'd7,  7'd16, 7'd25, 7'd34, 7'd43, 7'd52, 7'd61, 7'd70, 7'd79, 7'd88, 7'd97, 7'd106, 7'd115:
        lut_sum_mod9 = 4'd7;
      // s % 9 == 8
      7'd8,  7'd17, 7'd26, 7'd35, 7'd44, 7'd53, 7'd62, 7'd71, 7'd80, 7'd89, 7'd98, 7'd107, 7'd116:
        lut_sum_mod9 = 4'd8;

      default: lut_sum_mod9 = 4'd0; // sécurité
    endcase
  end
endfunction

function [3:0] mod9_nibble;
  input [3:0] h, l;
  reg  [6:0] s; // 0..120
  begin
    // 7H = 4H + 2H + H (pas de multiplieur)
    s = (h << 2) + (h << 1) + h + l;
    mod9_nibble = lut_sum_mod9(s);
  end
endfunction


  always @(posedge clk) begin
    if (!rst_n) begin
      v1 <= 1'b0; r5 <= 3'd0; r7 <= 3'd0; r9 <= 4'd0;
    end else begin
      v1 <= v0;
      if (v0) begin
        r5 <= mod5_nibble(H, L);
        r7 <= mod7_nibble(H, L);
        r9 <= mod9_nibble(H, L);
      end
    end
  end

  // ====================================================
  // Phase 3 — Opération (+2) en RNS
  // ====================================================
  reg        v2;
  reg  [2:0] r5_op, r7_op;
  reg  [3:0] r9_op;

  function [2:0] add5_2; input [2:0] a; reg [3:0] t; begin
    t = {1'b0,a} + 4'd2; if (t >= 4'd5) t = t - 4'd5; add5_2 = t[2:0];
  end endfunction
  function [2:0] add7_2; input [2:0] a; reg [3:0] t; begin
    t = {1'b0,a} + 4'd2; if (t >= 4'd7) t = t - 4'd7; add7_2 = t[2:0];
  end endfunction
  function [3:0] add9_2; input [3:0] a; reg [4:0] t; begin
    t = {1'b0,a} + 5'd2; if (t >= 5'd9) t = t - 5'd9; add9_2 = t[3:0];
  end endfunction

  always @(posedge clk) begin
    if (!rst_n) begin
      v2 <= 1'b0; r5_op <= 3'd0; r7_op <= 3'd0; r9_op <= 4'd0;
    end else begin
      v2 <= v1;
      if (v1) begin
        r5_op <= add5_2(r5);
        r7_op <= add7_2(r7);
        r9_op <= add9_2(r9);
      end
    end
  end

  // ====================================================
  // Phase 4 — CRT (RNS -> binaire)
  // ====================================================
  // mini-ROM Ai*ri (M=315)
  function [11:0] T5; input [2:0] r; begin
    case (r)
      3'd0: T5=12'd0;   3'd1: T5=12'd126; 3'd2: T5=12'd252;
      3'd3: T5=12'd378; 3'd4: T5=12'd504; default: T5=12'd0;
    endcase
  end endfunction
  function [11:0] T7; input [2:0] r; begin
    case (r)
      3'd0: T7=12'd0;   3'd1: T7=12'd225; 3'd2: T7=12'd450;
      3'd3: T7=12'd675; 3'd4: T7=12'd900; 3'd5: T7=12'd1125;
      3'd6: T7=12'd1350; default: T7=12'd0;
    endcase
  end endfunction
  function [11:0] T9; input [3:0] r; begin
    case (r)
      4'd0: T9=12'd0;   4'd1: T9=12'd280; 4'd2: T9=12'd560;
      4'd3: T9=12'd840; 4'd4: T9=12'd1120;4'd5: T9=12'd1400;
      4'd6: T9=12'd1680;4'd7: T9=12'd1960;4'd8: T9=12'd2240;
      default: T9=12'd0;
    endcase
  end endfunction

  reg        v3, v4;
  reg [11:0] S;
  reg  [8:0] X;   // debug
  reg  [7:0] y4;

  function [8:0] reduce315;
    input [11:0] val;
    reg   [11:0] t;
  begin
    t = val;
    if (t >= 12'd2520) t = t - 12'd2520;
    if (t >= 12'd1260) t = t - 12'd1260;
    if (t >= 12'd630 ) t = t - 12'd630;
    if (t >= 12'd315 ) t = t - 12'd315;
    reduce315 = t[8:0];
  end endfunction

  wire [8:0] S_red = reduce315(S);

  always @(posedge clk) begin
    if (!rst_n) begin
      v3 <= 1'b0; v4 <= 1'b0; S <= 12'd0; X <= 9'd0; y4 <= 8'd0;
    end else begin
      v3 <= v2;
      if (v2) S <= T5(r5_op) + T7(r7_op) + T9(r9_op);

      v4 <= v3;
      if (v3) begin
        X  <= S_red;        // debug
        y4 <= S_red[7:0];   // aligné avec v4
      end
    end
  end

  // ====================================================
  // Phase 5 — buffer de sortie
  // ====================================================
  reg  [7:0] y5;
  reg        v5;

  wire s5_ready = (~v5) | out_ready;

  always @(posedge clk) begin
    if (!rst_n) begin
      v5 <= 1'b0; y5 <= 8'd0; out_valid <= 1'b0; out_byte <= 8'd0;
    end else begin
      if (s5_ready) begin
        v5 <= v4;
        if (v4) y5 <= y4;
      end
      out_valid <= v5;
      out_byte  <= y5;
    end
  end

endmodule

`default_nettype wire