Why Compile-Time Fixed Coefficients and "mul-by-const mod m" LUTs?
My objective is to design the FIR without multipliers, using few wires, only shifts + additions/subtractions and ROM. Just as in the first part of the program, with the binary/RNS conversion.
In RNS, the multiplication
(x × h) mod m
with h a fixed coefficient (filter tap) becomes a simple ROM lookup:
LUT_m[r] = (c * r) mod m
with c = h_int mod m, r in [0, m-1]
The first case investigated was with free coefficients (no fixed coefficient).
With moduli:
{7, 11, 13, 15, 17, 19, 23, 29, 31},
and max size m=31, encoding requires 5 bits.
All possible multiplications require:
Memory = N × Σ m² × 5 bits
= N × (7² + 11² + 13² + 15² + 17² + 19² + 23² + 29² + 31²) × 5 bits
Numerically:
Σ m² = 3545
- N = 128 ⇒ 2,268,800 bits ≈ 246 M9K (Cyclone IV)
- N = 256 ⇒ 4,537,600 bits ≈ 493 M9K
So it's way too heavy to be efficient and in any case completely exceeds the resources of a Cyclone IV.
Moreover, having free filter taps would mean a much more complicated design, with a loader to load the coefficients into the filter. I'm not a fan of this idea, especially since you'd have to load the program into the FPGA before every execution.And it's a try, a prototype. Not a real product...
The Alternative: Compile-Time Fixed Filter Coefficients
We store only the actually needed products because the coefficients are fixed. Then:
Memory = N × Σ m × 5 bits
Numerically:
Σ m = 165
- N = 128 ⇒ 105,600 bits ≈ 11.5 M9K
- N = 256 ⇒ 211,200 bits ≈ 22.9 M9K
And now, this is entirely manageable and consistent with the "multiply without multiplying" approach.
This is the option that was chosen.
Filter Design: Windowed-Sinc FIR, Symmetry, Quantization and Scaling
The FIR filter implemented here is a windowed-sinc lowpass of length N = 128, using a Hamming window for side-lobe suppression. This is the classic textbook approach for FIR design when you want maximum phase linearity (symetry) and predictable cutoff.
We start from the ideal lowpass impulse response (the sinc), center it around n_c = (N-1)/2, and apply a symmetric window.
h_ideal[n] = 2·fc_norm · sinc(2·fc_norm·(n - n_c))
h[n] = h_ideal[n] · w[n]
h[n] = h[n] / sum(h[n]) // DC gain normalization
With N even, symmetry is perfect: h[n] = h[N-1-n]. This ensures a strictly linear phase and a constant group delay of (N-1)/2 samples.
Quantization and scaling: The entire filter is scaled by a quantization factor Q (here, Q = 1024) and all taps are rounded to integer values:
h_int[n] = round(Q · h[n])
Every result is just "off by a factor Q". This scale is recovered by dividing by Q at the output if needed.
The following figure show the filter elements.
The list of filter elements values :
Columns 1 through 29
0 0 0 0 0 0 -1 -1 0 0 1 1 1 0 0 -1 -1 -1 -1 1 2 2 2 1 -1 -3 -4 -3 -1
Columns 30 through 58
1 4 5 5 2 -2 -6 -8 -7 -3 3 8 11 9 4 -4 -12 -16 -14 -6 6 18 24 21 9 -10 -30 -42 -40
Columns 59 through 87
-18 22 75 130 175 201 201 175 130 75 22 -18 -40 -42 -30 -10 9 21 24 18 6 -6 -14 -16 -12 -4 4 9 11
Columns 88 through 116
8 3 -3 -7 -8 -6 -2 2 5 5 4 1 -1 -3 -4 -3 -1 1 2 2 2 1 -1 -1 -1 -1 0 0 1
Columns 117 through 128
1 1 0 0 -1 -1 0 0 0 0 0 0
MATLAB/OCTAVE script to generate filter coefficients
clear; clc;
%% ---------- Paramètres FIR ----------
N = 128; % nb de taps (pair recommandé pour Hamming)
fc_norm = 0.10; % fréquence de coupure normalisée (jusqu'à 10 % de f_e)
window_fun = @hamming; % @hamming, @hann, ...
Q = 1024; % facteur de quantification (puissance de 2)
%% ---------- Moduli ----------
moduli = [7 11 13 15 17 19 23 29 31];
%% ---------- FIR (fenêtre de Hamming + sinc) ----------
n = 0:N-1;
nc = (N-1)/2;
x = n - nc;
h_ideal = zeros(1, N);
for i = 1:N
h_ideal(i) = (x(i) == 0) * (2 * fc_norm) + ...
(x(i) ~= 0) * (sin(2 * pi * fc_norm * x(i)) / (pi * x(i)));
end
w = window_fun(N).';
h = h_ideal .* w;
h = h / sum(h); % normalisation DC
h_int = round(Q * h); % quantification entière
fprintf('--- FIR report ---\n');
fprintf('N=%d, fc_norm=%.3f, window=%s, Q=%d\n', ...
N, fc_norm, func2str(window_fun), Q);
fprintf('sum(h)=%.12f ; sum(h_int)=%d ~ Q=%d\n', ...
sum(h), sum(h_int), Q);
sum_m2 = sum(moduli .^ 2);
sum_m = sum(moduli);
fprintf('Σm^2=%d ; Σm=%d\n', sum_m2, sum_m);
fprintf('General: N*Σm^2*5 = %d bits\n', N * sum_m2 * 5);
fprintf('Fixed : N*Σm *5 = %d bits\n', N * sum_m * 5);
% Tracé (réel vs quantifié)
figure;
plot(n, h, 'LineWidth', 1.25); hold on;
stairs(n, double(h_int)/Q, 'LineWidth', 1.0);
yline(0, ':');
grid on;
xlabel('Indice n');
ylabel('Amplitude');
title(sprintf('FIR - HAMMING, N=%d, f_c = %.2f f_s', N, fc_norm));
legend({'h (réel)', 'h_{int}/Q (quantifié)', 'zero'}, 'Location', 'best');
MATLAB/OCTAVE script to generate multiplication functions for all filter taps and moduli
This MATLAB/OCTAVE script automatically generates, for each filter tap and each RNS modulus, a Verilog function implementing modular multiplication:
each function multi<modulus>_h<tap> encodes the multiplication table (h_int[tap] * r) % modulus as a LUT (lookup table).
Inputs and outputs are always coded on 5 bits [4:0].
A comment indicates the integer value of the filter coefficient used.
The resulting file, lut_verilog.v, is included with the project files.
close all
clear all
clc
% Paramètres
N = 128; % nombre de taps
moduli = [7 11 13 15 17 19 23 29 31];
Q = 1024; % facteur de quantification
fc_norm = 0.10; % fréquence de coupure normalisée
window_fun = @hamming; % type de fenêtre
% --- FIR (fenêtré Hamming + sinc) ---
n = 0:N-1; nc = (N-1)/2;
x = n - nc;
h_ideal = zeros(1,N);
for i = 1:N
h_ideal(i) = (x(i)==0) * (2*fc_norm) + (x(i)~=0) * (sin(2*pi*fc_norm*x(i)) / (pi*x(i)));
end
w = window_fun(N).';
h = (h_ideal .* w); h = h / sum(h);
h_int = round(Q * h);
outfile = 'lut_verilog.v';
fid = fopen(outfile, 'w');
for k = 1:N
for mi = 1:length(moduli)
m = moduli(mi);
c = mod(h_int(k), m);
fname = sprintf('multi%d_h%d', m, k-1); % tap indices start at 0
% --- commentaire avec valeur du coefficient ---
fprintf(fid, '/// %s: h_int[%d] = %d\n', fname, k-1, h_int(k));
fprintf(fid, 'function [4:0] %s;\n', fname);
fprintf(fid, ' input [4:0] r;\n');
fprintf(fid, ' begin\n');
fprintf(fid, ' case (r)\n');
for r = 0:m-1
v = mod(c*r, m);
fprintf(fid, ' 5''d%d : %s = 5''d%d;\n', r, fname, v);
end
fprintf(fid, ' default: %s = 5''d0;\n', fname);
fprintf(fid, ' endcase\n');
fprintf(fid, ' end\n');
fprintf(fid, 'endfunction\n\n');
end
end
fclose(fid);
disp(['Fichier généré : ', outfile]);
Some example functions:
/// multi7_h0: h_int[0] = 0
function [4:0] multi7_h0;
input [4:0] r;
begin
case (r)
5'd0 : multi7_h0 = 5'd0;
5'd1 : multi7_h0 = 5'd0;
5'd2 : multi7_h0 = 5'd0;
5'd3 : multi7_h0 = 5'd0;
5'd4 : multi7_h0 = 5'd0;
5'd5 : multi7_h0 = 5'd0;
5'd6 : multi7_h0 = 5'd0;
default: multi7_h0 = 5'd0;
endcase
end
endfunction
/// multi11_h0: h_int[0] = 0
function [4:0] multi11_h0;
input [4:0] r;
begin
case (r)
5'd0 : multi11_h0 = 5'd0;
5'd1 : multi11_h0 = 5'd0;
5'd2 : multi11_h0 = 5'd0;
5'd3 : multi11_h0 = 5'd0;
5'd4 : multi11_h0 = 5'd0;
5'd5 : multi11_h0 = 5'd0;
5'd6 : multi11_h0 = 5'd0;
5'd7 : multi11_h0 = 5'd0;
5'd8 : multi11_h0 = 5'd0;
5'd9 : multi11_h0 = 5'd0;
5'd10 : multi11_h0 = 5'd0;
default: multi11_h0 = 5'd0;
endcase
end
endfunction
The presence of zeros here is normal: the first filter coefficients are zero. All these functions are written in lut_verilog.v
Addition modulo m
For the FIR sum in RNS, we need to add two residues, each in the range [0, m-1].
If the sum exceeds the modulus, it can only be in [m, 2m-2], so a single subtraction by m is always sufficient to bring the result back into the correct range.
The following Verilog function implements this modular addition for a given modulus :
// Modular addition function for modulus m
function [4:0] add_mod_m;
input [4:0] a, b;
input [4:0] m; // the modulus
reg [5:0] sum; // one extra bit to detect overflow
begin
sum = a + b;
if (sum >= m)
add_mod_m = sum - m;
else
add_mod_m = sum;
end
endfunction
The result is always in [0, m-1].
In practice, I wonder if it’s not preferable to implement a dedicated modular addition function for each modulus, with the modulus value hardwired in the logic.
This avoids having to pass the modulus value as an input and allows the compiler to generate a fixed (hardwired) comparison and subtraction.
I will therefore implement 9 modular addition functions using this principle in the final program (see next log).
Next step
In the upcoming log, I’ll upload the fully functional program.
This intermediate stage was necessary to clearly formalize the generation of modular addition and multiplication functions and give the MATLAB/OCTAVE script in order to generate the functions for random moduli.
In the end, the original requirements are met: everything runs with LUTs, modular addition is always 5 bits max, just bit shifts, no wide adders, and absolutely no multipliers anywhere.
Bertrand Selva
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.