Close

Reflections on FIR Filter Architecture

A project log for Winter, FPGAs, and Forgotten Arithmetic

RNS on FPGA: Revisiting an Unusual Number System for Modern Signal Processing

bertrand-selvaBertrand Selva 10/18/2025 at 14:530 Comments

I thought designing the FIR filter would be simpler -  in fact it isn’t that straightforward. To picture how data flows through the filter, I drew tables showing successive cycles with the propagation of values and the operations done at each step. The filter parameters are noted ai. The data to process are xi. Here we choose N = 4 for illustration.

I started with a direct approach, the way you’d do it on an MCU. You’ll see it’s probably not the right method here.

Per modulus mi, the FIR is:
yi[n] = ( Σk=0…N−1 hi[k] · xi[n−k] ) mod mi, for all i in [1, p]

Reconstruction by CRT:
y[n] = CRT( y1[n], y2[n], …, yp[n] )

To escape purely sequential thinking, it helps to reason in terms of a production line: separate “lines” manipulate data items, cycle after cycle, as in a factory.

Moreover I count the operations and accumulations involved in an FIR with N taps and it easier to do with the following tables:

Step-by-step

FIR step 1 diagram
Figure 1. With x0 ingested, compute the four products tied to the filter {a0, a1, a2, a3}.
FIR step 2 diagram
Figure 2. When x1 arrives, compute its four products and start summing: x0a0 + x1a1.
FIR step 3 diagram
Figure 3. Ingest x2, add the new term x2a2 into the highlighted box. Start a new accumulator with x2a1 + x1a0. One more addition is still missing to complete a full convolution cycle.
FIR step 4 diagram
Figure 4. First full cycle completed: we can extract the first valid output.
FIR step 5 diagram
Figure 5. Continue the process. Values like x1a1, x1a2, x1a3, x2a2, etc., are not kept. A red boundary in the step-5 table marks what must be stored vs. what can be discarded.

This is the direct approach. Of course, you must replicate that whole pipeline for each modulus. With k moduli, you duplicate it k times — that’s exactly where RNS parallelism comes from.

From the direct form to parallel and transposed architectures

While analyzing the cycle-by-cycle behavior of the direct form through these tables, a structural issue appears: for an FIR of N taps, the pipeline must effectively include N differentiated phases. At each cycle, every coefficient must be paired with its corresponding delayed sample — which requires a time-varying addressing scheme within the delay line.

As a result, the pipeline is not stationary: its internal data paths evolve from one cycle to the next (with a global sequence of N cycles). This makes programmation and sequencing heavier.

In addition to this non-stationary addressing, the addition chain itself must be broken into multiple intermediate stages to meet timing constraints. Together, these factors make the pipeline hard to program and tune.

Hence the natural question: isn’t there a more regular structure, where each pipeline stage would perform exactly the same operation at every instant?

Parallel direct form (fully parallel direct FIR)

If one wants a “hard” (stationary) pipeline, the solution is to let the samples slide through a delay line:
X₃ ← X₂ ← X₁ ← X₀ ← x[n].

The multipliers always see the correct positions (x[n], x[n−1], x[n−2], …): no more dynamic addressing.

That’s much better — but there’s still a hard point:

Pipelining this reduction introduces a few extra cycles of latency. In practice, if one addition combines two values, the reduction tree has log₂(N) stages. With a ternary adder (three inputs), this drops to roughly log₃(N).

This approach is scalable for frequency if the addition tree is balanced and pipelined (depth ≈ log₂ N). Ressources needed linear growth (N multipliers, ~N additions, and delay-line registers).

This approach is often preferred for large filters: short pipeline latency (≈ log₂ N), easy routing (tree fan-in), and a fully stationary structure.

Transposed pipeline (transposed / systolic form)

Another possible structure is the transposed form. It consists of a chain of identical cells, each performing a multiply–add (mod) operation followed by a register. In this configuration, every stage executes the same operation at each clock cycle, making the pipeline extremely regular. Once the pipeline is primed, it delivers one output per cycle (the same for parallel direct form) and typically achieves a very high operating frequency for small to medium filter sizes.

We reuse the same setup with N = 4 coefficients {a0, a1, a2, a3}.
In the transposed form, the input sample x[n] is broadcast to all four multipliers. Each cell performs a multiply–add (mod mi) followed by a register that stores the running sum for the next cycle.

The internal registers are R1, R2, and R3. All operations below are taken mod mi.

y[n]  = a₀·x[n] + R1
R1'   = a₁·x[n] + R2
R2'   = a₂·x[n] + R3
R3'   = a₃·x[n]

After four cycles the pipeline is primed; from then on, one valid output is produced at every clock cycle. Each stage repeats the same operation at every cycle, which keeps timing and routing highly regular. 

However, as N increases, a technological limitation appears: the input sample x[n] must be broadcast to all N multipliers in parallel. This high fan-out can degrade both fmax and routing quality unless the distribution network is segmented or pipelined.

In contrast, the parallel direct form only feeds x[n] into the first register of the delay line, and the data then propagates by simple shifting—an architecture that scales much more gracefully for large N.

RNS context — and my choice

Since my goal is to exploit the Residue Number System (RNS) to push dynamic range and build deep FIR filters (large N), the RNS approach shows its full advantage when the filter depth is high.

For the next steps of this project, I will therefore focus on the fully parallel direct form: it offers a short pipeline latency, routing better suited to large N (no massive fan-out at input), and a stationary pipeline structure.

In the next log, I’ll present the implementation for the parallel direct approach with N = 64.

Discussions