Winter, FPGAs, and Forgotten Arithmetic

Project Logs

Collapse

BACKPRESSURE

Bertrand Selva • 11/16/2025 at 14:31 • 0 comments

Implementing Backpressure in a No-RAM Pipeline

In the latest version of the project (see attached source), I've introduced a simple backpressure mechanism to ensure that each produced data word is truly consumed before a new one is accepted. Here’s how it works:

The design is 100% synchronous. There is no external RAM, no FIFO. The objectives are twofold:
– Never overwrite an output that hasn’t yet been read.
– Never accept a new input until the previous output is consumed.
For debugging and visualization, the 16 output bits will be connected to a parallel bus, monitored by the logic analyzer.

Validation Protocol

The pipeline uses two standard handshake signals:
out_valid: signals that output data is ready.
out_ready: external pulse signaling the data has been consumed.
Control logic is built around:
wire hold_pipeline = (out_valid && !out_ready);
wire step_en = !hold_pipeline;
Pipeline steps forward only if the previous output has been read (out_ready=1). Otherwise, the pipeline is frozen everywhere. The step_en signal then drives all internal enables.
This logic ensures:
– No new input if previous output isn’t consumed.
– No risk of internal overflow, even with zero FIFO.
– The pipeline remains frozen as long as downstream isn’t ready.
That’s exactly what’s needed for streaming scenarios controlled by a slow peripheral or manual inspection.

Signal Legend

Signal	Direction	Role/Meaning
in_valid	Input	External → FPGA: “Input data is present on in_byte”
in_ready	Output	FPGA → External: “I can accept a new input now”
in_byte	Input	External → FPGA: Data to process (8 bits)
out_word	Output	FPGA → External: Output word produced (16 bits)
out_valid	Output	FPGA → External: “Output data is ready, read me!”
out_ready	Input	External → FPGA: “The output data has just been read/consumed”

Typical Timing

cycle	in_valid	in_ready	out_word	out_valid	out_ready	Description
0	0	1	--	0	1	Idle state. FPGA does nothing—no input pending, no output produced. `in_ready=1`: ready to accept. `out_valid=0`: nothing to read. The outside can inject a value at any time.
1	1	1	X1	1	0	Input received. External presents value on `in_byte` with `in_valid=1` (pipeline ready). The pipeline captures input, computes, produces X1, sets `out_valid=1`. `out_ready=0`: output not yet consumed.
2	0	0	X1	1	0	Waiting for consumption. Input transfer done (`in_valid=0`), output not yet read (`out_ready=0`). Pipeline blocks: `in_ready=0`, `out_valid=1`, output is held stable. Registers frozen, no updates.
3	0	0	X1	1	1	Output read. External acknowledges output by asserting `out_ready=1` for one cycle. The pipeline sees the "green light": data is consumed. X1 is readed by ESP32. On next clock, pipeline will advance and be ready for new input.
0	0	1	X1	0	1	Return to idle. After data consumption, pipeline returns to idle: `in_ready=1`, ready to accept a new word. Cycle repeats as above for the next input.
1	1	1	X2	1	0	New input accepted. External detects `in_ready=1` and injects a new value with `in_valid=1`. The pipeline captures this new input, computes X2, and outputs it. Sequence is analogous to step 1.

Key Principles

Computation (internal register update) happens only on a clock edge where step_en=1, i.e.:
```
step_en = !hold_pipeline = !(out_valid && !out_ready)
```
If the output is not consum hold_pipeline and step_en depending on the statesed (out_ready=0), the entire pipeline is frozen: no new input, no state evolution, everything held stable.
in_ready=1 only when the pipeline can accept a new input (never if the previous output remains unconsumed).
The external system dictates the pace: as long as out_ready isn't asserted, there is no risk of data loss or overflow, as the pipeline remains frozen. The external master sets the tempo.

hold_pipeline and step_en depending on the states

In the idle state, the pipeline keeps stepping on every clock edge, because step_en = 1 as long as out_valid = 0.
But in_fire = in_valid && in_ready stays at 0, since...

USB BLASTER V2 OK
Bertrand Selva • 11/15/2025 at 19:27 • 0 comments

I had bought a USB Blaster on AliExpress for 2€.

If it had worked, it would have been great…

I tried it with Quartus 13 and Quartus 22. Nothing worked: it just wouldn’t function.

So I bought a Waveshare Electronics USB Blaster V2, and this time my FPGA was detected immediately and everything worked on the first try with Quartus 22.

I paid around thirty euros for this version, but it actually works perfecty !

The EP4CE6E22 Cyclone IV was detected at the fisrt try....

At the same time, if I’d bought the 10-wire ribbon cable on its own, it would have cost me more than 2€. I don’t have the motivation to open it up and see what’s inside. But it’s definitely better to go for another model…
64 TAPS RNS FIR
Bertrand Selva • 11/10/2025 at 16:34 • 0 comments
I’ve uploaded the new version of the RNS-based FIR filter, now extended to 64 TAPS (see PROGRAMME16BITS_FIRCOMPLET64TAPS.zip in project files) .
Everything is OK !

FIR Coefficients

The implemented coefficients are:
```
h_int =

Columns 1 through 29
  0   0   0   0   1   1   2   2   3   3   2   1  -1  -3  -6  -9 -11 -13 -14 -13  -9  -4   4  15  27  40  54  68  81

Columns 30 through 58
 91  98 102 102  98  91  81  68  54  40  27  15   4  -4  -9 -13 -14 -13 -11  -9  -6  -3  -1   1   2   3   3   2   2

Columns 59 through 64
  1   1   0   0   0   0
```
These coefficients come from a windowed sinc with a normalized cutoff frequency of 0.1.

They were scaled so that the sum of all taps equals 1024 (2¹⁰), allowing normalization after CRT reconstruction by a simple 10-bit right shift, with no hardware divider required.

All multiplication LUTs were completely rewritten to match these coefficients. The updated code is available in lut_verilog_64TAPS.v.

Implementation results

Target: Cyclone IV GX EP4CGX22BF14C6
Tool: Quartus Prime 22.1std.2 (Lite Edition)

Metric Value
Logic elements 16 007 / 21 280 (75 %)
Registers 7 982
Embedded multipliers 0 / 80 (0 %)
Memory bits 0 / 774 144 (0 %)
PLLs 0 / 3 (0 %)
Max frequency (Slow 1200 mV 85 °C) 96.9 MHz

This confirms that the 64-tap RNS FIR is entirely built from logic elements only, with no DSP usage : a fully LUT-based architecture.

Next step — Now that everything works:

Decouple the input data rate from the FPGA clock so that the 64-tap pipeline can handle inputs slower than clk without any data loss or misalignment, and simulate the full chain to verify correct behavior.

Then move on to the hardware implementation of the FIR.
First Functional RNS FIR
Bertrand Selva • 11/09/2025 at 17:36 • 0 comments
I continued exploring FIR filtering in RNS on the Cyclone IV, in line with what was presented in the previous log. This log introduces an RNS-based FIR filter that does not rely on the FPGA’s DSP blocks, uses no conventional multipliers, and is implemented entirely with combinational logic and LUTs.

Sizing and Initial Limits

I started with a 128-tap FIR, as defined in the previous log.
In that form, it did not fit in the Cyclone IV E22.

Compilation reported around 30,000 logic elements when not exploiting coefficient symmetry for a 128-tap FIR. By taking advantage of symmetry (which is generally required for linear-phase FIR filters and is a common configuration), the logic usage drops to around 15,000 logic elements for the same filter length. Even if this does not completely saturate the device, this kind of “full logic” architecture quickly eats into the available budget, especially on a family like Cyclone IV.

With a symmetric implementation, it should be possible to push a 128-tap FIR with 16-bit input data and an effective dynamic range of about 2³⁷. That would be very close to the practical upper bound on this FPGA.

Minimal FIR: Validating the Pipeline

Since I had to abandon the initially targeted filter, and in order to simplify debugging and verification, I implemented a symmetric 4-tap FIR with coefficients (1, 3, 3, 1), without exploiting symmetry internally. The goal was to preserve a general architecture that can be reused for non-symmetric filters if needed.

In this configuration:
- Around 2,800 logic elements are used
- No embedded RAM, no DSP blocks, no conventional multipliers
Simulation becomes manageable again: each pipeline stage is easier to track, and debugging is more straightforward. This makes it a useful intermediate step before moving back to longer filters. The code is provided in a ZIP file attached to the project files.

The design works. The maximum achievable frequency is about 100 MHz.
To validate the behavior, I injected a harmonic input over 8,192 samples, covering the full dynamic range.

The filter output matches the expected response, confirming that the FIR operates correctly in RNS, using only pure logic (including the full pipeline: the FIR itself plus the two correction stages).

Scaling Law

These results show that it is possible to implement an RNS FIR on a standard FPGA with 0 RAM and 0 DSP usage. The following figure shows the (approximately linear) relationship between FIR length and logic element usage.

From the measured points, the logic usage scales approximately as:

(there is likely room for improvement with a more optimized implementation, but this gives the correct order of magnitude).

For a symmetric FIR, a reasonable target is about:

$LE\approx110\timesN$

What RNS Really Enables on FPGA

These results help reassess where RNS is relevant in this kind of architecture. RNS is not meant to replace binary arithmetic, but it is a useful tool in the FPGA designer’s toolbox, with clear advantages in specific contexts.

On an FPGA like the Cyclone IV E22

Even with hardware multipliers available, a fully parallel FIR quickly runs into constraints related to accumulation, routing, pipelining, and overall logic usage. The limitation does not come only from the number of multipliers, but also from all the surrounding logic: adders, registers, fan-out, and dataflow control.

For example, the EP4CE22 provides 66 DSP blocks with 18×18 multipliers (suitable for 16-bit inputs). This is enough to parallelize a binary FIR up to about 64 taps; that is roughly the practical upper bound for a fully parallel implementation. Even in that configuration, the logic elements are still used: they are required to build the adder tree that sums all multiplier outputs.

It is noteworthy that, on the EP4CE22, both approaches, binary with DSPs and RNS with pure logic, end up constrained to FIR filters of about 64 taps,...
Read more »
Low-pass filter design
Bertrand Selva • 10/23/2025 at 17:02 • 0 comments
Why Compile-Time Fixed Coefficients and "mul-by-const mod m" LUTs?

My objective is to design the FIR without multipliers, using few wires, only shifts + additions/subtractions and ROM. Just as in the first part of the program, with the binary/RNS conversion.

In RNS, the multiplication
(x × h) mod m
with h a fixed coefficient (filter tap) becomes a simple ROM lookup:
```
LUT_m[r] = (c * r) mod m
with c = h_int mod m,   r in [0, m-1]
```
The first case investigated was with free coefficients (no fixed coefficient).
With moduli:
{7, 11, 13, 15, 17, 19, 23, 29, 31},
and max size m=31, encoding requires 5 bits.

All possible multiplications require:
```
Memory = N × Σ m² × 5 bits
      = N × (7² + 11² + 13² + 15² + 17² + 19² + 23² + 29² + 31²) × 5 bits
```
Numerically:
Σ m² = 3545
- N = 128 ⇒ 2,268,800 bits ≈ 246 M9K (Cyclone IV)
- N = 256 ⇒ 4,537,600 bits ≈ 493 M9K
So it's way too heavy to be efficient and in any case completely exceeds the resources of a Cyclone IV.
Moreover, having free filter taps would mean a much more complicated design, with a loader to load the coefficients into the filter. I'm not a fan of this idea, especially since you'd have to load the program into the FPGA before every execution.And it's a try, a prototype. Not a real product...

The Alternative: Compile-Time Fixed Filter Coefficients

We store only the actually needed products because the coefficients are fixed. Then:
```
Memory = N × Σ m × 5 bits
```
Numerically:
Σ m = 165
- N = 128 ⇒ 105,600 bits ≈ 11.5 M9K
- N = 256 ⇒ 211,200 bits ≈ 22.9 M9K
And now, this is entirely manageable and consistent with the "multiply without multiplying" approach.
This is the option that was chosen.

Filter Design: Windowed-Sinc FIR, Symmetry, Quantization and Scaling

The FIR filter implemented here is a windowed-sinc lowpass of length N = 128, using a Hamming window for side-lobe suppression. This is the classic textbook approach for FIR design when you want maximum phase linearity (symetry) and predictable cutoff.

We start from the ideal lowpass impulse response (the sinc), center it around n_c = (N-1)/2, and apply a symmetric window.
```
h_ideal[n] = 2·fc_norm · sinc(2·fc_norm·(n - n_c)) 
h[n] = h_ideal[n] · w[n] 
h[n] = h[n] / sum(h[n]) // DC gain normalization 
```
With N even, symmetry is perfect: h[n] = h[N-1-n]. This ensures a strictly linear phase and a constant group delay of (N-1)/2 samples.

Quantization and scaling: The entire filter is scaled by a quantization factor Q (here, Q = 1024) and all taps are rounded to integer values:
```
 h_int[n] = round(Q · h[n]) 
```
Every result is just "off by a factor Q". This scale is recovered by dividing by Q at the output if needed.

The following figure show the filter elements.

The list of filter elements values :
Columns 1 through 29

0 0 0 0 0 0 -1 -1 0 0 1 1 1 0 0 -1 -1 -1 -1 1 2 2 2 1 -1 -3 -4 -3 -1

Columns 30 through 58

1 4 5 5 2 -2 -6 -8 -7 -3 3 8 11 9 4 -4 -12 -16 -14 -6 6 ...
Read more »
Received the 16-channel logic analyzer
Bertrand Selva • 10/23/2025 at 13:57 • 0 comments

Another small step forward in the project: I’ve (quickly) received my 16-channel logic analyzer! The package and the box look nice.

It will allow me to observe the behavior at the FPGA pipeline output and check the signal synchronization.

No more hardware excuses for not moving forward…
Reflections on FIR Filter Architecture
Bertrand Selva • 10/18/2025 at 14:53 • 0 comments
I thought designing the FIR filter would be simpler - in fact it isn’t that straightforward. To picture how data flows through the filter, I drew tables showing successive cycles with the propagation of values and the operations done at each step. The filter parameters are noted a_i. The data to process are x_i. Here we choose N = 4 for illustration.

I started with a direct approach, the way you’d do it on an MCU. You’ll see it’s probably not the right method here.

Per modulus m_i, the FIR is:
y_i[n] = ( Σ_k=0…N−1 h_i[k] · x_i[n−k] ) mod m_i, for all i in [1, p]

Reconstruction by CRT:
y[n] = CRT( y₁[n], y₂[n], …, y_p[n] )

To escape purely sequential thinking, it helps to reason in terms of a production line: separate “lines” manipulate data items, cycle after cycle, as in a factory.

Moreover I count the operations and accumulations involved in an FIR with N taps and it easier to do with the following tables:
- N−1 additions
- N multiplications
- N cycles of latency (here one “cycle” means the time to ingest one full sample into the FPGA. The input sample is 16-bit, the input bus is 8-bit, so it takes 2 clocks to advance one cycle.)
- Looking at the step-5 table, we need to realize N + (N−1) + (N−2) + … + 1 = N(N+1)/2 intermediate products to fill the pipeline.
- We need to keep N products
- We also need to keep N−1 partial sums (accumulators)
Step-by-step

Figure 1. With x₀ ingested, compute the four products tied to the filter {a₀, a₁, a₂, a₃}.

Figure 2. When x₁ arrives, compute its four products and start summing: x₀a₀ + x₁a₁.

Figure 3. Ingest x₂, add the new term x₂a₂ into the highlighted box. Start a new accumulator with x₂a₁ + x₁a₀. One more addition is still missing to complete a full convolution cycle.

Figure 4. First full cycle completed: we can extract the first valid output.

Figure 5. Continue the process. Values like x₁a₁, x₁a₂, x₁a₃, x₂a₂, etc., are not kept. A red boundary in the step-5 table marks what must be stored vs. what can be discarded.

This is the direct approach. Of course, you must replicate that whole pipeline for each modulus. With k moduli, you duplicate it k times — that’s exactly where RNS parallelism comes from.

From the direct form to parallel and transposed architectures

While analyzing the cycle-by-cycle behavior of the direct form through these tables, a structural issue appears: for an FIR of N taps, the pipeline must effectively include N differentiated phases. At each cycle, every coefficient must be paired with its corresponding delayed sample — which requires a time-varying addressing scheme within the delay line.

As a result, the pipeline is not stationary: its internal data paths evolve from one cycle to the next (with a global sequence of N cycles). This makes programmation and sequencing heavier.

In addition to this non-stationary addressing, the addition chain itself must be broken into multiple intermediate stages to meet timing constraints. Together, these factors make the pipeline hard to program and tune.

Hence the natural question: isn’t there a more regular structure, where each pipeline stage would perform exactly the same operation at every instant?

Parallel direct form (fully parallel direct FIR)

If one wants a “hard” (stationary) pipeline, the solution is to let the samples slide through a delay line:
X₃ ← X₂ ← X₁ ← X₀ ← x[n].

The multipliers always see the correct positions (x[n], x[n−1], x[n−2], …): no more dynamic addressing.

That’s much better — but there’s still a hard point:
- the reduction of N products (modular addition tree) must itself be pipelined;
- otherwise, the final addition becomes heavy and limits f_max.
Pipelining this reduction introduces a few extra cycles of latency. In practice, if one addition combines two values, the reduction tree has log₂(N) stages. ...
Read more »
From 8-bit Demo to Full RNS Pipeline: +2 Across 2^37 Dynamic Range
Bertrand Selva • 10/17/2025 at 09:42 • 0 comments
This marks one of the key milestones towards the final objective: a design that performs +2, but this time using all 9 moduli, allowing for a dynamic range of 2³⁷.

Input data arrives via an 8-bit bus, in two cycles per word, while output data is mapped to 16 pins.
You can find the program on my github : https://github.com/Bertrand-selvasystems/rns-pipeline-16bit

Program Details (What Changed)

Data Capture, Phase 1

Input data enters the FPGA over an 8-bit bus. On each cycle, we alternately load the low and high bytes: the first byte received (LSB) is stored temporarily, then the next one (MSB) completes the 16-bit word. As soon as the word is complete, a one-cycle validity pulse (v0) is issued to signal "word ready" for the next pipeline phase.
This simple handshake (in_fire = in_valid & in_ready, with in_ready = rst_n) ensures the input is always ready (except during reset), simplifying both testbenches and real use.
```
// Phase 1 — Capture 16b from 8b bus (LSB then MSB)
reg  [7:0]  x_lo;
reg         have_lo;     // 1 = LSB already captured
reg         v0;          // 1-cycle pulse: 16b word ready
reg  [15:0] x0_16;

wire in_fire = in_valid & in_ready;
assign in_ready = rst_n;

always @(posedge clk) begin
  if (!rst_n) begin
    have_lo <= 1'b0;
    v0      <= 1'b0;
    x_lo    <= 8'd0;
    x0_16   <= 16'd0;
  end else begin
    v0 <= 1'b0; // default: no pulse
    if (in_fire) begin
      if (!have_lo) begin
        // First byte: LSB
        x_lo    <= in_byte;
        have_lo <= 1'b1;
      end else begin
        // Second byte: MSB -> complete word + v0 pulse
        x0_16   <= {in_byte, x_lo};
        have_lo <= 1'b0;
        v0      <= 1'b1; // **1-cycle pulse**
      end
    end
  end
end
```
On each cycle when in_fire is true:
- If have_lo is 0, we capture the LSB.
- If have_lo is 1, we capture the MSB, assemble the 16-bit word, and issue v0.
This mechanism guarantees no data is lost, and every incoming byte is processed at the rate dictated by the bus and module availability.

If you later want to handle discontinuous or asynchronous sources, simply make in_ready dependent on an actual FIFO or on the real buffering capacity of the module.

In short: two input bytes come in, one 16-bit word goes out, marked by v0 for pipeline synchronization.

Variable roles:
x_lo: temporary register for the first received byte (LSB).
have_lo: flag (1 bit) indicating whether the LSB has already been captured, controlling alternation between LSB storage and word assembly.
x0_16: register assembling the complete 16-bit word from the newly received MSB and the stored LSB.
v0: 1-cycle validity pulse; signals that x0_16 now holds a valid word ready for downstream processing.
in_fire: combinational signal that indicates a data transfer is happening this cycle (in_valid & in_ready).
in_ready: input availability signal; here, always 1 (except during reset), meaning the module never stalls the input—this is ideal for bench testing.

Principle: On each in_fire, if no LSB is present, we store the LSB in x_lo and set have_lo to 1.
On the next in_fire, the MSB is received, we assemble the 16-bit word into x0_16, reset have_lo to 0, and v0 emits a one-cycle pulse to signal "word ready".
This ensures lossless, overlap-free reconstruction, and maintains correct synchronization with the downstream pipeline.

RNS to Binary Conversion, Phase 4

Here, we work with 9 coprime moduli (7, 11, 13, 15, 17, 19, 23, 29, 31), yielding a 37-bit dynamic range (M = 100,280,245,065). Reconstruction using the Chinese Remainder Theorem (CRT) is performed through weighted accumulation: each residue (r_i) is assigned a value read from a precomputed ROM table (T), and these terms are summed in multiple pipeline stages, each with a reduction modulo M.
```
// Phase 4 — Pipelined CRT "mod M" at each addition stage
localparam [36:0] M = 37'd100280245065;

// Modular addition: returns (a+b) mod M
function [36:0] add_modM;
  input [36:0] a, b;
  reg   [37:0] s;   // 0..(2*M-2) < 2^38
begin
  s = {1'b0,a} + {1'b0,b};
  if (s >= {1'b0,M}) s = s - {1'b0,M};
  add_modM = s[36:0];
end
endfunction
...
```
Read more »
Automatic Generation of RNS LUTs in MATLAB
Bertrand Selva • 10/16/2025 at 09:02 • 0 comments
This MATLAB/Octave script automatically computes look-up tables (LUTs) for a Residue Number System (RNS) with a given set of moduli. It generates:
- The total product M = m1 * m2 * ... * mn
- The reconstruction coefficients Ai
- An example table Tj for a modulus mj
- Verilog-style outputs (case / endcase) ready to integrate into your FPGA design
Method

The reconstruction follows the classical Chinese Remainder Theorem (CRT):
```
M  = product of all moduli
Mi = M / mi
Ai = (Mi * inverse(Mi mod mi, mi)) mod M
Tm(r) = (Ai * r) mod M   for r = 0 .. mi-1
```
Each Tm table stores precomputed modular residues, allowing fast RNS operations on FPGA without any division or modulo logic.

MATLAB/OCTAVE Script
```
close all;
clear all;
clc;
format long g;

%% RNS LUT generation

%% Moduli list
moduli = [7 11 13 15 17 19 23 29 31];

%% Compute M
M = 1;
for i = 1:length(moduli)
    M = M * moduli(i);
end
disp(M);   % 100280245065

%% Coefficients Ai
Ai = zeros(1, numel(moduli));   % preallocation
for i = 1:numel(moduli)
    m  = moduli(i);
    Mi = M / m;                 % M_i
    t  = mod(Mi, m);            % (M_i mod m)
    [g, x, ~] = gcd(t, m);      % t*x + m*y = g
    if g ~= 1
        error('No modular inverse for m=%d (gcd=%d).', m, g);
    end
    inv_t = mod(x, m);          % inverse of t modulo m
    Ai(i) = mod(inv_t * Mi, M); % A_i = (M_i * inv_t) mod M
end

%% Example LUT Tj (e.g., j=9 -> m = 31)
j  = 9;                         % choose index in 'moduli'
mj = moduli(j);
Tj = zeros(1, mj);              % Tj(r+1) = (Ai(j)*r) mod M

for r = 0 : mj-1
    Tj(r+1) = mod(Ai(j) * r, M);
end

% Display
disp('Ai ='); disp(Ai);
disp(['T' num2str(mj) '(r) for r=0..' num2str(mj-1) ':']);
disp(Tj);

%% Verilog-like table generator
for j = 1:numel(moduli)
    m = moduli(j);
    fprintf('function [36:0] T%d;\n', m);
    fprintf('  input [4:0] r;\n  begin\n    case (r)\n');
    for r = 0:m-1
        fprintf('      5''d%-2d : T%d = 37''d%u;\n', r, m, mod(Ai(j)*r,M));
    end
    fprintf('      default: T%d = 37''d0;\n    endcase\n  end\nendfunction\n\n', m);
end
```
Notes
- Chosen moduli: [7, 11, 13, 15, 17, 19, 23, 29, 31]
- Total product: M = 100,280,245,065
- Because 2^37 = 137,438,953,472 > M, 37-bit wide signals are sufficient.
- Each input r fits in 5 bits (max modulus = 31).
- In this configuration, M is large enough to prevent overflow when using Q15 coefficients in an RNS-based FIR filter (the goal project).
Conclusion

This log automates the generation of RNS LUTs and Verilog snippets. Run the script, copy the output, and integrate it directly into your FPGA design without manual editing. Because Typing 300 Constants by Hand Is for Humans...

Results
function [36:0] T7;
input [4:0] r;
begin
case (r)
5'd0 : T7 = 37'd0;
5'd1 : T7 = 37'd28651498590;
5'd2 : T7 = 37'd57302997180;
5'd3 : T7 = 37'd85954495770;
5'd4 : T7 = 37'd14325749295;
5'd5 : T7 = 37'd42977247885;
5'd6 : T7 = 37'd71628746475;
default: T7 = 37'd0;
endcase
end
endfunction

function [36:0] T11;
input [4:0] r;
begin
case (r)
5'd0 : T11 = 37'd0;
5'd1 : T11 = 37'd91163859150;
5'd2 : T11 = 37'd82047473235;
5'd3 : T11 = 37'd72931087320;
5'd4 : T11 = 37'd63814701405;
5'd5 : T11 = 37'd54698315490;
5'd6 : T11 = 37'd45581929575;
5'd7 : T11 = 37'd36465543660;
5'd8 : T11 = 37'd27349157745;
5'd9 : T11 = 37'd18232771830;
5'd10 : T11 = 37'd9116385915;
default: T11 = 37'd0;
endcase
end
endfunction

function [36:0] T13;
input [4:0] r;
begin
case (r)
5'd0 : T13 = 37'd0;
5'd1 : T13 = 37'd53997055035;
...
Read more »
Receiving the USB Blaster
Bertrand Selva • 10/14/2025 at 14:17 • 0 comments
I just received the Altera USB Blaster, used to program the Cyclone IV FPGA.

What it’s for:
- JTAG interface between the PC and Intel/Altera FPGA or CPLD (Cyclone, MAX, etc.)
- Programming of configuration files (.sof, .pof, .jic) via Quartus
- Hardware debugging access (SignalTap, JTAG UART, boundary-scan)
- Low-level memory read/write and diagnostic operations
The real tests will begin soon. I’ll post a log once the first successful programming sequence is done.

Metric	Value
Logic elements	16 007 / 21 280 (75 %)
Registers	7 982
Embedded multipliers	0 / 80 (0 %)
Memory bits	0 / 774 144 (0 %)
PLLs	0 / 3 (0 %)
Max frequency (Slow 1200 mV 85 °C)	96.9 MHz

View all 12 project logs

PROGRAMME16BITS_FIRCOMPLET64TAPS_BACKPRESSURE.zip 64 TAPS RNS FIR WITH BACKPRESSURE x-zip-compressed - 31.90 MB - 11/16/2025 at 14:33		Download
PROGRAMME16BITS_FIRCOMPLET64TAPS.zip 64 TAPS RNS FIR x-zip-compressed - 31.95 MB - 11/10/2025 at 16:06		Download
PROGRAMME16BITS_FIRCOMPLET.zip 4 TAPS RNS FIR x-zip-compressed - 19.70 MB - 11/09/2025 at 17:38		Download
lut_verilog.v These functions multi<modulus>_h<tap> encodes the multiplication table (h_int[tap] * r) % modulus v - 885.81 kB - 10/23/2025 at 19:01		Download

Winter, FPGAs, and Forgotten Arithmetic

Become a Hackaday.io member

Just one more thing

Description

Details

The Inspiration: Duga and the mysterious K340A

A Brief Mathematical Detour: What is RNS?

Why is RNS interesting today?

The Project: A Modern RNS DSP Filter

Files

PROGRAMME16BITS_FIRCOMPLET64TAPS_BACKPRESSURE.zip

PROGRAMME16BITS_FIRCOMPLET64TAPS.zip

PROGRAMME16BITS_FIRCOMPLET.zip

lut_verilog.v

Project Logs Collapse

Implementing Backpressure in a No-RAM Pipeline

Validation Protocol

Typical Timing

Key Principles

hold_pipeline and step_en depending on the states

FIR Coefficients

Implementation results

Sizing and Initial Limits

Minimal FIR: Validating the Pipeline

Scaling Law

What RNS Really Enables on FPGA

On an FPGA like the Cyclone IV E22

Why Compile-Time Fixed Coefficients and "mul-by-const mod m" LUTs?

The Alternative: Compile-Time Fixed Filter Coefficients

Filter Design: Windowed-Sinc FIR, Symmetry, Quantization and Scaling

Step-by-step

From the direct form to parallel and transposed architectures

Parallel direct form (fully parallel direct FIR)

Program Details (What Changed)

Data Capture, Phase 1

RNS to Binary Conversion, Phase 4

Method

MATLAB/OCTAVE Script

Notes

Conclusion

Results

Enjoy this project?

Discussions

Become a Hackaday.io Member

Does this project spark your interest?

Report project as inappropriate

Send message

Remove Member

Project Logs

Collapse