I just hit a performance milestone on Darśana: on a Raspberry Pi Pico 2 (RP2350 @ 204 MHz), running on Core 0 only, I’m now running a 5-voice poly engine built from 3 oscillators + noise + a TPT ladder filter per voice — that’s 15 oscillators, 5 noise sources, and 5 TPT ladder filters in total — while keeping peak CPU usage under 90% and staying stable (no dropouts).
I chose TPT for a simple reason: I wanted a musical, high-quality ladder filter at the heart of the synth. A ladder filter is where a synth’s character often lives — that smooth “grip” when you move the cutoff is a big part of what makes it feel like an instrument. The trade-off is that TPT (especially ZDF-style structures) can be computationally demanding compared to lightweight traditional IIR designs. So I approached it by incorporating as much as I could from guidance in the literature, along with general real-time DSP optimization techniques, aiming to keep the sound quality while fitting within a realistic CPU budget.
TPT + ZDF: aiming for “no one-sample delay” behavior
By using a ZDF (Zero-Delay Feedback) TPT structure with trapezoidal integration, the filter stays stable and “analog-like” even under aggressive cutoff modulation, without the thinness you often get from naïve digital structures.
Key speedups
To reduce real-time cost, I applied these optimizations:
- Non-iterative PWL solver (no Newton iterations): Instead of heavy iterative solvers, I use a piecewise-linear (PWL) approach that resolves the saturation characteristic (hard-clipper style) in a single algebraic step.
- Division-free inner loop: Any per-sample division is eliminated: reciprocals like inv_denom are computed at the control rate, so the audio loop is dominated by multiplications.
- Efficient 2× oversampling with IIR: To mitigate aliasing from nonlinearities, I use 2× oversampling, but with a 4th-order (2-biquad) IIR instead of FIR, achieving strong attenuation with far fewer cycles.
- Full unrolling + register-friendly state: The 4-stage ladder is unrolled and internal states are kept as locals to maximize time in registers, minimizing memory traffic and loop overhead.
Hardware-aware optimization: running critical DSP from SRAM to kill jitter
Executing from flash via XIP can introduce unpredictable latency (cache misses → jitter). Audio needs deterministic timing, so:
- I mark time-critical routines (e.g., filter_process, the PWL solver) with `__not_in_flash_func` and run them from SRAM.
- This improves determinism under load and helps prevent dropouts even when many peripherals are active.
It's an analog blueprint running at digital speed.
Hiroyuki OYAMA
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.