From 8ec442da0a91b37ecfbff3dae33f6f133a206bed Mon Sep 17 00:00:00 2001 From: Stan Grams Date: Sun, 1 Mar 2026 11:06:27 +0100 Subject: [PATCH] [docs](trx-rs): add DSP chain performance optimization guidelines Document lessons learned from WFM stereo decoder and audio encoding optimization: quadrature NCO, double-angle identities, AVX2 batching, polyphase resampler design, filter matching, stereo detection decimation, and opus encoder tuning. Co-Authored-By: Claude Opus 4.6 Signed-off-by: Stan Grams --- OPTIMIZATION.md | 175 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 175 insertions(+) create mode 100644 OPTIMIZATION.md diff --git a/OPTIMIZATION.md b/OPTIMIZATION.md new file mode 100644 index 0000000..30146b5 --- /dev/null +++ b/OPTIMIZATION.md @@ -0,0 +1,175 @@ +# DSP Chain Performance Optimization Guidelines + +This document captures lessons learned and best practices for optimizing +the real-time DSP pipelines in trx-rs, particularly the WFM stereo decoder +and audio encoding paths. + +## General Principles + +1. **Measure first.** Profile with real workloads before optimizing. + Synthetic benchmarks miss cache effects, branch prediction patterns, + and real signal statistics. + +2. **Eliminate transcendentals from inner loops.** A single `sin_cos` or + `atan2` per sample at 200 kHz composite rate costs millions of calls + per second. Replace with: + - **Quadrature NCO** for oscillators: maintain `(cos, sin)` state and + rotate by a precomputed `(cos_inc, sin_inc)` each sample. Cost: + 4 muls + 2 adds. Renormalize every ~1024 samples to prevent drift. + - **Double-angle identities** to derive `sin(2θ), cos(2θ)` from + `sin(θ), cos(θ)`: `sin2 = 2·sin·cos`, `cos2 = 2·cos²−1`. + - **I/Q arm extraction** for PLL phase error: if you have + `i = lp(signal * cos)` and `q = lp(signal * -sin)`, then + `sin(err) = q/mag`, `cos(err) = i/mag` — no `atan2` or `sin_cos` + needed for the rotation. + +3. **Batch operations for SIMD.** Separate data-parallel work (e.g. FM + discriminator: conjugate-multiply + atan2) from sequential-state work + (PLL, biquads). Process the parallel part in batches of 8 using AVX2, + then feed scalar results into the sequential pipeline. + +4. **Power-of-2 sizes for circular buffers.** Use `& (N-1)` bitmask + instead of `% N` modulo. Ensure buffer lengths (e.g. `WFM_RESAMP_TAPS`) + are powers of two. + +5. **Circular buffers over shift registers.** Writing one sample at a + ring-buffer position is O(1); `rotate_left(1)` is O(N). For a 32-tap + FIR called 3× per composite sample, this eliminates ~200 byte-moves + per sample. + +6. **Decimate slow-changing metrics.** Stereo detection (pilot coherence, + lock, drive) changes over tens of milliseconds. Running it every 16th + sample instead of every sample saves ~94% of that work with no audible + effect. Accumulate values over the window and process the average. + +## Filter Design + +- **Match filter cutoffs** across parallel paths (sum and diff) to ensure + identical group delay. Mismatched cutoffs cause frequency-dependent + phase errors that directly degrade stereo separation. + +- **4th-order Butterworth** (two cascaded biquads) is generally sufficient + when the polyphase resampler provides additional stopband rejection. + 6th-order adds 50% more biquad evaluations per sample for diminishing + returns. + +- **Q values for Butterworth cascades:** + - 4th-order: Q₁ = 0.5412, Q₂ = 1.3066 + - 6th-order: Q₁ = 0.5176, Q₂ = 0.7071, Q₃ = 1.9319 + +## Polyphase Resampler + +- **Compute cutoff from actual rate ratio:** `cutoff = output_rate / input_rate`. + A fixed cutoff (e.g. 0.94) can be catastrophically wrong — at 200 kHz + composite to 48 kHz audio, it passes everything up to 94 kHz while the + output Nyquist is only 24 kHz. The 38 kHz stereo subcarrier residuals + alias directly into the treble range. + +- **Blackman-Harris window** gives ~92 dB stopband rejection vs ~43 dB + for Hamming, at the same tap count. Use it for the windowed-sinc + coefficients: + ``` + w(n) = 0.35875 − 0.48829·cos(2πn/N) + 0.14128·cos(4πn/N) − 0.01168·cos(6πn/N) + ``` + +- **32 taps** with Blackman-Harris and a proper cutoff gives >60 dB + stopband rejection — more than enough. 64 taps doubles the MAC count + for marginal improvement. + +- **64 polyphase phases** balances fractional sample resolution against + coefficient bank size (64 × 32 × 4 = 8 KB fits comfortably in L1 + cache). 128 phases offer diminishing returns for double the memory. + +## FM Discriminator + +- **Batch with AVX2:** The conjugate-multiply + atan2 pattern is + data-parallel (each output depends only on two adjacent input samples). + Process 8 samples at a time using 256-bit SIMD. + +- **Use a high-precision atan2 polynomial** for AVX2. A 7th-order minimax + polynomial (max error ~2.4e-7 rad) avoids the treble distortion that + cheap 1st-order approximations (e.g. `0.273*(1−|z|)`) introduce on + strong signals. Coefficients: + ``` + c0 = 0.999_999_5 + c1 = −0.333_326_1 + c2 = 0.199_777_1 + c3 = −0.138_776_8 + ``` + +- **Branchless argument reduction** for atan2: swap `|y|` and `|x|` using + masks rather than branches, apply quadrant correction via arithmetic + shift and copysign. + +## WFM Stereo Specifics + +- **Pilot notch before diff demod:** The 19 kHz pilot leaks into the + 38 kHz multiplication and creates intermod products. Notch it from the + composite signal before `x * cos(2θ)`. This notch is separate from the + mono-path pilot notch (which sits after the sum LPF). + +- **IQ hard limiter before FM discriminator:** For WFM, only the phase + carries information. Normalizing IQ magnitude to 1.0 prevents + overdeviation artifacts and clipping. Guard against zero magnitude. + +- **Binary stereo blend:** A smooth blend function (e.g. smoothstep) + sounds good in theory but reduces real-world separation. Use + `blend = 1.0` when pilot is detected, `0.0` otherwise. + +- **STEREO_MATRIX_GAIN = 0.50:** The correct unity factor for + `L = (S+D)/2`, `R = (S−D)/2`. Lower values waste headroom; higher + values clip. + +## Opus Encoding + +- **Complexity 5** (down from default 9-10) saves significant CPU with + minimal quality impact at bitrates ≥128 kbps. The higher complexity + levels run expensive psychoacoustic search algorithms that produce + negligible improvement at high bitrates. + +- **256 kbps** is transparent for stereo FM broadcast audio. Going higher + wastes bandwidth; going below 128 kbps may introduce artifacts on + complex program material. + +- **`Application::Audio`** (not VoIP) — uses the MDCT-based CELT mode + optimized for music and broadband audio rather than speech. + +## AVX2 Guidelines + +- Gate all AVX2 code behind `#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]` + and runtime `is_x86_feature_detected!("avx2")` checks. + +- Mark unsafe SIMD functions with `#[target_feature(enable = "avx2")]` + so the compiler generates AVX2 code for the function body. + +- Provide scalar fallbacks for non-x86 targets and CPUs without AVX2. + +- Add epsilon guards (e.g. `1e-12`) to denominators in SIMD paths where + both numerator and denominator can be zero simultaneously. + +## What NOT to Optimize + +- **Biquad filters** — already minimal (5 muls + 4 adds per sample). + The sequential state dependency prevents SIMD vectorization within a + single stream. + +- **One-pole lowpass filters** — single multiply-accumulate, cannot be + made faster. + +- **DC blockers** — trivial per-sample cost. + +- **Deemphasis** — single biquad, runs at audio rate (not composite rate). + +## Profiling Tips + +- Use `cargo build --release` — debug builds are 10-50x slower and + misleading for DSP profiling. + +- `perf stat` / `Instruments` on the inner loop to check IPC, cache + misses, and branch mispredictions. + +- Compare CPU% with stereo enabled vs disabled to isolate stereo-specific + costs (diff path biquads, pilot PLL, 38 kHz demod, resampler channels). + +- Watch for unexpected `libm` calls in disassembly — the compiler may + not inline `f32::atan2` or `f32::sin_cos` even in release mode.