Files
trx-rs/docs/Optimization-Guidelines.md
T
sjg ba48de2d30
Sync docs to Wiki / wiki (push) Has been cancelled
Initial commit
Signed-off-by: Stan Grams <sjg@haxx.space>
2026-05-17 23:25:14 +02:00

176 lines
7.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# DSP Optimization Guidelines
This document captures lessons learned and best practices for optimizing
the real-time DSP pipelines in trx-rs, particularly the WFM stereo decoder
and audio encoding paths.
## General Principles
1. **Measure first.** Profile with real workloads before optimizing.
Synthetic benchmarks miss cache effects, branch prediction patterns,
and real signal statistics.
2. **Eliminate transcendentals from inner loops.** A single `sin_cos` or
`atan2` per sample at 200 kHz composite rate costs millions of calls
per second. Replace with:
- **Quadrature NCO** for oscillators: maintain `(cos, sin)` state and
rotate by a precomputed `(cos_inc, sin_inc)` each sample. Cost:
4 muls + 2 adds. Renormalize every ~1024 samples to prevent drift.
- **Double-angle identities** to derive `sin(2θ), cos(2θ)` from
`sin(θ), cos(θ)`: `sin2 = 2·sin·cos`, `cos2 = 2·cos²−1`.
- **I/Q arm extraction** for PLL phase error: if you have
`i = lp(signal * cos)` and `q = lp(signal * -sin)`, then
`sin(err) = q/mag`, `cos(err) = i/mag` — no `atan2` or `sin_cos`
needed for the rotation.
3. **Batch operations for SIMD.** Separate data-parallel work (e.g. FM
discriminator: conjugate-multiply + atan2) from sequential-state work
(PLL, biquads). Process the parallel part in batches of 8 using AVX2,
then feed scalar results into the sequential pipeline.
4. **Power-of-2 sizes for circular buffers.** Use `& (N-1)` bitmask
instead of `% N` modulo. Ensure buffer lengths (e.g. `WFM_RESAMP_TAPS`)
are powers of two.
5. **Circular buffers over shift registers.** Writing one sample at a
ring-buffer position is O(1); `rotate_left(1)` is O(N). For a 32-tap
FIR called 3× per composite sample, this eliminates ~200 byte-moves
per sample.
6. **Decimate slow-changing metrics.** Stereo detection (pilot coherence,
lock, drive) changes over tens of milliseconds. Running it every 16th
sample instead of every sample saves ~94% of that work with no audible
effect. Accumulate values over the window and process the average.
## Filter Design
- **Match filter cutoffs** across parallel paths (sum and diff) to ensure
identical group delay. Mismatched cutoffs cause frequency-dependent
phase errors that directly degrade stereo separation.
- **4th-order Butterworth** (two cascaded biquads) is generally sufficient
when the polyphase resampler provides additional stopband rejection.
6th-order adds 50% more biquad evaluations per sample for diminishing
returns.
- **Q values for Butterworth cascades:**
- 4th-order: Q₁ = 0.5412, Q₂ = 1.3066
- 6th-order: Q₁ = 0.5176, Q₂ = 0.7071, Q₃ = 1.9319
## Polyphase Resampler
- **Compute cutoff from actual rate ratio:** `cutoff = output_rate / input_rate`.
A fixed cutoff (e.g. 0.94) can be catastrophically wrong — at 200 kHz
composite to 48 kHz audio, it passes everything up to 94 kHz while the
output Nyquist is only 24 kHz. The 38 kHz stereo subcarrier residuals
alias directly into the treble range.
- **Blackman-Harris window** gives ~92 dB stopband rejection vs ~43 dB
for Hamming, at the same tap count. Use it for the windowed-sinc
coefficients:
```
w(n) = 0.35875 0.48829·cos(2πn/N) + 0.14128·cos(4πn/N) 0.01168·cos(6πn/N)
```
- **32 taps** with Blackman-Harris and a proper cutoff gives >60 dB
stopband rejection — more than enough. 64 taps doubles the MAC count
for marginal improvement.
- **64 polyphase phases** balances fractional sample resolution against
coefficient bank size (64 × 32 × 4 = 8 KB fits comfortably in L1
cache). 128 phases offer diminishing returns for double the memory.
## FM Discriminator
- **Batch with AVX2:** The conjugate-multiply + atan2 pattern is
data-parallel (each output depends only on two adjacent input samples).
Process 8 samples at a time using 256-bit SIMD.
- **Use a high-precision atan2 polynomial** for AVX2. A 7th-order minimax
polynomial (max error ~2.4e-7 rad) avoids the treble distortion that
cheap 1st-order approximations (e.g. `0.273*(1|z|)`) introduce on
strong signals. Coefficients:
```
c0 = 0.999_999_5
c1 = 0.333_326_1
c2 = 0.199_777_1
c3 = 0.138_776_8
```
- **Branchless argument reduction** for atan2: swap `|y|` and `|x|` using
masks rather than branches, apply quadrant correction via arithmetic
shift and copysign.
## WFM Stereo Specifics
- **Pilot notch before diff demod:** The 19 kHz pilot leaks into the
38 kHz multiplication and creates intermod products. Notch it from the
composite signal before `x * cos(2θ)`. This notch is separate from the
mono-path pilot notch (which sits after the sum LPF).
- **IQ hard limiter before FM discriminator:** For WFM, only the phase
carries information. Normalizing IQ magnitude to 1.0 prevents
overdeviation artifacts and clipping. Guard against zero magnitude.
- **Binary stereo blend:** A smooth blend function (e.g. smoothstep)
sounds good in theory but reduces real-world separation. Use
`blend = 1.0` when pilot is detected, `0.0` otherwise.
- **STEREO_MATRIX_GAIN = 0.50:** The correct unity factor for
`L = (S+D)/2`, `R = (SD)/2`. Lower values waste headroom; higher
values clip.
## Opus Encoding
- **Complexity 5** (down from default 9-10) saves significant CPU with
minimal quality impact at bitrates ≥128 kbps. The higher complexity
levels run expensive psychoacoustic search algorithms that produce
negligible improvement at high bitrates.
- **256 kbps** is transparent for stereo FM broadcast audio. Going higher
wastes bandwidth; going below 128 kbps may introduce artifacts on
complex program material.
- **`Application::Audio`** (not VoIP) — uses the MDCT-based CELT mode
optimized for music and broadband audio rather than speech.
## AVX2 Guidelines
- Gate all AVX2 code behind `#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]`
and runtime `is_x86_feature_detected!("avx2")` checks.
- Mark unsafe SIMD functions with `#[target_feature(enable = "avx2")]`
so the compiler generates AVX2 code for the function body.
- Provide scalar fallbacks for non-x86 targets and CPUs without AVX2.
- Add epsilon guards (e.g. `1e-12`) to denominators in SIMD paths where
both numerator and denominator can be zero simultaneously.
## What NOT to Optimize
- **Biquad filters** — already minimal (5 muls + 4 adds per sample).
The sequential state dependency prevents SIMD vectorization within a
single stream.
- **One-pole lowpass filters** — single multiply-accumulate, cannot be
made faster.
- **DC blockers** — trivial per-sample cost.
- **Deemphasis** — single biquad, runs at audio rate (not composite rate).
## Profiling Tips
- Use `cargo build --release` — debug builds are 10-50x slower and
misleading for DSP profiling.
- `perf stat` / `Instruments` on the inner loop to check IPC, cache
misses, and branch mispredictions.
- Compare CPU% with stereo enabled vs disabled to isolate stereo-specific
costs (diff path biquads, pilot PLL, 38 kHz demod, resampler channels).
- Watch for unexpected `libm` calls in disassembly — the compiler may
not inline `f32::atan2` or `f32::sin_cos` even in release mode.