[chore](trx-rs): add local copy of docs
Signed-off-by: Stan Grams <sjg@haxx.space>
This commit is contained in:
@@ -0,0 +1,175 @@
|
||||
# DSP Optimization Guidelines
|
||||
|
||||
This document captures lessons learned and best practices for optimizing
|
||||
the real-time DSP pipelines in trx-rs, particularly the WFM stereo decoder
|
||||
and audio encoding paths.
|
||||
|
||||
## General Principles
|
||||
|
||||
1. **Measure first.** Profile with real workloads before optimizing.
|
||||
Synthetic benchmarks miss cache effects, branch prediction patterns,
|
||||
and real signal statistics.
|
||||
|
||||
2. **Eliminate transcendentals from inner loops.** A single `sin_cos` or
|
||||
`atan2` per sample at 200 kHz composite rate costs millions of calls
|
||||
per second. Replace with:
|
||||
- **Quadrature NCO** for oscillators: maintain `(cos, sin)` state and
|
||||
rotate by a precomputed `(cos_inc, sin_inc)` each sample. Cost:
|
||||
4 muls + 2 adds. Renormalize every ~1024 samples to prevent drift.
|
||||
- **Double-angle identities** to derive `sin(2θ), cos(2θ)` from
|
||||
`sin(θ), cos(θ)`: `sin2 = 2·sin·cos`, `cos2 = 2·cos²−1`.
|
||||
- **I/Q arm extraction** for PLL phase error: if you have
|
||||
`i = lp(signal * cos)` and `q = lp(signal * -sin)`, then
|
||||
`sin(err) = q/mag`, `cos(err) = i/mag` — no `atan2` or `sin_cos`
|
||||
needed for the rotation.
|
||||
|
||||
3. **Batch operations for SIMD.** Separate data-parallel work (e.g. FM
|
||||
discriminator: conjugate-multiply + atan2) from sequential-state work
|
||||
(PLL, biquads). Process the parallel part in batches of 8 using AVX2,
|
||||
then feed scalar results into the sequential pipeline.
|
||||
|
||||
4. **Power-of-2 sizes for circular buffers.** Use `& (N-1)` bitmask
|
||||
instead of `% N` modulo. Ensure buffer lengths (e.g. `WFM_RESAMP_TAPS`)
|
||||
are powers of two.
|
||||
|
||||
5. **Circular buffers over shift registers.** Writing one sample at a
|
||||
ring-buffer position is O(1); `rotate_left(1)` is O(N). For a 32-tap
|
||||
FIR called 3× per composite sample, this eliminates ~200 byte-moves
|
||||
per sample.
|
||||
|
||||
6. **Decimate slow-changing metrics.** Stereo detection (pilot coherence,
|
||||
lock, drive) changes over tens of milliseconds. Running it every 16th
|
||||
sample instead of every sample saves ~94% of that work with no audible
|
||||
effect. Accumulate values over the window and process the average.
|
||||
|
||||
## Filter Design
|
||||
|
||||
- **Match filter cutoffs** across parallel paths (sum and diff) to ensure
|
||||
identical group delay. Mismatched cutoffs cause frequency-dependent
|
||||
phase errors that directly degrade stereo separation.
|
||||
|
||||
- **4th-order Butterworth** (two cascaded biquads) is generally sufficient
|
||||
when the polyphase resampler provides additional stopband rejection.
|
||||
6th-order adds 50% more biquad evaluations per sample for diminishing
|
||||
returns.
|
||||
|
||||
- **Q values for Butterworth cascades:**
|
||||
- 4th-order: Q₁ = 0.5412, Q₂ = 1.3066
|
||||
- 6th-order: Q₁ = 0.5176, Q₂ = 0.7071, Q₃ = 1.9319
|
||||
|
||||
## Polyphase Resampler
|
||||
|
||||
- **Compute cutoff from actual rate ratio:** `cutoff = output_rate / input_rate`.
|
||||
A fixed cutoff (e.g. 0.94) can be catastrophically wrong — at 200 kHz
|
||||
composite to 48 kHz audio, it passes everything up to 94 kHz while the
|
||||
output Nyquist is only 24 kHz. The 38 kHz stereo subcarrier residuals
|
||||
alias directly into the treble range.
|
||||
|
||||
- **Blackman-Harris window** gives ~92 dB stopband rejection vs ~43 dB
|
||||
for Hamming, at the same tap count. Use it for the windowed-sinc
|
||||
coefficients:
|
||||
```
|
||||
w(n) = 0.35875 − 0.48829·cos(2πn/N) + 0.14128·cos(4πn/N) − 0.01168·cos(6πn/N)
|
||||
```
|
||||
|
||||
- **32 taps** with Blackman-Harris and a proper cutoff gives >60 dB
|
||||
stopband rejection — more than enough. 64 taps doubles the MAC count
|
||||
for marginal improvement.
|
||||
|
||||
- **64 polyphase phases** balances fractional sample resolution against
|
||||
coefficient bank size (64 × 32 × 4 = 8 KB fits comfortably in L1
|
||||
cache). 128 phases offer diminishing returns for double the memory.
|
||||
|
||||
## FM Discriminator
|
||||
|
||||
- **Batch with AVX2:** The conjugate-multiply + atan2 pattern is
|
||||
data-parallel (each output depends only on two adjacent input samples).
|
||||
Process 8 samples at a time using 256-bit SIMD.
|
||||
|
||||
- **Use a high-precision atan2 polynomial** for AVX2. A 7th-order minimax
|
||||
polynomial (max error ~2.4e-7 rad) avoids the treble distortion that
|
||||
cheap 1st-order approximations (e.g. `0.273*(1−|z|)`) introduce on
|
||||
strong signals. Coefficients:
|
||||
```
|
||||
c0 = 0.999_999_5
|
||||
c1 = −0.333_326_1
|
||||
c2 = 0.199_777_1
|
||||
c3 = −0.138_776_8
|
||||
```
|
||||
|
||||
- **Branchless argument reduction** for atan2: swap `|y|` and `|x|` using
|
||||
masks rather than branches, apply quadrant correction via arithmetic
|
||||
shift and copysign.
|
||||
|
||||
## WFM Stereo Specifics
|
||||
|
||||
- **Pilot notch before diff demod:** The 19 kHz pilot leaks into the
|
||||
38 kHz multiplication and creates intermod products. Notch it from the
|
||||
composite signal before `x * cos(2θ)`. This notch is separate from the
|
||||
mono-path pilot notch (which sits after the sum LPF).
|
||||
|
||||
- **IQ hard limiter before FM discriminator:** For WFM, only the phase
|
||||
carries information. Normalizing IQ magnitude to 1.0 prevents
|
||||
overdeviation artifacts and clipping. Guard against zero magnitude.
|
||||
|
||||
- **Binary stereo blend:** A smooth blend function (e.g. smoothstep)
|
||||
sounds good in theory but reduces real-world separation. Use
|
||||
`blend = 1.0` when pilot is detected, `0.0` otherwise.
|
||||
|
||||
- **STEREO_MATRIX_GAIN = 0.50:** The correct unity factor for
|
||||
`L = (S+D)/2`, `R = (S−D)/2`. Lower values waste headroom; higher
|
||||
values clip.
|
||||
|
||||
## Opus Encoding
|
||||
|
||||
- **Complexity 5** (down from default 9-10) saves significant CPU with
|
||||
minimal quality impact at bitrates ≥128 kbps. The higher complexity
|
||||
levels run expensive psychoacoustic search algorithms that produce
|
||||
negligible improvement at high bitrates.
|
||||
|
||||
- **256 kbps** is transparent for stereo FM broadcast audio. Going higher
|
||||
wastes bandwidth; going below 128 kbps may introduce artifacts on
|
||||
complex program material.
|
||||
|
||||
- **`Application::Audio`** (not VoIP) — uses the MDCT-based CELT mode
|
||||
optimized for music and broadband audio rather than speech.
|
||||
|
||||
## AVX2 Guidelines
|
||||
|
||||
- Gate all AVX2 code behind `#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]`
|
||||
and runtime `is_x86_feature_detected!("avx2")` checks.
|
||||
|
||||
- Mark unsafe SIMD functions with `#[target_feature(enable = "avx2")]`
|
||||
so the compiler generates AVX2 code for the function body.
|
||||
|
||||
- Provide scalar fallbacks for non-x86 targets and CPUs without AVX2.
|
||||
|
||||
- Add epsilon guards (e.g. `1e-12`) to denominators in SIMD paths where
|
||||
both numerator and denominator can be zero simultaneously.
|
||||
|
||||
## What NOT to Optimize
|
||||
|
||||
- **Biquad filters** — already minimal (5 muls + 4 adds per sample).
|
||||
The sequential state dependency prevents SIMD vectorization within a
|
||||
single stream.
|
||||
|
||||
- **One-pole lowpass filters** — single multiply-accumulate, cannot be
|
||||
made faster.
|
||||
|
||||
- **DC blockers** — trivial per-sample cost.
|
||||
|
||||
- **Deemphasis** — single biquad, runs at audio rate (not composite rate).
|
||||
|
||||
## Profiling Tips
|
||||
|
||||
- Use `cargo build --release` — debug builds are 10-50x slower and
|
||||
misleading for DSP profiling.
|
||||
|
||||
- `perf stat` / `Instruments` on the inner loop to check IPC, cache
|
||||
misses, and branch mispredictions.
|
||||
|
||||
- Compare CPU% with stereo enabled vs disabled to isolate stereo-specific
|
||||
costs (diff path biquads, pilot PLL, 38 kHz demod, resampler channels).
|
||||
|
||||
- Watch for unexpected `libm` calls in disassembly — the compiler may
|
||||
not inline `f32::atan2` or `f32::sin_cos` even in release mode.
|
||||
Reference in New Issue
Block a user