trx-rs/docs/Optimization-Guidelines.md

# DSP Optimization Guidelines

This document captures lessons learned and best practices for optimizing
the real-time DSP pipelines in trx-rs, particularly the WFM stereo decoder
and audio encoding paths.

## General Principles

1. **Measure first.** Profile with real workloads before optimizing.
   Synthetic benchmarks miss cache effects, branch prediction patterns,
   and real signal statistics.

2. **Eliminate transcendentals from inner loops.** A single `sin_cos` or
   `atan2` per sample at 200 kHz composite rate costs millions of calls
   per second. Replace with:
   - **Quadrature NCO** for oscillators: maintain `(cos, sin)` state and
     rotate by a precomputed `(cos_inc, sin_inc)` each sample. Cost:
     4 muls + 2 adds. Renormalize every ~1024 samples to prevent drift.
   - **Double-angle identities** to derive `sin(2θ), cos(2θ)` from
     `sin(θ), cos(θ)`: `sin2 = 2·sin·cos`, `cos2 = 2·cos²−1`.
   - **I/Q arm extraction** for PLL phase error: if you have
     `i = lp(signal * cos)` and `q = lp(signal * -sin)`, then
     `sin(err) = q/mag`, `cos(err) = i/mag` — no `atan2` or `sin_cos`
     needed for the rotation.

3. **Batch operations for SIMD.** Separate data-parallel work (e.g. FM
   discriminator: conjugate-multiply + atan2) from sequential-state work
   (PLL, biquads). Process the parallel part in batches of 8 using AVX2,
   then feed scalar results into the sequential pipeline.

4. **Power-of-2 sizes for circular buffers.** Use `& (N-1)` bitmask
   instead of `% N` modulo. Ensure buffer lengths (e.g. `WFM_RESAMP_TAPS`)
   are powers of two.

5. **Circular buffers over shift registers.** Writing one sample at a
   ring-buffer position is O(1); `rotate_left(1)` is O(N). For a 32-tap
   FIR called 3× per composite sample, this eliminates ~200 byte-moves
   per sample.

6. **Decimate slow-changing metrics.** Stereo detection (pilot coherence,
   lock, drive) changes over tens of milliseconds. Running it every 16th
   sample instead of every sample saves ~94% of that work with no audible
   effect. Accumulate values over the window and process the average.

## Filter Design

- **Match filter cutoffs** across parallel paths (sum and diff) to ensure
  identical group delay. Mismatched cutoffs cause frequency-dependent
  phase errors that directly degrade stereo separation.

- **4th-order Butterworth** (two cascaded biquads) is generally sufficient
  when the polyphase resampler provides additional stopband rejection.
  6th-order adds 50% more biquad evaluations per sample for diminishing
  returns.

- **Q values for Butterworth cascades:**
  - 4th-order: Q₁ = 0.5412, Q₂ = 1.3066
  - 6th-order: Q₁ = 0.5176, Q₂ = 0.7071, Q₃ = 1.9319

## Polyphase Resampler

- **Compute cutoff from actual rate ratio:** `cutoff = output_rate / input_rate`.
  A fixed cutoff (e.g. 0.94) can be catastrophically wrong — at 200 kHz
  composite to 48 kHz audio, it passes everything up to 94 kHz while the
  output Nyquist is only 24 kHz. The 38 kHz stereo subcarrier residuals
  alias directly into the treble range.

- **Blackman-Harris window** gives ~92 dB stopband rejection vs ~43 dB
  for Hamming, at the same tap count. Use it for the windowed-sinc
  coefficients:
  ```
  w(n) = 0.35875 − 0.48829·cos(2πn/N) + 0.14128·cos(4πn/N) − 0.01168·cos(6πn/N)
  ```

- **32 taps** with Blackman-Harris and a proper cutoff gives >60 dB
  stopband rejection — more than enough. 64 taps doubles the MAC count
  for marginal improvement.

- **64 polyphase phases** balances fractional sample resolution against
  coefficient bank size (64 × 32 × 4 = 8 KB fits comfortably in L1
  cache). 128 phases offer diminishing returns for double the memory.

## FM Discriminator

- **Batch with AVX2:** The conjugate-multiply + atan2 pattern is
  data-parallel (each output depends only on two adjacent input samples).
  Process 8 samples at a time using 256-bit SIMD.

- **Use a high-precision atan2 polynomial** for AVX2. A 7th-order minimax
  polynomial (max error ~2.4e-7 rad) avoids the treble distortion that
  cheap 1st-order approximations (e.g. `0.273*(1−|z|)`) introduce on
  strong signals. Coefficients:
  ```
  c0 =  0.999_999_5
  c1 = −0.333_326_1
  c2 =  0.199_777_1
  c3 = −0.138_776_8
  ```

- **Branchless argument reduction** for atan2: swap `|y|` and `|x|` using
  masks rather than branches, apply quadrant correction via arithmetic
  shift and copysign.

## WFM Stereo Specifics

- **Pilot notch before diff demod:** The 19 kHz pilot leaks into the
  38 kHz multiplication and creates intermod products. Notch it from the
  composite signal before `x * cos(2θ)`. This notch is separate from the
  mono-path pilot notch (which sits after the sum LPF).

- **IQ hard limiter before FM discriminator:** For WFM, only the phase
  carries information. Normalizing IQ magnitude to 1.0 prevents
  overdeviation artifacts and clipping. Guard against zero magnitude.

- **Binary stereo blend:** A smooth blend function (e.g. smoothstep)
  sounds good in theory but reduces real-world separation. Use
  `blend = 1.0` when pilot is detected, `0.0` otherwise.

- **STEREO_MATRIX_GAIN = 0.50:** The correct unity factor for
  `L = (S+D)/2`, `R = (S−D)/2`. Lower values waste headroom; higher
  values clip.

## Opus Encoding

- **Complexity 5** (down from default 9-10) saves significant CPU with
  minimal quality impact at bitrates ≥128 kbps. The higher complexity
  levels run expensive psychoacoustic search algorithms that produce
  negligible improvement at high bitrates.

- **256 kbps** is transparent for stereo FM broadcast audio. Going higher
  wastes bandwidth; going below 128 kbps may introduce artifacts on
  complex program material.

- **`Application::Audio`** (not VoIP) — uses the MDCT-based CELT mode
  optimized for music and broadband audio rather than speech.

## AVX2 Guidelines

- Gate all AVX2 code behind `#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]`
  and runtime `is_x86_feature_detected!("avx2")` checks.

- Mark unsafe SIMD functions with `#[target_feature(enable = "avx2")]`
  so the compiler generates AVX2 code for the function body.

- Provide scalar fallbacks for non-x86 targets and CPUs without AVX2.

- Add epsilon guards (e.g. `1e-12`) to denominators in SIMD paths where
  both numerator and denominator can be zero simultaneously.

## What NOT to Optimize

- **Biquad filters** — already minimal (5 muls + 4 adds per sample).
  The sequential state dependency prevents SIMD vectorization within a
  single stream.

- **One-pole lowpass filters** — single multiply-accumulate, cannot be
  made faster.

- **DC blockers** — trivial per-sample cost.

- **Deemphasis** — single biquad, runs at audio rate (not composite rate).

## Profiling Tips

- Use `cargo build --release` — debug builds are 10-50x slower and
  misleading for DSP profiling.

- `perf stat` / `Instruments` on the inner loop to check IPC, cache
  misses, and branch mispredictions.

- Compare CPU% with stereo enabled vs disabled to isolate stereo-specific
  costs (diff path biquads, pilot PLL, 38 kHz demod, resampler channels).

- Watch for unexpected `libm` calls in disassembly — the compiler may
  not inline `f32::atan2` or `f32::sin_cos` even in release mode.