From 8ec442da0a91b37ecfbff3dae33f6f133a206bed Mon Sep 17 00:00:00 2001
From: Stan Grams <sjg@haxx.space>
Date: Sun, 1 Mar 2026 11:06:27 +0100
Subject: [PATCH] [docs](trx-rs): add DSP chain performance optimization
 guidelines

Document lessons learned from WFM stereo decoder and audio encoding
optimization: quadrature NCO, double-angle identities, AVX2 batching,
polyphase resampler design, filter matching, stereo detection decimation,
and opus encoder tuning.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Stan Grams <sjg@haxx.space>
---
 OPTIMIZATION.md | 175 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 175 insertions(+)
 create mode 100644 OPTIMIZATION.md

diff --git a/OPTIMIZATION.md b/OPTIMIZATION.md
new file mode 100644
index 0000000..30146b5
--- /dev/null
+++ b/OPTIMIZATION.md
@@ -0,0 +1,175 @@
+# DSP Chain Performance Optimization Guidelines
+
+This document captures lessons learned and best practices for optimizing
+the real-time DSP pipelines in trx-rs, particularly the WFM stereo decoder
+and audio encoding paths.
+
+## General Principles
+
+1. **Measure first.** Profile with real workloads before optimizing.
+   Synthetic benchmarks miss cache effects, branch prediction patterns,
+   and real signal statistics.
+
+2. **Eliminate transcendentals from inner loops.** A single `sin_cos` or
+   `atan2` per sample at 200 kHz composite rate costs millions of calls
+   per second. Replace with:
+   - **Quadrature NCO** for oscillators: maintain `(cos, sin)` state and
+     rotate by a precomputed `(cos_inc, sin_inc)` each sample. Cost:
+     4 muls + 2 adds. Renormalize every ~1024 samples to prevent drift.
+   - **Double-angle identities** to derive `sin(2θ), cos(2θ)` from
+     `sin(θ), cos(θ)`: `sin2 = 2·sin·cos`, `cos2 = 2·cos²−1`.
+   - **I/Q arm extraction** for PLL phase error: if you have
+     `i = lp(signal * cos)` and `q = lp(signal * -sin)`, then
+     `sin(err) = q/mag`, `cos(err) = i/mag` — no `atan2` or `sin_cos`
+     needed for the rotation.
+
+3. **Batch operations for SIMD.** Separate data-parallel work (e.g. FM
+   discriminator: conjugate-multiply + atan2) from sequential-state work
+   (PLL, biquads). Process the parallel part in batches of 8 using AVX2,
+   then feed scalar results into the sequential pipeline.
+
+4. **Power-of-2 sizes for circular buffers.** Use `& (N-1)` bitmask
+   instead of `% N` modulo. Ensure buffer lengths (e.g. `WFM_RESAMP_TAPS`)
+   are powers of two.
+
+5. **Circular buffers over shift registers.** Writing one sample at a
+   ring-buffer position is O(1); `rotate_left(1)` is O(N). For a 32-tap
+   FIR called 3× per composite sample, this eliminates ~200 byte-moves
+   per sample.
+
+6. **Decimate slow-changing metrics.** Stereo detection (pilot coherence,
+   lock, drive) changes over tens of milliseconds. Running it every 16th
+   sample instead of every sample saves ~94% of that work with no audible
+   effect. Accumulate values over the window and process the average.
+
+## Filter Design
+
+- **Match filter cutoffs** across parallel paths (sum and diff) to ensure
+  identical group delay. Mismatched cutoffs cause frequency-dependent
+  phase errors that directly degrade stereo separation.
+
+- **4th-order Butterworth** (two cascaded biquads) is generally sufficient
+  when the polyphase resampler provides additional stopband rejection.
+  6th-order adds 50% more biquad evaluations per sample for diminishing
+  returns.
+
+- **Q values for Butterworth cascades:**
+  - 4th-order: Q₁ = 0.5412, Q₂ = 1.3066
+  - 6th-order: Q₁ = 0.5176, Q₂ = 0.7071, Q₃ = 1.9319
+
+## Polyphase Resampler
+
+- **Compute cutoff from actual rate ratio:** `cutoff = output_rate / input_rate`.
+  A fixed cutoff (e.g. 0.94) can be catastrophically wrong — at 200 kHz
+  composite to 48 kHz audio, it passes everything up to 94 kHz while the
+  output Nyquist is only 24 kHz. The 38 kHz stereo subcarrier residuals
+  alias directly into the treble range.
+
+- **Blackman-Harris window** gives ~92 dB stopband rejection vs ~43 dB
+  for Hamming, at the same tap count. Use it for the windowed-sinc
+  coefficients:
+  ```
+  w(n) = 0.35875 − 0.48829·cos(2πn/N) + 0.14128·cos(4πn/N) − 0.01168·cos(6πn/N)
+  ```
+
+- **32 taps** with Blackman-Harris and a proper cutoff gives >60 dB
+  stopband rejection — more than enough. 64 taps doubles the MAC count
+  for marginal improvement.
+
+- **64 polyphase phases** balances fractional sample resolution against
+  coefficient bank size (64 × 32 × 4 = 8 KB fits comfortably in L1
+  cache). 128 phases offer diminishing returns for double the memory.
+
+## FM Discriminator
+
+- **Batch with AVX2:** The conjugate-multiply + atan2 pattern is
+  data-parallel (each output depends only on two adjacent input samples).
+  Process 8 samples at a time using 256-bit SIMD.
+
+- **Use a high-precision atan2 polynomial** for AVX2. A 7th-order minimax
+  polynomial (max error ~2.4e-7 rad) avoids the treble distortion that
+  cheap 1st-order approximations (e.g. `0.273*(1−|z|)`) introduce on
+  strong signals. Coefficients:
+  ```
+  c0 =  0.999_999_5
+  c1 = −0.333_326_1
+  c2 =  0.199_777_1
+  c3 = −0.138_776_8
+  ```
+
+- **Branchless argument reduction** for atan2: swap `|y|` and `|x|` using
+  masks rather than branches, apply quadrant correction via arithmetic
+  shift and copysign.
+
+## WFM Stereo Specifics
+
+- **Pilot notch before diff demod:** The 19 kHz pilot leaks into the
+  38 kHz multiplication and creates intermod products. Notch it from the
+  composite signal before `x * cos(2θ)`. This notch is separate from the
+  mono-path pilot notch (which sits after the sum LPF).
+
+- **IQ hard limiter before FM discriminator:** For WFM, only the phase
+  carries information. Normalizing IQ magnitude to 1.0 prevents
+  overdeviation artifacts and clipping. Guard against zero magnitude.
+
+- **Binary stereo blend:** A smooth blend function (e.g. smoothstep)
+  sounds good in theory but reduces real-world separation. Use
+  `blend = 1.0` when pilot is detected, `0.0` otherwise.
+
+- **STEREO_MATRIX_GAIN = 0.50:** The correct unity factor for
+  `L = (S+D)/2`, `R = (S−D)/2`. Lower values waste headroom; higher
+  values clip.
+
+## Opus Encoding
+
+- **Complexity 5** (down from default 9-10) saves significant CPU with
+  minimal quality impact at bitrates ≥128 kbps. The higher complexity
+  levels run expensive psychoacoustic search algorithms that produce
+  negligible improvement at high bitrates.
+
+- **256 kbps** is transparent for stereo FM broadcast audio. Going higher
+  wastes bandwidth; going below 128 kbps may introduce artifacts on
+  complex program material.
+
+- **`Application::Audio`** (not VoIP) — uses the MDCT-based CELT mode
+  optimized for music and broadband audio rather than speech.
+
+## AVX2 Guidelines
+
+- Gate all AVX2 code behind `#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]`
+  and runtime `is_x86_feature_detected!("avx2")` checks.
+
+- Mark unsafe SIMD functions with `#[target_feature(enable = "avx2")]`
+  so the compiler generates AVX2 code for the function body.
+
+- Provide scalar fallbacks for non-x86 targets and CPUs without AVX2.
+
+- Add epsilon guards (e.g. `1e-12`) to denominators in SIMD paths where
+  both numerator and denominator can be zero simultaneously.
+
+## What NOT to Optimize
+
+- **Biquad filters** — already minimal (5 muls + 4 adds per sample).
+  The sequential state dependency prevents SIMD vectorization within a
+  single stream.
+
+- **One-pole lowpass filters** — single multiply-accumulate, cannot be
+  made faster.
+
+- **DC blockers** — trivial per-sample cost.
+
+- **Deemphasis** — single biquad, runs at audio rate (not composite rate).
+
+## Profiling Tips
+
+- Use `cargo build --release` — debug builds are 10-50x slower and
+  misleading for DSP profiling.
+
+- `perf stat` / `Instruments` on the inner loop to check IPC, cache
+  misses, and branch mispredictions.
+
+- Compare CPU% with stereo enabled vs disabled to isolate stereo-specific
+  costs (diff path biquads, pilot PLL, 38 kHz demod, resampler channels).
+
+- Watch for unexpected `libm` calls in disassembly — the compiler may
+  not inline `f32::atan2` or `f32::sin_cos` even in release mode.