From 9df71cf36cff9cd69fab6b0a1076d34653b00486 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 28 Mar 2026 09:52:08 +0000 Subject: [PATCH] [docs](trx-rs): deep codebase review with updated architecture and improvement plan Architecture.md: added detailed component notes covering rig_task internals, audio pipeline, remote client dual-connection model, FrontendRuntimeContext field groups, decoder implementation patterns, and FT-817 backend workarounds. Improvement-Areas.md: added 10 new findings from deep review including LIFO command batching, unbounded decoder history, missing jitter in backoff, rig task crash recovery, SoapySdrRig constructor complexity, and protocol versioning. CLAUDE.md: refreshed review observations with accurate LOC counts, prioritized improvement items (P1/P2/P3), and new strengths identified. https://claude.ai/code/session_011aiY4GfrmDUrpYVvEUGNGm Signed-off-by: Claude --- CLAUDE.md | 53 +++++++---- docs/Architecture.md | 87 +++++++++++++++++ docs/Home.md | 2 +- docs/Improvement-Areas.md | 193 +++++++++++++++++++++++++++++++++++++- 4 files changed, 314 insertions(+), 21 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 53647ce..b3df909 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -104,27 +104,44 @@ Types: `feat`, `fix`, `docs`, `style`, `refactor`, `test`, `chore`. Use `(trx-rs ## Codebase Review Observations -Full architecture documentation: `docs/architecture.md` -Improvement plan: `docs/improvement-plan.md` +Full architecture documentation: `docs/Architecture.md` +Improvement plan: `docs/Improvement-Areas.md` + +*Last reviewed: 2026-03-28* ### Strengths -- **Explicit state machine**: `RigMachineState` FSM prevents invalid states with a deterministic transition table and exhaustive matching. Well-tested with lifecycle, error recovery, and invalid transition tests. -- **Trait-based polymorphism**: Clean abstraction boundaries (`RigCat`, `RigSdr`, `AudioSource`, `RigListener`, `RigCommandHandler`, `CommandExecutor`, `TokenValidator`, `FrontendSpawner`) enable loose coupling and testability. -- **Multi-rig architecture**: Per-rig task isolation with `HashMap` routing, per-rig state/spectrum/audio watch channels, and backward-compatible single-rig mode. -- **Async concurrency model**: Proper use of tokio channels -- `watch` for state snapshots, `broadcast` for PCM/decode fan-out, `mpsc` for commands. No mutex contention on hot paths. -- **Comprehensive SDR support**: Full DSP pipeline with multi-mode demodulation (SSB, AM, SAM, FM, WFM, AIS, VDES), virtual channel management, squelch, noise blanker, spectrum FFT, RDS decoding. -- **Pure Rust decoders**: FT8/FT4/FT2, APRS, CW, WSPR, AIS, VDES, RDS -- all implemented without C FFI dependencies. -- **Good test coverage** in protocol layer: codec, mapping, auth all have thorough unit tests with round-trip verification. -- **Feature-gated backends**: ft817, ft450d, soapysdr compiled conditionally to minimize binary size. +- **Explicit state machine**: `RigMachineState` FSM (7 states) prevents invalid states with a deterministic transition table and exhaustive matching. Well-tested with lifecycle, error recovery, and invalid transition tests. `ReadyStateData`/`TransmittingStateData` use `pub(crate)` fields with controlled accessors. +- **Trait-based polymorphism**: Clean abstraction boundaries (`RigCat`, `RigSdr`, `AudioSource`, `RigListener`, `RigCommandHandler`, `CommandExecutor`, `TokenValidator`, `FrontendSpawner`) enable loose coupling and testability. `RigCat`/`RigSdr` split cleanly separates CAT ops from SDR-specific methods. +- **Multi-rig architecture**: Per-rig task isolation with `HashMap` routing, per-rig state/spectrum/audio/decoder-history channels, dual-connection model (main + spectrum) in the client, and backward-compatible single-rig mode. +- **Async concurrency model**: Proper use of tokio channels -- `watch` for state snapshots, `broadcast` for PCM/decode fan-out, `mpsc` for commands. No mutex contention on hot paths. Spectrum deduplication collapses concurrent GetSpectrum requests. +- **Comprehensive SDR support**: Full DSP pipeline with multi-mode demodulation (SSB, AM, SAM, FM, WFM, AIS, VDES), virtual channel management, squelch, noise blanker, spectrum FFT, RDS decoding. AVX2-optimized FM discriminator with scalar fallbacks. +- **Pure Rust decoders**: FT8/FT4/FT2, APRS, CW, WSPR, AIS, VDES, RDS -- all implemented without C FFI dependencies. Consistent decoder pattern: stateful struct → `process_block()` → `decode_if_ready()`. +- **Good test coverage** in protocol layer: codec, mapping, auth all have thorough unit tests with round-trip verification. 45+ mapping tests cover all command variants. +- **Feature-gated backends**: ft817, ft450d, soapysdr compiled conditionally to minimize binary size. Factory pattern with name normalization for registration. +- **Defensive error handling**: Lock poisoning recovery, stream error deduplication with 60s summaries, input truncation in logs (128 chars), per-IP rate limiting on auth endpoints. +- **Well-documented DSP guidelines**: `docs/Optimization-Guidelines.md` captures lessons on NCO design, polyphase resampling, AVX2 batching, and stereo FM decoding. ### Areas for Improvement -- **FrontendRuntimeContext** (`trx-frontend/src/lib.rs`) is a 60+ field god-struct mixing audio, decode, auth, UI config, and per-rig routing. Should be decomposed into sub-structs. -- **Dual command enums**: `ClientCommand` and `RigCommand` are near-identical 40+ variant enums with mechanical 1:1 mapping in `mapping.rs`. Adding a command requires 4-file changes. -- **Command handler boilerplate**: 11 implementations of `RigCommandHandler` follow identical patterns across 500+ lines. A macro could reduce this to ~100 lines. -- **No integration tests** for `rig_task.rs` (1050 lines) or `audio.rs` (850+ lines) -- the most critical server components. -- **No command execution timeouts** at the `CommandExecutor` level. Backend stalls propagate up. The 10s timeout only exists at the rig_task layer. -- **Event emitter not panic-safe**: If one `RigListener` panics, remaining listeners in the notification loop are skipped. -- **Sparse structured logging** in trx-core: only one `debug!()` call. State transitions, command execution, and retries are not instrumented. -- **No command rate limiting**: rapid-fire commands from clients can overwhelm slow serial CAT interfaces. +**P1 — High:** +- **FrontendRuntimeContext** (`trx-frontend/src/lib.rs`) is a ~50-field god-struct mixing audio channels, decode histories (9 types), auth config (7 fields), UI settings, rig routing, virtual channels, and branding. Should be decomposed into sub-structs (see `docs/Improvement-Areas.md`). +- **Rig task command batching uses LIFO** (`rig_task.rs`): `batch.pop()` reverses arrival order. Commands execute newest-first, causing unexpected transient states. +- **Decoder history unbounded** (`audio.rs`): No capacity limit on `VecDeque` queues; only 24h time-based pruning. Busy AIS channels can exhaust memory. +- **ExponentialBackoff has no jitter** (`policies.rs`): All rigs/clients retry at identical times after a server restart (thundering herd). +- **No rig task crash recovery** (`main.rs`): If a rig task panics, it silently disappears. No supervisor, no restart, no health monitoring. + +**P2 — Medium:** +- **Dual command enums**: `ClientCommand` and `RigCommand` are near-identical 40+ variant enums with mechanical 1:1 mapping in `mapping.rs` (675 lines). Adding a command requires 4-file changes. `GetRigs` triggers `unreachable!()`. +- **SoapySdrRig 20-parameter constructor**: No builder pattern, fragile call sites. +- **Lock poisoning recovery hides panics**: `unwrap_or_else(|e| e.into_inner())` throughout `audio.rs` silently continues with potentially inconsistent data. +- **Hardcoded timeouts**: 10+ timeout/retention constants scattered across files, none configurable via TOML. +- **Config duplication**: `config.rs` in server (1,512 LOC) and client (1,181 LOC) mirror many structs. + +**P3 — Low:** +- **Command handler boilerplate**: 11 `RigCommandHandler` impls follow identical patterns across 500+ lines. Macro opportunity. +- **No integration tests** for `rig_task.rs` (1,315 LOC) or `audio.rs` (3,977 LOC) — the two largest server modules. +- **No command execution timeouts** at the `CommandExecutor` level. Backend stalls propagate up. +- **FT-817 VFO inference fragile**: Fails when VFO A and B share the same frequency. +- **VDES decoder incomplete**: Turbo FEC and link-layer parsing not implemented. +- **No protocol versioning**: Unknown commands cause parse errors with no graceful degradation. diff --git a/docs/Architecture.md b/docs/Architecture.md index 3f0b503..ada509c 100644 --- a/docs/Architecture.md +++ b/docs/Architecture.md @@ -1003,3 +1003,90 @@ DecoderHistories buffer ↓ listener connections ↓ stream decoder messages HTTP WebSocket / local speakers ``` + +--- + +## Detailed Component Notes + +### Rig Task Internals (`rig_task.rs` — 1,315 lines) + +The rig task is the heart of the server. Key implementation details: + +- **Command batching**: Accumulates pending requests before processing sequentially. Uses `batch.pop()` (LIFO order, not FIFO). +- **Spectrum deduplication**: Concurrent `GetSpectrum` requests are collapsed — one DSP computation broadcasts to all waiting responders. +- **Adaptive polling**: Poll interval adjusts based on TX state (100ms during TX, 500ms idle). +- **Grace period**: 800ms pause on polling after power-on/off operations to let hardware settle. +- **VFO priming**: Optional initialization sequence that toggles VFO A/B to populate the state cache. +- **Per-rig decoder histories**: Each rig maintains independent `Arc` for all 11 decoder types. +- **Timeout enforcement**: Commands have `COMMAND_EXEC_TIMEOUT` (10s), polling has `POLL_REFRESH_TIMEOUT` (8s). + +### Audio Pipeline (`audio.rs` — 3,977 lines) + +The audio module handles decoder history storage and stream management: + +- **`DecoderHistories`**: Per-rig mutable store for 11 decoder history queues (AIS, VDES, APRS, HF_APRS, CW, FT8, FT4, FT2, WSPR, WXSAT, LRPT). +- **Time-based retention**: 24h TTL on all history with periodic pruning. +- **Atomic total count**: `AtomicUsize` with CAS loop avoids acquiring 11 mutex locks in `snapshot_all()`. +- **Lock poisoning tolerance**: Uses `.unwrap_or_else(|e| e.into_inner())` to survive poisoned mutexes. +- **`StreamErrorLogger`**: Suppresses duplicate stream errors with 60s periodic summaries and error classification (alsa_poll_failure, input/output_stream_error). +- **CRC filtering**: APRS records filtered by `crc_ok` before storage. + +### Remote Client Dual-Connection Model + +`remote_client.rs` maintains two independent TCP connections to the server: + +1. **Main connection** (port 4530): State polling, command forwarding, rig discovery. +2. **Spectrum connection** (dedicated): Polls `GetSpectrum` at 50ms intervals (20 fps) independently to avoid blocking the main connection during command processing. + +Constants: `CONNECT_TIMEOUT: 5s`, `IO_TIMEOUT: 15s`, `SPECTRUM_IO_TIMEOUT: 3s`. Exponential backoff with jitter on reconnect. + +### FrontendRuntimeContext Field Groups (~50 fields) + +The `FrontendRuntimeContext` struct in `trx-frontend/src/lib.rs` organizes into logical groups: + +| Group | Fields | Examples | +|-------|--------|---------| +| Audio/decode channels | 13 | `audio_rx`, `decode_rx`, `ais_history`, `ft8_history` | +| Client tracking | 4 | `sse_clients`, `rigctl_clients`, `audio_clients` | +| Connection state | 4 | `server_connected`, `rig_server_connected` | +| HTTP auth config | 7 | `http_auth_enabled`, `http_auth_session_ttl_secs` | +| HTTP UI config | 6 | `http_initial_map_zoom`, `http_spectrum_usable_span_ratio` | +| Remote rig management | 4 | `remote_active_rig_id`, `remote_rigs`, `rig_states` | +| Owner/branding | 4 | `owner_callsign`, `owner_website_url`, `ais_vessel_url_base` | +| Spectrum & per-rig audio | 4 | `spectrum`, `rig_audio_rx`, `rig_audio_info` | +| Virtual channel audio | 4 | `rig_vchan_audio_cmd`, `vchan_audio`, `vchan_destroyed` | + +### Decoder Implementation Patterns + +All real-time decoders follow a consistent pattern: + +```rust +// 1. Stateful decoder struct with sample buffer +pub struct XxxDecoder { sample_buf: Vec, ... } + +// 2. Block/sample processing +pub fn process_block(&mut self, samples: &[f32]) { ... } + +// 3. Result extraction +pub fn decode_if_ready(&mut self) -> Vec { ... } +``` + +| Decoder | Algorithm | Sample Rate | Key Constants | +|---------|-----------|-------------|---------------| +| FT8/FT4/FT2 | Waterfall + LDPC/OSD | Varies | MAX_LDPC_ITERATIONS=20, MAX_CANDIDATES=120 | +| CW | Goertzel tone detection | Varies | 10ms windows, tone range 300–1200 Hz | +| APRS | Bell 202 AFSK (1200/2200 Hz) | 9600 | HDLC framing, NRZI, CRC-16-CCITT | +| AIS | GMSK 9600 baud | 9600 | Narrowband FM input | +| WSPR | Fano decoder | 12000 | 162 symbols, 120s slot, 1.46 Hz spacing | +| RDS | RRC matched filter + Costas PLL | Native | 57 kHz subcarrier, 1187.5 bps, OSD FEC | +| VDES | pi/4-QPSK 76.8 ksps | 100k | Burst detection, partial Turbo FEC | + +### Backend Reliability Workarounds (FT-817) + +The FT-817 CAT backend (`trx-backend-ft817/`) includes empirical workarounds for hardware quirks: + +- **Duplicate frame sends**: `set_mode()` and `set_ptt()` send CAT frames twice with 80ms delay (radio sometimes drops first frame). +- **Panel unlock before commands**: Clears stale bytes from the serial buffer. +- **Power-on dummy frame**: CPU wakes before CAT framing locks; dummy frame ensures readiness. +- **VFO state inference**: Infers VFO A/B by matching frequencies against cached values (fragile when frequencies collide). +- **Read timeout**: 800ms per CAT read operation (not configurable). diff --git a/docs/Home.md b/docs/Home.md index 81792e2..e9b80f1 100644 --- a/docs/Home.md +++ b/docs/Home.md @@ -11,4 +11,4 @@ and remote control are exposed elsewhere on the network. - [Architecture](Architecture) — system design, crate layout, data flow, and internals - [Optimization Guidelines](Optimization-Guidelines) — performance guidelines for the real-time DSP pipeline - [Planned Features](Planned-Features) — planned features and design notes -- [Improvement Areas](Improvement-Areas) — codebase audit: quality, architecture, security, and performance +- [Improvement Areas](Improvement-Areas) — codebase audit: quality, architecture, security, performance, and improvement plan diff --git a/docs/Improvement-Areas.md b/docs/Improvement-Areas.md index 9793222..500ab31 100644 --- a/docs/Improvement-Areas.md +++ b/docs/Improvement-Areas.md @@ -4,7 +4,7 @@ A comprehensive audit of the trx-rs codebase covering code quality, architecture security, testing, and performance. Each item includes the affected location and a suggested fix. -*Last updated: 2026-03-26* +*Last updated: 2026-03-28* --- @@ -122,4 +122,193 @@ No plugin API version, capability manifest, or unload/reload semantics. Old plugins break silently on API changes. **Fix:** Add a version field to the registration struct and reject incompatible -plugins at load time. \ No newline at end of file +plugins at load time. + +--- + +## New Findings (2026-03-28 Deep Review) + +### High Priority (P1) + +#### Rig task command batching uses LIFO order + +**Location:** `src/trx-server/src/` — `batch.pop()` + +Pending commands are accumulated into a `Vec` and processed with `pop()`, which +reverses arrival order. If a client sends `SetFreq(14.074)` then `SetMode(USB)`, +the mode change executes before the frequency change. This can cause unexpected +transient state on hardware that validates mode against frequency (e.g. FT-817 +rejects CW below 1.8 MHz). + +**Fix:** Replace `pop()` with `drain(..)` or iterate in forward order. + +#### FrontendRuntimeContext god-struct (~50 fields) + +**Location:** `src/trx-client/trx-frontend/src/` + +Mixes audio channels, decode histories, auth config, UI settings, rig routing, +virtual channel management, and branding info into a single struct passed +through `Arc`. Every frontend receives all 50 fields even if it only needs a +subset. Changes to any field group force recompilation of all frontends. + +**Suggested decomposition:** +``` +FrontendRuntimeContext + ├── AudioContext (audio_rx, audio_tx, audio_info, decode_rx) + ├── DecodeHistoryContext (ais, vdes, aprs, hf_aprs, cw, ft8, ft4, ft2, wspr) + ├── HttpAuthConfig (enabled, passphrases, session_ttl, cookie settings) + ├── HttpUiConfig (map_zoom, spectrum settings, history retention) + ├── RigRoutingContext (active_rig_id, remote_rigs, rig_states, rig_spectrums) + ├── OwnerInfo (callsign, website_url, website_name, ais_vessel_url) + └── VChanContext (vchan_audio, vchan_audio_cmd, vchan_destroyed) +``` + +#### Decoder history queues have no capacity bounds + +**Location:** `src/trx-server/src/` — `DecoderHistories` + +History queues (`VecDeque`) grow unbounded until the 24h retention period expires. +Under high traffic (e.g. busy AIS channel near a port), a single queue could +accumulate millions of entries and consume gigabytes of memory. + +**Fix:** Add per-decoder max capacity (e.g. 10,000 entries). Evict oldest entries +when capacity is reached, independent of time-based pruning. + +#### ExponentialBackoff has no jitter + +**Location:** `src/trx-core/src/rig/controller/` + +Multiple rigs or reconnecting clients using the same backoff parameters will retry +at identical times (thundering herd). This is especially problematic when a server +restarts and all clients reconnect simultaneously. + +**Fix:** Add randomized jitter (e.g. ±25% of the computed delay) to the +`ExponentialBackoff::delay()` method. + +#### No crash recovery for rig tasks + +**Location:** `src/trx-server/src/` + +If a rig task panics (e.g. due to an unexpected backend error), the task simply +disappears. The listener continues routing commands to the dead rig's channel, +where they silently timeout. No automatic restart or health monitoring exists. + +**Fix:** Wrap rig tasks in a supervisor loop that detects task completion/panic +and restarts with backoff. Emit a `RigMachineState::Error` on the watch channel +so clients see the failure. + +--- + +### Medium Priority (P2) + +#### SoapySdrRig constructor takes 20+ parameters + +**Location:** `src/trx-server/trx-backend/trx-backend-soapysdr/src/` +— `new_with_config()` + +The constructor accepts 20+ positional parameters with no builder pattern, +making call sites fragile and hard to read. Adding a new parameter requires +updating all callers. + +**Fix:** Introduce a `SoapySdrConfig` builder struct with sensible defaults. + +#### Dual command enums with mechanical 1:1 mapping + +**Location:** `src/trx-protocol/src/` (675 lines), +`src/trx-protocol/src/`, `src/trx-core/src/rig/` + +`ClientCommand` and `RigCommand` are near-identical 40+ variant enums with +purely mechanical mapping in `mapping.rs`. Adding a new command requires editing +4 files (command.rs, types.rs, mapping.rs in both directions, codec.rs). +`mapping.rs` contains an `unreachable!()` for `GetRigs` that would panic if +the listener logic changes. + +**Fix:** Consider a macro that generates both enums and the mapping from a single +definition. Alternatively, collapse to a single enum with serde annotations. + +#### Lock poisoning recovery hides panics + +**Location:** `src/trx-server/src/` — `DecoderHistories` + +All mutex acquisitions use `.unwrap_or_else(|e| e.into_inner())` which silently +recovers from poisoned mutexes. While this prevents cascading panics, it hides +the original panic and may operate on inconsistent data. + +**Fix:** Log a warning when recovering from a poisoned lock, and consider whether +the recovered data is actually safe to use. For history queues, clearing the +queue on poison recovery may be safer than continuing with partial data. + +#### Configuration duplication between server and client + +**Location:** `src/trx-server/src/` (1,512 lines), +`src/trx-client/src/` (1,181 lines) + +14 config structs each, many mirrored between server and client (GeneralConfig, +rig model definitions, defaults). Shared config definitions should live in +`trx-app`. + +#### Hardcoded timeouts and retention periods + +**Locations:** Multiple files + +| Constant | Value | Location | +|----------|-------|----------| +| COMMAND_EXEC_TIMEOUT | 10s | rig_task.rs | +| POLL_REFRESH_TIMEOUT | 8s | rig_task.rs | +| IO_TIMEOUT | 10s | listener.rs | +| REQUEST_TIMEOUT | 12s | listener.rs | +| History retention | 24h | audio.rs | +| FT-817 read timeout | 800ms | trx-backend-ft817 | +| RIG_TASK_CHANNEL_BUFFER | 32 | main.rs | + +None are configurable. Making these part of the TOML config would help +deployments with slow serial links or high-latency networks. + +--- + +### Low Priority (P3) + +#### FT-817 VFO state inference is fragile + +**Location:** `src/trx-server/trx-backend/trx-backend-ft817/src/` + +VFO state starts as `Unknown` and is inferred by matching frequencies against +cached values. When VFO A and VFO B are set to the same frequency, inference +fails and the rig may report the wrong VFO. + +**Fix:** Some FT-817 variants support extended status bytes that indicate active +VFO directly. Detect firmware version and use the direct query when available. + +#### VDES decoder has incomplete FEC + +**Location:** `src/decoders/trx-vdes/src/` + +The VDES decoder implements burst detection and pi/4-QPSK demodulation but the +Turbo FEC (1/2 rate) decoder and link-layer (M.2092-1) parser are not complete. +Decoded output is limited to raw symbols. + +#### No forward compatibility in protocol + +**Location:** `src/trx-protocol/src/` + +Unknown commands cause parse errors. There is no protocol version field in the +envelope, making it impossible for older clients to gracefully degrade when +connecting to newer servers. + +**Fix:** Add an optional `protocol_version` field to `ClientEnvelope`. Unknown +commands should return an error response rather than a parse failure. + +#### Command handler boilerplate + +**Location:** `src/trx-core/src/rig/controller/` + +11 `RigCommandHandler` implementations follow identical patterns across 500+ +lines (validate state → call executor method → return result). A declarative +macro could reduce this to ~100 lines. + +#### Missing tests for critical paths + +Serial backends (FT-817, FT-450D), plugin loading/discovery, and the audio +pipeline (Opus encode/decode) have no or minimal test coverage. `rig_task.rs` +(1,315 lines) and `audio.rs` (3,977 lines) — the two largest server modules — +have no integration tests. \ No newline at end of file