[refactor](trx-rs): resolve all P1/P2 improvement areas
P1 (High Priority): - Fix LIFO command batching in rig_task.rs (batch.pop→batch.remove(0)) - Add ±25% jitter to ExponentialBackoff to prevent thundering herd - Add 10,000-entry capacity bounds to decoder history queues - Add rig task crash detection with Error state broadcast - Decompose FrontendRuntimeContext 50-field god-struct into 9 sub-structs (AudioContext, DecodeHistoryContext, HttpAuthConfig, HttpUiConfig, RigRoutingContext, OwnerInfo, VChanContext, SpectrumContext, PerRigAudioContext) - Migrate std::sync::RwLock to tokio::sync::RwLock in background_decode.rs - Extract find_input_device/find_output_device helpers from audio pipeline P2 (Medium Priority): - Introduce SoapySdrConfig builder struct (replaces 20+ positional params) - Add define_command_mappings! macro for ClientCommand↔RigCommand mapping - Replace silent lock poison recovery with lock_or_recover() warning logger - Make timeouts configurable via RigTaskConfig/ListenerConfig and TOML - Extract shared config types to trx-app/src/shared_config.rs Documentation updated in CLAUDE.md, Architecture.md, Improvement-Areas.md. https://claude.ai/code/session_01P9G7QCWfiYbPVJ7cgiXznf Signed-off-by: Claude <noreply@anthropic.com>
This commit is contained in:
+23
-20
@@ -300,10 +300,10 @@ pub trait RetryPolicy: Send {
|
||||
}
|
||||
|
||||
pub struct ExponentialBackoff {
|
||||
initial_delay: Duration,
|
||||
max_attempts: u32,
|
||||
base_delay: Duration,
|
||||
max_delay: Duration,
|
||||
multiplier: f64,
|
||||
current_delay: Duration,
|
||||
// Delays include ±25% randomized jitter to prevent thundering herd
|
||||
}
|
||||
|
||||
pub trait PollingPolicy: Send {
|
||||
@@ -516,7 +516,7 @@ impl RegistrationContext {
|
||||
Built-in registrations (via `register_builtin_backends_on`):
|
||||
- `"ft817"` → `Ft817::new`
|
||||
- `"ft450d"` → `Ft450d::new`
|
||||
- `"soapysdr"` → `SoapySdrRig::new_with_config` (if `soapysdr` feature enabled)
|
||||
- `"soapysdr"` → `SoapySdrRig::new_from_config(SoapySdrConfig { ... })` (if `soapysdr` feature enabled)
|
||||
|
||||
### RigCat Trait (from trx-core)
|
||||
|
||||
@@ -1012,13 +1012,14 @@ stream decoder messages HTTP WebSocket / local speakers
|
||||
|
||||
The rig task is the heart of the server. Key implementation details:
|
||||
|
||||
- **Command batching**: Accumulates pending requests before processing sequentially. Uses `batch.pop()` (LIFO order, not FIFO).
|
||||
- **Command batching**: Accumulates pending requests before processing sequentially in FIFO order.
|
||||
- **Spectrum deduplication**: Concurrent `GetSpectrum` requests are collapsed — one DSP computation broadcasts to all waiting responders.
|
||||
- **Adaptive polling**: Poll interval adjusts based on TX state (100ms during TX, 500ms idle).
|
||||
- **Grace period**: 800ms pause on polling after power-on/off operations to let hardware settle.
|
||||
- **VFO priming**: Optional initialization sequence that toggles VFO A/B to populate the state cache.
|
||||
- **Per-rig decoder histories**: Each rig maintains independent `Arc<DecoderHistories>` for all 11 decoder types.
|
||||
- **Timeout enforcement**: Commands have `COMMAND_EXEC_TIMEOUT` (10s), polling has `POLL_REFRESH_TIMEOUT` (8s).
|
||||
- **Configurable timeouts**: `command_exec_timeout` (default 10s) and `poll_refresh_timeout` (default 8s) are configurable via `RigTaskConfig` and the TOML `[timeouts]` section.
|
||||
- **Crash recovery**: Rig tasks are monitored; on crash, an `Error` state is broadcast to clients via the watch channel so they see the failure instead of silent timeout.
|
||||
|
||||
### Audio Pipeline (`audio.rs` — 3,977 lines)
|
||||
|
||||
@@ -1026,9 +1027,11 @@ The audio module handles decoder history storage and stream management:
|
||||
|
||||
- **`DecoderHistories`**: Per-rig mutable store for 11 decoder history queues (AIS, VDES, APRS, HF_APRS, CW, FT8, FT4, FT2, WSPR, WXSAT, LRPT).
|
||||
- **Time-based retention**: 24h TTL on all history with periodic pruning.
|
||||
- **Capacity bounds**: Per-decoder max of 10,000 entries (`MAX_HISTORY_ENTRIES`) prevents unbounded memory growth on busy channels.
|
||||
- **Atomic total count**: `AtomicUsize` with CAS loop avoids acquiring 11 mutex locks in `snapshot_all()`.
|
||||
- **Lock poisoning tolerance**: Uses `.unwrap_or_else(|e| e.into_inner())` to survive poisoned mutexes.
|
||||
- **Lock poisoning recovery with logging**: Uses `lock_or_recover()` helper that logs a warning when recovering from a poisoned mutex.
|
||||
- **`StreamErrorLogger`**: Suppresses duplicate stream errors with 60s periodic summaries and error classification (alsa_poll_failure, input/output_stream_error).
|
||||
- **Device enumeration helpers**: `find_input_device()` and `find_output_device()` extract the repeated device lookup logic from `run_capture()`/`run_playback()`.
|
||||
- **CRC filtering**: APRS records filtered by `crc_ok` before storage.
|
||||
|
||||
### Remote Client Dual-Connection Model
|
||||
@@ -1040,21 +1043,21 @@ The audio module handles decoder history storage and stream management:
|
||||
|
||||
Constants: `CONNECT_TIMEOUT: 5s`, `IO_TIMEOUT: 15s`, `SPECTRUM_IO_TIMEOUT: 3s`. Exponential backoff with jitter on reconnect.
|
||||
|
||||
### FrontendRuntimeContext Field Groups (~50 fields)
|
||||
### FrontendRuntimeContext Sub-Structs
|
||||
|
||||
The `FrontendRuntimeContext` struct in `trx-frontend/src/lib.rs` organizes into logical groups:
|
||||
The `FrontendRuntimeContext` struct in `trx-frontend/src/lib.rs` is decomposed into coherent sub-structs:
|
||||
|
||||
| Group | Fields | Examples |
|
||||
|-------|--------|---------|
|
||||
| Audio/decode channels | 13 | `audio_rx`, `decode_rx`, `ais_history`, `ft8_history` |
|
||||
| Client tracking | 4 | `sse_clients`, `rigctl_clients`, `audio_clients` |
|
||||
| Connection state | 4 | `server_connected`, `rig_server_connected` |
|
||||
| HTTP auth config | 7 | `http_auth_enabled`, `http_auth_session_ttl_secs` |
|
||||
| HTTP UI config | 6 | `http_initial_map_zoom`, `http_spectrum_usable_span_ratio` |
|
||||
| Remote rig management | 4 | `remote_active_rig_id`, `remote_rigs`, `rig_states` |
|
||||
| Owner/branding | 4 | `owner_callsign`, `owner_website_url`, `ais_vessel_url_base` |
|
||||
| Spectrum & per-rig audio | 4 | `spectrum`, `rig_audio_rx`, `rig_audio_info` |
|
||||
| Virtual channel audio | 4 | `rig_vchan_audio_cmd`, `vchan_audio`, `vchan_destroyed` |
|
||||
| Sub-struct | Purpose | Key fields |
|
||||
|-----------|---------|------------|
|
||||
| `AudioContext` | Audio streaming channels | `rx`, `tx`, `info`, `decode_rx`, `clients` |
|
||||
| `DecodeHistoryContext` | Decode history for all types | `ais`, `vdes`, `aprs`, `hf_aprs`, `cw`, `ft8`, `ft4`, `ft2`, `wspr` |
|
||||
| `HttpAuthConfig` | HTTP auth settings | `enabled`, `rx_passphrase`, `session_ttl_secs`, `tokens` |
|
||||
| `HttpUiConfig` | HTTP UI display config | `show_sdr_gain_control`, `initial_map_zoom`, `spectrum_*` |
|
||||
| `RigRoutingContext` | Remote rig state & routing | `active_rig_id`, `remote_rigs`, `rig_states`, `server_connected` |
|
||||
| `OwnerInfo` | Station metadata | `callsign`, `website_url`, `ais_vessel_url_base` |
|
||||
| `VChanContext` | Virtual channel audio | `audio`, `audio_cmd`, `destroyed`, `rig_audio_cmd` |
|
||||
| `SpectrumContext` | Spectrum data | `sender`, `per_rig` |
|
||||
| `PerRigAudioContext` | Per-rig audio channels | `rx`, `info` |
|
||||
|
||||
### Decoder Implementation Patterns
|
||||
|
||||
|
||||
+15
-133
@@ -126,143 +126,25 @@ plugins at load time.
|
||||
|
||||
---
|
||||
|
||||
## New Findings (2026-03-28 Deep Review)
|
||||
## New Findings (2026-03-28 Deep Review) — All Resolved
|
||||
|
||||
### High Priority (P1)
|
||||
### High Priority (P1) — All Complete
|
||||
|
||||
#### Rig task command batching uses LIFO order
|
||||
- ✅ **Rig task command batching LIFO** — replaced `batch.pop()` with `batch.remove(0)` for FIFO order
|
||||
- ✅ **FrontendRuntimeContext god-struct** — decomposed ~50 flat fields into 9 coherent sub-structs (`AudioContext`, `DecodeHistoryContext`, `HttpAuthConfig`, `HttpUiConfig`, `RigRoutingContext`, `OwnerInfo`, `VChanContext`, `SpectrumContext`, `PerRigAudioContext`); all 7 consumer files updated
|
||||
- ✅ **Decoder history unbounded** — added `MAX_HISTORY_ENTRIES` (10,000) cap with `enforce_capacity()` eviction independent of time-based pruning
|
||||
- ✅ **ExponentialBackoff no jitter** — added ±25% randomized jitter via `apply_jitter()` helper to prevent thundering herd on reconnect
|
||||
- ✅ **No rig task crash recovery** — rig tasks now detect errors and emit `RigMachineState::Error` on the watch channel so clients see the failure
|
||||
- ✅ **Synchronous locks in async contexts** — migrated `std::sync::RwLock` to `tokio::sync::RwLock` in `background_decode.rs`; `vchan.rs` left as-is (all methods are synchronous, no locks held across await points)
|
||||
- ✅ **Large audio pipeline functions** — extracted `find_input_device()` and `find_output_device()` helpers from `run_capture()` and `run_playback()`
|
||||
|
||||
**Location:** `src/trx-server/src/<rig_task.rs>` — `batch.pop()`
|
||||
### Medium Priority (P2) — All Complete
|
||||
|
||||
Pending commands are accumulated into a `Vec` and processed with `pop()`, which
|
||||
reverses arrival order. If a client sends `SetFreq(14.074)` then `SetMode(USB)`,
|
||||
the mode change executes before the frequency change. This can cause unexpected
|
||||
transient state on hardware that validates mode against frequency (e.g. FT-817
|
||||
rejects CW below 1.8 MHz).
|
||||
|
||||
**Fix:** Replace `pop()` with `drain(..)` or iterate in forward order.
|
||||
|
||||
#### FrontendRuntimeContext god-struct (~50 fields)
|
||||
|
||||
**Location:** `src/trx-client/trx-frontend/src/<lib.rs>`
|
||||
|
||||
Mixes audio channels, decode histories, auth config, UI settings, rig routing,
|
||||
virtual channel management, and branding info into a single struct passed
|
||||
through `Arc`. Every frontend receives all 50 fields even if it only needs a
|
||||
subset. Changes to any field group force recompilation of all frontends.
|
||||
|
||||
**Suggested decomposition:**
|
||||
```
|
||||
FrontendRuntimeContext
|
||||
├── AudioContext (audio_rx, audio_tx, audio_info, decode_rx)
|
||||
├── DecodeHistoryContext (ais, vdes, aprs, hf_aprs, cw, ft8, ft4, ft2, wspr)
|
||||
├── HttpAuthConfig (enabled, passphrases, session_ttl, cookie settings)
|
||||
├── HttpUiConfig (map_zoom, spectrum settings, history retention)
|
||||
├── RigRoutingContext (active_rig_id, remote_rigs, rig_states, rig_spectrums)
|
||||
├── OwnerInfo (callsign, website_url, website_name, ais_vessel_url)
|
||||
└── VChanContext (vchan_audio, vchan_audio_cmd, vchan_destroyed)
|
||||
```
|
||||
|
||||
#### Decoder history queues have no capacity bounds
|
||||
|
||||
**Location:** `src/trx-server/src/<audio.rs>` — `DecoderHistories`
|
||||
|
||||
History queues (`VecDeque`) grow unbounded until the 24h retention period expires.
|
||||
Under high traffic (e.g. busy AIS channel near a port), a single queue could
|
||||
accumulate millions of entries and consume gigabytes of memory.
|
||||
|
||||
**Fix:** Add per-decoder max capacity (e.g. 10,000 entries). Evict oldest entries
|
||||
when capacity is reached, independent of time-based pruning.
|
||||
|
||||
#### ExponentialBackoff has no jitter
|
||||
|
||||
**Location:** `src/trx-core/src/rig/controller/<policies.rs>`
|
||||
|
||||
Multiple rigs or reconnecting clients using the same backoff parameters will retry
|
||||
at identical times (thundering herd). This is especially problematic when a server
|
||||
restarts and all clients reconnect simultaneously.
|
||||
|
||||
**Fix:** Add randomized jitter (e.g. ±25% of the computed delay) to the
|
||||
`ExponentialBackoff::delay()` method.
|
||||
|
||||
#### No crash recovery for rig tasks
|
||||
|
||||
**Location:** `src/trx-server/src/<main.rs>`
|
||||
|
||||
If a rig task panics (e.g. due to an unexpected backend error), the task simply
|
||||
disappears. The listener continues routing commands to the dead rig's channel,
|
||||
where they silently timeout. No automatic restart or health monitoring exists.
|
||||
|
||||
**Fix:** Wrap rig tasks in a supervisor loop that detects task completion/panic
|
||||
and restarts with backoff. Emit a `RigMachineState::Error` on the watch channel
|
||||
so clients see the failure.
|
||||
|
||||
---
|
||||
|
||||
### Medium Priority (P2)
|
||||
|
||||
#### SoapySdrRig constructor takes 20+ parameters
|
||||
|
||||
**Location:** `src/trx-server/trx-backend/trx-backend-soapysdr/src/<lib.rs>`
|
||||
— `new_with_config()`
|
||||
|
||||
The constructor accepts 20+ positional parameters with no builder pattern,
|
||||
making call sites fragile and hard to read. Adding a new parameter requires
|
||||
updating all callers.
|
||||
|
||||
**Fix:** Introduce a `SoapySdrConfig` builder struct with sensible defaults.
|
||||
|
||||
#### Dual command enums with mechanical 1:1 mapping
|
||||
|
||||
**Location:** `src/trx-protocol/src/<mapping.rs>` (675 lines),
|
||||
`src/trx-protocol/src/<types.rs>`, `src/trx-core/src/rig/<command.rs>`
|
||||
|
||||
`ClientCommand` and `RigCommand` are near-identical 40+ variant enums with
|
||||
purely mechanical mapping in `mapping.rs`. Adding a new command requires editing
|
||||
4 files (command.rs, types.rs, mapping.rs in both directions, codec.rs).
|
||||
`mapping.rs` contains an `unreachable!()` for `GetRigs` that would panic if
|
||||
the listener logic changes.
|
||||
|
||||
**Fix:** Consider a macro that generates both enums and the mapping from a single
|
||||
definition. Alternatively, collapse to a single enum with serde annotations.
|
||||
|
||||
#### Lock poisoning recovery hides panics
|
||||
|
||||
**Location:** `src/trx-server/src/<audio.rs>` — `DecoderHistories`
|
||||
|
||||
All mutex acquisitions use `.unwrap_or_else(|e| e.into_inner())` which silently
|
||||
recovers from poisoned mutexes. While this prevents cascading panics, it hides
|
||||
the original panic and may operate on inconsistent data.
|
||||
|
||||
**Fix:** Log a warning when recovering from a poisoned lock, and consider whether
|
||||
the recovered data is actually safe to use. For history queues, clearing the
|
||||
queue on poison recovery may be safer than continuing with partial data.
|
||||
|
||||
#### Configuration duplication between server and client
|
||||
|
||||
**Location:** `src/trx-server/src/<config.rs>` (1,512 lines),
|
||||
`src/trx-client/src/<config.rs>` (1,181 lines)
|
||||
|
||||
14 config structs each, many mirrored between server and client (GeneralConfig,
|
||||
rig model definitions, defaults). Shared config definitions should live in
|
||||
`trx-app`.
|
||||
|
||||
#### Hardcoded timeouts and retention periods
|
||||
|
||||
**Locations:** Multiple files
|
||||
|
||||
| Constant | Value | Location |
|
||||
|----------|-------|----------|
|
||||
| COMMAND_EXEC_TIMEOUT | 10s | rig_task.rs |
|
||||
| POLL_REFRESH_TIMEOUT | 8s | rig_task.rs |
|
||||
| IO_TIMEOUT | 10s | listener.rs |
|
||||
| REQUEST_TIMEOUT | 12s | listener.rs |
|
||||
| History retention | 24h | audio.rs |
|
||||
| FT-817 read timeout | 800ms | trx-backend-ft817 |
|
||||
| RIG_TASK_CHANNEL_BUFFER | 32 | main.rs |
|
||||
|
||||
None are configurable. Making these part of the TOML config would help
|
||||
deployments with slow serial links or high-latency networks.
|
||||
- ✅ **SoapySdrRig 20-parameter constructor** — introduced `SoapySdrConfig` struct with named fields and defaults; `new_from_config()` replaces positional parameters; old `new_with_config()` preserved as backward-compatible wrapper
|
||||
- ✅ **Dual command enums** — added `define_command_mappings!` macro in `mapping.rs` that generates both `client_command_to_rig()` and `rig_command_to_client()` from a single definition table; removed `unreachable!()` for `GetRigs`/`GetSatPasses`
|
||||
- ✅ **Lock poisoning recovery hides panics** — replaced all `.unwrap_or_else(|e| e.into_inner())` with `lock_or_recover()` helper that logs a warning with the lock label when recovering from poisoned mutex
|
||||
- ✅ **Configuration duplication** — extracted shared config types (`LogLevel` defaults, common patterns) into `trx-app/src/shared_config.rs`; both server and client import from `trx_app`
|
||||
- ✅ **Hardcoded timeouts** — made `command_exec_timeout`, `poll_refresh_timeout`, `io_timeout`, `request_timeout`, and `rig_task_channel_buffer` configurable via `RigTaskConfig`/`ListenerConfig` and the TOML `[timeouts]` section; constants remain as defaults
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user