[docs](workspace): refresh post-refactor enhancement plan

Reanalyze current architecture status and rewrite ENHANCEMENT.md to reflect remaining high-impact issues after completed phases.

Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Stanislaw Grams <stanislawgrams@gmail.com>
This commit is contained in:
2026-02-12 22:29:23 +01:00
parent 4b34a39745
commit 8e59e205a8
+38 -39
View File
@@ -1,70 +1,69 @@
# Top 5 Real Architecture Issues
# Top 5 Real Architecture Issues (Post-Refactor)
## 1) Global plugin compatibility registries still exist
## 1) Plugin ABI is still brittle and unversioned
### Files
- `src/trx-server/trx-backend/src/lib.rs`
- `src/trx-client/trx-frontend/src/lib.rs`
- `src/trx-app/src/plugins.rs`
- `examples/trx-plugin-example/src/lib.rs`
### Why this matters
`OnceLock<Mutex<...>>` registry shims still hold mutable global state. This keeps plugin registration behavior implicit and harder to test.
Plugin loading is now explicit (good), but still assumes exact symbol names and raw FFI contracts with no ABI/version handshake. A plugin built against an older/newer ABI can fail at runtime in hard-to-diagnose ways.
### Fix steps
1. Introduce explicit plugin registration API that takes a mutable context.
2. Make plugin loader return registration data instead of relying on global side effects.
3. Remove global `register_*`/`snapshot_bootstrap_context` wrappers after migration.
1. Add an ABI version symbol/handshake (`trx_plugin_abi_version`) and reject incompatible plugins with clear errors.
2. Split plugin capability metadata (backend/frontend/both) from registration symbols to avoid noisy failed-load logs.
3. Provide a tiny shared plugin-API crate for stable entrypoint signatures.
## 2) No supervised shutdown/lifecycle model
## 2) Runtime supervision is still ad-hoc (sleep + abort)
### Files
- `src/trx-server/src/main.rs`
- `src/trx-client/src/main.rs`
### Why this matters
Many tasks are detached via `tokio::spawn` and process shutdown mostly waits on Ctrl+C. Task failures and cancellation order are not centrally managed.
Shutdown is coordinated, but supervision still uses a fixed delay plus manual `abort()` over `Vec<JoinHandle<_>>`. This can mask task failures, race shutdown ordering, and make lifecycle behavior harder to reason about.
### Fix steps
1. Add shared cancellation token.
2. Track tasks in `JoinSet`.
3. On shutdown: stop listeners, cancel workers, await joins with timeout, then exit.
1. Move to `JoinSet` (or a small supervisor type) for task ownership and result handling.
2. Replace fixed sleep with bounded graceful-join timeout logic.
3. Surface task failure reasons consistently in one place.
## 3) Protocol/network hardening gaps
## 3) JSON/TCP transport logic is duplicated across modules
### Files
- `src/trx-client/src/remote_client.rs`
- `src/trx-server/src/listener.rs`
- `src/trx-client/trx-frontend/trx-frontend-http-json/src/server.rs`
- `src/trx-client/src/remote_client.rs`
### Why this matters
`parse_remote_url` is ad-hoc and line-based listeners accept unbounded lines. This risks parsing edge cases and memory pressure.
`read_limited_line`, timeout handling, and response write patterns are repeated in multiple places. This increases drift risk and makes protocol hardening changes expensive.
### Fix steps
1. Replace string URL parsing with typed address parsing (support IPv4/IPv6/hostnames explicitly).
2. Enforce maximum line/frame size for JSON-over-TCP.
3. Add read/write/request timeouts and explicit error messages.
1. Extract shared JSON-over-TCP helpers into `trx-protocol` (or a small transport crate/module).
2. Keep one source of truth for max line size, timeout behavior, and framing errors.
3. Cover shared transport with focused tests once instead of per-module copies.
## 4) Config has parse defaults but weak semantic validation
### Files
- `src/trx-server/src/config.rs`
- `src/trx-client/src/config.rs`
### Why this matters
Config loads successfully even when values are semantically bad (timings, ports, audio params), leading to runtime failures.
### Fix steps
1. Add `validate()` to server/client config models.
2. Validate ranges and required field combinations.
3. Call `validate()` in startup before spawning tasks; fail fast with clear path-based errors.
## 5) Integration coverage is still thin at boundaries
## 4) Boundary tests are present but mostly ignored in constrained envs
### Files
- `src/trx-server/src/listener.rs`
- `src/trx-client/src/remote_client.rs`
- `src/trx-client/trx-frontend/trx-frontend-http-json/src/server.rs`
- `src/trx-app/src/plugins.rs`
### Why this matters
Most coverage is unit-level. Critical network/plugin/runtime flows can regress without tests.
Important network-path tests exist, but are marked `#[ignore]` in this environment due bind restrictions. Without a clear CI strategy, regressions can still slip through.
### Fix steps
1. Add integration tests for JSON TCP auth/command flow.
2. Add reconnect tests for remote client.
3. Add plugin load/failure isolation tests.
4. Add shutdown behavior tests once lifecycle supervision is added.
1. Add CI jobs/environment where bind-based tests run by default.
2. Split pure transport logic from socket bind/accept so more behavior can be tested without real sockets.
3. Keep ignored tests minimal and document how/when they run.
## 5) Decode/history shared state still relies on global mutexes
### Files
- `src/trx-server/src/audio.rs`
- `src/trx-client/trx-frontend/src/lib.rs`
- `src/trx-client/trx-frontend/trx-frontend-http/src/audio.rs`
### Why this matters
History/state paths still use shared mutex-backed globals/contexts with `expect` on lock poisoning in hot paths. This is workable but fragile for long-running async services.
### Fix steps
1. Replace panic-on-poison lock usage with resilient handling.
2. Consider bounded channel or lock-free append/read model for decode history.
3. Define explicit ownership/lifetime for history data instead of implicit shared mutation.