From 002f7d0cdaa3947e0559af8cd8aea629561222e8 Mon Sep 17 00:00:00 2001 From: Jan Svabenik Date: Wed, 25 Mar 2026 14:23:27 +0100 Subject: [PATCH] docs: document telemetry API --- README.md | 26 ++ docs/telemetry-api.md | 711 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 737 insertions(+) create mode 100644 docs/telemetry-api.md diff --git a/README.md b/README.md index a2625ec..d872a1f 100644 --- a/README.md +++ b/README.md @@ -192,6 +192,32 @@ go build -tags sdrplay ./cmd/sdrd - `GET /api/signals` -> current live signals - `GET /api/events?limit=&since=` -> recent events +### Debug Telemetry +- `GET /api/debug/telemetry/live` -> current telemetry snapshot (counters, gauges, distributions, recent events, collector status/config) +- `GET /api/debug/telemetry/history` -> historical metric samples with filtering by time/name/prefix/tags +- `GET /api/debug/telemetry/events` -> telemetry event/anomaly history with filtering by time/name/prefix/level/tags +- `GET /api/debug/telemetry/config` -> current collector config plus `debug.telemetry` runtime config +- `POST /api/debug/telemetry/config` -> update telemetry settings at runtime and persist them to autosave config + +Telemetry query params (`history` / `events`) include: +- `since`, `until` -> unix seconds, unix milliseconds, or RFC3339 timestamps +- `limit` +- `name`, `prefix` +- `signal_id`, `session_id`, `stage`, `trace_id`, `component` +- `tag_=` for arbitrary tag filters +- `include_persisted=true|false` (default `true`) +- `level` on the events endpoint + +Telemetry config lives under `debug.telemetry`: +- `enabled`, `heavy_enabled`, `heavy_sample_every` +- `metric_sample_every`, `metric_history_max`, `event_history_max` +- `retention_seconds` +- `persist_enabled`, `persist_dir`, `rotate_mb`, `keep_files` + +See also: +- `docs/telemetry-api.md` for the full telemetry API reference +- `docs/telemetry-debug-runbook.md` for the short operational debug flow + ### Recordings - `GET /api/recordings` - `GET /api/recordings/:id` (meta.json) diff --git a/docs/telemetry-api.md b/docs/telemetry-api.md new file mode 100644 index 0000000..9ac8672 --- /dev/null +++ b/docs/telemetry-api.md @@ -0,0 +1,711 @@ +# Telemetry API Reference + +This document describes the server-side telemetry collector, its runtime configuration, and the HTTP API exposed by `sdrd`. + +The telemetry system is intended for debugging and performance analysis of the SDR pipeline, especially around source cadence, extraction, DSP timing, boundary artifacts, queue pressure, and other runtime anomalies. + +## Goals + +The telemetry layer gives you three different views of runtime state: + +1. **Live snapshot** + - Current counters, gauges, distributions, recent events, and collector status. +2. **Historical metrics** + - Timestamped metric samples that can be filtered by name, prefix, or tags. +3. **Historical events** + - Structured anomalies / warnings / debug events with optional fields. + +It is designed to be lightweight in normal operation and more detailed when `heavy_enabled` is turned on. + +--- + +## Base URLs + +All telemetry endpoints live under: + +- `/api/debug/telemetry/live` +- `/api/debug/telemetry/history` +- `/api/debug/telemetry/events` +- `/api/debug/telemetry/config` + +Responses are JSON. + +--- + +## Data model + +### Metric types + +Telemetry metrics are stored in three logical groups: + +- **counter** + - Accumulating values, usually incremented over time. +- **gauge** + - Latest current value. +- **distribution** + - Observed numeric samples with summary stats. + +A historical metric sample is returned as: + +```json +{ + "ts": "2026-03-25T12:00:00Z", + "name": "stage.extract_stream.duration_ms", + "type": "distribution", + "value": 4.83, + "tags": { + "stage": "extract_stream", + "signal_id": "1" + } +} +``` + +### Events + +Telemetry events are structured anomaly/debug records: + +```json +{ + "id": 123, + "ts": "2026-03-25T12:00:02Z", + "name": "demod_boundary", + "level": "warn", + "message": "boundary discontinuity detected", + "tags": { + "signal_id": "1", + "stage": "demod" + }, + "fields": { + "d2": 0.3358, + "index": 25 + } +} +``` + +### Tags + +Tags are string key/value metadata used for filtering and correlation. + +Common tag keys already supported by the HTTP layer: + +- `signal_id` +- `session_id` +- `stage` +- `trace_id` +- `component` + +You can also filter on arbitrary tags via `tag_=` query parameters. + +--- + +## Endpoint: `GET /api/debug/telemetry/live` + +Returns a live snapshot of the in-memory collector state. + +### Response shape + +```json +{ + "now": "2026-03-25T12:00:05Z", + "started_at": "2026-03-25T11:52:10Z", + "uptime_ms": 472500, + "config": { + "enabled": true, + "heavy_enabled": false, + "heavy_sample_every": 12, + "metric_sample_every": 2, + "metric_history_max": 12000, + "event_history_max": 4000, + "retention": 900000000000, + "persist_enabled": false, + "persist_dir": "debug/telemetry", + "rotate_mb": 16, + "keep_files": 8 + }, + "counters": [ + { + "name": "source.resets", + "value": 1, + "tags": { + "component": "source" + } + } + ], + "gauges": [ + { + "name": "source.buffer_samples", + "value": 304128, + "tags": { + "component": "source" + } + } + ], + "distributions": [ + { + "name": "dsp.frame.duration_ms", + "count": 96, + "min": 82.5, + "max": 212.4, + "mean": 104.8, + "last": 98.3, + "p95": 149.2, + "tags": { + "stage": "dsp" + } + } + ], + "recent_events": [], + "status": { + "source_state": "running" + } +} +``` + +### Notes + +- `counters`, `gauges`, and `distributions` are sorted by metric name. +- `recent_events` contains the most recent in-memory event slice. +- `status` is optional and contains arbitrary runtime status published by code using `SetStatus(...)`. +- If telemetry is unavailable, the server returns a small JSON object instead of a full snapshot. + +### Typical uses + +- Check whether telemetry is enabled. +- Look for timing hotspots in `*.duration_ms` distributions. +- Inspect current queue or source gauges. +- See recent anomaly events without querying history. + +--- + +## Endpoint: `GET /api/debug/telemetry/history` + +Returns historical metric samples from in-memory history and, optionally, persisted JSONL files. + +### Response shape + +```json +{ + "items": [ + { + "ts": "2026-03-25T12:00:01Z", + "name": "stage.extract_stream.duration_ms", + "type": "distribution", + "value": 5.2, + "tags": { + "stage": "extract_stream", + "signal_id": "2" + } + } + ], + "count": 1 +} +``` + +### Supported query parameters + +#### Time filters + +- `since` +- `until` + +Accepted formats: + +- Unix seconds +- Unix milliseconds +- RFC3339 +- RFC3339Nano + +Examples: + +- `?since=1711368000` +- `?since=1711368000123` +- `?since=2026-03-25T12:00:00Z` + +#### Result shaping + +- `limit` + - Default normalization is 500. + - Values above 5000 are clamped down by the collector query layer. + +#### Name filters + +- `name=` +- `prefix=` + +Examples: + +- `?name=source.read.duration_ms` +- `?prefix=stage.` +- `?prefix=iq.extract.` + +#### Tag filters + +Special convenience query params map directly to tag filters: + +- `signal_id` +- `session_id` +- `stage` +- `trace_id` +- `component` + +Arbitrary tag filters: + +- `tag_=` + +Examples: + +- `?signal_id=1` +- `?stage=extract_stream` +- `?tag_path=gpu` +- `?tag_zone=broadcast` + +#### Persistence control + +- `include_persisted=true|false` + - Default: `true` + +When enabled and persistence is active, the server reads matching data from rotated JSONL telemetry files in addition to in-memory history. + +### Notes + +- Results are sorted by timestamp ascending. +- If `limit` is hit, the most recent matching items are retained. +- Exact retention depends on both in-memory retention and persisted file availability. +- A small set of boundary-related IQ metrics is force-stored regardless of the normal metric sample cadence. + +### Typical queries + +Get all stage timing since a specific start: + +```text +/api/debug/telemetry/history?since=2026-03-25T12:00:00Z&prefix=stage. +``` + +Get extraction metrics for a single signal: + +```text +/api/debug/telemetry/history?since=2026-03-25T12:00:00Z&prefix=extract.&signal_id=2 +``` + +Get source cadence metrics only from in-memory history: + +```text +/api/debug/telemetry/history?prefix=source.&include_persisted=false +``` + +--- + +## Endpoint: `GET /api/debug/telemetry/events` + +Returns historical telemetry events from memory and, optionally, persisted storage. + +### Response shape + +```json +{ + "items": [ + { + "id": 991, + "ts": "2026-03-25T12:00:03Z", + "name": "source_reset", + "level": "warn", + "message": "source reader reset observed", + "tags": { + "component": "source" + }, + "fields": { + "reason": "short_read" + } + } + ], + "count": 1 +} +``` + +### Supported query parameters + +All `history` filters are also supported here, plus: + +- `level=` + +Examples: + +- `?since=2026-03-25T12:00:00Z&level=warn` +- `?prefix=audio.&signal_id=1` +- `?name=demod_boundary&signal_id=1` + +### Notes + +- Event matching supports `name`, `prefix`, `level`, time range, and tags. +- Event `level` matching is case-insensitive. +- Results are timestamp-sorted ascending. + +### Typical queries + +Get warnings during a reproduction run: + +```text +/api/debug/telemetry/events?since=2026-03-25T12:00:00Z&level=warn +``` + +Get boundary-related events for one signal: + +```text +/api/debug/telemetry/events?since=2026-03-25T12:00:00Z&signal_id=1&prefix=demod_ +``` + +--- + +## Endpoint: `GET /api/debug/telemetry/config` + +Returns both: + +1. the active collector configuration, and +2. the current runtime config under `debug.telemetry` + +### Response shape + +```json +{ + "collector": { + "enabled": true, + "heavy_enabled": false, + "heavy_sample_every": 12, + "metric_sample_every": 2, + "metric_history_max": 12000, + "event_history_max": 4000, + "retention": 900000000000, + "persist_enabled": false, + "persist_dir": "debug/telemetry", + "rotate_mb": 16, + "keep_files": 8 + }, + "config": { + "enabled": true, + "heavy_enabled": false, + "heavy_sample_every": 12, + "metric_sample_every": 2, + "metric_history_max": 12000, + "event_history_max": 4000, + "retention_seconds": 900, + "persist_enabled": false, + "persist_dir": "debug/telemetry", + "rotate_mb": 16, + "keep_files": 8 + } +} +``` + +### Important distinction + +- `collector.retention` is a Go duration serialized in nanoseconds. +- `config.retention_seconds` is the config-facing field used by YAML and the POST update API. + +If you are writing tooling, prefer `config.retention_seconds` for human-facing config edits. + +--- + +## Endpoint: `POST /api/debug/telemetry/config` + +Updates telemetry settings at runtime and writes them back via the autosave config path. + +### Request body + +All fields are optional. Only provided fields are changed. + +```json +{ + "enabled": true, + "heavy_enabled": true, + "heavy_sample_every": 8, + "metric_sample_every": 1, + "metric_history_max": 20000, + "event_history_max": 6000, + "retention_seconds": 1800, + "persist_enabled": true, + "persist_dir": "debug/telemetry", + "rotate_mb": 32, + "keep_files": 12 +} +``` + +### Response shape + +```json +{ + "ok": true, + "collector": { + "enabled": true, + "heavy_enabled": true, + "heavy_sample_every": 8, + "metric_sample_every": 1, + "metric_history_max": 20000, + "event_history_max": 6000, + "retention": 1800000000000, + "persist_enabled": true, + "persist_dir": "debug/telemetry", + "rotate_mb": 32, + "keep_files": 12 + }, + "config": { + "enabled": true, + "heavy_enabled": true, + "heavy_sample_every": 8, + "metric_sample_every": 1, + "metric_history_max": 20000, + "event_history_max": 6000, + "retention_seconds": 1800, + "persist_enabled": true, + "persist_dir": "debug/telemetry", + "rotate_mb": 32, + "keep_files": 12 + } +} +``` + +### Persistence behavior + +A POST updates: + +- the runtime manager snapshot/config +- the in-process collector config +- the autosave config file via `config.Save(...)` + +That means these updates are runtime-effective immediately and also survive restarts through autosave, unless manually reverted. + +### Error cases + +- Invalid JSON -> `400 Bad Request` +- Invalid collector reconfiguration -> `400 Bad Request` +- Telemetry unavailable -> `503 Service Unavailable` + +--- + +## Configuration fields (`debug.telemetry`) + +Telemetry config lives under: + +```yaml +debug: + telemetry: + enabled: true + heavy_enabled: false + heavy_sample_every: 12 + metric_sample_every: 2 + metric_history_max: 12000 + event_history_max: 4000 + retention_seconds: 900 + persist_enabled: false + persist_dir: debug/telemetry + rotate_mb: 16 + keep_files: 8 +``` + +### Field reference + +#### `enabled` +Master on/off switch for telemetry collection. + +If false: +- metrics are not recorded +- events are not recorded +- live snapshot remains effectively empty/minimal + +#### `heavy_enabled` +Enables more expensive / more detailed telemetry paths that should not be left on permanently unless needed. + +Use this for deep extractor/IQ/boundary debugging. + +#### `heavy_sample_every` +Sampling cadence for heavy telemetry. + +- `1` means every eligible heavy sample +- higher numbers reduce cost by sampling less often + +#### `metric_sample_every` +Sampling cadence for normal historical metric point storage. + +Collector summaries still update live, but historical storage becomes less dense when this value is greater than 1. + +#### `metric_history_max` +Maximum number of in-memory historical metric samples retained. + +#### `event_history_max` +Maximum number of in-memory telemetry events retained. + +#### `retention_seconds` +Time-based in-memory retention window. + +Older in-memory metrics/events are trimmed once they fall outside this retention period. + +#### `persist_enabled` +When enabled, telemetry metrics/events are also appended to rotated JSONL files. + +#### `persist_dir` +Directory where rotated telemetry JSONL files are written. + +Default: + +- `debug/telemetry` + +#### `rotate_mb` +Approximate JSONL file rotation threshold in megabytes. + +#### `keep_files` +How many rotated telemetry files to retain in `persist_dir`. + +Older files beyond this count are pruned. + +--- + +## Collector behavior and caveats + +### In-memory vs persisted data + +The query endpoints can read from both: + +- current in-memory collector state/history +- persisted JSONL files + +This means a request may return data older than current in-memory retention if: + +- `persist_enabled=true`, and +- `include_persisted=true` + +### Sampling behavior + +Not every observation necessarily becomes a historical metric point. + +The collector: + +- always updates live counters/gauges/distributions while enabled +- stores historical points according to `metric_sample_every` +- force-stores selected boundary IQ metrics even when sampling would normally skip them + +So the live snapshot and historical series density are intentionally different. + +### Distribution summaries + +Distribution values in the live snapshot include: + +- `count` +- `min` +- `max` +- `mean` +- `last` +- `p95` + +The p95 estimate is based on the collector's bounded rolling sample buffer, not an unbounded full-history quantile computation. + +### Config serialization detail + +The collector's `retention` field is a Go duration. In JSON this appears as an integer nanosecond count. + +This is expected. + +--- + +## Recommended workflows + +### Fast low-overhead runtime watch + +Use: + +- `enabled=true` +- `heavy_enabled=false` +- `persist_enabled=false` or `true` if you want an archive + +Then query: + +- `/api/debug/telemetry/live` +- `/api/debug/telemetry/history?prefix=stage.` +- `/api/debug/telemetry/events?level=warn` + +### 5-10 minute anomaly capture + +Suggested settings: + +- `enabled=true` +- `heavy_enabled=false` +- `persist_enabled=true` +- moderate `metric_sample_every` + +Then: + +1. note start time +2. reproduce workload +3. fetch live snapshot +4. inspect warning events +5. inspect `stage.*`, `streamer.*`, and `source.*` history + +### Deep extractor / boundary investigation + +Temporarily enable: + +- `heavy_enabled=true` +- `heavy_sample_every` > 1 unless you really need every sample +- `persist_enabled=true` + +Then inspect: + +- `iq.*` +- `extract.*` +- `audio.*` +- boundary/anomaly events for specific `signal_id` or `session_id` + +Turn heavy telemetry back off once done. + +--- + +## Example requests + +### Fetch live snapshot + +```bash +curl http://localhost:8080/api/debug/telemetry/live +``` + +### Fetch stage timings from the last 10 minutes + +```bash +curl "http://localhost:8080/api/debug/telemetry/history?since=2026-03-25T12:00:00Z&prefix=stage." +``` + +### Fetch source metrics for one signal + +```bash +curl "http://localhost:8080/api/debug/telemetry/history?prefix=source.&signal_id=1" +``` + +### Fetch warning events only + +```bash +curl "http://localhost:8080/api/debug/telemetry/events?since=2026-03-25T12:00:00Z&level=warn" +``` + +### Fetch events with a custom tag filter + +```bash +curl "http://localhost:8080/api/debug/telemetry/events?tag_zone=broadcast" +``` + +### Enable persistence and heavy telemetry temporarily + +```bash +curl -X POST http://localhost:8080/api/debug/telemetry/config \ + -H "Content-Type: application/json" \ + -d '{ + "heavy_enabled": true, + "heavy_sample_every": 8, + "persist_enabled": true + }' +``` + +--- + +## Related docs + +- `README.md` - high-level project overview and endpoint summary +- `docs/telemetry-debug-runbook.md` - quick operational runbook for short debug sessions +- `internal/telemetry/telemetry.go` - collector implementation details +- `cmd/sdrd/http_handlers.go` - HTTP wiring for telemetry endpoints