# Telemetry API Reference This document describes the server-side telemetry collector, its runtime configuration, and the HTTP API exposed by `sdrd`. The telemetry system is intended for debugging and performance analysis of the SDR pipeline, especially around source cadence, extraction, DSP timing, boundary artifacts, queue pressure, and other runtime anomalies. ## Goals The telemetry layer gives you three different views of runtime state: 1. **Live snapshot** - Current counters, gauges, distributions, recent events, and collector status. 2. **Historical metrics** - Timestamped metric samples that can be filtered by name, prefix, or tags. 3. **Historical events** - Structured anomalies / warnings / debug events with optional fields. It is designed to be lightweight in normal operation and more detailed when `heavy_enabled` is turned on. --- ## Base URLs All telemetry endpoints live under: - `/api/debug/telemetry/live` - `/api/debug/telemetry/history` - `/api/debug/telemetry/events` - `/api/debug/telemetry/config` Responses are JSON. --- ## Data model ### Metric types Telemetry metrics are stored in three logical groups: - **counter** - Accumulating values, usually incremented over time. - **gauge** - Latest current value. - **distribution** - Observed numeric samples with summary stats. A historical metric sample is returned as: ```json { "ts": "2026-03-25T12:00:00Z", "name": "stage.extract_stream.duration_ms", "type": "distribution", "value": 4.83, "tags": { "stage": "extract_stream", "signal_id": "1" } } ``` ### Events Telemetry events are structured anomaly/debug records: ```json { "id": 123, "ts": "2026-03-25T12:00:02Z", "name": "demod_boundary", "level": "warn", "message": "boundary discontinuity detected", "tags": { "signal_id": "1", "stage": "demod" }, "fields": { "d2": 0.3358, "index": 25 } } ``` ### Tags Tags are string key/value metadata used for filtering and correlation. Common tag keys already supported by the HTTP layer: - `signal_id` - `session_id` - `stage` - `trace_id` - `component` You can also filter on arbitrary tags via `tag_=` query parameters. --- ## Endpoint: `GET /api/debug/telemetry/live` Returns a live snapshot of the in-memory collector state. ### Response shape ```json { "now": "2026-03-25T12:00:05Z", "started_at": "2026-03-25T11:52:10Z", "uptime_ms": 472500, "config": { "enabled": true, "heavy_enabled": false, "heavy_sample_every": 12, "metric_sample_every": 2, "metric_history_max": 12000, "event_history_max": 4000, "retention": 900000000000, "persist_enabled": false, "persist_dir": "debug/telemetry", "rotate_mb": 16, "keep_files": 8 }, "counters": [ { "name": "source.resets", "value": 1, "tags": { "component": "source" } } ], "gauges": [ { "name": "source.buffer_samples", "value": 304128, "tags": { "component": "source" } } ], "distributions": [ { "name": "dsp.frame.duration_ms", "count": 96, "min": 82.5, "max": 212.4, "mean": 104.8, "last": 98.3, "p95": 149.2, "tags": { "stage": "dsp" } } ], "recent_events": [], "status": { "source_state": "running" } } ``` ### Notes - `counters`, `gauges`, and `distributions` are sorted by metric name. - `recent_events` contains the most recent in-memory event slice. - `status` is optional and contains arbitrary runtime status published by code using `SetStatus(...)`. - If telemetry is unavailable, the server returns a small JSON object instead of a full snapshot. ### Typical uses - Check whether telemetry is enabled. - Look for timing hotspots in `*.duration_ms` distributions. - Inspect current queue or source gauges. - See recent anomaly events without querying history. --- ## Endpoint: `GET /api/debug/telemetry/history` Returns historical metric samples from in-memory history and, optionally, persisted JSONL files. ### Response shape ```json { "items": [ { "ts": "2026-03-25T12:00:01Z", "name": "stage.extract_stream.duration_ms", "type": "distribution", "value": 5.2, "tags": { "stage": "extract_stream", "signal_id": "2" } } ], "count": 1 } ``` ### Supported query parameters #### Time filters - `since` - `until` Accepted formats: - Unix seconds - Unix milliseconds - RFC3339 - RFC3339Nano Examples: - `?since=1711368000` - `?since=1711368000123` - `?since=2026-03-25T12:00:00Z` #### Result shaping - `limit` - Default normalization is 500. - Values above 5000 are clamped down by the collector query layer. #### Name filters - `name=` - `prefix=` Examples: - `?name=source.read.duration_ms` - `?prefix=stage.` - `?prefix=iq.extract.` #### Tag filters Special convenience query params map directly to tag filters: - `signal_id` - `session_id` - `stage` - `trace_id` - `component` Arbitrary tag filters: - `tag_=` Examples: - `?signal_id=1` - `?stage=extract_stream` - `?tag_path=gpu` - `?tag_zone=broadcast` #### Persistence control - `include_persisted=true|false` - Default: `true` When enabled and persistence is active, the server reads matching data from rotated JSONL telemetry files in addition to in-memory history. ### Notes - Results are sorted by timestamp ascending. - If `limit` is hit, the most recent matching items are retained. - Exact retention depends on both in-memory retention and persisted file availability. - A small set of boundary-related IQ metrics is force-stored regardless of the normal metric sample cadence. ### Typical queries Get all stage timing since a specific start: ```text /api/debug/telemetry/history?since=2026-03-25T12:00:00Z&prefix=stage. ``` Get extraction metrics for a single signal: ```text /api/debug/telemetry/history?since=2026-03-25T12:00:00Z&prefix=extract.&signal_id=2 ``` Get source cadence metrics only from in-memory history: ```text /api/debug/telemetry/history?prefix=source.&include_persisted=false ``` --- ## Endpoint: `GET /api/debug/telemetry/events` Returns historical telemetry events from memory and, optionally, persisted storage. ### Response shape ```json { "items": [ { "id": 991, "ts": "2026-03-25T12:00:03Z", "name": "source_reset", "level": "warn", "message": "source reader reset observed", "tags": { "component": "source" }, "fields": { "reason": "short_read" } } ], "count": 1 } ``` ### Supported query parameters All `history` filters are also supported here, plus: - `level=` Examples: - `?since=2026-03-25T12:00:00Z&level=warn` - `?prefix=audio.&signal_id=1` - `?name=demod_boundary&signal_id=1` ### Notes - Event matching supports `name`, `prefix`, `level`, time range, and tags. - Event `level` matching is case-insensitive. - Results are timestamp-sorted ascending. ### Typical queries Get warnings during a reproduction run: ```text /api/debug/telemetry/events?since=2026-03-25T12:00:00Z&level=warn ``` Get boundary-related events for one signal: ```text /api/debug/telemetry/events?since=2026-03-25T12:00:00Z&signal_id=1&prefix=demod_ ``` --- ## Endpoint: `GET /api/debug/telemetry/config` Returns both: 1. the active collector configuration, and 2. the current runtime config under `debug.telemetry` ### Response shape ```json { "collector": { "enabled": true, "heavy_enabled": false, "heavy_sample_every": 12, "metric_sample_every": 2, "metric_history_max": 12000, "event_history_max": 4000, "retention": 900000000000, "persist_enabled": false, "persist_dir": "debug/telemetry", "rotate_mb": 16, "keep_files": 8 }, "config": { "enabled": true, "heavy_enabled": false, "heavy_sample_every": 12, "metric_sample_every": 2, "metric_history_max": 12000, "event_history_max": 4000, "retention_seconds": 900, "persist_enabled": false, "persist_dir": "debug/telemetry", "rotate_mb": 16, "keep_files": 8 } } ``` ### Important distinction - `collector.retention` is a Go duration serialized in nanoseconds. - `config.retention_seconds` is the config-facing field used by YAML and the POST update API. If you are writing tooling, prefer `config.retention_seconds` for human-facing config edits. --- ## Endpoint: `POST /api/debug/telemetry/config` Updates telemetry settings at runtime and writes them back via the autosave config path. ### Request body All fields are optional. Only provided fields are changed. ```json { "enabled": true, "heavy_enabled": true, "heavy_sample_every": 8, "metric_sample_every": 1, "metric_history_max": 20000, "event_history_max": 6000, "retention_seconds": 1800, "persist_enabled": true, "persist_dir": "debug/telemetry", "rotate_mb": 32, "keep_files": 12 } ``` ### Response shape ```json { "ok": true, "collector": { "enabled": true, "heavy_enabled": true, "heavy_sample_every": 8, "metric_sample_every": 1, "metric_history_max": 20000, "event_history_max": 6000, "retention": 1800000000000, "persist_enabled": true, "persist_dir": "debug/telemetry", "rotate_mb": 32, "keep_files": 12 }, "config": { "enabled": true, "heavy_enabled": true, "heavy_sample_every": 8, "metric_sample_every": 1, "metric_history_max": 20000, "event_history_max": 6000, "retention_seconds": 1800, "persist_enabled": true, "persist_dir": "debug/telemetry", "rotate_mb": 32, "keep_files": 12 } } ``` ### Persistence behavior A POST updates: - the runtime manager snapshot/config - the in-process collector config - the autosave config file via `config.Save(...)` That means these updates are runtime-effective immediately and also survive restarts through autosave, unless manually reverted. ### Error cases - Invalid JSON -> `400 Bad Request` - Invalid collector reconfiguration -> `400 Bad Request` - Telemetry unavailable -> `503 Service Unavailable` --- ## Configuration fields (`debug.telemetry`) Telemetry config lives under: ```yaml debug: telemetry: enabled: true heavy_enabled: false heavy_sample_every: 12 metric_sample_every: 2 metric_history_max: 12000 event_history_max: 4000 retention_seconds: 900 persist_enabled: false persist_dir: debug/telemetry rotate_mb: 16 keep_files: 8 ``` ### Field reference #### `enabled` Master on/off switch for telemetry collection. If false: - metrics are not recorded - events are not recorded - live snapshot remains effectively empty/minimal #### `heavy_enabled` Enables more expensive / more detailed telemetry paths that should not be left on permanently unless needed. Use this for deep extractor/IQ/boundary debugging. #### `heavy_sample_every` Sampling cadence for heavy telemetry. - `1` means every eligible heavy sample - higher numbers reduce cost by sampling less often #### `metric_sample_every` Sampling cadence for normal historical metric point storage. Collector summaries still update live, but historical storage becomes less dense when this value is greater than 1. #### `metric_history_max` Maximum number of in-memory historical metric samples retained. #### `event_history_max` Maximum number of in-memory telemetry events retained. #### `retention_seconds` Time-based in-memory retention window. Older in-memory metrics/events are trimmed once they fall outside this retention period. #### `persist_enabled` When enabled, telemetry metrics/events are also appended to rotated JSONL files. #### `persist_dir` Directory where rotated telemetry JSONL files are written. Default: - `debug/telemetry` #### `rotate_mb` Approximate JSONL file rotation threshold in megabytes. #### `keep_files` How many rotated telemetry files to retain in `persist_dir`. Older files beyond this count are pruned. --- ## Collector behavior and caveats ### In-memory vs persisted data The query endpoints can read from both: - current in-memory collector state/history - persisted JSONL files This means a request may return data older than current in-memory retention if: - `persist_enabled=true`, and - `include_persisted=true` ### Sampling behavior Not every observation necessarily becomes a historical metric point. The collector: - always updates live counters/gauges/distributions while enabled - stores historical points according to `metric_sample_every` - force-stores selected boundary IQ metrics even when sampling would normally skip them So the live snapshot and historical series density are intentionally different. ### Distribution summaries Distribution values in the live snapshot include: - `count` - `min` - `max` - `mean` - `last` - `p95` The p95 estimate is based on the collector's bounded rolling sample buffer, not an unbounded full-history quantile computation. ### Config serialization detail The collector's `retention` field is a Go duration. In JSON this appears as an integer nanosecond count. This is expected. --- ## Recommended workflows ### Fast low-overhead runtime watch Use: - `enabled=true` - `heavy_enabled=false` - `persist_enabled=false` or `true` if you want an archive Then query: - `/api/debug/telemetry/live` - `/api/debug/telemetry/history?prefix=stage.` - `/api/debug/telemetry/events?level=warn` ### 5-10 minute anomaly capture Suggested settings: - `enabled=true` - `heavy_enabled=false` - `persist_enabled=true` - moderate `metric_sample_every` Then: 1. note start time 2. reproduce workload 3. fetch live snapshot 4. inspect warning events 5. inspect `stage.*`, `streamer.*`, and `source.*` history ### Deep extractor / boundary investigation Temporarily enable: - `heavy_enabled=true` - `heavy_sample_every` > 1 unless you really need every sample - `persist_enabled=true` Then inspect: - `iq.*` - `extract.*` - `audio.*` - boundary/anomaly events for specific `signal_id` or `session_id` Turn heavy telemetry back off once done. --- ## Example requests ### Fetch live snapshot ```bash curl http://localhost:8080/api/debug/telemetry/live ``` ### Fetch stage timings from the last 10 minutes ```bash curl "http://localhost:8080/api/debug/telemetry/history?since=2026-03-25T12:00:00Z&prefix=stage." ``` ### Fetch source metrics for one signal ```bash curl "http://localhost:8080/api/debug/telemetry/history?prefix=source.&signal_id=1" ``` ### Fetch warning events only ```bash curl "http://localhost:8080/api/debug/telemetry/events?since=2026-03-25T12:00:00Z&level=warn" ``` ### Fetch events with a custom tag filter ```bash curl "http://localhost:8080/api/debug/telemetry/events?tag_zone=broadcast" ``` ### Enable persistence and heavy telemetry temporarily ```bash curl -X POST http://localhost:8080/api/debug/telemetry/config \ -H "Content-Type: application/json" \ -d '{ "heavy_enabled": true, "heavy_sample_every": 8, "persist_enabled": true }' ``` --- ## Related docs - `README.md` - high-level project overview and endpoint summary - `docs/telemetry-debug-runbook.md` - quick operational runbook for short debug sessions - `internal/telemetry/telemetry.go` - collector implementation details - `cmd/sdrd/http_handlers.go` - HTTP wiring for telemetry endpoints