|
|
@@ -0,0 +1,711 @@ |
|
|
|
|
|
# Telemetry API Reference |
|
|
|
|
|
|
|
|
|
|
|
This document describes the server-side telemetry collector, its runtime configuration, and the HTTP API exposed by `sdrd`. |
|
|
|
|
|
|
|
|
|
|
|
The telemetry system is intended for debugging and performance analysis of the SDR pipeline, especially around source cadence, extraction, DSP timing, boundary artifacts, queue pressure, and other runtime anomalies. |
|
|
|
|
|
|
|
|
|
|
|
## Goals |
|
|
|
|
|
|
|
|
|
|
|
The telemetry layer gives you three different views of runtime state: |
|
|
|
|
|
|
|
|
|
|
|
1. **Live snapshot** |
|
|
|
|
|
- Current counters, gauges, distributions, recent events, and collector status. |
|
|
|
|
|
2. **Historical metrics** |
|
|
|
|
|
- Timestamped metric samples that can be filtered by name, prefix, or tags. |
|
|
|
|
|
3. **Historical events** |
|
|
|
|
|
- Structured anomalies / warnings / debug events with optional fields. |
|
|
|
|
|
|
|
|
|
|
|
It is designed to be lightweight in normal operation and more detailed when `heavy_enabled` is turned on. |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
## Base URLs |
|
|
|
|
|
|
|
|
|
|
|
All telemetry endpoints live under: |
|
|
|
|
|
|
|
|
|
|
|
- `/api/debug/telemetry/live` |
|
|
|
|
|
- `/api/debug/telemetry/history` |
|
|
|
|
|
- `/api/debug/telemetry/events` |
|
|
|
|
|
- `/api/debug/telemetry/config` |
|
|
|
|
|
|
|
|
|
|
|
Responses are JSON. |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
## Data model |
|
|
|
|
|
|
|
|
|
|
|
### Metric types |
|
|
|
|
|
|
|
|
|
|
|
Telemetry metrics are stored in three logical groups: |
|
|
|
|
|
|
|
|
|
|
|
- **counter** |
|
|
|
|
|
- Accumulating values, usually incremented over time. |
|
|
|
|
|
- **gauge** |
|
|
|
|
|
- Latest current value. |
|
|
|
|
|
- **distribution** |
|
|
|
|
|
- Observed numeric samples with summary stats. |
|
|
|
|
|
|
|
|
|
|
|
A historical metric sample is returned as: |
|
|
|
|
|
|
|
|
|
|
|
```json |
|
|
|
|
|
{ |
|
|
|
|
|
"ts": "2026-03-25T12:00:00Z", |
|
|
|
|
|
"name": "stage.extract_stream.duration_ms", |
|
|
|
|
|
"type": "distribution", |
|
|
|
|
|
"value": 4.83, |
|
|
|
|
|
"tags": { |
|
|
|
|
|
"stage": "extract_stream", |
|
|
|
|
|
"signal_id": "1" |
|
|
|
|
|
} |
|
|
|
|
|
} |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
### Events |
|
|
|
|
|
|
|
|
|
|
|
Telemetry events are structured anomaly/debug records: |
|
|
|
|
|
|
|
|
|
|
|
```json |
|
|
|
|
|
{ |
|
|
|
|
|
"id": 123, |
|
|
|
|
|
"ts": "2026-03-25T12:00:02Z", |
|
|
|
|
|
"name": "demod_boundary", |
|
|
|
|
|
"level": "warn", |
|
|
|
|
|
"message": "boundary discontinuity detected", |
|
|
|
|
|
"tags": { |
|
|
|
|
|
"signal_id": "1", |
|
|
|
|
|
"stage": "demod" |
|
|
|
|
|
}, |
|
|
|
|
|
"fields": { |
|
|
|
|
|
"d2": 0.3358, |
|
|
|
|
|
"index": 25 |
|
|
|
|
|
} |
|
|
|
|
|
} |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
### Tags |
|
|
|
|
|
|
|
|
|
|
|
Tags are string key/value metadata used for filtering and correlation. |
|
|
|
|
|
|
|
|
|
|
|
Common tag keys already supported by the HTTP layer: |
|
|
|
|
|
|
|
|
|
|
|
- `signal_id` |
|
|
|
|
|
- `session_id` |
|
|
|
|
|
- `stage` |
|
|
|
|
|
- `trace_id` |
|
|
|
|
|
- `component` |
|
|
|
|
|
|
|
|
|
|
|
You can also filter on arbitrary tags via `tag_<key>=<value>` query parameters. |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
## Endpoint: `GET /api/debug/telemetry/live` |
|
|
|
|
|
|
|
|
|
|
|
Returns a live snapshot of the in-memory collector state. |
|
|
|
|
|
|
|
|
|
|
|
### Response shape |
|
|
|
|
|
|
|
|
|
|
|
```json |
|
|
|
|
|
{ |
|
|
|
|
|
"now": "2026-03-25T12:00:05Z", |
|
|
|
|
|
"started_at": "2026-03-25T11:52:10Z", |
|
|
|
|
|
"uptime_ms": 472500, |
|
|
|
|
|
"config": { |
|
|
|
|
|
"enabled": true, |
|
|
|
|
|
"heavy_enabled": false, |
|
|
|
|
|
"heavy_sample_every": 12, |
|
|
|
|
|
"metric_sample_every": 2, |
|
|
|
|
|
"metric_history_max": 12000, |
|
|
|
|
|
"event_history_max": 4000, |
|
|
|
|
|
"retention": 900000000000, |
|
|
|
|
|
"persist_enabled": false, |
|
|
|
|
|
"persist_dir": "debug/telemetry", |
|
|
|
|
|
"rotate_mb": 16, |
|
|
|
|
|
"keep_files": 8 |
|
|
|
|
|
}, |
|
|
|
|
|
"counters": [ |
|
|
|
|
|
{ |
|
|
|
|
|
"name": "source.resets", |
|
|
|
|
|
"value": 1, |
|
|
|
|
|
"tags": { |
|
|
|
|
|
"component": "source" |
|
|
|
|
|
} |
|
|
|
|
|
} |
|
|
|
|
|
], |
|
|
|
|
|
"gauges": [ |
|
|
|
|
|
{ |
|
|
|
|
|
"name": "source.buffer_samples", |
|
|
|
|
|
"value": 304128, |
|
|
|
|
|
"tags": { |
|
|
|
|
|
"component": "source" |
|
|
|
|
|
} |
|
|
|
|
|
} |
|
|
|
|
|
], |
|
|
|
|
|
"distributions": [ |
|
|
|
|
|
{ |
|
|
|
|
|
"name": "dsp.frame.duration_ms", |
|
|
|
|
|
"count": 96, |
|
|
|
|
|
"min": 82.5, |
|
|
|
|
|
"max": 212.4, |
|
|
|
|
|
"mean": 104.8, |
|
|
|
|
|
"last": 98.3, |
|
|
|
|
|
"p95": 149.2, |
|
|
|
|
|
"tags": { |
|
|
|
|
|
"stage": "dsp" |
|
|
|
|
|
} |
|
|
|
|
|
} |
|
|
|
|
|
], |
|
|
|
|
|
"recent_events": [], |
|
|
|
|
|
"status": { |
|
|
|
|
|
"source_state": "running" |
|
|
|
|
|
} |
|
|
|
|
|
} |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
### Notes |
|
|
|
|
|
|
|
|
|
|
|
- `counters`, `gauges`, and `distributions` are sorted by metric name. |
|
|
|
|
|
- `recent_events` contains the most recent in-memory event slice. |
|
|
|
|
|
- `status` is optional and contains arbitrary runtime status published by code using `SetStatus(...)`. |
|
|
|
|
|
- If telemetry is unavailable, the server returns a small JSON object instead of a full snapshot. |
|
|
|
|
|
|
|
|
|
|
|
### Typical uses |
|
|
|
|
|
|
|
|
|
|
|
- Check whether telemetry is enabled. |
|
|
|
|
|
- Look for timing hotspots in `*.duration_ms` distributions. |
|
|
|
|
|
- Inspect current queue or source gauges. |
|
|
|
|
|
- See recent anomaly events without querying history. |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
## Endpoint: `GET /api/debug/telemetry/history` |
|
|
|
|
|
|
|
|
|
|
|
Returns historical metric samples from in-memory history and, optionally, persisted JSONL files. |
|
|
|
|
|
|
|
|
|
|
|
### Response shape |
|
|
|
|
|
|
|
|
|
|
|
```json |
|
|
|
|
|
{ |
|
|
|
|
|
"items": [ |
|
|
|
|
|
{ |
|
|
|
|
|
"ts": "2026-03-25T12:00:01Z", |
|
|
|
|
|
"name": "stage.extract_stream.duration_ms", |
|
|
|
|
|
"type": "distribution", |
|
|
|
|
|
"value": 5.2, |
|
|
|
|
|
"tags": { |
|
|
|
|
|
"stage": "extract_stream", |
|
|
|
|
|
"signal_id": "2" |
|
|
|
|
|
} |
|
|
|
|
|
} |
|
|
|
|
|
], |
|
|
|
|
|
"count": 1 |
|
|
|
|
|
} |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
### Supported query parameters |
|
|
|
|
|
|
|
|
|
|
|
#### Time filters |
|
|
|
|
|
|
|
|
|
|
|
- `since` |
|
|
|
|
|
- `until` |
|
|
|
|
|
|
|
|
|
|
|
Accepted formats: |
|
|
|
|
|
|
|
|
|
|
|
- Unix seconds |
|
|
|
|
|
- Unix milliseconds |
|
|
|
|
|
- RFC3339 |
|
|
|
|
|
- RFC3339Nano |
|
|
|
|
|
|
|
|
|
|
|
Examples: |
|
|
|
|
|
|
|
|
|
|
|
- `?since=1711368000` |
|
|
|
|
|
- `?since=1711368000123` |
|
|
|
|
|
- `?since=2026-03-25T12:00:00Z` |
|
|
|
|
|
|
|
|
|
|
|
#### Result shaping |
|
|
|
|
|
|
|
|
|
|
|
- `limit` |
|
|
|
|
|
- Default normalization is 500. |
|
|
|
|
|
- Values above 5000 are clamped down by the collector query layer. |
|
|
|
|
|
|
|
|
|
|
|
#### Name filters |
|
|
|
|
|
|
|
|
|
|
|
- `name=<exact_metric_name>` |
|
|
|
|
|
- `prefix=<metric_name_prefix>` |
|
|
|
|
|
|
|
|
|
|
|
Examples: |
|
|
|
|
|
|
|
|
|
|
|
- `?name=source.read.duration_ms` |
|
|
|
|
|
- `?prefix=stage.` |
|
|
|
|
|
- `?prefix=iq.extract.` |
|
|
|
|
|
|
|
|
|
|
|
#### Tag filters |
|
|
|
|
|
|
|
|
|
|
|
Special convenience query params map directly to tag filters: |
|
|
|
|
|
|
|
|
|
|
|
- `signal_id` |
|
|
|
|
|
- `session_id` |
|
|
|
|
|
- `stage` |
|
|
|
|
|
- `trace_id` |
|
|
|
|
|
- `component` |
|
|
|
|
|
|
|
|
|
|
|
Arbitrary tag filters: |
|
|
|
|
|
|
|
|
|
|
|
- `tag_<key>=<value>` |
|
|
|
|
|
|
|
|
|
|
|
Examples: |
|
|
|
|
|
|
|
|
|
|
|
- `?signal_id=1` |
|
|
|
|
|
- `?stage=extract_stream` |
|
|
|
|
|
- `?tag_path=gpu` |
|
|
|
|
|
- `?tag_zone=broadcast` |
|
|
|
|
|
|
|
|
|
|
|
#### Persistence control |
|
|
|
|
|
|
|
|
|
|
|
- `include_persisted=true|false` |
|
|
|
|
|
- Default: `true` |
|
|
|
|
|
|
|
|
|
|
|
When enabled and persistence is active, the server reads matching data from rotated JSONL telemetry files in addition to in-memory history. |
|
|
|
|
|
|
|
|
|
|
|
### Notes |
|
|
|
|
|
|
|
|
|
|
|
- Results are sorted by timestamp ascending. |
|
|
|
|
|
- If `limit` is hit, the most recent matching items are retained. |
|
|
|
|
|
- Exact retention depends on both in-memory retention and persisted file availability. |
|
|
|
|
|
- A small set of boundary-related IQ metrics is force-stored regardless of the normal metric sample cadence. |
|
|
|
|
|
|
|
|
|
|
|
### Typical queries |
|
|
|
|
|
|
|
|
|
|
|
Get all stage timing since a specific start: |
|
|
|
|
|
|
|
|
|
|
|
```text |
|
|
|
|
|
/api/debug/telemetry/history?since=2026-03-25T12:00:00Z&prefix=stage. |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
Get extraction metrics for a single signal: |
|
|
|
|
|
|
|
|
|
|
|
```text |
|
|
|
|
|
/api/debug/telemetry/history?since=2026-03-25T12:00:00Z&prefix=extract.&signal_id=2 |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
Get source cadence metrics only from in-memory history: |
|
|
|
|
|
|
|
|
|
|
|
```text |
|
|
|
|
|
/api/debug/telemetry/history?prefix=source.&include_persisted=false |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
## Endpoint: `GET /api/debug/telemetry/events` |
|
|
|
|
|
|
|
|
|
|
|
Returns historical telemetry events from memory and, optionally, persisted storage. |
|
|
|
|
|
|
|
|
|
|
|
### Response shape |
|
|
|
|
|
|
|
|
|
|
|
```json |
|
|
|
|
|
{ |
|
|
|
|
|
"items": [ |
|
|
|
|
|
{ |
|
|
|
|
|
"id": 991, |
|
|
|
|
|
"ts": "2026-03-25T12:00:03Z", |
|
|
|
|
|
"name": "source_reset", |
|
|
|
|
|
"level": "warn", |
|
|
|
|
|
"message": "source reader reset observed", |
|
|
|
|
|
"tags": { |
|
|
|
|
|
"component": "source" |
|
|
|
|
|
}, |
|
|
|
|
|
"fields": { |
|
|
|
|
|
"reason": "short_read" |
|
|
|
|
|
} |
|
|
|
|
|
} |
|
|
|
|
|
], |
|
|
|
|
|
"count": 1 |
|
|
|
|
|
} |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
### Supported query parameters |
|
|
|
|
|
|
|
|
|
|
|
All `history` filters are also supported here, plus: |
|
|
|
|
|
|
|
|
|
|
|
- `level=<debug|info|warn|error|...>` |
|
|
|
|
|
|
|
|
|
|
|
Examples: |
|
|
|
|
|
|
|
|
|
|
|
- `?since=2026-03-25T12:00:00Z&level=warn` |
|
|
|
|
|
- `?prefix=audio.&signal_id=1` |
|
|
|
|
|
- `?name=demod_boundary&signal_id=1` |
|
|
|
|
|
|
|
|
|
|
|
### Notes |
|
|
|
|
|
|
|
|
|
|
|
- Event matching supports `name`, `prefix`, `level`, time range, and tags. |
|
|
|
|
|
- Event `level` matching is case-insensitive. |
|
|
|
|
|
- Results are timestamp-sorted ascending. |
|
|
|
|
|
|
|
|
|
|
|
### Typical queries |
|
|
|
|
|
|
|
|
|
|
|
Get warnings during a reproduction run: |
|
|
|
|
|
|
|
|
|
|
|
```text |
|
|
|
|
|
/api/debug/telemetry/events?since=2026-03-25T12:00:00Z&level=warn |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
Get boundary-related events for one signal: |
|
|
|
|
|
|
|
|
|
|
|
```text |
|
|
|
|
|
/api/debug/telemetry/events?since=2026-03-25T12:00:00Z&signal_id=1&prefix=demod_ |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
## Endpoint: `GET /api/debug/telemetry/config` |
|
|
|
|
|
|
|
|
|
|
|
Returns both: |
|
|
|
|
|
|
|
|
|
|
|
1. the active collector configuration, and |
|
|
|
|
|
2. the current runtime config under `debug.telemetry` |
|
|
|
|
|
|
|
|
|
|
|
### Response shape |
|
|
|
|
|
|
|
|
|
|
|
```json |
|
|
|
|
|
{ |
|
|
|
|
|
"collector": { |
|
|
|
|
|
"enabled": true, |
|
|
|
|
|
"heavy_enabled": false, |
|
|
|
|
|
"heavy_sample_every": 12, |
|
|
|
|
|
"metric_sample_every": 2, |
|
|
|
|
|
"metric_history_max": 12000, |
|
|
|
|
|
"event_history_max": 4000, |
|
|
|
|
|
"retention": 900000000000, |
|
|
|
|
|
"persist_enabled": false, |
|
|
|
|
|
"persist_dir": "debug/telemetry", |
|
|
|
|
|
"rotate_mb": 16, |
|
|
|
|
|
"keep_files": 8 |
|
|
|
|
|
}, |
|
|
|
|
|
"config": { |
|
|
|
|
|
"enabled": true, |
|
|
|
|
|
"heavy_enabled": false, |
|
|
|
|
|
"heavy_sample_every": 12, |
|
|
|
|
|
"metric_sample_every": 2, |
|
|
|
|
|
"metric_history_max": 12000, |
|
|
|
|
|
"event_history_max": 4000, |
|
|
|
|
|
"retention_seconds": 900, |
|
|
|
|
|
"persist_enabled": false, |
|
|
|
|
|
"persist_dir": "debug/telemetry", |
|
|
|
|
|
"rotate_mb": 16, |
|
|
|
|
|
"keep_files": 8 |
|
|
|
|
|
} |
|
|
|
|
|
} |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
### Important distinction |
|
|
|
|
|
|
|
|
|
|
|
- `collector.retention` is a Go duration serialized in nanoseconds. |
|
|
|
|
|
- `config.retention_seconds` is the config-facing field used by YAML and the POST update API. |
|
|
|
|
|
|
|
|
|
|
|
If you are writing tooling, prefer `config.retention_seconds` for human-facing config edits. |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
## Endpoint: `POST /api/debug/telemetry/config` |
|
|
|
|
|
|
|
|
|
|
|
Updates telemetry settings at runtime and writes them back via the autosave config path. |
|
|
|
|
|
|
|
|
|
|
|
### Request body |
|
|
|
|
|
|
|
|
|
|
|
All fields are optional. Only provided fields are changed. |
|
|
|
|
|
|
|
|
|
|
|
```json |
|
|
|
|
|
{ |
|
|
|
|
|
"enabled": true, |
|
|
|
|
|
"heavy_enabled": true, |
|
|
|
|
|
"heavy_sample_every": 8, |
|
|
|
|
|
"metric_sample_every": 1, |
|
|
|
|
|
"metric_history_max": 20000, |
|
|
|
|
|
"event_history_max": 6000, |
|
|
|
|
|
"retention_seconds": 1800, |
|
|
|
|
|
"persist_enabled": true, |
|
|
|
|
|
"persist_dir": "debug/telemetry", |
|
|
|
|
|
"rotate_mb": 32, |
|
|
|
|
|
"keep_files": 12 |
|
|
|
|
|
} |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
### Response shape |
|
|
|
|
|
|
|
|
|
|
|
```json |
|
|
|
|
|
{ |
|
|
|
|
|
"ok": true, |
|
|
|
|
|
"collector": { |
|
|
|
|
|
"enabled": true, |
|
|
|
|
|
"heavy_enabled": true, |
|
|
|
|
|
"heavy_sample_every": 8, |
|
|
|
|
|
"metric_sample_every": 1, |
|
|
|
|
|
"metric_history_max": 20000, |
|
|
|
|
|
"event_history_max": 6000, |
|
|
|
|
|
"retention": 1800000000000, |
|
|
|
|
|
"persist_enabled": true, |
|
|
|
|
|
"persist_dir": "debug/telemetry", |
|
|
|
|
|
"rotate_mb": 32, |
|
|
|
|
|
"keep_files": 12 |
|
|
|
|
|
}, |
|
|
|
|
|
"config": { |
|
|
|
|
|
"enabled": true, |
|
|
|
|
|
"heavy_enabled": true, |
|
|
|
|
|
"heavy_sample_every": 8, |
|
|
|
|
|
"metric_sample_every": 1, |
|
|
|
|
|
"metric_history_max": 20000, |
|
|
|
|
|
"event_history_max": 6000, |
|
|
|
|
|
"retention_seconds": 1800, |
|
|
|
|
|
"persist_enabled": true, |
|
|
|
|
|
"persist_dir": "debug/telemetry", |
|
|
|
|
|
"rotate_mb": 32, |
|
|
|
|
|
"keep_files": 12 |
|
|
|
|
|
} |
|
|
|
|
|
} |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
### Persistence behavior |
|
|
|
|
|
|
|
|
|
|
|
A POST updates: |
|
|
|
|
|
|
|
|
|
|
|
- the runtime manager snapshot/config |
|
|
|
|
|
- the in-process collector config |
|
|
|
|
|
- the autosave config file via `config.Save(...)` |
|
|
|
|
|
|
|
|
|
|
|
That means these updates are runtime-effective immediately and also survive restarts through autosave, unless manually reverted. |
|
|
|
|
|
|
|
|
|
|
|
### Error cases |
|
|
|
|
|
|
|
|
|
|
|
- Invalid JSON -> `400 Bad Request` |
|
|
|
|
|
- Invalid collector reconfiguration -> `400 Bad Request` |
|
|
|
|
|
- Telemetry unavailable -> `503 Service Unavailable` |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
## Configuration fields (`debug.telemetry`) |
|
|
|
|
|
|
|
|
|
|
|
Telemetry config lives under: |
|
|
|
|
|
|
|
|
|
|
|
```yaml |
|
|
|
|
|
debug: |
|
|
|
|
|
telemetry: |
|
|
|
|
|
enabled: true |
|
|
|
|
|
heavy_enabled: false |
|
|
|
|
|
heavy_sample_every: 12 |
|
|
|
|
|
metric_sample_every: 2 |
|
|
|
|
|
metric_history_max: 12000 |
|
|
|
|
|
event_history_max: 4000 |
|
|
|
|
|
retention_seconds: 900 |
|
|
|
|
|
persist_enabled: false |
|
|
|
|
|
persist_dir: debug/telemetry |
|
|
|
|
|
rotate_mb: 16 |
|
|
|
|
|
keep_files: 8 |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
### Field reference |
|
|
|
|
|
|
|
|
|
|
|
#### `enabled` |
|
|
|
|
|
Master on/off switch for telemetry collection. |
|
|
|
|
|
|
|
|
|
|
|
If false: |
|
|
|
|
|
- metrics are not recorded |
|
|
|
|
|
- events are not recorded |
|
|
|
|
|
- live snapshot remains effectively empty/minimal |
|
|
|
|
|
|
|
|
|
|
|
#### `heavy_enabled` |
|
|
|
|
|
Enables more expensive / more detailed telemetry paths that should not be left on permanently unless needed. |
|
|
|
|
|
|
|
|
|
|
|
Use this for deep extractor/IQ/boundary debugging. |
|
|
|
|
|
|
|
|
|
|
|
#### `heavy_sample_every` |
|
|
|
|
|
Sampling cadence for heavy telemetry. |
|
|
|
|
|
|
|
|
|
|
|
- `1` means every eligible heavy sample |
|
|
|
|
|
- higher numbers reduce cost by sampling less often |
|
|
|
|
|
|
|
|
|
|
|
#### `metric_sample_every` |
|
|
|
|
|
Sampling cadence for normal historical metric point storage. |
|
|
|
|
|
|
|
|
|
|
|
Collector summaries still update live, but historical storage becomes less dense when this value is greater than 1. |
|
|
|
|
|
|
|
|
|
|
|
#### `metric_history_max` |
|
|
|
|
|
Maximum number of in-memory historical metric samples retained. |
|
|
|
|
|
|
|
|
|
|
|
#### `event_history_max` |
|
|
|
|
|
Maximum number of in-memory telemetry events retained. |
|
|
|
|
|
|
|
|
|
|
|
#### `retention_seconds` |
|
|
|
|
|
Time-based in-memory retention window. |
|
|
|
|
|
|
|
|
|
|
|
Older in-memory metrics/events are trimmed once they fall outside this retention period. |
|
|
|
|
|
|
|
|
|
|
|
#### `persist_enabled` |
|
|
|
|
|
When enabled, telemetry metrics/events are also appended to rotated JSONL files. |
|
|
|
|
|
|
|
|
|
|
|
#### `persist_dir` |
|
|
|
|
|
Directory where rotated telemetry JSONL files are written. |
|
|
|
|
|
|
|
|
|
|
|
Default: |
|
|
|
|
|
|
|
|
|
|
|
- `debug/telemetry` |
|
|
|
|
|
|
|
|
|
|
|
#### `rotate_mb` |
|
|
|
|
|
Approximate JSONL file rotation threshold in megabytes. |
|
|
|
|
|
|
|
|
|
|
|
#### `keep_files` |
|
|
|
|
|
How many rotated telemetry files to retain in `persist_dir`. |
|
|
|
|
|
|
|
|
|
|
|
Older files beyond this count are pruned. |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
## Collector behavior and caveats |
|
|
|
|
|
|
|
|
|
|
|
### In-memory vs persisted data |
|
|
|
|
|
|
|
|
|
|
|
The query endpoints can read from both: |
|
|
|
|
|
|
|
|
|
|
|
- current in-memory collector state/history |
|
|
|
|
|
- persisted JSONL files |
|
|
|
|
|
|
|
|
|
|
|
This means a request may return data older than current in-memory retention if: |
|
|
|
|
|
|
|
|
|
|
|
- `persist_enabled=true`, and |
|
|
|
|
|
- `include_persisted=true` |
|
|
|
|
|
|
|
|
|
|
|
### Sampling behavior |
|
|
|
|
|
|
|
|
|
|
|
Not every observation necessarily becomes a historical metric point. |
|
|
|
|
|
|
|
|
|
|
|
The collector: |
|
|
|
|
|
|
|
|
|
|
|
- always updates live counters/gauges/distributions while enabled |
|
|
|
|
|
- stores historical points according to `metric_sample_every` |
|
|
|
|
|
- force-stores selected boundary IQ metrics even when sampling would normally skip them |
|
|
|
|
|
|
|
|
|
|
|
So the live snapshot and historical series density are intentionally different. |
|
|
|
|
|
|
|
|
|
|
|
### Distribution summaries |
|
|
|
|
|
|
|
|
|
|
|
Distribution values in the live snapshot include: |
|
|
|
|
|
|
|
|
|
|
|
- `count` |
|
|
|
|
|
- `min` |
|
|
|
|
|
- `max` |
|
|
|
|
|
- `mean` |
|
|
|
|
|
- `last` |
|
|
|
|
|
- `p95` |
|
|
|
|
|
|
|
|
|
|
|
The p95 estimate is based on the collector's bounded rolling sample buffer, not an unbounded full-history quantile computation. |
|
|
|
|
|
|
|
|
|
|
|
### Config serialization detail |
|
|
|
|
|
|
|
|
|
|
|
The collector's `retention` field is a Go duration. In JSON this appears as an integer nanosecond count. |
|
|
|
|
|
|
|
|
|
|
|
This is expected. |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
## Recommended workflows |
|
|
|
|
|
|
|
|
|
|
|
### Fast low-overhead runtime watch |
|
|
|
|
|
|
|
|
|
|
|
Use: |
|
|
|
|
|
|
|
|
|
|
|
- `enabled=true` |
|
|
|
|
|
- `heavy_enabled=false` |
|
|
|
|
|
- `persist_enabled=false` or `true` if you want an archive |
|
|
|
|
|
|
|
|
|
|
|
Then query: |
|
|
|
|
|
|
|
|
|
|
|
- `/api/debug/telemetry/live` |
|
|
|
|
|
- `/api/debug/telemetry/history?prefix=stage.` |
|
|
|
|
|
- `/api/debug/telemetry/events?level=warn` |
|
|
|
|
|
|
|
|
|
|
|
### 5-10 minute anomaly capture |
|
|
|
|
|
|
|
|
|
|
|
Suggested settings: |
|
|
|
|
|
|
|
|
|
|
|
- `enabled=true` |
|
|
|
|
|
- `heavy_enabled=false` |
|
|
|
|
|
- `persist_enabled=true` |
|
|
|
|
|
- moderate `metric_sample_every` |
|
|
|
|
|
|
|
|
|
|
|
Then: |
|
|
|
|
|
|
|
|
|
|
|
1. note start time |
|
|
|
|
|
2. reproduce workload |
|
|
|
|
|
3. fetch live snapshot |
|
|
|
|
|
4. inspect warning events |
|
|
|
|
|
5. inspect `stage.*`, `streamer.*`, and `source.*` history |
|
|
|
|
|
|
|
|
|
|
|
### Deep extractor / boundary investigation |
|
|
|
|
|
|
|
|
|
|
|
Temporarily enable: |
|
|
|
|
|
|
|
|
|
|
|
- `heavy_enabled=true` |
|
|
|
|
|
- `heavy_sample_every` > 1 unless you really need every sample |
|
|
|
|
|
- `persist_enabled=true` |
|
|
|
|
|
|
|
|
|
|
|
Then inspect: |
|
|
|
|
|
|
|
|
|
|
|
- `iq.*` |
|
|
|
|
|
- `extract.*` |
|
|
|
|
|
- `audio.*` |
|
|
|
|
|
- boundary/anomaly events for specific `signal_id` or `session_id` |
|
|
|
|
|
|
|
|
|
|
|
Turn heavy telemetry back off once done. |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
## Example requests |
|
|
|
|
|
|
|
|
|
|
|
### Fetch live snapshot |
|
|
|
|
|
|
|
|
|
|
|
```bash |
|
|
|
|
|
curl http://localhost:8080/api/debug/telemetry/live |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
### Fetch stage timings from the last 10 minutes |
|
|
|
|
|
|
|
|
|
|
|
```bash |
|
|
|
|
|
curl "http://localhost:8080/api/debug/telemetry/history?since=2026-03-25T12:00:00Z&prefix=stage." |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
### Fetch source metrics for one signal |
|
|
|
|
|
|
|
|
|
|
|
```bash |
|
|
|
|
|
curl "http://localhost:8080/api/debug/telemetry/history?prefix=source.&signal_id=1" |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
### Fetch warning events only |
|
|
|
|
|
|
|
|
|
|
|
```bash |
|
|
|
|
|
curl "http://localhost:8080/api/debug/telemetry/events?since=2026-03-25T12:00:00Z&level=warn" |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
### Fetch events with a custom tag filter |
|
|
|
|
|
|
|
|
|
|
|
```bash |
|
|
|
|
|
curl "http://localhost:8080/api/debug/telemetry/events?tag_zone=broadcast" |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
### Enable persistence and heavy telemetry temporarily |
|
|
|
|
|
|
|
|
|
|
|
```bash |
|
|
|
|
|
curl -X POST http://localhost:8080/api/debug/telemetry/config \ |
|
|
|
|
|
-H "Content-Type: application/json" \ |
|
|
|
|
|
-d '{ |
|
|
|
|
|
"heavy_enabled": true, |
|
|
|
|
|
"heavy_sample_every": 8, |
|
|
|
|
|
"persist_enabled": true |
|
|
|
|
|
}' |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
## Related docs |
|
|
|
|
|
|
|
|
|
|
|
- `README.md` - high-level project overview and endpoint summary |
|
|
|
|
|
- `docs/telemetry-debug-runbook.md` - quick operational runbook for short debug sessions |
|
|
|
|
|
- `internal/telemetry/telemetry.go` - collector implementation details |
|
|
|
|
|
- `cmd/sdrd/http_handlers.go` - HTTP wiring for telemetry endpoints |