소스 검색

docs: document telemetry API

master
Jan Svabenik 2 일 전
부모
커밋
002f7d0cda
2개의 변경된 파일737개의 추가작업 그리고 0개의 파일을 삭제
  1. +26
    -0
      README.md
  2. +711
    -0
      docs/telemetry-api.md

+ 26
- 0
README.md 파일 보기

@@ -192,6 +192,32 @@ go build -tags sdrplay ./cmd/sdrd
- `GET /api/signals` -> current live signals
- `GET /api/events?limit=&since=` -> recent events

### Debug Telemetry
- `GET /api/debug/telemetry/live` -> current telemetry snapshot (counters, gauges, distributions, recent events, collector status/config)
- `GET /api/debug/telemetry/history` -> historical metric samples with filtering by time/name/prefix/tags
- `GET /api/debug/telemetry/events` -> telemetry event/anomaly history with filtering by time/name/prefix/level/tags
- `GET /api/debug/telemetry/config` -> current collector config plus `debug.telemetry` runtime config
- `POST /api/debug/telemetry/config` -> update telemetry settings at runtime and persist them to autosave config

Telemetry query params (`history` / `events`) include:
- `since`, `until` -> unix seconds, unix milliseconds, or RFC3339 timestamps
- `limit`
- `name`, `prefix`
- `signal_id`, `session_id`, `stage`, `trace_id`, `component`
- `tag_<key>=<value>` for arbitrary tag filters
- `include_persisted=true|false` (default `true`)
- `level` on the events endpoint

Telemetry config lives under `debug.telemetry`:
- `enabled`, `heavy_enabled`, `heavy_sample_every`
- `metric_sample_every`, `metric_history_max`, `event_history_max`
- `retention_seconds`
- `persist_enabled`, `persist_dir`, `rotate_mb`, `keep_files`

See also:
- `docs/telemetry-api.md` for the full telemetry API reference
- `docs/telemetry-debug-runbook.md` for the short operational debug flow

### Recordings
- `GET /api/recordings`
- `GET /api/recordings/:id` (meta.json)


+ 711
- 0
docs/telemetry-api.md 파일 보기

@@ -0,0 +1,711 @@
# Telemetry API Reference

This document describes the server-side telemetry collector, its runtime configuration, and the HTTP API exposed by `sdrd`.

The telemetry system is intended for debugging and performance analysis of the SDR pipeline, especially around source cadence, extraction, DSP timing, boundary artifacts, queue pressure, and other runtime anomalies.

## Goals

The telemetry layer gives you three different views of runtime state:

1. **Live snapshot**
- Current counters, gauges, distributions, recent events, and collector status.
2. **Historical metrics**
- Timestamped metric samples that can be filtered by name, prefix, or tags.
3. **Historical events**
- Structured anomalies / warnings / debug events with optional fields.

It is designed to be lightweight in normal operation and more detailed when `heavy_enabled` is turned on.

---

## Base URLs

All telemetry endpoints live under:

- `/api/debug/telemetry/live`
- `/api/debug/telemetry/history`
- `/api/debug/telemetry/events`
- `/api/debug/telemetry/config`

Responses are JSON.

---

## Data model

### Metric types

Telemetry metrics are stored in three logical groups:

- **counter**
- Accumulating values, usually incremented over time.
- **gauge**
- Latest current value.
- **distribution**
- Observed numeric samples with summary stats.

A historical metric sample is returned as:

```json
{
"ts": "2026-03-25T12:00:00Z",
"name": "stage.extract_stream.duration_ms",
"type": "distribution",
"value": 4.83,
"tags": {
"stage": "extract_stream",
"signal_id": "1"
}
}
```

### Events

Telemetry events are structured anomaly/debug records:

```json
{
"id": 123,
"ts": "2026-03-25T12:00:02Z",
"name": "demod_boundary",
"level": "warn",
"message": "boundary discontinuity detected",
"tags": {
"signal_id": "1",
"stage": "demod"
},
"fields": {
"d2": 0.3358,
"index": 25
}
}
```

### Tags

Tags are string key/value metadata used for filtering and correlation.

Common tag keys already supported by the HTTP layer:

- `signal_id`
- `session_id`
- `stage`
- `trace_id`
- `component`

You can also filter on arbitrary tags via `tag_<key>=<value>` query parameters.

---

## Endpoint: `GET /api/debug/telemetry/live`

Returns a live snapshot of the in-memory collector state.

### Response shape

```json
{
"now": "2026-03-25T12:00:05Z",
"started_at": "2026-03-25T11:52:10Z",
"uptime_ms": 472500,
"config": {
"enabled": true,
"heavy_enabled": false,
"heavy_sample_every": 12,
"metric_sample_every": 2,
"metric_history_max": 12000,
"event_history_max": 4000,
"retention": 900000000000,
"persist_enabled": false,
"persist_dir": "debug/telemetry",
"rotate_mb": 16,
"keep_files": 8
},
"counters": [
{
"name": "source.resets",
"value": 1,
"tags": {
"component": "source"
}
}
],
"gauges": [
{
"name": "source.buffer_samples",
"value": 304128,
"tags": {
"component": "source"
}
}
],
"distributions": [
{
"name": "dsp.frame.duration_ms",
"count": 96,
"min": 82.5,
"max": 212.4,
"mean": 104.8,
"last": 98.3,
"p95": 149.2,
"tags": {
"stage": "dsp"
}
}
],
"recent_events": [],
"status": {
"source_state": "running"
}
}
```

### Notes

- `counters`, `gauges`, and `distributions` are sorted by metric name.
- `recent_events` contains the most recent in-memory event slice.
- `status` is optional and contains arbitrary runtime status published by code using `SetStatus(...)`.
- If telemetry is unavailable, the server returns a small JSON object instead of a full snapshot.

### Typical uses

- Check whether telemetry is enabled.
- Look for timing hotspots in `*.duration_ms` distributions.
- Inspect current queue or source gauges.
- See recent anomaly events without querying history.

---

## Endpoint: `GET /api/debug/telemetry/history`

Returns historical metric samples from in-memory history and, optionally, persisted JSONL files.

### Response shape

```json
{
"items": [
{
"ts": "2026-03-25T12:00:01Z",
"name": "stage.extract_stream.duration_ms",
"type": "distribution",
"value": 5.2,
"tags": {
"stage": "extract_stream",
"signal_id": "2"
}
}
],
"count": 1
}
```

### Supported query parameters

#### Time filters

- `since`
- `until`

Accepted formats:

- Unix seconds
- Unix milliseconds
- RFC3339
- RFC3339Nano

Examples:

- `?since=1711368000`
- `?since=1711368000123`
- `?since=2026-03-25T12:00:00Z`

#### Result shaping

- `limit`
- Default normalization is 500.
- Values above 5000 are clamped down by the collector query layer.

#### Name filters

- `name=<exact_metric_name>`
- `prefix=<metric_name_prefix>`

Examples:

- `?name=source.read.duration_ms`
- `?prefix=stage.`
- `?prefix=iq.extract.`

#### Tag filters

Special convenience query params map directly to tag filters:

- `signal_id`
- `session_id`
- `stage`
- `trace_id`
- `component`

Arbitrary tag filters:

- `tag_<key>=<value>`

Examples:

- `?signal_id=1`
- `?stage=extract_stream`
- `?tag_path=gpu`
- `?tag_zone=broadcast`

#### Persistence control

- `include_persisted=true|false`
- Default: `true`

When enabled and persistence is active, the server reads matching data from rotated JSONL telemetry files in addition to in-memory history.

### Notes

- Results are sorted by timestamp ascending.
- If `limit` is hit, the most recent matching items are retained.
- Exact retention depends on both in-memory retention and persisted file availability.
- A small set of boundary-related IQ metrics is force-stored regardless of the normal metric sample cadence.

### Typical queries

Get all stage timing since a specific start:

```text
/api/debug/telemetry/history?since=2026-03-25T12:00:00Z&prefix=stage.
```

Get extraction metrics for a single signal:

```text
/api/debug/telemetry/history?since=2026-03-25T12:00:00Z&prefix=extract.&signal_id=2
```

Get source cadence metrics only from in-memory history:

```text
/api/debug/telemetry/history?prefix=source.&include_persisted=false
```

---

## Endpoint: `GET /api/debug/telemetry/events`

Returns historical telemetry events from memory and, optionally, persisted storage.

### Response shape

```json
{
"items": [
{
"id": 991,
"ts": "2026-03-25T12:00:03Z",
"name": "source_reset",
"level": "warn",
"message": "source reader reset observed",
"tags": {
"component": "source"
},
"fields": {
"reason": "short_read"
}
}
],
"count": 1
}
```

### Supported query parameters

All `history` filters are also supported here, plus:

- `level=<debug|info|warn|error|...>`

Examples:

- `?since=2026-03-25T12:00:00Z&level=warn`
- `?prefix=audio.&signal_id=1`
- `?name=demod_boundary&signal_id=1`

### Notes

- Event matching supports `name`, `prefix`, `level`, time range, and tags.
- Event `level` matching is case-insensitive.
- Results are timestamp-sorted ascending.

### Typical queries

Get warnings during a reproduction run:

```text
/api/debug/telemetry/events?since=2026-03-25T12:00:00Z&level=warn
```

Get boundary-related events for one signal:

```text
/api/debug/telemetry/events?since=2026-03-25T12:00:00Z&signal_id=1&prefix=demod_
```

---

## Endpoint: `GET /api/debug/telemetry/config`

Returns both:

1. the active collector configuration, and
2. the current runtime config under `debug.telemetry`

### Response shape

```json
{
"collector": {
"enabled": true,
"heavy_enabled": false,
"heavy_sample_every": 12,
"metric_sample_every": 2,
"metric_history_max": 12000,
"event_history_max": 4000,
"retention": 900000000000,
"persist_enabled": false,
"persist_dir": "debug/telemetry",
"rotate_mb": 16,
"keep_files": 8
},
"config": {
"enabled": true,
"heavy_enabled": false,
"heavy_sample_every": 12,
"metric_sample_every": 2,
"metric_history_max": 12000,
"event_history_max": 4000,
"retention_seconds": 900,
"persist_enabled": false,
"persist_dir": "debug/telemetry",
"rotate_mb": 16,
"keep_files": 8
}
}
```

### Important distinction

- `collector.retention` is a Go duration serialized in nanoseconds.
- `config.retention_seconds` is the config-facing field used by YAML and the POST update API.

If you are writing tooling, prefer `config.retention_seconds` for human-facing config edits.

---

## Endpoint: `POST /api/debug/telemetry/config`

Updates telemetry settings at runtime and writes them back via the autosave config path.

### Request body

All fields are optional. Only provided fields are changed.

```json
{
"enabled": true,
"heavy_enabled": true,
"heavy_sample_every": 8,
"metric_sample_every": 1,
"metric_history_max": 20000,
"event_history_max": 6000,
"retention_seconds": 1800,
"persist_enabled": true,
"persist_dir": "debug/telemetry",
"rotate_mb": 32,
"keep_files": 12
}
```

### Response shape

```json
{
"ok": true,
"collector": {
"enabled": true,
"heavy_enabled": true,
"heavy_sample_every": 8,
"metric_sample_every": 1,
"metric_history_max": 20000,
"event_history_max": 6000,
"retention": 1800000000000,
"persist_enabled": true,
"persist_dir": "debug/telemetry",
"rotate_mb": 32,
"keep_files": 12
},
"config": {
"enabled": true,
"heavy_enabled": true,
"heavy_sample_every": 8,
"metric_sample_every": 1,
"metric_history_max": 20000,
"event_history_max": 6000,
"retention_seconds": 1800,
"persist_enabled": true,
"persist_dir": "debug/telemetry",
"rotate_mb": 32,
"keep_files": 12
}
}
```

### Persistence behavior

A POST updates:

- the runtime manager snapshot/config
- the in-process collector config
- the autosave config file via `config.Save(...)`

That means these updates are runtime-effective immediately and also survive restarts through autosave, unless manually reverted.

### Error cases

- Invalid JSON -> `400 Bad Request`
- Invalid collector reconfiguration -> `400 Bad Request`
- Telemetry unavailable -> `503 Service Unavailable`

---

## Configuration fields (`debug.telemetry`)

Telemetry config lives under:

```yaml
debug:
telemetry:
enabled: true
heavy_enabled: false
heavy_sample_every: 12
metric_sample_every: 2
metric_history_max: 12000
event_history_max: 4000
retention_seconds: 900
persist_enabled: false
persist_dir: debug/telemetry
rotate_mb: 16
keep_files: 8
```

### Field reference

#### `enabled`
Master on/off switch for telemetry collection.

If false:
- metrics are not recorded
- events are not recorded
- live snapshot remains effectively empty/minimal

#### `heavy_enabled`
Enables more expensive / more detailed telemetry paths that should not be left on permanently unless needed.

Use this for deep extractor/IQ/boundary debugging.

#### `heavy_sample_every`
Sampling cadence for heavy telemetry.

- `1` means every eligible heavy sample
- higher numbers reduce cost by sampling less often

#### `metric_sample_every`
Sampling cadence for normal historical metric point storage.

Collector summaries still update live, but historical storage becomes less dense when this value is greater than 1.

#### `metric_history_max`
Maximum number of in-memory historical metric samples retained.

#### `event_history_max`
Maximum number of in-memory telemetry events retained.

#### `retention_seconds`
Time-based in-memory retention window.

Older in-memory metrics/events are trimmed once they fall outside this retention period.

#### `persist_enabled`
When enabled, telemetry metrics/events are also appended to rotated JSONL files.

#### `persist_dir`
Directory where rotated telemetry JSONL files are written.

Default:

- `debug/telemetry`

#### `rotate_mb`
Approximate JSONL file rotation threshold in megabytes.

#### `keep_files`
How many rotated telemetry files to retain in `persist_dir`.

Older files beyond this count are pruned.

---

## Collector behavior and caveats

### In-memory vs persisted data

The query endpoints can read from both:

- current in-memory collector state/history
- persisted JSONL files

This means a request may return data older than current in-memory retention if:

- `persist_enabled=true`, and
- `include_persisted=true`

### Sampling behavior

Not every observation necessarily becomes a historical metric point.

The collector:

- always updates live counters/gauges/distributions while enabled
- stores historical points according to `metric_sample_every`
- force-stores selected boundary IQ metrics even when sampling would normally skip them

So the live snapshot and historical series density are intentionally different.

### Distribution summaries

Distribution values in the live snapshot include:

- `count`
- `min`
- `max`
- `mean`
- `last`
- `p95`

The p95 estimate is based on the collector's bounded rolling sample buffer, not an unbounded full-history quantile computation.

### Config serialization detail

The collector's `retention` field is a Go duration. In JSON this appears as an integer nanosecond count.

This is expected.

---

## Recommended workflows

### Fast low-overhead runtime watch

Use:

- `enabled=true`
- `heavy_enabled=false`
- `persist_enabled=false` or `true` if you want an archive

Then query:

- `/api/debug/telemetry/live`
- `/api/debug/telemetry/history?prefix=stage.`
- `/api/debug/telemetry/events?level=warn`

### 5-10 minute anomaly capture

Suggested settings:

- `enabled=true`
- `heavy_enabled=false`
- `persist_enabled=true`
- moderate `metric_sample_every`

Then:

1. note start time
2. reproduce workload
3. fetch live snapshot
4. inspect warning events
5. inspect `stage.*`, `streamer.*`, and `source.*` history

### Deep extractor / boundary investigation

Temporarily enable:

- `heavy_enabled=true`
- `heavy_sample_every` > 1 unless you really need every sample
- `persist_enabled=true`

Then inspect:

- `iq.*`
- `extract.*`
- `audio.*`
- boundary/anomaly events for specific `signal_id` or `session_id`

Turn heavy telemetry back off once done.

---

## Example requests

### Fetch live snapshot

```bash
curl http://localhost:8080/api/debug/telemetry/live
```

### Fetch stage timings from the last 10 minutes

```bash
curl "http://localhost:8080/api/debug/telemetry/history?since=2026-03-25T12:00:00Z&prefix=stage."
```

### Fetch source metrics for one signal

```bash
curl "http://localhost:8080/api/debug/telemetry/history?prefix=source.&signal_id=1"
```

### Fetch warning events only

```bash
curl "http://localhost:8080/api/debug/telemetry/events?since=2026-03-25T12:00:00Z&level=warn"
```

### Fetch events with a custom tag filter

```bash
curl "http://localhost:8080/api/debug/telemetry/events?tag_zone=broadcast"
```

### Enable persistence and heavy telemetry temporarily

```bash
curl -X POST http://localhost:8080/api/debug/telemetry/config \
-H "Content-Type: application/json" \
-d '{
"heavy_enabled": true,
"heavy_sample_every": 8,
"persist_enabled": true
}'
```

---

## Related docs

- `README.md` - high-level project overview and endpoint summary
- `docs/telemetry-debug-runbook.md` - quick operational runbook for short debug sessions
- `internal/telemetry/telemetry.go` - collector implementation details
- `cmd/sdrd/http_handlers.go` - HTTP wiring for telemetry endpoints

불러오는 중...
취소
저장