Wideband autonomous SDR analysis engine forked from sdr-visual-suite

15KB

Датотека Blame Историја

Telemetry API Reference

This document describes the server-side telemetry collector, its runtime configuration, and the HTTP API exposed by sdrd.

The telemetry system is intended for debugging and performance analysis of the SDR pipeline, especially around source cadence, extraction, DSP timing, boundary artifacts, queue pressure, and other runtime anomalies.

Goals

The telemetry layer gives you three different views of runtime state:

Live snapshot
- Current counters, gauges, distributions, recent events, and collector status.
Historical metrics
- Timestamped metric samples that can be filtered by name, prefix, or tags.
Historical events
- Structured anomalies / warnings / debug events with optional fields.

It is designed to be lightweight in normal operation and more detailed when heavy_enabled is turned on.

Base URLs

All telemetry endpoints live under:

/api/debug/telemetry/live
/api/debug/telemetry/history
/api/debug/telemetry/events
/api/debug/telemetry/config

Responses are JSON.

Data model

Metric types

Telemetry metrics are stored in three logical groups:

counter
- Accumulating values, usually incremented over time.
gauge
- Latest current value.
distribution
- Observed numeric samples with summary stats.

A historical metric sample is returned as:

{
  "ts": "2026-03-25T12:00:00Z",
  "name": "stage.extract_stream.duration_ms",
  "type": "distribution",
  "value": 4.83,
  "tags": {
    "stage": "extract_stream",
    "signal_id": "1"
  }
}

Events

Telemetry events are structured anomaly/debug records:

{
  "id": 123,
  "ts": "2026-03-25T12:00:02Z",
  "name": "demod_boundary",
  "level": "warn",
  "message": "boundary discontinuity detected",
  "tags": {
    "signal_id": "1",
    "stage": "demod"
  },
  "fields": {
    "d2": 0.3358,
    "index": 25
  }
}

Endpoint: `GET /api/debug/telemetry/live`

Returns a live snapshot of the in-memory collector state.

Response shape

{
  "now": "2026-03-25T12:00:05Z",
  "started_at": "2026-03-25T11:52:10Z",
  "uptime_ms": 472500,
  "config": {
    "enabled": true,
    "heavy_enabled": false,
    "heavy_sample_every": 12,
    "metric_sample_every": 2,
    "metric_history_max": 12000,
    "event_history_max": 4000,
    "retention": 900000000000,
    "persist_enabled": false,
    "persist_dir": "debug/telemetry",
    "rotate_mb": 16,
    "keep_files": 8
  },
  "counters": [
    {
      "name": "source.resets",
      "value": 1,
      "tags": {
        "component": "source"
      }
    }
  ],
  "gauges": [
    {
      "name": "source.buffer_samples",
      "value": 304128,
      "tags": {
        "component": "source"
      }
    }
  ],
  "distributions": [
    {
      "name": "dsp.frame.duration_ms",
      "count": 96,
      "min": 82.5,
      "max": 212.4,
      "mean": 104.8,
      "last": 98.3,
      "p95": 149.2,
      "tags": {
        "stage": "dsp"
      }
    }
  ],
  "recent_events": [],
  "status": {
    "source_state": "running"
  }
}

Notes

counters, gauges, and distributions are sorted by metric name.
recent_events contains the most recent in-memory event slice.
status is optional and contains arbitrary runtime status published by code using SetStatus(...).
If telemetry is unavailable, the server returns a small JSON object instead of a full snapshot.

Typical uses

Check whether telemetry is enabled.
Look for timing hotspots in *.duration_ms distributions.
Inspect current queue or source gauges.
See recent anomaly events without querying history.

Endpoint: `GET /api/debug/telemetry/history`

Returns historical metric samples from in-memory history and, optionally, persisted JSONL files.

Response shape

{
  "items": [
    {
      "ts": "2026-03-25T12:00:01Z",
      "name": "stage.extract_stream.duration_ms",
      "type": "distribution",
      "value": 5.2,
      "tags": {
        "stage": "extract_stream",
        "signal_id": "2"
      }
    }
  ],
  "count": 1
}

Supported query parameters

Time filters

since
until

Accepted formats:

Unix seconds
Unix milliseconds
RFC3339
RFC3339Nano

Examples:

?since=1711368000
?since=1711368000123
?since=2026-03-25T12:00:00Z

Result shaping

limit
- Default normalization is 500.
- Values above 5000 are clamped down by the collector query layer.

Name filters

name=<exact_metric_name>
prefix=<metric_name_prefix>

Examples:

?name=source.read.duration_ms
?prefix=stage.
?prefix=iq.extract.

Tag filters

Special convenience query params map directly to tag filters:

signal_id
session_id
stage
trace_id
component

Arbitrary tag filters:

tag_<key>=<value>

Examples:

?signal_id=1
?stage=extract_stream
?tag_path=gpu
?tag_zone=broadcast

Persistence control

include_persisted=true|false
- Default: true

When enabled and persistence is active, the server reads matching data from rotated JSONL telemetry files in addition to in-memory history.

Notes

Results are sorted by timestamp ascending.
If limit is hit, the most recent matching items are retained.
Exact retention depends on both in-memory retention and persisted file availability.
A small set of boundary-related IQ metrics is force-stored regardless of the normal metric sample cadence.

Typical queries

Get all stage timing since a specific start:

/api/debug/telemetry/history?since=2026-03-25T12:00:00Z&prefix=stage.

Get extraction metrics for a single signal:

/api/debug/telemetry/history?since=2026-03-25T12:00:00Z&prefix=extract.&signal_id=2

Get source cadence metrics only from in-memory history:

/api/debug/telemetry/history?prefix=source.&include_persisted=false

Endpoint: `GET /api/debug/telemetry/events`

Returns historical telemetry events from memory and, optionally, persisted storage.

Response shape

{
  "items": [
    {
      "id": 991,
      "ts": "2026-03-25T12:00:03Z",
      "name": "source_reset",
      "level": "warn",
      "message": "source reader reset observed",
      "tags": {
        "component": "source"
      },
      "fields": {
        "reason": "short_read"
      }
    }
  ],
  "count": 1
}

Supported query parameters

All history filters are also supported here, plus:

level=<debug|info|warn|error|...>

Examples:

?since=2026-03-25T12:00:00Z&level=warn
?prefix=audio.&signal_id=1
?name=demod_boundary&signal_id=1

Notes

Event matching supports name, prefix, level, time range, and tags.
Event level matching is case-insensitive.
Results are timestamp-sorted ascending.

Typical queries

Get warnings during a reproduction run:

/api/debug/telemetry/events?since=2026-03-25T12:00:00Z&level=warn

Get boundary-related events for one signal:

/api/debug/telemetry/events?since=2026-03-25T12:00:00Z&signal_id=1&prefix=demod_

Endpoint: `GET /api/debug/telemetry/config`

Returns both:

the active collector configuration, and
the current runtime config under debug.telemetry

Response shape

{
  "collector": {
    "enabled": true,
    "heavy_enabled": false,
    "heavy_sample_every": 12,
    "metric_sample_every": 2,
    "metric_history_max": 12000,
    "event_history_max": 4000,
    "retention": 900000000000,
    "persist_enabled": false,
    "persist_dir": "debug/telemetry",
    "rotate_mb": 16,
    "keep_files": 8
  },
  "config": {
    "enabled": true,
    "heavy_enabled": false,
    "heavy_sample_every": 12,
    "metric_sample_every": 2,
    "metric_history_max": 12000,
    "event_history_max": 4000,
    "retention_seconds": 900,
    "persist_enabled": false,
    "persist_dir": "debug/telemetry",
    "rotate_mb": 16,
    "keep_files": 8
  }
}

Important distinction

collector.retention is a Go duration serialized in nanoseconds.
config.retention_seconds is the config-facing field used by YAML and the POST update API.

If you are writing tooling, prefer config.retention_seconds for human-facing config edits.

Endpoint: `POST /api/debug/telemetry/config`

Updates telemetry settings at runtime and writes them back via the autosave config path.

Request body

All fields are optional. Only provided fields are changed.

{
  "enabled": true,
  "heavy_enabled": true,
  "heavy_sample_every": 8,
  "metric_sample_every": 1,
  "metric_history_max": 20000,
  "event_history_max": 6000,
  "retention_seconds": 1800,
  "persist_enabled": true,
  "persist_dir": "debug/telemetry",
  "rotate_mb": 32,
  "keep_files": 12
}

Response shape

{
  "ok": true,
  "collector": {
    "enabled": true,
    "heavy_enabled": true,
    "heavy_sample_every": 8,
    "metric_sample_every": 1,
    "metric_history_max": 20000,
    "event_history_max": 6000,
    "retention": 1800000000000,
    "persist_enabled": true,
    "persist_dir": "debug/telemetry",
    "rotate_mb": 32,
    "keep_files": 12
  },
  "config": {
    "enabled": true,
    "heavy_enabled": true,
    "heavy_sample_every": 8,
    "metric_sample_every": 1,
    "metric_history_max": 20000,
    "event_history_max": 6000,
    "retention_seconds": 1800,
    "persist_enabled": true,
    "persist_dir": "debug/telemetry",
    "rotate_mb": 32,
    "keep_files": 12
  }
}

Persistence behavior

A POST updates:

the runtime manager snapshot/config
the in-process collector config
the autosave config file via config.Save(...)

That means these updates are runtime-effective immediately and also survive restarts through autosave, unless manually reverted.

Error cases

Invalid JSON -> 400 Bad Request
Invalid collector reconfiguration -> 400 Bad Request
Telemetry unavailable -> 503 Service Unavailable

Configuration fields (`debug.telemetry`)

Telemetry config lives under:

debug:
  telemetry:
    enabled: true
    heavy_enabled: false
    heavy_sample_every: 12
    metric_sample_every: 2
    metric_history_max: 12000
    event_history_max: 4000
    retention_seconds: 900
    persist_enabled: false
    persist_dir: debug/telemetry
    rotate_mb: 16
    keep_files: 8

Field reference

`enabled`

Master on/off switch for telemetry collection.

If false:

metrics are not recorded
events are not recorded
live snapshot remains effectively empty/minimal

`heavy_enabled`

Enables more expensive / more detailed telemetry paths that should not be left on permanently unless needed.

Use this for deep extractor/IQ/boundary debugging.

`heavy_sample_every`

Sampling cadence for heavy telemetry.

1 means every eligible heavy sample
higher numbers reduce cost by sampling less often

`metric_sample_every`

Sampling cadence for normal historical metric point storage.

Collector summaries still update live, but historical storage becomes less dense when this value is greater than 1.

`metric_history_max`

Maximum number of in-memory historical metric samples retained.

`event_history_max`

Maximum number of in-memory telemetry events retained.

`retention_seconds`

Time-based in-memory retention window.

Older in-memory metrics/events are trimmed once they fall outside this retention period.

`persist_enabled`

When enabled, telemetry metrics/events are also appended to rotated JSONL files.

`persist_dir`

Directory where rotated telemetry JSONL files are written.

Default:

debug/telemetry

`rotate_mb`

Approximate JSONL file rotation threshold in megabytes.

`keep_files`

How many rotated telemetry files to retain in persist_dir.

Older files beyond this count are pruned.

Collector behavior and caveats

In-memory vs persisted data

The query endpoints can read from both:

current in-memory collector state/history
persisted JSONL files

This means a request may return data older than current in-memory retention if:

persist_enabled=true, and
include_persisted=true

Sampling behavior

Not every observation necessarily becomes a historical metric point.

The collector:

always updates live counters/gauges/distributions while enabled
stores historical points according to metric_sample_every
force-stores selected boundary IQ metrics even when sampling would normally skip them

So the live snapshot and historical series density are intentionally different.

Distribution summaries

Distribution values in the live snapshot include:

count
min
max
mean
last
p95

The p95 estimate is based on the collector's bounded rolling sample buffer, not an unbounded full-history quantile computation.

Config serialization detail

The collector's retention field is a Go duration. In JSON this appears as an integer nanosecond count.

This is expected.

Recommended workflows

Fast low-overhead runtime watch

Use:

enabled=true
heavy_enabled=false
persist_enabled=false or true if you want an archive

Then query:

/api/debug/telemetry/live
/api/debug/telemetry/history?prefix=stage.
/api/debug/telemetry/events?level=warn

5-10 minute anomaly capture

Suggested settings:

enabled=true
heavy_enabled=false
persist_enabled=true
moderate metric_sample_every

Then:

note start time
reproduce workload
fetch live snapshot
inspect warning events
inspect stage.*, streamer.*, and source.* history

Deep extractor / boundary investigation

Temporarily enable:

heavy_enabled=true
heavy_sample_every > 1 unless you really need every sample
persist_enabled=true

Then inspect:

iq.*
extract.*
audio.*
boundary/anomaly events for specific signal_id or session_id

Turn heavy telemetry back off once done.

Example requests

Fetch live snapshot

curl http://localhost:8080/api/debug/telemetry/live

Fetch stage timings from the last 10 minutes

curl "http://localhost:8080/api/debug/telemetry/history?since=2026-03-25T12:00:00Z&prefix=stage."

Fetch source metrics for one signal

curl "http://localhost:8080/api/debug/telemetry/history?prefix=source.&signal_id=1"

Fetch warning events only

curl "http://localhost:8080/api/debug/telemetry/events?since=2026-03-25T12:00:00Z&level=warn"

Fetch events with a custom tag filter

curl "http://localhost:8080/api/debug/telemetry/events?tag_zone=broadcast"

Enable persistence and heavy telemetry temporarily

curl -X POST http://localhost:8080/api/debug/telemetry/config \
  -H "Content-Type: application/json" \
  -d '{
    "heavy_enabled": true,
    "heavy_sample_every": 8,
    "persist_enabled": true
  }'

README.md - high-level project overview and endpoint summary
docs/telemetry-debug-runbook.md - quick operational runbook for short debug sessions
internal/telemetry/telemetry.go - collector implementation details
cmd/sdrd/http_handlers.go - HTTP wiring for telemetry endpoints

15KB Датотека Blame Историја

Telemetry API Reference

Goals

Base URLs

Data model

Metric types

Events

Tags

Endpoint: GET /api/debug/telemetry/live

Response shape

Notes

Typical uses

Endpoint: GET /api/debug/telemetry/history

Response shape

Supported query parameters

Time filters

Result shaping

Name filters

Tag filters

Persistence control

Notes

Typical queries

Endpoint: GET /api/debug/telemetry/events

Response shape

Supported query parameters

Notes

Typical queries

Endpoint: GET /api/debug/telemetry/config

Response shape

Important distinction

Endpoint: POST /api/debug/telemetry/config

Request body

Response shape

Persistence behavior

Error cases

Configuration fields (debug.telemetry)

Field reference

enabled

heavy_enabled

heavy_sample_every

metric_sample_every

metric_history_max

event_history_max

retention_seconds

persist_enabled

persist_dir

rotate_mb

keep_files

Collector behavior and caveats

In-memory vs persisted data

Sampling behavior

Distribution summaries

Config serialization detail

Recommended workflows

Fast low-overhead runtime watch

5-10 minute anomaly capture

Deep extractor / boundary investigation

Example requests

Fetch live snapshot

Fetch stage timings from the last 10 minutes

Fetch source metrics for one signal

Fetch warning events only

Fetch events with a custom tag filter

Enable persistence and heavy telemetry temporarily

Related docs

15KB

Датотека Blame Историја

Endpoint: `GET /api/debug/telemetry/live`

Endpoint: `GET /api/debug/telemetry/history`

Endpoint: `GET /api/debug/telemetry/events`

Endpoint: `GET /api/debug/telemetry/config`

Endpoint: `POST /api/debug/telemetry/config`

Configuration fields (`debug.telemetry`)

`enabled`

`heavy_enabled`

`heavy_sample_every`

`metric_sample_every`

`metric_history_max`

`event_history_max`

`retention_seconds`

`persist_enabled`

`persist_dir`

`rotate_mb`

`keep_files`