Wideband autonomous SDR analysis engine forked from sdr-visual-suite

45KB

Brut Annotations Historique

Audio Click Debug Notes — 2026-03-24

Context

This note captures the intermediate findings from the live/recording audio click investigation on sdr-wideband-suite.

Goal: preserve the reasoning, experiments, false leads, and current best understanding so future work does not restart from scratch.

High-level outcome so far

We do not yet have the final root cause.

But we now know substantially more about what the clicks are not, and we identified at least one real bug plus several strong behavioral constraints in the pipeline.

What was tested

1. Session/context recovery

Reconstructed prior debugging context from reset-session backup files.
Confirmed the relevant investigation was the persistent audio clicking bug in live audio / recordings.

2. Codebase deep-read

Reviewed in detail:

cmd/sdrd/dsp_loop.go
cmd/sdrd/pipeline_runtime.go
cmd/sdrd/helpers.go
internal/recorder/streamer.go
internal/recorder/demod_live.go
internal/dsp/fir.go
internal/dsp/fir_stateful.go
internal/dsp/resample.go
internal/demod/fm.go
internal/demod/gpudemod/*
web/app.js

Main conclusion from static reading: the pipeline contains several stateful continuity mechanisms, so clicks are likely to emerge at boundaries or from phase/timing inconsistencies rather than from one obvious isolated bug.

3. AM vs FM tests

Observed by ear:

AM clicks too.
Therefore this is not an FM-only issue.
That shifted focus away from purely FM-specific explanations and toward shared-path / continuity / transport / demod-adjacent causes.

4. Recording vs live path comparison

Observed by ear:

Recordings click too.
Therefore browser/WebSocket/live playback is not the sole cause.
The root problem exists in the server-side audio pipeline before browser playback.

5. Boundary instrumentation added

Temporary diagnostics were added to inspect:

extract trimming
snippet lengths
demod path lengths
boundary click / intra-click detector
IQ continuity at various stages

6. Discriminator-overlap hypothesis

A test switch temporarily disabled the extra 1-sample discriminator overlap prepend in streamer.go.

Result:

This extra overlap was a real problem.
It caused the downstream decimation phase to flip between blocks.
Removing it cleaned up the boundary model and was the correct change.

However:

Removing it did not eliminate the audible clicks.
Therefore it was a real bug, but not the main remaining root cause.

7. GPU vs CPU extraction test

Forced CPU-only stream extraction.

Result:

CPU-only made things dramatically worse in real time.
Large feed_gap values appeared.
Huge backlogs built up.
Therefore CPU-only is not a solution, and the GPU path is not the sole main problem.

8. Fixed read-size test

Forced a constant extraction read size (389120) instead of variable read sizing based on backlog.

Result:

allIQ, gpuIQ_len, raw_len, and out_len became very stable.
This reduced pipeline variability and made logs much cleaner.
Subjectively, audio may have become slightly better, but clicks remained.
Therefore variable block sizing is likely a contributing factor, but not the full explanation.

9. Multi-stage audio dump test

Added optional debug dumping for:

demod audio (*-demod.wav)
final audio after resampler (*-final.wav)

Observed by ear:

Clicks are present in both dump types.
Therefore the click is already present by the time demodulated audio exists.
Resampler/final audio path is not the primary origin.

10. CPU monitoring

A process-level CSV monitor was added and used.

Result:

Overall process CPU usage was modest (not near full machine saturation).
This does not support “overall CPU is pegged” as the main explanation.
Caveat: this does not fully exclude a hot thread or scheduler issue, but gross total CPU overload is not the main story.

What we now know with reasonable confidence

A. The issue is not primarily caused by:

Browser playback
WebSocket transport
Final PCM fanout only
Resampler alone
CPU-only vs GPU-only as the core dichotomy
The old extra discriminator overlap prepend (that was a bug, but not the remaining dominant one)
Purely variable block sizes alone
Gross whole-process CPU saturation

B. The issue is server-side and exists before final playback

Because:

recordings click
demod dump clicks
final dump clicks

C. The issue is present by the demodulated audio stage

This is one of the strongest current findings.

D. The WFM/FM-demod-adjacent path remains highly suspicious

Current best area of suspicion:

decimated IQ may still contain subtle corruption/instability not fully captured by current metrics
OR the FM discriminator (fmDiscrim) is producing pathological output from otherwise “boundary-clean-looking” IQ

Important runtime/pathology observations

1. Backlog amplification is real

Several debug runs showed severe buffer growth and drops:

large buf= values
growing drop= counts
repeated audio_gap

This means some debug configurations can easily become self-distorting and produce additional artifacts that are not representative of the original bug.

2. Too much debug output causes self-inflicted load

At one point:

rate limiter was disabled (rate_limit_ms: 0)
aggressive boundary logging was enabled
many short WAV files were generated

This clearly increased overhead and likely polluted some runs.

3. Many short WAVs were a bad debug design

That was replaced with a design intended to write one continuous window file instead of many micro-files.

4. Total process CPU saturation does not appear to be the main cause

A process-level CSV monitor was collected and showed only modest total CPU utilisation during the relevant tests. This does not support a simple “the machine is pegged” explanation. A hot thread / scheduling issue is still theoretically possible, but gross overall CPU overload is not the main signal.

Current debug state in repo

Branch

All current work is on:

debug/audio-clicks

Commits so far

94c132d — debug: instrument audio click investigation
ffbc45d — debug: add advanced boundary metering

Current config/logging state

The active debug logging was trimmed down to:

demod
discrim
gap
boundary

Rate limit is currently back to a nonzero value to avoid self-induced spam.

Dump/CPU debug state

A debug: config section was added with:

audio_dump_enabled: false
cpu_monitoring: false

Meaning:

heavy WAV dumping is now OFF by default
CPU monitoring is conceptually OFF by default (script still exists, but must be explicitly used)

Most important code changes/findings to remember

1. Removed the extra discriminator overlap prepend in `streamer.go`

This was a correct fix.

Reason:

it introduced a blockwise extra IQ sample
this shifted decimation phase between blocks
it created real boundary artifacts

This should not be reintroduced casually.

2. Fixed read-size test exists and is useful for investigation

A temporary mechanism exists to force stable extraction block sizes. This is useful diagnostically because it removes one source of pipeline variability.

IMPORTANT DECISION / DO NOT LOSE:

The fixed read-size path currently lives behind the environment variable SDR_FORCE_FIXED_STREAM_READ_SAMPLES.
The tested value 389120 clearly helps by making allIQ, gpuIQ_len, raw_len, and out_len much more stable and by reducing one major source of pipeline variability.
Current plan: once the remaining click root cause is solved, promote this behavior into the normal code path instead of leaving it as an env-var-only debug switch.
In other words: treat fixed read sizing as a likely permanent stabilization improvement, but do not bake it in blindly until the click investigation is complete.

3. FM discriminator metering exists

internal/demod/fm.go now emits targeted discriminator stats under discrim logging, including:

min/max IQ magnitude
maximum absolute phase step
count of large phase steps

This was useful to establish that large discriminator steps correlate with low IQ magnitude, but discriminator logging was later disabled from the active category list to reduce log spam.

4. Strong `dec`-IQ findings before demod

Additional metering in streamer.go showed:

repeated dec_iq_head_dip
repeated low magnitude near min_idx ~= 25
repeated early large local phase step near max_step_idx ~= 24
repeated demod_boundary and audible clicks shortly afterward

This is the strongest currently known mechanism in the chain.

5. Group delay observation

For the current pre-demod FIR:

taps = 101
FIR group delay = (101 - 1) / 2 = 50 input samples
with decim1 = 2, this projects to about 25 output samples

This matches the repeatedly observed problematic dec indices (~24-25) remarkably well. That strongly suggests the audible issue is connected to the FIR/decimation settling region at the beginning of the dec block.

6. Pre-FIR vs post-FIR comparison

A dedicated pre-FIR probe was added on fullSnip (the input to the pre-demod FIR) and compared against the existing dec-side probes.

Observed pattern:

pre-FIR head probe usually looked relatively normal
no equally strong or equally reproducible hot spot appeared there
after FIR + decimation, the problematic dip/step repeatedly appeared near dec indices ~24-25

Interpretation:

the strongest currently observed defect is not already present in the same form before the FIR
it is much more likely to emerge in the FIR/decimation section (or its settling behavior) than in the raw pre-FIR input

7. Head-trim test results

A debug head-trim on dec was tested. Subjective result:

trim=32 sounded best among the tested values (16/32/48/64)
but it did not remove the clicks entirely

Interpretation:

the early dec settling region is a real contributor
but it is probably not the only contributor, or trimming alone is not the final correct fix

8. Current architectural conclusion

The likely clean fix is not to keep trimming samples away. The FIR/decimation section is still suspicious, but later tests showed it is likely not the sole origin.

Important nuance:

the currently suspicious FIR + decimation section is already running in Go/CPU (processSnippet), not in CUDA
therefore the next correctness fix should be developed and validated in Go first

Later update:

a stateful decimating FIR / polyphase-style replacement was implemented in Go and tested
it was architecturally cleaner than the old separated FIR->decimate handoff
but it did not remove the recurring hot spot / clicks
therefore the old handoff was not the whole root cause, even if the newer path is still cleaner

Best current hypothesis

The remaining audible clicks are most likely generated at or immediately before FM demodulation.

Most plausible interpretations:

The decimated IQ stream still contains subtle corruption/instability not fully captured by the earliest boundary metrics.
The FM discriminator is reacting violently to short abnormal IQ behavior inside blocks, not just at block boundaries.
The problematic region is likely a very specific early decimated-IQ settling zone, not broad corruption across the whole block.

At this point, the most valuable next data is low-overhead IQ telemetry right before demod, plus carefully controlled demod-vs-final audio comparison.

Stronger updated working theory (later findings, same day)

After discriminator-focused metering and targeted dec-IQ probes, the strongest current theory is:

A reproducible early defect in the dec IQ block appears around sample index 24-25, where IQ magnitude dips sharply and the effective FM phase step becomes abnormally large. This then shows up as demod_boundary and audible clicks.

Crucially:

this issue appears in demod.wav, so it exists before the final resampler/playback path
it is not spread uniformly across the whole dec block
it repeatedly appears near the same index
trimming the first ~32 samples subjectively reduces the click, but does not eliminate it entirely

This strongly suggests a settling/transient zone at the beginning of the decimated IQ block.

Later refinements to this theory:

pre-FIR probing originally looked cleaner than post-FIR probing, which made FIR/decimation look like the main culprit
however, a temporary FIR bypass showed the clicks were still present, only somewhat quieter / less aggressive
this indicates the pre-demod FIR likely amplifies or sharpens an upstream issue, but is not the sole origin
a cleaner stateful decimating FIR implementation also failed to eliminate the recurring hot spot, further weakening the idea that the old FIR->decimate handoff alone caused the bug

Recommended next steps

Run with reduced logging only and keep heavy dump features OFF unless explicitly needed.
Continue investigating the extractor path and its immediate surroundings (extractForStreaming, signal parameter source, offset/BW stability, overlap/trim behavior).
Treat FIR/decimation as a possible amplifier/focuser of the issue, but not the only suspect.
When testing fixes, prefer low-overhead, theory-driven experiments over broad logging/dump spam.
Only re-enable audio dump windows selectively and briefly.

Debug TODO / operational reminders

The current telemetry collector is not using a true ring buffer for metric/event history.
Internally it keeps append-only history slices (metricsHistory, events) and periodically trims them by copying tail slices.
Under heavy per-block telemetry this can add enough mutex/copy overhead to make the live stream start stuttering after a short run.
Therefore: keep telemetry sampling conservative during live reproduction runs; do not leave full heavy telemetry enabled longer than needed.
Follow-up engineering task: replace or redesign telemetry history storage to use a proper low-overhead ring-buffer style structure (or equivalent bounded lock-light design) if live telemetry is to remain a standard debugging tool.

2026-03-25 update — extractor-focused live telemetry findings

Where the investigation moved

The investigation was deliberately refocused away from browser/feed/demod-only suspicions and toward:

shared upstream IQ cadence / block boundaries
extractor input/output continuity
raw vs trimmed extractor-head behaviour

This was driven by two observations:

all signals still click
the newly added live telemetry made it possible to inspect the shared path while the system was running

Telemetry infrastructure / config notes

Two config files matter for debug telemetry defaults:

config.yaml
config.autosave.yaml

The autosave file can overwrite intended telemetry defaults after restart, so both must be updated together.

Current conservative live-debug defaults that worked better:

heavy_enabled: false
heavy_sample_every: 12
metric_sample_every: 8
metric_history_max: 6000
event_history_max: 1500

Important operational lesson:

runtime POST /api/debug/telemetry/config changes only affect the current sdrd process
after restart, the process reloads config defaults again
if autosave still contains older values (for example heavy_enabled: true or very large history limits), the debug run can accidentally become self-distorting again

Telemetry endpoints

The live debug work used these HTTP endpoints on the sdrd web server (typically http://127.0.0.1:8080):

`GET /api/debug/telemetry/config`

Returns the current effective telemetry configuration. Useful for verifying:

whether heavy telemetry is enabled
history sizes
persistence settings
sample rates actually active in the running process

Typical fields:

enabled
heavy_enabled
heavy_sample_every
metric_sample_every
metric_history_max
event_history_max
retention_seconds
persist_enabled
persist_dir

`POST /api/debug/telemetry/config`

Applies runtime telemetry config changes to the current process. Used during investigation to temporarily reduce telemetry load without editing files.

Example body used during investigation:

{
  "heavy_enabled": true,
  "heavy_sample_every": 12,
  "metric_sample_every": 8
}

`GET /api/debug/telemetry/live`

Returns the current live metric snapshot (gauges/counters/distributions). Useful for:

quick sanity checks
verifying that a metric family exists
confirming whether a new metric name is actually being emitted

`GET /api/debug/telemetry/history?prefix=<prefix>&limit=<n>`

Returns stored metric history entries filtered by metric-name prefix. This is the main endpoint for time-series debugging during live runs.

Useful examples:

prefix=stage.
prefix=source.
prefix=iq.boundary.all
prefix=iq.extract.input
prefix=iq.extract.output
prefix=iq.extract.raw.
prefix=iq.extract.trimmed.
prefix=iq.pre_demod
prefix=audio.demod

`GET /api/debug/telemetry/events?limit=<n>`

Returns recent structured telemetry events. Used heavily once compact per-block event probes were added, because events were often easier to inspect reliably than sparsely sampled distribution histories.

This ended up being especially useful for:

raw extractor head probes
trimmed extractor head probes
extractor input head probes
GPU kernel input/output head probes
boundary snapshots

Important telemetry families added/used

Shared-path / global boundary metrics

iq.boundary.all.head_mean_mag
iq.boundary.all.prev_tail_mean_mag
iq.boundary.all.delta_mag
iq.boundary.all.delta_phase
iq.boundary.all.discontinuity_score

Purpose:

detect whether the shared allIQ block boundary was already obviously broken before signal-specific extraction

Extractor input/output metrics

iq.extract.input.length
iq.extract.input.overlap_length
iq.extract.input.head_mean_mag
iq.extract.input.prev_tail_mean_mag
iq.extract.input.discontinuity_score
iq.extract.output.length
iq.extract.output.head_mean_mag
iq.extract.output.head_min_mag
iq.extract.output.head_max_step
iq.extract.output.head_p95_step
iq.extract.output.head_tail_ratio
iq.extract.output.head_low_magnitude_count
iq.extract.output.boundary.delta_mag
iq.extract.output.boundary.delta_phase
iq.extract.output.boundary.d2
iq.extract.output.boundary.discontinuity_score

Purpose:

isolate whether the final per-signal extractor output itself was discontinuous across blocks

Raw vs trimmed extractor-head telemetry

iq.extract.raw.length
iq.extract.raw.head_mag
iq.extract.raw.tail_mag
iq.extract.raw.head_zero_count
iq.extract.raw.first_nonzero_index
iq.extract.raw.head_max_step
iq.extract.trim.trim_samples
iq.extract.trimmed.head_mag
iq.extract.trimmed.tail_mag
iq.extract.trimmed.head_zero_count
iq.extract.trimmed.first_nonzero_index
iq.extract.trimmed.head_max_step
event extract_raw_head_probe
event extract_trimmed_head_probe

Purpose:

answer the key question: is the corruption already present in the raw extractor output head, or created by trimming/overlap logic afterward?

Additional extractor input / GPU-kernel probe telemetry

iq.extract.input_head.zero_count
iq.extract.input_head.first_nonzero_index
iq.extract.input_head.max_step
event extract_input_head_probe
event gpu_kernel_input_head_probe
event gpu_kernel_output_head_probe

Purpose:

split the remaining uncertainty between:
- signal-specific input already being bad
- GPU extractor kernel/start semantics producing the bad raw head
- later output assembly after the kernel

Pre-demod / audio-stage metrics

iq.pre_demod.head_mean_mag
iq.pre_demod.head_min_mag
iq.pre_demod.head_max_step
iq.pre_demod.head_p95_step
iq.pre_demod.head_low_magnitude_count
audio.demod.head_mean_abs
audio.demod.tail_mean_abs
audio.demod.edge_delta_abs
existing audio.demod_boundary.*

Purpose:

verify where artifacts become visible/audible downstream

What the 2026-03-25 telemetry actually showed

1. Feed / enqueue remained relatively uninteresting

stage.feed_enqueue.duration_ms was usually effectively zero.

Representative values during live runs:

mostly 0
occasional small spikes such as 0.5 ms and 5.8 ms

Interpretation:

feed enqueue is not the main source of clicks

2. Extract-stream time was usually modest

stage.extract_stream.duration_ms was usually small and stable compared with the main loop.

Representative values:

often 1–5 ms
occasional spikes such as 10.7 ms and 18.9 ms

Interpretation:

extraction is not free, but runtime cost alone does not explain the clicks

3. Shared capture / source cadence still fluctuated heavily

Representative live values:

dsp.frame.duration_ms: often around 90–100 ms, but also 110–150 ms, with one observed spike around 212.6 ms
source.read.duration_ms: roughly 80–90 ms often, but also about 60 ms, 47 ms, 19 ms, and even 0.677 ms
source.buffer_samples: ranged from very small to very large bursts, including examples like 512, 4608, 94720, 179200, 304544
a source_reset event was seen and source.resets=1

Interpretation:

shared upstream cadence is clearly unstable enough to remain suspicious
but this alone did not localize the final click mechanism

4. Pre-demod stage showed repeated hard phase anomalies even when energy looked healthy

Representative live values for normal non-vanishing signals:

iq.pre_demod.head_mean_mag around 0.25–0.31
iq.pre_demod.head_low_magnitude_count = 0
iq.pre_demod.head_max_step repeatedly high, including roughly:
- 1.5
- 2.0
- 2.4
- 2.8
- 3.08

Interpretation:

not primarily an amplitude collapse
rather a strong phase/continuity defect reaching the pre-demod stage

5. Audio stage still showed real block-edge artifacts

Representative values:

audio.demod.edge_delta_abs repeatedly around 0.4–0.8
outliers up to roughly 1.21 and 1.26
audio.demod_boundary.count continued to fire repeatedly

Interpretation:

demod is where the problem becomes audible, but the root cause still appeared to be earlier/shared

Key extractor findings from the new telemetry

A. Per-signal extractor output boundary is genuinely broken

For a representative strong signal (signal_id=2), iq.extract.output.boundary.delta_phase repeatedly showed very large jumps such as:

2.60
3.06
2.14
2.71
3.09
2.92
2.63
2.78

Also observed for iq.extract.output.boundary.discontinuity_score:

2.86
3.08
2.92
2.52
2.40
2.85

Later runs using d2 made the discontinuity even easier to see. Representative iq.extract.output.boundary.d2 values for the same strong signal included:

0.347
0.303
0.362
0.359
0.382
0.344
0.337
0.206

At the same time, iq.extract.output.boundary.delta_mag was often comparatively small (examples around 0.0003–0.0038).

Interpretation:

the main boundary defect is not primarily amplitude mismatch
it is much more consistent with complex/phase discontinuity across output blocks

B. The raw extractor head is systematically bad on all signals

The new extract_raw_head_probe events were the strongest finding of the day.

Representative repeated pattern for strong signals (signal_id=1 and signal_id=2):

first_nonzero_index = 1
zero_count = 1
first magnitude sample exactly 0
then a short ramp: e.g. for signal_id=2
- 0
- 0.000388
- 0.002316
- 0.004152
- 0.019126
- 0.011418
- 0.124034
- 0.257569
- 0.317579
head_max_step often near π, e.g.:
- 3.141592653589793
- 3.088773696463606
- 3.0106854446936318
- 2.9794833659932527

The same qualitative pattern appeared for weaker signals too:

raw head starts at 0
a brief near-zero ramp follows
only after several samples does the magnitude look like a normal extracted band

Interpretation:

the raw extractor output head is already damaged / settling / invalid before trimming
this strongly supports an upstream/shared-start-condition problem rather than a trim-created artifact

C. The trimmed extractor head usually looks sane

Representative repeated pattern for the same signals after trim_samples = 64:

first_nonzero_index = 0
zero_count = 0
magnitudes look immediately plausible and stable
head_max_step is dramatically lower than raw, often around 0.15–0.9 for strong channels

Example trimmed head magnitudes for signal_id=2:

0.299350
0.300954
0.298032
0.298738
0.312258
0.296932
0.239010
0.266881
0.313193

Example trimmed head magnitudes for signal_id=1:

0.277400
0.275994
0.273718
0.272846
0.277842
0.278398
0.268829
0.273790
0.279031

Interpretation:

trimming is removing a genuinely bad raw head region
trimming is therefore not the main origin of the problem
it acts more like cleanup of an already bad upstream/raw start region

Input-vs-raw-vs-trimmed extractor result (important refinement)

A later, more targeted telemetry pass added a direct probe on the signal-specific extractor input head (extract_input_head_probe) and compared it against the raw and trimmed extractor output heads.

This materially refined the earlier conclusion.

Input-head result

Representative values from iq.extract.input_head.*:

iq.extract.input_head.zero_count = 0
iq.extract.input_head.first_nonzero_index = 0

Interpretation:

the signal-specific input head going into the GPU extractor is not starting with a zero sample
the head is not arriving already dead/null from the immediate input probe point

Raw-head result

Representative values from iq.extract.raw.*:

iq.extract.raw.head_mag = 0
iq.extract.raw.head_zero_count = 1
iq.extract.raw.head_max_step frequently around 2.4–3.14

These values repeated for strong channels such as signal_id=2, and similarly across other signals.

Interpretation:

the first raw output sample is repeatedly exactly zero
therefore the visibly bad raw head is being created after the probed input head and before/during raw extractor output generation

Trimmed-head result

Representative values from iq.extract.trimmed.*:

iq.extract.trimmed.head_zero_count = 0
iq.extract.trimmed.head_mag often looked healthy immediately after trimming, for example:
- signal 1: about 0.275–0.300
- signal 2: about 0.311
iq.extract.trimmed.head_max_step was much lower than raw for strong channels, often around:
- 0.11
- 0.14
- 0.19
- 0.30
- 0.75

Interpretation:

trimming cleans up the visibly bad raw head region
trimming still does not explain the deeper output-boundary continuity issue

Further refinement after direct extractor-input and GPU-kernel probes

A final telemetry round added:

extract_input_head_probe
gpu_kernel_input_head_probe
gpu_kernel_output_head_probe

These probes further sharpened the likely fault location.

Signal-specific extractor input head looked sane

Representative values:

iq.extract.input_head.zero_count = 0
iq.extract.input_head.first_nonzero_index = 0

Interpretation:

at the observed signal-specific input probe point, the GPU extractor is not receiving a dead/null head

Raw GPU output head remained systematically broken

Representative repeated values:

iq.extract.raw.head_mag = 0
iq.extract.raw.head_zero_count = 1
iq.extract.raw.head_max_step repeatedly around:
- 3.141592653589793
- 3.122847934305907
- 3.101915352902961
- 3.080672178550904
- 3.062425574273907
- 2.9785041567778427
- 2.7508533785793476

Representative repeated examples from strong channels:

signal 2: head_mag = 0, head_zero_count = 1
signal 3: head_mag = 0, head_zero_count = 1
signal 1/4 showed the same qualitative head-zero pattern as well

Interpretation:

the raw extractor output head is still repeatedly born broken
the problem is therefore after the currently probed input head and before/during raw output creation

Trimmed head still looked healthier

Representative values:

iq.extract.trimmed.head_zero_count = 0
signal 1 iq.extract.trimmed.head_mag repeatedly around:
- 0.2868
- 0.2907
- 0.3036
- 0.3116
- 0.2838
- 0.2760
signal 2 examples:
- 0.3461
- 0.3182

Representative iq.extract.trimmed.head_max_step values for strong channels were much lower than raw, often around:

0.11
0.13
0.21
0.30
0.44
0.69
0.86

Interpretation:

trimming still removes the most visibly broken head region
but trimming does not explain the deeper output-boundary continuity issue

Refined strongest current conclusion after the full 2026-03-25 telemetry pass

The strongest current reading is now:

The click root cause is very likely not that the signal-specific extractor input already starts dead/null. Instead, the bad raw head appears to be introduced inside the GPU extractor path itself (or at its immediate start/output semantics) before final trimming.

More specifically:

signal-specific extractor input head looks non-zero and sane at the probe point
raw GPU output head still repeatedly starts with an exact zero sample and a short bad settling region
the trimmed head usually looks healthier
yet the final extractor output still exhibits significant complex boundary discontinuity from block to block

This now points away from a simple “shared global input head is already zero” theory and toward one of these narrower causes:

GPU extractor kernel start semantics / warmup / first-output handling
phase-start or alignment handling at extractor block start
raw GPU output assembly semantics within the extractor path

What should not be forgotten from this stage

The overlap-prepend bug was real and worth fixing, but was not sufficient.
The fixed read-size path (SDR_FORCE_FIXED_STREAM_READ_SAMPLES=389120) remains useful and likely worth promoting later, but it is not the root-cause fix.
The telemetry system itself can perturb runs if overused; conservative sampling matters.
config.autosave.yaml must be kept in sync with config.yaml or telemetry defaults can silently revert after restart.
The most promising root-cause area is now the shared upstream/extractor-start boundary path, not downstream playback.

2026-03-25 refactor work status (post-reviewer instruction)

After the reviewer guidance, work pivoted away from symptomatic patching and onto the required two-track architecture change:

Track 1 — CPU/oracle path repair (in progress)

The following was added to start building a trustworthy streaming oracle:

internal/demod/gpudemod/streaming_types.go
internal/demod/gpudemod/cpu_oracle.go
internal/demod/gpudemod/cpu_oracle_test.go
internal/demod/gpudemod/streaming_oracle_extract.go
internal/demod/gpudemod/polyphase.go
internal/demod/gpudemod/polyphase_test.go

What exists now:

explicit StreamingExtractJob / StreamingExtractResult
explicit CPUOracleState
exact integer decimation enforcement (ExactIntegerDecimation)
monolithic-vs-chunked CPU oracle test
explicit polyphase tap layout (phase-major)
CPU oracle direct-vs-polyphase equivalence test
persistent CPU oracle runner state keyed by signal ID
config-hash reset behavior
cleanup of disappeared signals from oracle state

Important limitation:

this is not finished production validation yet
the CPU oracle path is being built toward the reviewer’s required semantics, but it is not yet the final signed-off oracle for GPU validation

Track 2 — GPU path architecture refactor (in progress)

The following was added to begin the new stateful GPU architecture:

internal/demod/gpudemod/stream_state.go
internal/demod/gpudemod/streaming_gpu_stub.go
docs/gpu-streaming-refactor-plan-2026-03-25.md
cmd/sdrd/streaming_refactor.go

What exists now:

explicit ExtractStreamState
batch-runner-owned per-signal state map
config-hash reset behavior for GPU-side stream state
exact integer decimation enforcement in relevant batch path
base taps and polyphase taps initialized into GPU-side stream state
explicit future production entry point: StreamingExtractGPU(...)
explicit separation between current legacy extractor path and the new streaming/oracle path
persistent oracle-runner lifecycle hooks, including reset on stream-drop events

Important limitation:

the new GPU production path is not implemented yet
the legacy overlap+trim production path still exists and is still the current active path
the new GPU entry point currently exists as an explicit architectural boundary and state owner, not as the finished stateful polyphase kernel path

Tests currently passing during refactor

Repeatedly verified during the refactor work:

go test ./internal/demod/gpudemod/...
go test ./cmd/sdrd/...

Incremental progress reached so far inside the refactor

Additional progress after the initial refactor scaffolding:

the CPU oracle runner now uses the explicit polyphase oracle path (CPUOracleExtractPolyphase) instead of only carrying polyphase tap data passively
the CPU oracle now has a direct-vs-polyphase equivalence test
the GPU-side stream state now initializes both BaseTaps and PolyphaseTaps
the GPU side now has an explicit future production entry point StreamingExtractGPU(...)
the GPU streaming stub now advances NCOPhase over NEW samples only
the GPU streaming stub now advances PhaseCount modulo exact integer decimation
the GPU streaming stub now builds and persists ShiftedHistory from already frequency-shifted NEW samples
the new streaming/oracle path is explicitly separated from the current legacy overlap+trim production path

Important current limitation:

StreamingExtractGPU(...) still intentionally returns a not-implemented error rather than pretending to be the finished production path
this is deliberate, to avoid hidden quick-fix semantics or silent goalpost shifts

Additional note on the latest step:

the GPU streaming stub now also reports an estimated output-count schedule (NOut) derived from NEW sample consumption plus carried PhaseCount
this still does not make it a production path; it only means the stub now models output cadence semantics more honestly
the new CPU/oracle path is also now exposing additional runtime telemetry such as streaming.oracle.rate and streaming.oracle.output_len, so the reference path becomes easier to inspect as it matures
a reusable complex-slice comparison helper now exists (CompareComplexSlices) to support later oracle-vs-GPU equivalence work without improvising comparison logic at the last minute
a dedicated TestCPUOracleMonolithicVsChunkedPolyphase now verifies chunked-vs-monolithic self-consistency for the polyphase oracle path specifically
explicit reset tests now exist for both CPU oracle state and GPU streaming state, so config-change reset semantics are no longer only implicit in code review
a dedicated ExtractDebugMetrics structure now exists as a future comparison/telemetry contract for reviewer-required state/error/boundary metrics
the first mapper from oracle results into that debug-metric structure now exists, so the comparison contract is beginning to attach to real refactor code rather than staying purely conceptual
the same minimal debug-metric mapping now also exists for GPU-stub results, so both sides of the future GPU-vs-oracle comparison now have an initial common reporting shape
a first comparison-pipeline helper now exists to turn oracle-vs-GPU-stub results into shared CompareStats / ExtractDebugMetrics output, even though the GPU path is still intentionally incomplete
that comparison helper is now also covered by a dedicated unit test, so even the scaffolding around future GPU-vs-oracle validation is being locked down incrementally
GPU-side stream-state initialization is now also unit-tested (Decim, BaseTaps, PolyphaseTaps, ShiftedHistory capacity), so the new state ownership layer is no longer just trusted by inspection
the GPU streaming stub now also has a dedicated test proving that it advances persistent state while still explicitly failing as a not-yet-implemented production path
at this point, enough scaffolding exists that the next sensible step is to build the broader validation/test harness in one larger pass before continuing the actual production-path rewrite
that harness pass has now happened: deterministic IQ/tone fixtures, harness config/state builders, chunked polyphase oracle runners, and additional validation tests now exist, so the next step is back to the actual production-path rewrite
the first non-stub NEW-samples-only production-like path now exists as StreamingExtractGPUHostOracle(...): it is still host-side, but it executes the new streaming/stateful semantics and therefore serves as a concrete bridge between pure test infrastructure and the eventual real GPU production path
that host-side production-like path is now directly compared against the CPU oracle in tests and currently matches within tight tolerance, which is an important confidence step before any real CUDA-path replacement
the canonical new production entry point StreamingExtractGPU(...) is now structurally wired so that the host-side production-like implementation can sit behind the same API later, without forcing a premature switch today
a top-level cmd/sdrd production path hook now exists as well (extractForStreamingProduction plus useStreamingProductionPath=false), so the new architecture is no longer isolated to internal packages only
the new production path now also emits first-class output/heading telemetry (rate, output_len, head_mean_mag, head_max_step) in addition to pure state counters, which will make activation/debugging easier later
a top-level comparison observation hook now also exists in cmd/sdrd, so oracle-vs-production metrics no longer have to remain buried inside internal package helpers
after the broader monitoring/comparison consolidation pass, the next agreed work mode is to continue in larger clusters rather than micro-steps: (1) wire the new production semantics more deeply, (2) isolate the legacy path more sharply, (3) keep preparing the eventual real GPU production path behind the same architecture
after the first larger cluster, the next explicit target is to complete Cluster B: make the host-oracle bridge sit more naturally behind the new production execution architecture, rather than leaving production-path semantics spread across loosely connected files
after Cluster B, the remaining GPU rewrite work is now best split into two explicit parts: C1 = prepare and C2 = definitive implementation, so the project can keep momentum without pretending that the final CUDA/stateful production path is already done
Cluster B is now effectively complete: CPU oracle runner, host-oracle production-like path, and top-level production comparison all share the same host streaming core, and that common core is directly tested against the polyphase oracle
Cluster C1 is now also complete: the new GPU production layer has an explicit invocation contract, execution-result contract, state handoff/build/apply stages, and a host-side execution strategy already running behind the same model

Current refactor status before C2

At this point the project has:

a corrected streaming/oracle architecture direction
a shared host-side streaming core used by both the CPU oracle runner and the host-side production-like bridge
explicit production-path hooks in cmd/sdrd
comparison and monitoring scaffolding above and below the execution layer
a prepared GPU execution contract (StreamingGPUInvocation / StreamingGPUExecutionResult)

What it does not have yet:

a real native CUDA streaming/polyphase execution entry point with history-in/history-out and phase-count in/out semantics
a real CUDA-backed implementation behind StreamingExtractGPUExec(...)
completed GPU-vs-oracle validation on the final native execution path

C2 plan

C2-A — native CUDA / bridge entry preparation

Goal:

introduce the real native entry shape for stateful streaming/polyphase execution

Status note before starting C2-A:

C2 is not honestly complete yet because the native CUDA side still only exposes the old separate freq-shift/FIR/decimate pieces.
Therefore C2-A must begin by creating the real native entry shape rather than continuing to stack more Go-only abstractions on top of the old kernels.

Required outcomes:

explicit native/CUDA function signature for streaming execution
bridge bindings for history in/out, phase count in/out, new samples in, outputs out
Go-side wrapper ready to call the new native path through the prepared invocation/result model

C2-B — definitive execution implementation hookup

Goal:

put a real native CUDA-backed execution strategy behind StreamingExtractGPUExec(...)

Status note after C2-A:

the native entry shape now exists in CUDA, the Windows bridge can resolve it, and the Go execution layer can route into a native-prepared strategy.
what is still missing for C2-B is the actual stateful execution body behind that new native entrypoint.
therefore C2-B now means exactly one serious thing: replace the current placeholder body of the new native entrypoint with real stateful streaming/polyphase execution semantics, rather than adding more scaffolding around it.
C2-B is now materially done: the new native entrypoint no longer returns only placeholder state, and the Go native execution path now uploads inputs/history/taps, runs the new native function, and reads back outputs plus updated state.
when the new exact-integer streaming decimation rules were turned on, an immediate runtime integration issue appeared: previous WFM extraction defaults expected outRate=500000, but the live sample rate was 4096000, which is not exactly divisible. The correct fix is to align streaming defaults with the new integer-decimation model instead of trying to preserve the old rounded ratio behavior.
the concrete immediate adjustment made for this was: wfmStreamOutRate = 512000 (instead of 500000), because 4096000 / 512000 = 8 is exactly divisible and therefore consistent with the new streaming architecture’s no-rounding rule.

Required outcomes:

StreamingExtractGPUExec(...) can execute a real native stateful path
host-oracle bridge remains available only as a comparison/support path, not as the disguised production implementation
state apply/backflow goes through the already prepared invocation/result contract

C2-C — final validation and serious completion gate

Goal:

validate the real CUDA-backed path against the corrected oracle and make the completion criterion explicit

Required outcomes:

GPU-vs-oracle comparison active on the real native path
test coverage and runtime comparison hooks in place
after C2-C, the CUDA story must be treated as complete, correct, and serious — not half-switched or pseudo-finished

Why the refactor is intentionally incremental

The reviewer explicitly required:

no start-index-only production patch
no continued reliance on overlap+trim as final continuity model
no silent decimation rounding
no GPU sign-off without a corrected CPU oracle

Because of that, the work is being done in ordered layers:

define streaming types and state
build the CPU oracle with exact streaming semantics
establish shared polyphase/tap semantics
prepare GPU-side persistent state ownership
only then replace the actual production GPU execution path

This means the repo now contains partially completed new architecture pieces that are deliberate stepping stones, not abandoned half-fixes.

Reviewer package artifacts created for second-opinion review

To support external/secondary review of the GPU extractor path, a focused reviewer package was created in the project root:

reviewer-gpu-extractor-package/
reviewer-gpu-extractor-package.zip
reviewer-gpu-extractor-package.json

The package intentionally contains:

relevant GPU extractor / kernel code
surrounding host-path code needed for context
current debug notes
a reviewer brief
a short reviewer prompt
relevant config files used during live telemetry work

The JSON variant is uncompressed and stores all included package files as a single JSON document with repeated entries of:

path
content

This was created specifically so the same reviewer payload can be consumed by tools or APIs that prefer a single structured text file instead of a ZIP archive.

Meta note

This investigation already disproved several plausible explanations. That is progress.

The most important thing not to forget is:

the overlap prepend bug was real, but not sufficient
the click is already present in demod audio
whole-process CPU saturation is not the main explanation
excessive debug instrumentation can itself create misleading secondary problems
the 2026-03-25 extractor telemetry strongly suggests the remaining root cause is upstream of the final trim stage

45KB Brut Annotations Historique

Audio Click Debug Notes — 2026-03-24

Context

High-level outcome so far

What was tested

1. Session/context recovery

2. Codebase deep-read

3. AM vs FM tests

4. Recording vs live path comparison

5. Boundary instrumentation added

6. Discriminator-overlap hypothesis

7. GPU vs CPU extraction test

8. Fixed read-size test

9. Multi-stage audio dump test

10. CPU monitoring

What we now know with reasonable confidence

A. The issue is not primarily caused by:

B. The issue is server-side and exists before final playback

C. The issue is present by the demodulated audio stage

D. The WFM/FM-demod-adjacent path remains highly suspicious

Important runtime/pathology observations

1. Backlog amplification is real

2. Too much debug output causes self-inflicted load

3. Many short WAVs were a bad debug design

4. Total process CPU saturation does not appear to be the main cause

Current debug state in repo

Branch

Commits so far

Current config/logging state

Dump/CPU debug state

Most important code changes/findings to remember

1. Removed the extra discriminator overlap prepend in streamer.go

2. Fixed read-size test exists and is useful for investigation

3. FM discriminator metering exists

4. Strong dec-IQ findings before demod

5. Group delay observation

6. Pre-FIR vs post-FIR comparison

7. Head-trim test results

8. Current architectural conclusion

Best current hypothesis

Stronger updated working theory (later findings, same day)

Recommended next steps

Debug TODO / operational reminders

2026-03-25 update — extractor-focused live telemetry findings

Where the investigation moved

Telemetry infrastructure / config notes

Telemetry endpoints

GET /api/debug/telemetry/config

POST /api/debug/telemetry/config

GET /api/debug/telemetry/live

GET /api/debug/telemetry/history?prefix=<prefix>&limit=<n>

GET /api/debug/telemetry/events?limit=<n>

Important telemetry families added/used

Shared-path / global boundary metrics

Extractor input/output metrics

Raw vs trimmed extractor-head telemetry

Additional extractor input / GPU-kernel probe telemetry

Pre-demod / audio-stage metrics

What the 2026-03-25 telemetry actually showed

1. Feed / enqueue remained relatively uninteresting

2. Extract-stream time was usually modest

3. Shared capture / source cadence still fluctuated heavily

4. Pre-demod stage showed repeated hard phase anomalies even when energy looked healthy

5. Audio stage still showed real block-edge artifacts

Key extractor findings from the new telemetry

A. Per-signal extractor output boundary is genuinely broken

B. The raw extractor head is systematically bad on all signals

C. The trimmed extractor head usually looks sane

Input-vs-raw-vs-trimmed extractor result (important refinement)

Input-head result

Raw-head result

Trimmed-head result

Further refinement after direct extractor-input and GPU-kernel probes

Signal-specific extractor input head looked sane

Raw GPU output head remained systematically broken

Trimmed head still looked healthier

Refined strongest current conclusion after the full 2026-03-25 telemetry pass

What should not be forgotten from this stage

2026-03-25 refactor work status (post-reviewer instruction)

Track 1 — CPU/oracle path repair (in progress)

45KB

Brut Annotations Historique

1. Removed the extra discriminator overlap prepend in `streamer.go`

4. Strong `dec`-IQ findings before demod

`GET /api/debug/telemetry/config`

`POST /api/debug/telemetry/config`

`GET /api/debug/telemetry/live`

`GET /api/debug/telemetry/history?prefix=<prefix>&limit=<n>`

`GET /api/debug/telemetry/events?limit=<n>`