This note captures the intermediate findings from the live/recording audio click investigation on sdr-wideband-suite.
Goal: preserve the reasoning, experiments, false leads, and current best understanding so future work does not restart from scratch.
We do not yet have the final root cause.
But we now know substantially more about what the clicks are not, and we identified at least one real bug plus several strong behavioral constraints in the pipeline.
Reviewed in detail:
cmd/sdrd/dsp_loop.gocmd/sdrd/pipeline_runtime.gocmd/sdrd/helpers.gointernal/recorder/streamer.gointernal/recorder/demod_live.gointernal/dsp/fir.gointernal/dsp/fir_stateful.gointernal/dsp/resample.gointernal/demod/fm.gointernal/demod/gpudemod/*web/app.jsMain conclusion from static reading: the pipeline contains several stateful continuity mechanisms, so clicks are likely to emerge at boundaries or from phase/timing inconsistencies rather than from one obvious isolated bug.
Observed by ear:
Observed by ear:
Temporary diagnostics were added to inspect:
A test switch temporarily disabled the extra 1-sample discriminator overlap prepend in streamer.go.
Result:
However:
Forced CPU-only stream extraction.
Result:
feed_gap values appeared.Forced a constant extraction read size (389120) instead of variable read sizing based on backlog.
Result:
allIQ, gpuIQ_len, raw_len, and out_len became very stable.Added optional debug dumping for:
*-demod.wav)*-final.wav)Observed by ear:
A process-level CSV monitor was added and used.
Result:
Because:
This is one of the strongest current findings.
Current best area of suspicion:
fmDiscrim) is producing pathological output from otherwise “boundary-clean-looking” IQSeveral debug runs showed severe buffer growth and drops:
buf= valuesdrop= countsaudio_gapThis means some debug configurations can easily become self-distorting and produce additional artifacts that are not representative of the original bug.
At one point:
rate_limit_ms: 0)This clearly increased overhead and likely polluted some runs.
That was replaced with a design intended to write one continuous window file instead of many micro-files.
A process-level CSV monitor was collected and showed only modest total CPU utilisation during the relevant tests. This does not support a simple “the machine is pegged” explanation. A hot thread / scheduling issue is still theoretically possible, but gross overall CPU overload is not the main signal.
All current work is on:
debug/audio-clicks94c132d — debug: instrument audio click investigationffbc45d — debug: add advanced boundary meteringThe active debug logging was trimmed down to:
demoddiscrimgapboundaryRate limit is currently back to a nonzero value to avoid self-induced spam.
A debug: config section was added with:
audio_dump_enabled: falsecpu_monitoring: falseMeaning:
streamer.goThis was a correct fix.
Reason:
This should not be reintroduced casually.
A temporary mechanism exists to force stable extraction block sizes. This is useful diagnostically because it removes one source of pipeline variability.
IMPORTANT DECISION / DO NOT LOSE:
SDR_FORCE_FIXED_STREAM_READ_SAMPLES.389120 clearly helps by making allIQ, gpuIQ_len, raw_len, and out_len much more stable and by reducing one major source of pipeline variability.internal/demod/fm.go now emits targeted discriminator stats under discrim logging, including:
This was useful to establish that large discriminator steps correlate with low IQ magnitude, but discriminator logging was later disabled from the active category list to reduce log spam.
dec-IQ findings before demodAdditional metering in streamer.go showed:
dec_iq_head_dipmin_idx ~= 25max_step_idx ~= 24demod_boundary and audible clicks shortly afterwardThis is the strongest currently known mechanism in the chain.
For the current pre-demod FIR:
101(101 - 1) / 2 = 50 input samplesdecim1 = 2, this projects to about 25 output samplesThis matches the repeatedly observed problematic dec indices (~24-25) remarkably well.
That strongly suggests the audible issue is connected to the FIR/decimation settling region at the beginning of the dec block.
A dedicated pre-FIR probe was added on fullSnip (the input to the pre-demod FIR) and compared against the existing dec-side probes.
Observed pattern:
dec indices ~24-25Interpretation:
A debug head-trim on dec was tested.
Subjective result:
trim=32 sounded best among the tested values (16/32/48/64)Interpretation:
dec settling region is a real contributorThe likely clean fix is not to keep trimming samples away. The FIR/decimation section is still suspicious, but later tests showed it is likely not the sole origin.
Important nuance:
processSnippet), not in CUDALater update:
The remaining audible clicks are most likely generated at or immediately before FM demodulation.
Most plausible interpretations:
At this point, the most valuable next data is low-overhead IQ telemetry right before demod, plus carefully controlled demod-vs-final audio comparison.
After discriminator-focused metering and targeted dec-IQ probes, the strongest current theory is:
A reproducible early defect in the
decIQ block appears around sample index 24-25, where IQ magnitude dips sharply and the effective FM phase step becomes abnormally large. This then shows up asdemod_boundaryand audible clicks.
Crucially:
demod.wav, so it exists before the final resampler/playback pathdec blockThis strongly suggests a settling/transient zone at the beginning of the decimated IQ block.
Later refinements to this theory:
extractForStreaming, signal parameter source, offset/BW stability, overlap/trim behavior).metricsHistory, events) and periodically trims them by copying tail slices.The investigation was deliberately refocused away from browser/feed/demod-only suspicions and toward:
This was driven by two observations:
Two config files matter for debug telemetry defaults:
config.yamlconfig.autosave.yamlThe autosave file can overwrite intended telemetry defaults after restart, so both must be updated together.
Current conservative live-debug defaults that worked better:
heavy_enabled: falseheavy_sample_every: 12metric_sample_every: 8metric_history_max: 6000event_history_max: 1500Important operational lesson:
POST /api/debug/telemetry/config changes only affect the current sdrd processheavy_enabled: true or very large history limits), the debug run can accidentally become self-distorting againThe live debug work used these HTTP endpoints on the sdrd web server (typically http://127.0.0.1:8080):
GET /api/debug/telemetry/configReturns the current effective telemetry configuration. Useful for verifying:
Typical fields:
enabledheavy_enabledheavy_sample_everymetric_sample_everymetric_history_maxevent_history_maxretention_secondspersist_enabledpersist_dirPOST /api/debug/telemetry/configApplies runtime telemetry config changes to the current process. Used during investigation to temporarily reduce telemetry load without editing files.
Example body used during investigation:
{
"heavy_enabled": true,
"heavy_sample_every": 12,
"metric_sample_every": 8
}
GET /api/debug/telemetry/liveReturns the current live metric snapshot (gauges/counters/distributions). Useful for:
GET /api/debug/telemetry/history?prefix=<prefix>&limit=<n>Returns stored metric history entries filtered by metric-name prefix. This is the main endpoint for time-series debugging during live runs.
Useful examples:
prefix=stage.prefix=source.prefix=iq.boundary.allprefix=iq.extract.inputprefix=iq.extract.outputprefix=iq.extract.raw.prefix=iq.extract.trimmed.prefix=iq.pre_demodprefix=audio.demodGET /api/debug/telemetry/events?limit=<n>Returns recent structured telemetry events. Used heavily once compact per-block event probes were added, because events were often easier to inspect reliably than sparsely sampled distribution histories.
This ended up being especially useful for:
iq.boundary.all.head_mean_magiq.boundary.all.prev_tail_mean_magiq.boundary.all.delta_magiq.boundary.all.delta_phaseiq.boundary.all.discontinuity_scorePurpose:
allIQ block boundary was already obviously broken before signal-specific extractioniq.extract.input.lengthiq.extract.input.overlap_lengthiq.extract.input.head_mean_magiq.extract.input.prev_tail_mean_magiq.extract.input.discontinuity_scoreiq.extract.output.lengthiq.extract.output.head_mean_magiq.extract.output.head_min_magiq.extract.output.head_max_stepiq.extract.output.head_p95_stepiq.extract.output.head_tail_ratioiq.extract.output.head_low_magnitude_countiq.extract.output.boundary.delta_magiq.extract.output.boundary.delta_phaseiq.extract.output.boundary.d2iq.extract.output.boundary.discontinuity_scorePurpose:
iq.extract.raw.lengthiq.extract.raw.head_magiq.extract.raw.tail_magiq.extract.raw.head_zero_countiq.extract.raw.first_nonzero_indexiq.extract.raw.head_max_stepiq.extract.trim.trim_samplesiq.extract.trimmed.head_magiq.extract.trimmed.tail_magiq.extract.trimmed.head_zero_countiq.extract.trimmed.first_nonzero_indexiq.extract.trimmed.head_max_stepextract_raw_head_probeextract_trimmed_head_probePurpose:
iq.pre_demod.head_mean_magiq.pre_demod.head_min_magiq.pre_demod.head_max_stepiq.pre_demod.head_p95_stepiq.pre_demod.head_low_magnitude_countaudio.demod.head_mean_absaudio.demod.tail_mean_absaudio.demod.edge_delta_absaudio.demod_boundary.*Purpose:
stage.feed_enqueue.duration_ms was usually effectively zero.
Representative values during live runs:
00.5 ms and 5.8 msInterpretation:
stage.extract_stream.duration_ms was usually small and stable compared with the main loop.
Representative values:
1–5 ms10.7 ms and 18.9 msInterpretation:
Representative live values:
dsp.frame.duration_ms: often around 90–100 ms, but also 110–150 ms, with one observed spike around 212.6 mssource.read.duration_ms: roughly 80–90 ms often, but also about 60 ms, 47 ms, 19 ms, and even 0.677 mssource.buffer_samples: ranged from very small to very large bursts, including examples like 512, 4608, 94720, 179200, 304544source_reset event was seen and source.resets=1Interpretation:
Representative live values for normal non-vanishing signals:
iq.pre_demod.head_mean_mag around 0.25–0.31iq.pre_demod.head_low_magnitude_count = 0iq.pre_demod.head_max_step repeatedly high, including roughly:
1.52.02.42.83.08Interpretation:
Representative values:
audio.demod.edge_delta_abs repeatedly around 0.4–0.81.21 and 1.26audio.demod_boundary.count continued to fire repeatedlyInterpretation:
For a representative strong signal (signal_id=2), iq.extract.output.boundary.delta_phase repeatedly showed very large jumps such as:
2.603.062.142.713.092.922.632.78Also observed for iq.extract.output.boundary.discontinuity_score:
2.863.082.922.522.402.85Later runs using d2 made the discontinuity even easier to see. Representative iq.extract.output.boundary.d2 values for the same strong signal included:
0.3470.3030.3620.3590.3820.3440.3370.206At the same time, iq.extract.output.boundary.delta_mag was often comparatively small (examples around 0.0003–0.0038).
Interpretation:
The new extract_raw_head_probe events were the strongest finding of the day.
Representative repeated pattern for strong signals (signal_id=1 and signal_id=2):
first_nonzero_index = 1zero_count = 10signal_id=2
00.0003880.0023160.0041520.0191260.0114180.1240340.2575690.317579head_max_step often near π, e.g.:
3.1415926535897933.0887736964636063.01068544469363182.9794833659932527The same qualitative pattern appeared for weaker signals too:
0Interpretation:
Representative repeated pattern for the same signals after trim_samples = 64:
first_nonzero_index = 0zero_count = 0head_max_step is dramatically lower than raw, often around 0.15–0.9 for strong channelsExample trimmed head magnitudes for signal_id=2:
0.2993500.3009540.2980320.2987380.3122580.2969320.2390100.2668810.313193Example trimmed head magnitudes for signal_id=1:
0.2774000.2759940.2737180.2728460.2778420.2783980.2688290.2737900.279031Interpretation:
The current best reading is:
The click root cause is very likely upstream of final trimming, at or before the raw extractor output head, and likely tied to shared block-boundary / extractor-start conditions rather than to feed enqueue, browser playback, or trimming itself.
More specifically:
SDR_FORCE_FIXED_STREAM_READ_SAMPLES=389120) remains useful and likely worth promoting later, but it is not the root-cause fix.config.autosave.yaml must be kept in sync with config.yaml or telemetry defaults can silently revert after restart.This investigation already disproved several plausible explanations. That is progress.
The most important thing not to forget is: