Wideband autonomous SDR analysis engine forked from sdr-visual-suite
Vous ne pouvez pas sélectionner plus de 25 sujets Les noms de sujets doivent commencer par une lettre ou un nombre, peuvent contenir des tirets ('-') et peuvent comporter jusqu'à 35 caractères.

45KB

Audio Click Debug Notes — 2026-03-24

Context

This note captures the intermediate findings from the live/recording audio click investigation on sdr-wideband-suite.

Goal: preserve the reasoning, experiments, false leads, and current best understanding so future work does not restart from scratch.


High-level outcome so far

We do not yet have the final root cause.

But we now know substantially more about what the clicks are not, and we identified at least one real bug plus several strong behavioral constraints in the pipeline.


What was tested

1. Session/context recovery

  • Reconstructed prior debugging context from reset-session backup files.
  • Confirmed the relevant investigation was the persistent audio clicking bug in live audio / recordings.

2. Codebase deep-read

Reviewed in detail:

  • cmd/sdrd/dsp_loop.go
  • cmd/sdrd/pipeline_runtime.go
  • cmd/sdrd/helpers.go
  • internal/recorder/streamer.go
  • internal/recorder/demod_live.go
  • internal/dsp/fir.go
  • internal/dsp/fir_stateful.go
  • internal/dsp/resample.go
  • internal/demod/fm.go
  • internal/demod/gpudemod/*
  • web/app.js

Main conclusion from static reading: the pipeline contains several stateful continuity mechanisms, so clicks are likely to emerge at boundaries or from phase/timing inconsistencies rather than from one obvious isolated bug.

3. AM vs FM tests

Observed by ear:

  • AM clicks too.
  • Therefore this is not an FM-only issue.
  • That shifted focus away from purely FM-specific explanations and toward shared-path / continuity / transport / demod-adjacent causes.

4. Recording vs live path comparison

Observed by ear:

  • Recordings click too.
  • Therefore browser/WebSocket/live playback is not the sole cause.
  • The root problem exists in the server-side audio pipeline before browser playback.

5. Boundary instrumentation added

Temporary diagnostics were added to inspect:

  • extract trimming
  • snippet lengths
  • demod path lengths
  • boundary click / intra-click detector
  • IQ continuity at various stages

6. Discriminator-overlap hypothesis

A test switch temporarily disabled the extra 1-sample discriminator overlap prepend in streamer.go.

Result:

  • This extra overlap was a real problem.
  • It caused the downstream decimation phase to flip between blocks.
  • Removing it cleaned up the boundary model and was the correct change.

However:

  • Removing it did not eliminate the audible clicks.
  • Therefore it was a real bug, but not the main remaining root cause.

7. GPU vs CPU extraction test

Forced CPU-only stream extraction.

Result:

  • CPU-only made things dramatically worse in real time.
  • Large feed_gap values appeared.
  • Huge backlogs built up.
  • Therefore CPU-only is not a solution, and the GPU path is not the sole main problem.

8. Fixed read-size test

Forced a constant extraction read size (389120) instead of variable read sizing based on backlog.

Result:

  • allIQ, gpuIQ_len, raw_len, and out_len became very stable.
  • This reduced pipeline variability and made logs much cleaner.
  • Subjectively, audio may have become slightly better, but clicks remained.
  • Therefore variable block sizing is likely a contributing factor, but not the full explanation.

9. Multi-stage audio dump test

Added optional debug dumping for:

  • demod audio (*-demod.wav)
  • final audio after resampler (*-final.wav)

Observed by ear:

  • Clicks are present in both dump types.
  • Therefore the click is already present by the time demodulated audio exists.
  • Resampler/final audio path is not the primary origin.

10. CPU monitoring

A process-level CSV monitor was added and used.

Result:

  • Overall process CPU usage was modest (not near full machine saturation).
  • This does not support “overall CPU is pegged” as the main explanation.
  • Caveat: this does not fully exclude a hot thread or scheduler issue, but gross total CPU overload is not the main story.

What we now know with reasonable confidence

A. The issue is not primarily caused by:

  • Browser playback
  • WebSocket transport
  • Final PCM fanout only
  • Resampler alone
  • CPU-only vs GPU-only as the core dichotomy
  • The old extra discriminator overlap prepend (that was a bug, but not the remaining dominant one)
  • Purely variable block sizes alone
  • Gross whole-process CPU saturation

B. The issue is server-side and exists before final playback

Because:

  • recordings click
  • demod dump clicks
  • final dump clicks

C. The issue is present by the demodulated audio stage

This is one of the strongest current findings.

D. The WFM/FM-demod-adjacent path remains highly suspicious

Current best area of suspicion:

  • decimated IQ may still contain subtle corruption/instability not fully captured by current metrics
  • OR the FM discriminator (fmDiscrim) is producing pathological output from otherwise “boundary-clean-looking” IQ

Important runtime/pathology observations

1. Backlog amplification is real

Several debug runs showed severe buffer growth and drops:

  • large buf= values
  • growing drop= counts
  • repeated audio_gap

This means some debug configurations can easily become self-distorting and produce additional artifacts that are not representative of the original bug.

2. Too much debug output causes self-inflicted load

At one point:

  • rate limiter was disabled (rate_limit_ms: 0)
  • aggressive boundary logging was enabled
  • many short WAV files were generated

This clearly increased overhead and likely polluted some runs.

3. Many short WAVs were a bad debug design

That was replaced with a design intended to write one continuous window file instead of many micro-files.

4. Total process CPU saturation does not appear to be the main cause

A process-level CSV monitor was collected and showed only modest total CPU utilisation during the relevant tests. This does not support a simple “the machine is pegged” explanation. A hot thread / scheduling issue is still theoretically possible, but gross overall CPU overload is not the main signal.


Current debug state in repo

Branch

All current work is on:

  • debug/audio-clicks

Commits so far

  • 94c132ddebug: instrument audio click investigation
  • ffbc45ddebug: add advanced boundary metering

Current config/logging state

The active debug logging was trimmed down to:

  • demod
  • discrim
  • gap
  • boundary

Rate limit is currently back to a nonzero value to avoid self-induced spam.

Dump/CPU debug state

A debug: config section was added with:

  • audio_dump_enabled: false
  • cpu_monitoring: false

Meaning:

  • heavy WAV dumping is now OFF by default
  • CPU monitoring is conceptually OFF by default (script still exists, but must be explicitly used)

Most important code changes/findings to remember

1. Removed the extra discriminator overlap prepend in streamer.go

This was a correct fix.

Reason:

  • it introduced a blockwise extra IQ sample
  • this shifted decimation phase between blocks
  • it created real boundary artifacts

This should not be reintroduced casually.

2. Fixed read-size test exists and is useful for investigation

A temporary mechanism exists to force stable extraction block sizes. This is useful diagnostically because it removes one source of pipeline variability.

IMPORTANT DECISION / DO NOT LOSE:

  • The fixed read-size path currently lives behind the environment variable SDR_FORCE_FIXED_STREAM_READ_SAMPLES.
  • The tested value 389120 clearly helps by making allIQ, gpuIQ_len, raw_len, and out_len much more stable and by reducing one major source of pipeline variability.
  • Current plan: once the remaining click root cause is solved, promote this behavior into the normal code path instead of leaving it as an env-var-only debug switch.
  • In other words: treat fixed read sizing as a likely permanent stabilization improvement, but do not bake it in blindly until the click investigation is complete.

3. FM discriminator metering exists

internal/demod/fm.go now emits targeted discriminator stats under discrim logging, including:

  • min/max IQ magnitude
  • maximum absolute phase step
  • count of large phase steps

This was useful to establish that large discriminator steps correlate with low IQ magnitude, but discriminator logging was later disabled from the active category list to reduce log spam.

4. Strong dec-IQ findings before demod

Additional metering in streamer.go showed:

  • repeated dec_iq_head_dip
  • repeated low magnitude near min_idx ~= 25
  • repeated early large local phase step near max_step_idx ~= 24
  • repeated demod_boundary and audible clicks shortly afterward

This is the strongest currently known mechanism in the chain.

5. Group delay observation

For the current pre-demod FIR:

  • taps = 101
  • FIR group delay = (101 - 1) / 2 = 50 input samples
  • with decim1 = 2, this projects to about 25 output samples

This matches the repeatedly observed problematic dec indices (~24-25) remarkably well. That strongly suggests the audible issue is connected to the FIR/decimation settling region at the beginning of the dec block.

6. Pre-FIR vs post-FIR comparison

A dedicated pre-FIR probe was added on fullSnip (the input to the pre-demod FIR) and compared against the existing dec-side probes.

Observed pattern:

  • pre-FIR head probe usually looked relatively normal
  • no equally strong or equally reproducible hot spot appeared there
  • after FIR + decimation, the problematic dip/step repeatedly appeared near dec indices ~24-25

Interpretation:

  • the strongest currently observed defect is not already present in the same form before the FIR
  • it is much more likely to emerge in the FIR/decimation section (or its settling behavior) than in the raw pre-FIR input

7. Head-trim test results

A debug head-trim on dec was tested. Subjective result:

  • trim=32 sounded best among the tested values (16/32/48/64)
  • but it did not remove the clicks entirely

Interpretation:

  • the early dec settling region is a real contributor
  • but it is probably not the only contributor, or trimming alone is not the final correct fix

8. Current architectural conclusion

The likely clean fix is not to keep trimming samples away. The FIR/decimation section is still suspicious, but later tests showed it is likely not the sole origin.

Important nuance:

  • the currently suspicious FIR + decimation section is already running in Go/CPU (processSnippet), not in CUDA
  • therefore the next correctness fix should be developed and validated in Go first

Later update:

  • a stateful decimating FIR / polyphase-style replacement was implemented in Go and tested
  • it was architecturally cleaner than the old separated FIR->decimate handoff
  • but it did not remove the recurring hot spot / clicks
  • therefore the old handoff was not the whole root cause, even if the newer path is still cleaner

Best current hypothesis

The remaining audible clicks are most likely generated at or immediately before FM demodulation.

Most plausible interpretations:

  1. The decimated IQ stream still contains subtle corruption/instability not fully captured by the earliest boundary metrics.
  2. The FM discriminator is reacting violently to short abnormal IQ behavior inside blocks, not just at block boundaries.
  3. The problematic region is likely a very specific early decimated-IQ settling zone, not broad corruption across the whole block.

At this point, the most valuable next data is low-overhead IQ telemetry right before demod, plus carefully controlled demod-vs-final audio comparison.

Stronger updated working theory (later findings, same day)

After discriminator-focused metering and targeted dec-IQ probes, the strongest current theory is:

A reproducible early defect in the dec IQ block appears around sample index 24-25, where IQ magnitude dips sharply and the effective FM phase step becomes abnormally large. This then shows up as demod_boundary and audible clicks.

Crucially:

  • this issue appears in demod.wav, so it exists before the final resampler/playback path
  • it is not spread uniformly across the whole dec block
  • it repeatedly appears near the same index
  • trimming the first ~32 samples subjectively reduces the click, but does not eliminate it entirely

This strongly suggests a settling/transient zone at the beginning of the decimated IQ block.

Later refinements to this theory:

  • pre-FIR probing originally looked cleaner than post-FIR probing, which made FIR/decimation look like the main culprit
  • however, a temporary FIR bypass showed the clicks were still present, only somewhat quieter / less aggressive
  • this indicates the pre-demod FIR likely amplifies or sharpens an upstream issue, but is not the sole origin
  • a cleaner stateful decimating FIR implementation also failed to eliminate the recurring hot spot, further weakening the idea that the old FIR->decimate handoff alone caused the bug

  1. Run with reduced logging only and keep heavy dump features OFF unless explicitly needed.
  2. Continue investigating the extractor path and its immediate surroundings (extractForStreaming, signal parameter source, offset/BW stability, overlap/trim behavior).
  3. Treat FIR/decimation as a possible amplifier/focuser of the issue, but not the only suspect.
  4. When testing fixes, prefer low-overhead, theory-driven experiments over broad logging/dump spam.
  5. Only re-enable audio dump windows selectively and briefly.

Debug TODO / operational reminders

  • The current telemetry collector is not using a true ring buffer for metric/event history.
  • Internally it keeps append-only history slices (metricsHistory, events) and periodically trims them by copying tail slices.
  • Under heavy per-block telemetry this can add enough mutex/copy overhead to make the live stream start stuttering after a short run.
  • Therefore: keep telemetry sampling conservative during live reproduction runs; do not leave full heavy telemetry enabled longer than needed.
  • Follow-up engineering task: replace or redesign telemetry history storage to use a proper low-overhead ring-buffer style structure (or equivalent bounded lock-light design) if live telemetry is to remain a standard debugging tool.

2026-03-25 update — extractor-focused live telemetry findings

Where the investigation moved

The investigation was deliberately refocused away from browser/feed/demod-only suspicions and toward:

  • shared upstream IQ cadence / block boundaries
  • extractor input/output continuity
  • raw vs trimmed extractor-head behaviour

This was driven by two observations:

  1. all signals still click
  2. the newly added live telemetry made it possible to inspect the shared path while the system was running

Telemetry infrastructure / config notes

Two config files matter for debug telemetry defaults:

  • config.yaml
  • config.autosave.yaml

The autosave file can overwrite intended telemetry defaults after restart, so both must be updated together.

Current conservative live-debug defaults that worked better:

  • heavy_enabled: false
  • heavy_sample_every: 12
  • metric_sample_every: 8
  • metric_history_max: 6000
  • event_history_max: 1500

Important operational lesson:

  • runtime POST /api/debug/telemetry/config changes only affect the current sdrd process
  • after restart, the process reloads config defaults again
  • if autosave still contains older values (for example heavy_enabled: true or very large history limits), the debug run can accidentally become self-distorting again

Telemetry endpoints

The live debug work used these HTTP endpoints on the sdrd web server (typically http://127.0.0.1:8080):

GET /api/debug/telemetry/config

Returns the current effective telemetry configuration. Useful for verifying:

  • whether heavy telemetry is enabled
  • history sizes
  • persistence settings
  • sample rates actually active in the running process

Typical fields:

  • enabled
  • heavy_enabled
  • heavy_sample_every
  • metric_sample_every
  • metric_history_max
  • event_history_max
  • retention_seconds
  • persist_enabled
  • persist_dir

POST /api/debug/telemetry/config

Applies runtime telemetry config changes to the current process. Used during investigation to temporarily reduce telemetry load without editing files.

Example body used during investigation:

{
  "heavy_enabled": true,
  "heavy_sample_every": 12,
  "metric_sample_every": 8
}

GET /api/debug/telemetry/live

Returns the current live metric snapshot (gauges/counters/distributions). Useful for:

  • quick sanity checks
  • verifying that a metric family exists
  • confirming whether a new metric name is actually being emitted

GET /api/debug/telemetry/history?prefix=<prefix>&limit=<n>

Returns stored metric history entries filtered by metric-name prefix. This is the main endpoint for time-series debugging during live runs.

Useful examples:

  • prefix=stage.
  • prefix=source.
  • prefix=iq.boundary.all
  • prefix=iq.extract.input
  • prefix=iq.extract.output
  • prefix=iq.extract.raw.
  • prefix=iq.extract.trimmed.
  • prefix=iq.pre_demod
  • prefix=audio.demod

GET /api/debug/telemetry/events?limit=<n>

Returns recent structured telemetry events. Used heavily once compact per-block event probes were added, because events were often easier to inspect reliably than sparsely sampled distribution histories.

This ended up being especially useful for:

  • raw extractor head probes
  • trimmed extractor head probes
  • extractor input head probes
  • GPU kernel input/output head probes
  • boundary snapshots

Important telemetry families added/used

Shared-path / global boundary metrics

  • iq.boundary.all.head_mean_mag
  • iq.boundary.all.prev_tail_mean_mag
  • iq.boundary.all.delta_mag
  • iq.boundary.all.delta_phase
  • iq.boundary.all.discontinuity_score

Purpose:

  • detect whether the shared allIQ block boundary was already obviously broken before signal-specific extraction

Extractor input/output metrics

  • iq.extract.input.length
  • iq.extract.input.overlap_length
  • iq.extract.input.head_mean_mag
  • iq.extract.input.prev_tail_mean_mag
  • iq.extract.input.discontinuity_score
  • iq.extract.output.length
  • iq.extract.output.head_mean_mag
  • iq.extract.output.head_min_mag
  • iq.extract.output.head_max_step
  • iq.extract.output.head_p95_step
  • iq.extract.output.head_tail_ratio
  • iq.extract.output.head_low_magnitude_count
  • iq.extract.output.boundary.delta_mag
  • iq.extract.output.boundary.delta_phase
  • iq.extract.output.boundary.d2
  • iq.extract.output.boundary.discontinuity_score

Purpose:

  • isolate whether the final per-signal extractor output itself was discontinuous across blocks

Raw vs trimmed extractor-head telemetry

  • iq.extract.raw.length
  • iq.extract.raw.head_mag
  • iq.extract.raw.tail_mag
  • iq.extract.raw.head_zero_count
  • iq.extract.raw.first_nonzero_index
  • iq.extract.raw.head_max_step
  • iq.extract.trim.trim_samples
  • iq.extract.trimmed.head_mag
  • iq.extract.trimmed.tail_mag
  • iq.extract.trimmed.head_zero_count
  • iq.extract.trimmed.first_nonzero_index
  • iq.extract.trimmed.head_max_step
  • event extract_raw_head_probe
  • event extract_trimmed_head_probe

Purpose:

  • answer the key question: is the corruption already present in the raw extractor output head, or created by trimming/overlap logic afterward?

Additional extractor input / GPU-kernel probe telemetry

  • iq.extract.input_head.zero_count
  • iq.extract.input_head.first_nonzero_index
  • iq.extract.input_head.max_step
  • event extract_input_head_probe
  • event gpu_kernel_input_head_probe
  • event gpu_kernel_output_head_probe

Purpose:

  • split the remaining uncertainty between:
    • signal-specific input already being bad
    • GPU extractor kernel/start semantics producing the bad raw head
    • later output assembly after the kernel

Pre-demod / audio-stage metrics

  • iq.pre_demod.head_mean_mag
  • iq.pre_demod.head_min_mag
  • iq.pre_demod.head_max_step
  • iq.pre_demod.head_p95_step
  • iq.pre_demod.head_low_magnitude_count
  • audio.demod.head_mean_abs
  • audio.demod.tail_mean_abs
  • audio.demod.edge_delta_abs
  • existing audio.demod_boundary.*

Purpose:

  • verify where artifacts become visible/audible downstream

What the 2026-03-25 telemetry actually showed

1. Feed / enqueue remained relatively uninteresting

stage.feed_enqueue.duration_ms was usually effectively zero.

Representative values during live runs:

  • mostly 0
  • occasional small spikes such as 0.5 ms and 5.8 ms

Interpretation:

  • feed enqueue is not the main source of clicks

2. Extract-stream time was usually modest

stage.extract_stream.duration_ms was usually small and stable compared with the main loop.

Representative values:

  • often 1–5 ms
  • occasional spikes such as 10.7 ms and 18.9 ms

Interpretation:

  • extraction is not free, but runtime cost alone does not explain the clicks

3. Shared capture / source cadence still fluctuated heavily

Representative live values:

  • dsp.frame.duration_ms: often around 90–100 ms, but also 110–150 ms, with one observed spike around 212.6 ms
  • source.read.duration_ms: roughly 80–90 ms often, but also about 60 ms, 47 ms, 19 ms, and even 0.677 ms
  • source.buffer_samples: ranged from very small to very large bursts, including examples like 512, 4608, 94720, 179200, 304544
  • a source_reset event was seen and source.resets=1

Interpretation:

  • shared upstream cadence is clearly unstable enough to remain suspicious
  • but this alone did not localize the final click mechanism

4. Pre-demod stage showed repeated hard phase anomalies even when energy looked healthy

Representative live values for normal non-vanishing signals:

  • iq.pre_demod.head_mean_mag around 0.25–0.31
  • iq.pre_demod.head_low_magnitude_count = 0
  • iq.pre_demod.head_max_step repeatedly high, including roughly:
    • 1.5
    • 2.0
    • 2.4
    • 2.8
    • 3.08

Interpretation:

  • not primarily an amplitude collapse
  • rather a strong phase/continuity defect reaching the pre-demod stage

5. Audio stage still showed real block-edge artifacts

Representative values:

  • audio.demod.edge_delta_abs repeatedly around 0.4–0.8
  • outliers up to roughly 1.21 and 1.26
  • audio.demod_boundary.count continued to fire repeatedly

Interpretation:

  • demod is where the problem becomes audible, but the root cause still appeared to be earlier/shared

Key extractor findings from the new telemetry

A. Per-signal extractor output boundary is genuinely broken

For a representative strong signal (signal_id=2), iq.extract.output.boundary.delta_phase repeatedly showed very large jumps such as:

  • 2.60
  • 3.06
  • 2.14
  • 2.71
  • 3.09
  • 2.92
  • 2.63
  • 2.78

Also observed for iq.extract.output.boundary.discontinuity_score:

  • 2.86
  • 3.08
  • 2.92
  • 2.52
  • 2.40
  • 2.85

Later runs using d2 made the discontinuity even easier to see. Representative iq.extract.output.boundary.d2 values for the same strong signal included:

  • 0.347
  • 0.303
  • 0.362
  • 0.359
  • 0.382
  • 0.344
  • 0.337
  • 0.206

At the same time, iq.extract.output.boundary.delta_mag was often comparatively small (examples around 0.0003–0.0038).

Interpretation:

  • the main boundary defect is not primarily amplitude mismatch
  • it is much more consistent with complex/phase discontinuity across output blocks

B. The raw extractor head is systematically bad on all signals

The new extract_raw_head_probe events were the strongest finding of the day.

Representative repeated pattern for strong signals (signal_id=1 and signal_id=2):

  • first_nonzero_index = 1
  • zero_count = 1
  • first magnitude sample exactly 0
  • then a short ramp: e.g. for signal_id=2
    • 0
    • 0.000388
    • 0.002316
    • 0.004152
    • 0.019126
    • 0.011418
    • 0.124034
    • 0.257569
    • 0.317579
  • head_max_step often near π, e.g.:
    • 3.141592653589793
    • 3.088773696463606
    • 3.0106854446936318
    • 2.9794833659932527

The same qualitative pattern appeared for weaker signals too:

  • raw head starts at 0
  • a brief near-zero ramp follows
  • only after several samples does the magnitude look like a normal extracted band

Interpretation:

  • the raw extractor output head is already damaged / settling / invalid before trimming
  • this strongly supports an upstream/shared-start-condition problem rather than a trim-created artifact

C. The trimmed extractor head usually looks sane

Representative repeated pattern for the same signals after trim_samples = 64:

  • first_nonzero_index = 0
  • zero_count = 0
  • magnitudes look immediately plausible and stable
  • head_max_step is dramatically lower than raw, often around 0.15–0.9 for strong channels

Example trimmed head magnitudes for signal_id=2:

  • 0.299350
  • 0.300954
  • 0.298032
  • 0.298738
  • 0.312258
  • 0.296932
  • 0.239010
  • 0.266881
  • 0.313193

Example trimmed head magnitudes for signal_id=1:

  • 0.277400
  • 0.275994
  • 0.273718
  • 0.272846
  • 0.277842
  • 0.278398
  • 0.268829
  • 0.273790
  • 0.279031

Interpretation:

  • trimming is removing a genuinely bad raw head region
  • trimming is therefore not the main origin of the problem
  • it acts more like cleanup of an already bad upstream/raw start region

Input-vs-raw-vs-trimmed extractor result (important refinement)

A later, more targeted telemetry pass added a direct probe on the signal-specific extractor input head (extract_input_head_probe) and compared it against the raw and trimmed extractor output heads.

This materially refined the earlier conclusion.

Input-head result

Representative values from iq.extract.input_head.*:

  • iq.extract.input_head.zero_count = 0
  • iq.extract.input_head.first_nonzero_index = 0

Interpretation:

  • the signal-specific input head going into the GPU extractor is not starting with a zero sample
  • the head is not arriving already dead/null from the immediate input probe point

Raw-head result

Representative values from iq.extract.raw.*:

  • iq.extract.raw.head_mag = 0
  • iq.extract.raw.head_zero_count = 1
  • iq.extract.raw.head_max_step frequently around 2.4–3.14

These values repeated for strong channels such as signal_id=2, and similarly across other signals.

Interpretation:

  • the first raw output sample is repeatedly exactly zero
  • therefore the visibly bad raw head is being created after the probed input head and before/during raw extractor output generation

Trimmed-head result

Representative values from iq.extract.trimmed.*:

  • iq.extract.trimmed.head_zero_count = 0
  • iq.extract.trimmed.head_mag often looked healthy immediately after trimming, for example:
    • signal 1: about 0.275–0.300
    • signal 2: about 0.311
  • iq.extract.trimmed.head_max_step was much lower than raw for strong channels, often around:
    • 0.11
    • 0.14
    • 0.19
    • 0.30
    • 0.75

Interpretation:

  • trimming cleans up the visibly bad raw head region
  • trimming still does not explain the deeper output-boundary continuity issue

Further refinement after direct extractor-input and GPU-kernel probes

A final telemetry round added:

  • extract_input_head_probe
  • gpu_kernel_input_head_probe
  • gpu_kernel_output_head_probe

These probes further sharpened the likely fault location.

Signal-specific extractor input head looked sane

Representative values:

  • iq.extract.input_head.zero_count = 0
  • iq.extract.input_head.first_nonzero_index = 0

Interpretation:

  • at the observed signal-specific input probe point, the GPU extractor is not receiving a dead/null head

Raw GPU output head remained systematically broken

Representative repeated values:

  • iq.extract.raw.head_mag = 0
  • iq.extract.raw.head_zero_count = 1
  • iq.extract.raw.head_max_step repeatedly around:
    • 3.141592653589793
    • 3.122847934305907
    • 3.101915352902961
    • 3.080672178550904
    • 3.062425574273907
    • 2.9785041567778427
    • 2.7508533785793476

Representative repeated examples from strong channels:

  • signal 2: head_mag = 0, head_zero_count = 1
  • signal 3: head_mag = 0, head_zero_count = 1
  • signal 1/4 showed the same qualitative head-zero pattern as well

Interpretation:

  • the raw extractor output head is still repeatedly born broken
  • the problem is therefore after the currently probed input head and before/during raw output creation

Trimmed head still looked healthier

Representative values:

  • iq.extract.trimmed.head_zero_count = 0
  • signal 1 iq.extract.trimmed.head_mag repeatedly around:
    • 0.2868
    • 0.2907
    • 0.3036
    • 0.3116
    • 0.2838
    • 0.2760
  • signal 2 examples:
    • 0.3461
    • 0.3182

Representative iq.extract.trimmed.head_max_step values for strong channels were much lower than raw, often around:

  • 0.11
  • 0.13
  • 0.21
  • 0.30
  • 0.44
  • 0.69
  • 0.86

Interpretation:

  • trimming still removes the most visibly broken head region
  • but trimming does not explain the deeper output-boundary continuity issue

Refined strongest current conclusion after the full 2026-03-25 telemetry pass

The strongest current reading is now:

The click root cause is very likely not that the signal-specific extractor input already starts dead/null. Instead, the bad raw head appears to be introduced inside the GPU extractor path itself (or at its immediate start/output semantics) before final trimming.

More specifically:

  • signal-specific extractor input head looks non-zero and sane at the probe point
  • raw GPU output head still repeatedly starts with an exact zero sample and a short bad settling region
  • the trimmed head usually looks healthier
  • yet the final extractor output still exhibits significant complex boundary discontinuity from block to block

This now points away from a simple “shared global input head is already zero” theory and toward one of these narrower causes:

  1. GPU extractor kernel start semantics / warmup / first-output handling
  2. phase-start or alignment handling at extractor block start
  3. raw GPU output assembly semantics within the extractor path

What should not be forgotten from this stage

  • The overlap-prepend bug was real and worth fixing, but was not sufficient.
  • The fixed read-size path (SDR_FORCE_FIXED_STREAM_READ_SAMPLES=389120) remains useful and likely worth promoting later, but it is not the root-cause fix.
  • The telemetry system itself can perturb runs if overused; conservative sampling matters.
  • config.autosave.yaml must be kept in sync with config.yaml or telemetry defaults can silently revert after restart.
  • The most promising root-cause area is now the shared upstream/extractor-start boundary path, not downstream playback.

2026-03-25 refactor work status (post-reviewer instruction)

After the reviewer guidance, work pivoted away from symptomatic patching and onto the required two-track architecture change:

Track 1 — CPU/oracle path repair (in progress)

The following was added to start building a trustworthy streaming oracle:

  • internal/demod/gpudemod/streaming_types.go
  • internal/demod/gpudemod/cpu_oracle.go
  • internal/demod/gpudemod/cpu_oracle_test.go
  • internal/demod/gpudemod/streaming_oracle_extract.go
  • internal/demod/gpudemod/polyphase.go
  • internal/demod/gpudemod/polyphase_test.go

What exists now:

  • explicit StreamingExtractJob / StreamingExtractResult
  • explicit CPUOracleState
  • exact integer decimation enforcement (ExactIntegerDecimation)
  • monolithic-vs-chunked CPU oracle test
  • explicit polyphase tap layout (phase-major)
  • CPU oracle direct-vs-polyphase equivalence test
  • persistent CPU oracle runner state keyed by signal ID
  • config-hash reset behavior
  • cleanup of disappeared signals from oracle state

Important limitation:

  • this is not finished production validation yet
  • the CPU oracle path is being built toward the reviewer’s required semantics, but it is not yet the final signed-off oracle for GPU validation

Track 2 — GPU path architecture refactor (in progress)

The following was added to begin the new stateful GPU architecture:

  • internal/demod/gpudemod/stream_state.go
  • internal/demod/gpudemod/streaming_gpu_stub.go
  • docs/gpu-streaming-refactor-plan-2026-03-25.md
  • cmd/sdrd/streaming_refactor.go

What exists now:

  • explicit ExtractStreamState
  • batch-runner-owned per-signal state map
  • config-hash reset behavior for GPU-side stream state
  • exact integer decimation enforcement in relevant batch path
  • base taps and polyphase taps initialized into GPU-side stream state
  • explicit future production entry point: StreamingExtractGPU(...)
  • explicit separation between current legacy extractor path and the new streaming/oracle path
  • persistent oracle-runner lifecycle hooks, including reset on stream-drop events

Important limitation:

  • the new GPU production path is not implemented yet
  • the legacy overlap+trim production path still exists and is still the current active path
  • the new GPU entry point currently exists as an explicit architectural boundary and state owner, not as the finished stateful polyphase kernel path

Tests currently passing during refactor

Repeatedly verified during the refactor work:

  • go test ./internal/demod/gpudemod/...
  • go test ./cmd/sdrd/...

Incremental progress reached so far inside the refactor

Additional progress after the initial refactor scaffolding:

  • the CPU oracle runner now uses the explicit polyphase oracle path (CPUOracleExtractPolyphase) instead of only carrying polyphase tap data passively
  • the CPU oracle now has a direct-vs-polyphase equivalence test
  • the GPU-side stream state now initializes both BaseTaps and PolyphaseTaps
  • the GPU side now has an explicit future production entry point StreamingExtractGPU(...)
  • the GPU streaming stub now advances NCOPhase over NEW samples only
  • the GPU streaming stub now advances PhaseCount modulo exact integer decimation
  • the GPU streaming stub now builds and persists ShiftedHistory from already frequency-shifted NEW samples
  • the new streaming/oracle path is explicitly separated from the current legacy overlap+trim production path

Important current limitation:

  • StreamingExtractGPU(...) still intentionally returns a not-implemented error rather than pretending to be the finished production path
  • this is deliberate, to avoid hidden quick-fix semantics or silent goalpost shifts

Additional note on the latest step:

  • the GPU streaming stub now also reports an estimated output-count schedule (NOut) derived from NEW sample consumption plus carried PhaseCount
  • this still does not make it a production path; it only means the stub now models output cadence semantics more honestly
  • the new CPU/oracle path is also now exposing additional runtime telemetry such as streaming.oracle.rate and streaming.oracle.output_len, so the reference path becomes easier to inspect as it matures
  • a reusable complex-slice comparison helper now exists (CompareComplexSlices) to support later oracle-vs-GPU equivalence work without improvising comparison logic at the last minute
  • a dedicated TestCPUOracleMonolithicVsChunkedPolyphase now verifies chunked-vs-monolithic self-consistency for the polyphase oracle path specifically
  • explicit reset tests now exist for both CPU oracle state and GPU streaming state, so config-change reset semantics are no longer only implicit in code review
  • a dedicated ExtractDebugMetrics structure now exists as a future comparison/telemetry contract for reviewer-required state/error/boundary metrics
  • the first mapper from oracle results into that debug-metric structure now exists, so the comparison contract is beginning to attach to real refactor code rather than staying purely conceptual
  • the same minimal debug-metric mapping now also exists for GPU-stub results, so both sides of the future GPU-vs-oracle comparison now have an initial common reporting shape
  • a first comparison-pipeline helper now exists to turn oracle-vs-GPU-stub results into shared CompareStats / ExtractDebugMetrics output, even though the GPU path is still intentionally incomplete
  • that comparison helper is now also covered by a dedicated unit test, so even the scaffolding around future GPU-vs-oracle validation is being locked down incrementally
  • GPU-side stream-state initialization is now also unit-tested (Decim, BaseTaps, PolyphaseTaps, ShiftedHistory capacity), so the new state ownership layer is no longer just trusted by inspection
  • the GPU streaming stub now also has a dedicated test proving that it advances persistent state while still explicitly failing as a not-yet-implemented production path
  • at this point, enough scaffolding exists that the next sensible step is to build the broader validation/test harness in one larger pass before continuing the actual production-path rewrite
  • that harness pass has now happened: deterministic IQ/tone fixtures, harness config/state builders, chunked polyphase oracle runners, and additional validation tests now exist, so the next step is back to the actual production-path rewrite
  • the first non-stub NEW-samples-only production-like path now exists as StreamingExtractGPUHostOracle(...): it is still host-side, but it executes the new streaming/stateful semantics and therefore serves as a concrete bridge between pure test infrastructure and the eventual real GPU production path
  • that host-side production-like path is now directly compared against the CPU oracle in tests and currently matches within tight tolerance, which is an important confidence step before any real CUDA-path replacement
  • the canonical new production entry point StreamingExtractGPU(...) is now structurally wired so that the host-side production-like implementation can sit behind the same API later, without forcing a premature switch today
  • a top-level cmd/sdrd production path hook now exists as well (extractForStreamingProduction plus useStreamingProductionPath=false), so the new architecture is no longer isolated to internal packages only
  • the new production path now also emits first-class output/heading telemetry (rate, output_len, head_mean_mag, head_max_step) in addition to pure state counters, which will make activation/debugging easier later
  • a top-level comparison observation hook now also exists in cmd/sdrd, so oracle-vs-production metrics no longer have to remain buried inside internal package helpers
  • after the broader monitoring/comparison consolidation pass, the next agreed work mode is to continue in larger clusters rather than micro-steps: (1) wire the new production semantics more deeply, (2) isolate the legacy path more sharply, (3) keep preparing the eventual real GPU production path behind the same architecture
  • after the first larger cluster, the next explicit target is to complete Cluster B: make the host-oracle bridge sit more naturally behind the new production execution architecture, rather than leaving production-path semantics spread across loosely connected files
  • after Cluster B, the remaining GPU rewrite work is now best split into two explicit parts: C1 = prepare and C2 = definitive implementation, so the project can keep momentum without pretending that the final CUDA/stateful production path is already done
  • Cluster B is now effectively complete: CPU oracle runner, host-oracle production-like path, and top-level production comparison all share the same host streaming core, and that common core is directly tested against the polyphase oracle
  • Cluster C1 is now also complete: the new GPU production layer has an explicit invocation contract, execution-result contract, state handoff/build/apply stages, and a host-side execution strategy already running behind the same model

Current refactor status before C2

At this point the project has:

  • a corrected streaming/oracle architecture direction
  • a shared host-side streaming core used by both the CPU oracle runner and the host-side production-like bridge
  • explicit production-path hooks in cmd/sdrd
  • comparison and monitoring scaffolding above and below the execution layer
  • a prepared GPU execution contract (StreamingGPUInvocation / StreamingGPUExecutionResult)

What it does not have yet:

  • a real native CUDA streaming/polyphase execution entry point with history-in/history-out and phase-count in/out semantics
  • a real CUDA-backed implementation behind StreamingExtractGPUExec(...)
  • completed GPU-vs-oracle validation on the final native execution path

C2 plan

C2-A — native CUDA / bridge entry preparation

Goal:

  • introduce the real native entry shape for stateful streaming/polyphase execution

Status note before starting C2-A:

  • C2 is not honestly complete yet because the native CUDA side still only exposes the old separate freq-shift/FIR/decimate pieces.
  • Therefore C2-A must begin by creating the real native entry shape rather than continuing to stack more Go-only abstractions on top of the old kernels.

Required outcomes:

  • explicit native/CUDA function signature for streaming execution
  • bridge bindings for history in/out, phase count in/out, new samples in, outputs out
  • Go-side wrapper ready to call the new native path through the prepared invocation/result model

C2-B — definitive execution implementation hookup

Goal:

  • put a real native CUDA-backed execution strategy behind StreamingExtractGPUExec(...)

Status note after C2-A:

  • the native entry shape now exists in CUDA, the Windows bridge can resolve it, and the Go execution layer can route into a native-prepared strategy.
  • what is still missing for C2-B is the actual stateful execution body behind that new native entrypoint.
  • therefore C2-B now means exactly one serious thing: replace the current placeholder body of the new native entrypoint with real stateful streaming/polyphase execution semantics, rather than adding more scaffolding around it.
  • C2-B is now materially done: the new native entrypoint no longer returns only placeholder state, and the Go native execution path now uploads inputs/history/taps, runs the new native function, and reads back outputs plus updated state.
  • when the new exact-integer streaming decimation rules were turned on, an immediate runtime integration issue appeared: previous WFM extraction defaults expected outRate=500000, but the live sample rate was 4096000, which is not exactly divisible. The correct fix is to align streaming defaults with the new integer-decimation model instead of trying to preserve the old rounded ratio behavior.
  • the concrete immediate adjustment made for this was: wfmStreamOutRate = 512000 (instead of 500000), because 4096000 / 512000 = 8 is exactly divisible and therefore consistent with the new streaming architecture’s no-rounding rule.

Required outcomes:

  • StreamingExtractGPUExec(...) can execute a real native stateful path
  • host-oracle bridge remains available only as a comparison/support path, not as the disguised production implementation
  • state apply/backflow goes through the already prepared invocation/result contract

C2-C — final validation and serious completion gate

Goal:

  • validate the real CUDA-backed path against the corrected oracle and make the completion criterion explicit

Required outcomes:

  • GPU-vs-oracle comparison active on the real native path
  • test coverage and runtime comparison hooks in place
  • after C2-C, the CUDA story must be treated as complete, correct, and serious — not half-switched or pseudo-finished

Why the refactor is intentionally incremental

The reviewer explicitly required:

  • no start-index-only production patch
  • no continued reliance on overlap+trim as final continuity model
  • no silent decimation rounding
  • no GPU sign-off without a corrected CPU oracle

Because of that, the work is being done in ordered layers:

  1. define streaming types and state
  2. build the CPU oracle with exact streaming semantics
  3. establish shared polyphase/tap semantics
  4. prepare GPU-side persistent state ownership
  5. only then replace the actual production GPU execution path

This means the repo now contains partially completed new architecture pieces that are deliberate stepping stones, not abandoned half-fixes.

Reviewer package artifacts created for second-opinion review

To support external/secondary review of the GPU extractor path, a focused reviewer package was created in the project root:

  • reviewer-gpu-extractor-package/
  • reviewer-gpu-extractor-package.zip
  • reviewer-gpu-extractor-package.json

The package intentionally contains:

  • relevant GPU extractor / kernel code
  • surrounding host-path code needed for context
  • current debug notes
  • a reviewer brief
  • a short reviewer prompt
  • relevant config files used during live telemetry work

The JSON variant is uncompressed and stores all included package files as a single JSON document with repeated entries of:

  • path
  • content

This was created specifically so the same reviewer payload can be consumed by tools or APIs that prefer a single structured text file instead of a ZIP archive.


Meta note

This investigation already disproved several plausible explanations. That is progress.

The most important thing not to forget is:

  • the overlap prepend bug was real, but not sufficient
  • the click is already present in demod audio
  • whole-process CPU saturation is not the main explanation
  • excessive debug instrumentation can itself create misleading secondary problems
  • the 2026-03-25 extractor telemetry strongly suggests the remaining root cause is upstream of the final trim stage