# Audio Click Debug Notes — 2026-03-24

## Context

This note captures the intermediate findings from the live/recording audio click investigation on `sdr-wideband-suite`.

Goal: preserve the reasoning, experiments, false leads, and current best understanding so future work does not restart from scratch.

---

## High-level outcome so far

We do **not** yet have the final root cause.

But we now know substantially more about what the clicks are **not**, and we identified at least one real bug plus several strong behavioral constraints in the pipeline.

---

## What was tested

### 1. Session/context recovery
- Reconstructed prior debugging context from reset-session backup files.
- Confirmed the relevant investigation was the persistent audio clicking bug in live audio / recordings.

### 2. Codebase deep-read
Reviewed in detail:
- `cmd/sdrd/dsp_loop.go`
- `cmd/sdrd/pipeline_runtime.go`
- `cmd/sdrd/helpers.go`
- `internal/recorder/streamer.go`
- `internal/recorder/demod_live.go`
- `internal/dsp/fir.go`
- `internal/dsp/fir_stateful.go`
- `internal/dsp/resample.go`
- `internal/demod/fm.go`
- `internal/demod/gpudemod/*`
- `web/app.js`

Main conclusion from static reading: the pipeline contains several stateful continuity mechanisms, so clicks are likely to emerge at boundaries or from phase/timing inconsistencies rather than from one obvious isolated bug.

### 3. AM vs FM tests
Observed by ear:
- AM clicks too.
- Therefore this is **not** an FM-only issue.
- That shifted focus away from purely FM-specific explanations and toward shared-path / continuity / transport / demod-adjacent causes.

### 4. Recording vs live path comparison
Observed by ear:
- Recordings click too.
- Therefore browser/WebSocket/live playback is **not** the sole cause.
- The root problem exists in the server-side audio pipeline before browser playback.

### 5. Boundary instrumentation added
Temporary diagnostics were added to inspect:
- extract trimming
- snippet lengths
- demod path lengths
- boundary click / intra-click detector
- IQ continuity at various stages

### 6. Discriminator-overlap hypothesis
A test switch temporarily disabled the extra 1-sample discriminator overlap prepend in `streamer.go`.

Result:
- This extra overlap **was** a real problem.
- It caused the downstream decimation phase to flip between blocks.
- Removing it cleaned up the boundary model and was the correct change.

However:
- Removing it did **not** eliminate the audible clicks.
- Therefore it was a real bug, but **not the main remaining root cause**.

### 7. GPU vs CPU extraction test
Forced CPU-only stream extraction.

Result:
- CPU-only made things dramatically worse in real time.
- Large `feed_gap` values appeared.
- Huge backlogs built up.
- Therefore CPU-only is not a solution, and the GPU path is not the sole main problem.

### 8. Fixed read-size test
Forced a constant extraction read size (`389120`) instead of variable read sizing based on backlog.

Result:
- `allIQ`, `gpuIQ_len`, `raw_len`, and `out_len` became very stable.
- This reduced pipeline variability and made logs much cleaner.
- Subjectively, audio may have become slightly better, but clicks remained.
- Therefore variable block sizing is likely a contributing factor, but not the full explanation.

### 9. Multi-stage audio dump test
Added optional debug dumping for:
- demod audio (`*-demod.wav`)
- final audio after resampler (`*-final.wav`)

Observed by ear:
- Clicks are present in **both** dump types.
- Therefore the click is already present by the time demodulated audio exists.
- Resampler/final audio path is not the primary origin.

### 10. CPU monitoring
A process-level CSV monitor was added and used.

Result:
- Overall process CPU usage was modest (not near full machine saturation).
- This does **not** support “overall CPU is pegged” as the main explanation.
- Caveat: this does not fully exclude a hot thread or scheduler issue, but gross total CPU overload is not the main story.

---

## What we now know with reasonable confidence

### A. The issue is not primarily caused by:
- Browser playback
- WebSocket transport
- Final PCM fanout only
- Resampler alone
- CPU-only vs GPU-only as the core dichotomy
- The old extra discriminator overlap prepend (that was a bug, but not the remaining dominant one)
- Purely variable block sizes alone
- Gross whole-process CPU saturation

### B. The issue is server-side and exists before final playback
Because:
- recordings click
- demod dump clicks
- final dump clicks

### C. The issue is present by the demodulated audio stage
This is one of the strongest current findings.

### D. The WFM/FM-demod-adjacent path remains highly suspicious
Current best area of suspicion:
- decimated IQ may still contain subtle corruption/instability not fully captured by current metrics
- OR the FM discriminator (`fmDiscrim`) is producing pathological output from otherwise “boundary-clean-looking” IQ

---

## Important runtime/pathology observations

### 1. Backlog amplification is real
Several debug runs showed severe buffer growth and drops:
- large `buf=` values
- growing `drop=` counts
- repeated `audio_gap`

This means some debug configurations can easily become self-distorting and produce additional artifacts that are not representative of the original bug.

### 2. Too much debug output causes self-inflicted load
At one point:
- rate limiter was disabled (`rate_limit_ms: 0`)
- aggressive boundary logging was enabled
- many short WAV files were generated

This clearly increased overhead and likely polluted some runs.

### 3. Many short WAVs were a bad debug design
That was replaced with a design intended to write one continuous window file instead of many micro-files.

### 4. Total process CPU saturation does not appear to be the main cause
A process-level CSV monitor was collected and showed only modest total CPU utilisation during the relevant tests.
This does **not** support a simple “the machine is pegged” explanation.
A hot thread / scheduling issue is still theoretically possible, but gross overall CPU overload is not the main signal.

---

## Current debug state in repo

### Branch
All current work is on:
- `debug/audio-clicks`

### Commits so far
- `94c132d` — `debug: instrument audio click investigation`
- `ffbc45d` — `debug: add advanced boundary metering`

### Current config/logging state
The active debug logging was trimmed down to:
- `demod`
- `discrim`
- `gap`
- `boundary`

Rate limit is currently back to a nonzero value to avoid self-induced spam.

### Dump/CPU debug state
A `debug:` config section was added with:
- `audio_dump_enabled: false`
- `cpu_monitoring: false`

Meaning:
- heavy WAV dumping is now OFF by default
- CPU monitoring is conceptually OFF by default (script still exists, but must be explicitly used)

---

## Most important code changes/findings to remember

### 1. Removed the extra discriminator overlap prepend in `streamer.go`
This was a correct fix.

Reason:
- it introduced a blockwise extra IQ sample
- this shifted decimation phase between blocks
- it created real boundary artifacts

This should **not** be reintroduced casually.

### 2. Fixed read-size test exists and is useful for investigation
A temporary mechanism exists to force stable extraction block sizes.
This is useful diagnostically because it removes one source of pipeline variability.

**IMPORTANT DECISION / DO NOT LOSE:**
- The fixed read-size path currently lives behind the environment variable `SDR_FORCE_FIXED_STREAM_READ_SAMPLES`.
- The tested value `389120` clearly helps by making `allIQ`, `gpuIQ_len`, `raw_len`, and `out_len` much more stable and by reducing one major source of pipeline variability.
- Current plan: **once the remaining click root cause is solved, promote this behavior into the normal code path instead of leaving it as an env-var-only debug switch.**
- In other words: treat fixed read sizing as a likely permanent stabilization improvement, but do not bake it in blindly until the click investigation is complete.

### 3. FM discriminator metering exists
`internal/demod/fm.go` now emits targeted discriminator stats under `discrim` logging, including:
- min/max IQ magnitude
- maximum absolute phase step
- count of large phase steps

This was useful to establish that large discriminator steps correlate with low IQ magnitude, but discriminator logging was later disabled from the active category list to reduce log spam.

### 4. Strong `dec`-IQ findings before demod
Additional metering in `streamer.go` showed:
- repeated `dec_iq_head_dip`
- repeated low magnitude near `min_idx ~= 25`
- repeated early large local phase step near `max_step_idx ~= 24`
- repeated `demod_boundary` and audible clicks shortly afterward

This is the strongest currently known mechanism in the chain.

### 5. Group delay observation
For the current pre-demod FIR:
- taps = `101`
- FIR group delay = `(101 - 1) / 2 = 50` input samples
- with `decim1 = 2`, this projects to about `25` output samples

This matches the repeatedly observed problematic `dec` indices (~24-25) remarkably well.
That strongly suggests the audible issue is connected to the FIR/decimation settling region at the beginning of the `dec` block.

### 6. Pre-FIR vs post-FIR comparison
A dedicated pre-FIR probe was added on `fullSnip` (the input to the pre-demod FIR) and compared against the existing `dec`-side probes.

Observed pattern:
- pre-FIR head probe usually looked relatively normal
- no equally strong or equally reproducible hot spot appeared there
- after FIR + decimation, the problematic dip/step repeatedly appeared near `dec` indices ~24-25

Interpretation:
- the strongest currently observed defect is **not already present in the same form before the FIR**
- it is much more likely to emerge in the FIR/decimation section (or its settling behavior) than in the raw pre-FIR input

### 7. Head-trim test results
A debug head-trim on `dec` was tested.
Subjective result:
- `trim=32` sounded best among the tested values (`16/32/48/64`)
- but it did **not** remove the clicks entirely

Interpretation:
- the early `dec` settling region is a real contributor
- but it is probably not the only contributor, or trimming alone is not the final correct fix

### 8. Current architectural conclusion
The likely clean fix is **not** to keep trimming samples away.
The FIR/decimation section is still suspicious, but later tests showed it is likely not the sole origin.

Important nuance:
- the currently suspicious FIR + decimation section is already running in **Go/CPU** (`processSnippet`), not in CUDA
- therefore the next correctness fix should be developed and validated in Go first

Later update:
- a stateful decimating FIR / polyphase-style replacement was implemented in Go and tested
- it was architecturally cleaner than the old separated FIR->decimate handoff
- but it did **not** remove the recurring hot spot / clicks
- therefore the old handoff was not the whole root cause, even if the newer path is still cleaner

---

## Best current hypothesis

The remaining audible clicks are most likely generated **at or immediately before FM demodulation**.

Most plausible interpretations:
1. The decimated IQ stream still contains subtle corruption/instability not fully captured by the earliest boundary metrics.
2. The FM discriminator is reacting violently to short abnormal IQ behavior inside blocks, not just at block boundaries.
3. The problematic region is likely a **very specific early decimated-IQ settling zone**, not broad corruption across the whole block.

At this point, the most valuable next data is low-overhead IQ telemetry right before demod, plus carefully controlled demod-vs-final audio comparison.

### Stronger updated working theory (later findings, same day)

After discriminator-focused metering and targeted `dec`-IQ probes, the strongest current theory is:

> A reproducible early defect in the `dec` IQ block appears around sample index **24-25**, where IQ magnitude dips sharply and the effective FM phase step becomes abnormally large. This then shows up as `demod_boundary` and audible clicks.

Crucially:
- this issue appears in `demod.wav`, so it exists before the final resampler/playback path
- it is **not** spread uniformly across the whole `dec` block
- it repeatedly appears near the same index
- trimming the first ~32 samples subjectively reduces the click, but does not eliminate it entirely

This strongly suggests a **settling/transient zone at the beginning of the decimated IQ block**.

Later refinements to this theory:
- pre-FIR probing originally looked cleaner than post-FIR probing, which made FIR/decimation look like the main culprit
- however, a temporary FIR bypass showed the clicks were still present, only somewhat quieter / less aggressive
- this indicates the pre-demod FIR likely amplifies or sharpens an upstream issue, but is not the sole origin
- a cleaner stateful decimating FIR implementation also failed to eliminate the recurring hot spot, further weakening the idea that the old FIR->decimate handoff alone caused the bug

---

## Recommended next steps

1. Run with reduced logging only and keep heavy dump features OFF unless explicitly needed.
2. Continue investigating the extractor path and its immediate surroundings (`extractForStreaming`, signal parameter source, offset/BW stability, overlap/trim behavior).
3. Treat FIR/decimation as a possible amplifier/focuser of the issue, but not the only suspect.
4. When testing fixes, prefer low-overhead, theory-driven experiments over broad logging/dump spam.
5. Only re-enable audio dump windows selectively and briefly.

### Debug TODO / operational reminders

- The current telemetry collector is **not** using a true ring buffer for metric/event history.
- Internally it keeps append-only history slices (`metricsHistory`, `events`) and periodically trims them by copying tail slices.
- Under heavy per-block telemetry this can add enough mutex/copy overhead to make the live stream start stuttering after a short run.
- Therefore: keep telemetry sampling conservative during live reproduction runs; do **not** leave full heavy telemetry enabled longer than needed.
- Follow-up engineering task: replace or redesign telemetry history storage to use a proper low-overhead ring-buffer style structure (or equivalent bounded lock-light design) if live telemetry is to remain a standard debugging tool.

---

## 2026-03-25 update — extractor-focused live telemetry findings

### Where the investigation moved

The investigation was deliberately refocused away from browser/feed/demod-only suspicions and toward:
- shared upstream IQ cadence / block boundaries
- extractor input/output continuity
- raw vs trimmed extractor-head behaviour

This was driven by two observations:
1. all signals still click
2. the newly added live telemetry made it possible to inspect the shared path while the system was running

### Telemetry infrastructure / config notes

Two config files matter for debug telemetry defaults:
- `config.yaml`
- `config.autosave.yaml`

The autosave file can overwrite intended telemetry defaults after restart, so both must be updated together.

Current conservative live-debug defaults that worked better:
- `heavy_enabled: false`
- `heavy_sample_every: 12`
- `metric_sample_every: 8`
- `metric_history_max: 6000`
- `event_history_max: 1500`

Important operational lesson:
- runtime `POST /api/debug/telemetry/config` changes only affect the current `sdrd` process
- after restart, the process reloads config defaults again
- if autosave still contains older values (for example `heavy_enabled: true` or very large history limits), the debug run can accidentally become self-distorting again

### Telemetry endpoints

The live debug work used these HTTP endpoints on the `sdrd` web server (typically `http://127.0.0.1:8080`):

#### `GET /api/debug/telemetry/config`
Returns the current effective telemetry configuration.
Useful for verifying:
- whether heavy telemetry is enabled
- history sizes
- persistence settings
- sample rates actually active in the running process

Typical fields:
- `enabled`
- `heavy_enabled`
- `heavy_sample_every`
- `metric_sample_every`
- `metric_history_max`
- `event_history_max`
- `retention_seconds`
- `persist_enabled`
- `persist_dir`

#### `POST /api/debug/telemetry/config`
Applies runtime telemetry config changes to the current process.
Used during investigation to temporarily reduce telemetry load without editing files.

Example body used during investigation:
```json
{
  "heavy_enabled": true,
  "heavy_sample_every": 12,
  "metric_sample_every": 8
}
```

#### `GET /api/debug/telemetry/live`
Returns the current live metric snapshot (gauges/counters/distributions).
Useful for:
- quick sanity checks
- verifying that a metric family exists
- confirming whether a new metric name is actually being emitted

#### `GET /api/debug/telemetry/history?prefix=<prefix>&limit=<n>`
Returns stored metric history entries filtered by metric-name prefix.
This is the main endpoint for time-series debugging during live runs.

Useful examples:
- `prefix=stage.`
- `prefix=source.`
- `prefix=iq.boundary.all`
- `prefix=iq.extract.input`
- `prefix=iq.extract.output`
- `prefix=iq.extract.raw.`
- `prefix=iq.extract.trimmed.`
- `prefix=iq.pre_demod`
- `prefix=audio.demod`

#### `GET /api/debug/telemetry/events?limit=<n>`
Returns recent structured telemetry events.
Used heavily once compact per-block event probes were added, because events were often easier to inspect reliably than sparsely sampled distribution histories.

This ended up being especially useful for:
- raw extractor head probes
- trimmed extractor head probes
- extractor input head probes
- GPU kernel input/output head probes
- boundary snapshots

### Important telemetry families added/used

#### Shared-path / global boundary metrics
- `iq.boundary.all.head_mean_mag`
- `iq.boundary.all.prev_tail_mean_mag`
- `iq.boundary.all.delta_mag`
- `iq.boundary.all.delta_phase`
- `iq.boundary.all.discontinuity_score`

Purpose:
- detect whether the shared `allIQ` block boundary was already obviously broken before signal-specific extraction

#### Extractor input/output metrics
- `iq.extract.input.length`
- `iq.extract.input.overlap_length`
- `iq.extract.input.head_mean_mag`
- `iq.extract.input.prev_tail_mean_mag`
- `iq.extract.input.discontinuity_score`
- `iq.extract.output.length`
- `iq.extract.output.head_mean_mag`
- `iq.extract.output.head_min_mag`
- `iq.extract.output.head_max_step`
- `iq.extract.output.head_p95_step`
- `iq.extract.output.head_tail_ratio`
- `iq.extract.output.head_low_magnitude_count`
- `iq.extract.output.boundary.delta_mag`
- `iq.extract.output.boundary.delta_phase`
- `iq.extract.output.boundary.d2`
- `iq.extract.output.boundary.discontinuity_score`

Purpose:
- isolate whether the final per-signal extractor output itself was discontinuous across blocks

#### Raw vs trimmed extractor-head telemetry
- `iq.extract.raw.length`
- `iq.extract.raw.head_mag`
- `iq.extract.raw.tail_mag`
- `iq.extract.raw.head_zero_count`
- `iq.extract.raw.first_nonzero_index`
- `iq.extract.raw.head_max_step`
- `iq.extract.trim.trim_samples`
- `iq.extract.trimmed.head_mag`
- `iq.extract.trimmed.tail_mag`
- `iq.extract.trimmed.head_zero_count`
- `iq.extract.trimmed.first_nonzero_index`
- `iq.extract.trimmed.head_max_step`
- event `extract_raw_head_probe`
- event `extract_trimmed_head_probe`

Purpose:
- answer the key question: is the corruption already present in the raw extractor output head, or created by trimming/overlap logic afterward?

#### Additional extractor input / GPU-kernel probe telemetry
- `iq.extract.input_head.zero_count`
- `iq.extract.input_head.first_nonzero_index`
- `iq.extract.input_head.max_step`
- event `extract_input_head_probe`
- event `gpu_kernel_input_head_probe`
- event `gpu_kernel_output_head_probe`

Purpose:
- split the remaining uncertainty between:
  - signal-specific input already being bad
  - GPU extractor kernel/start semantics producing the bad raw head
  - later output assembly after the kernel

#### Pre-demod / audio-stage metrics
- `iq.pre_demod.head_mean_mag`
- `iq.pre_demod.head_min_mag`
- `iq.pre_demod.head_max_step`
- `iq.pre_demod.head_p95_step`
- `iq.pre_demod.head_low_magnitude_count`
- `audio.demod.head_mean_abs`
- `audio.demod.tail_mean_abs`
- `audio.demod.edge_delta_abs`
- existing `audio.demod_boundary.*`

Purpose:
- verify where artifacts become visible/audible downstream

### What the 2026-03-25 telemetry actually showed

#### 1. Feed / enqueue remained relatively uninteresting
`stage.feed_enqueue.duration_ms` was usually effectively zero.

Representative values during live runs:
- mostly `0`
- occasional small spikes such as `0.5 ms` and `5.8 ms`

Interpretation:
- feed enqueue is not the main source of clicks

#### 2. Extract-stream time was usually modest
`stage.extract_stream.duration_ms` was usually small and stable compared with the main loop.

Representative values:
- often `1–5 ms`
- occasional spikes such as `10.7 ms` and `18.9 ms`

Interpretation:
- extraction is not free, but runtime cost alone does not explain the clicks

#### 3. Shared capture / source cadence still fluctuated heavily
Representative live values:
- `dsp.frame.duration_ms`: often around `90–100 ms`, but also `110–150 ms`, with one observed spike around `212.6 ms`
- `source.read.duration_ms`: roughly `80–90 ms` often, but also about `60 ms`, `47 ms`, `19 ms`, and even `0.677 ms`
- `source.buffer_samples`: ranged from very small to very large bursts, including examples like `512`, `4608`, `94720`, `179200`, `304544`
- a `source_reset` event was seen and `source.resets=1`

Interpretation:
- shared upstream cadence is clearly unstable enough to remain suspicious
- but this alone did not localize the final click mechanism

#### 4. Pre-demod stage showed repeated hard phase anomalies even when energy looked healthy
Representative live values for normal non-vanishing signals:
- `iq.pre_demod.head_mean_mag` around `0.25–0.31`
- `iq.pre_demod.head_low_magnitude_count = 0`
- `iq.pre_demod.head_max_step` repeatedly high, including roughly:
  - `1.5`
  - `2.0`
  - `2.4`
  - `2.8`
  - `3.08`

Interpretation:
- not primarily an amplitude collapse
- rather a strong phase/continuity defect reaching the pre-demod stage

#### 5. Audio stage still showed real block-edge artifacts
Representative values:
- `audio.demod.edge_delta_abs` repeatedly around `0.4–0.8`
- outliers up to roughly `1.21` and `1.26`
- `audio.demod_boundary.count` continued to fire repeatedly

Interpretation:
- demod is where the problem becomes audible, but the root cause still appeared to be earlier/shared

### Key extractor findings from the new telemetry

#### A. Per-signal extractor output boundary is genuinely broken
For a representative strong signal (`signal_id=2`), `iq.extract.output.boundary.delta_phase` repeatedly showed very large jumps such as:
- `2.60`
- `3.06`
- `2.14`
- `2.71`
- `3.09`
- `2.92`
- `2.63`
- `2.78`

Also observed for `iq.extract.output.boundary.discontinuity_score`:
- `2.86`
- `3.08`
- `2.92`
- `2.52`
- `2.40`
- `2.85`

Later runs using `d2` made the discontinuity even easier to see. Representative `iq.extract.output.boundary.d2` values for the same strong signal included:
- `0.347`
- `0.303`
- `0.362`
- `0.359`
- `0.382`
- `0.344`
- `0.337`
- `0.206`

At the same time, `iq.extract.output.boundary.delta_mag` was often comparatively small (examples around `0.0003–0.0038`).

Interpretation:
- the main boundary defect is not primarily amplitude mismatch
- it is much more consistent with complex/phase discontinuity across output blocks

#### B. The raw extractor head is systematically bad on all signals
The new `extract_raw_head_probe` events were the strongest finding of the day.

Representative repeated pattern for strong signals (`signal_id=1` and `signal_id=2`):
- `first_nonzero_index = 1`
- `zero_count = 1`
- first magnitude sample exactly `0`
- then a short ramp: e.g. for `signal_id=2`
  - `0`
  - `0.000388`
  - `0.002316`
  - `0.004152`
  - `0.019126`
  - `0.011418`
  - `0.124034`
  - `0.257569`
  - `0.317579`
- `head_max_step` often near π, e.g.:
  - `3.141592653589793`
  - `3.088773696463606`
  - `3.0106854446936318`
  - `2.9794833659932527`

The same qualitative pattern appeared for weaker signals too:
- raw head starts at `0`
- a brief near-zero ramp follows
- only after several samples does the magnitude look like a normal extracted band

Interpretation:
- the raw extractor output head is already damaged / settling / invalid before trimming
- this strongly supports an upstream/shared-start-condition problem rather than a trim-created artifact

#### C. The trimmed extractor head usually looks sane
Representative repeated pattern for the same signals after `trim_samples = 64`:
- `first_nonzero_index = 0`
- `zero_count = 0`
- magnitudes look immediately plausible and stable
- `head_max_step` is dramatically lower than raw, often around `0.15–0.9` for strong channels

Example trimmed head magnitudes for `signal_id=2`:
- `0.299350`
- `0.300954`
- `0.298032`
- `0.298738`
- `0.312258`
- `0.296932`
- `0.239010`
- `0.266881`
- `0.313193`

Example trimmed head magnitudes for `signal_id=1`:
- `0.277400`
- `0.275994`
- `0.273718`
- `0.272846`
- `0.277842`
- `0.278398`
- `0.268829`
- `0.273790`
- `0.279031`

Interpretation:
- trimming is removing a genuinely bad raw head region
- trimming is therefore **not** the main origin of the problem
- it acts more like cleanup of an already bad upstream/raw start region

### Input-vs-raw-vs-trimmed extractor result (important refinement)

A later, more targeted telemetry pass added a direct probe on the signal-specific extractor input head (`extract_input_head_probe`) and compared it against the raw and trimmed extractor output heads.

This materially refined the earlier conclusion.

#### Input-head result
Representative values from `iq.extract.input_head.*`:
- `iq.extract.input_head.zero_count = 0`
- `iq.extract.input_head.first_nonzero_index = 0`

Interpretation:
- the signal-specific input head going into the GPU extractor is **not** starting with a zero sample
- the head is not arriving already dead/null from the immediate input probe point

#### Raw-head result
Representative values from `iq.extract.raw.*`:
- `iq.extract.raw.head_mag = 0`
- `iq.extract.raw.head_zero_count = 1`
- `iq.extract.raw.head_max_step` frequently around `2.4–3.14`

These values repeated for strong channels such as `signal_id=2`, and similarly across other signals.

Interpretation:
- the first raw output sample is repeatedly exactly zero
- therefore the visibly bad raw head is being created **after** the probed input head and **before/during raw extractor output generation**

#### Trimmed-head result
Representative values from `iq.extract.trimmed.*`:
- `iq.extract.trimmed.head_zero_count = 0`
- `iq.extract.trimmed.head_mag` often looked healthy immediately after trimming, for example:
  - signal 1: about `0.275–0.300`
  - signal 2: about `0.311`
- `iq.extract.trimmed.head_max_step` was much lower than raw for strong channels, often around:
  - `0.11`
  - `0.14`
  - `0.19`
  - `0.30`
  - `0.75`

Interpretation:
- trimming cleans up the visibly bad raw head region
- trimming still does **not** explain the deeper output-boundary continuity issue

### Further refinement after direct extractor-input and GPU-kernel probes

A final telemetry round added:
- `extract_input_head_probe`
- `gpu_kernel_input_head_probe`
- `gpu_kernel_output_head_probe`

These probes further sharpened the likely fault location.

#### Signal-specific extractor input head looked sane
Representative values:
- `iq.extract.input_head.zero_count = 0`
- `iq.extract.input_head.first_nonzero_index = 0`

Interpretation:
- at the observed signal-specific input probe point, the GPU extractor is **not** receiving a dead/null head

#### Raw GPU output head remained systematically broken
Representative repeated values:
- `iq.extract.raw.head_mag = 0`
- `iq.extract.raw.head_zero_count = 1`
- `iq.extract.raw.head_max_step` repeatedly around:
  - `3.141592653589793`
  - `3.122847934305907`
  - `3.101915352902961`
  - `3.080672178550904`
  - `3.062425574273907`
  - `2.9785041567778427`
  - `2.7508533785793476`

Representative repeated examples from strong channels:
- signal 2: `head_mag = 0`, `head_zero_count = 1`
- signal 3: `head_mag = 0`, `head_zero_count = 1`
- signal 1/4 showed the same qualitative head-zero pattern as well

Interpretation:
- the raw extractor output head is still repeatedly born broken
- the problem is therefore after the currently probed input head and before/during raw output creation

#### Trimmed head still looked healthier
Representative values:
- `iq.extract.trimmed.head_zero_count = 0`
- signal 1 `iq.extract.trimmed.head_mag` repeatedly around:
  - `0.2868`
  - `0.2907`
  - `0.3036`
  - `0.3116`
  - `0.2838`
  - `0.2760`
- signal 2 examples:
  - `0.3461`
  - `0.3182`

Representative `iq.extract.trimmed.head_max_step` values for strong channels were much lower than raw, often around:
- `0.11`
- `0.13`
- `0.21`
- `0.30`
- `0.44`
- `0.69`
- `0.86`

Interpretation:
- trimming still removes the most visibly broken head region
- but trimming does not explain the deeper output-boundary continuity issue

### Refined strongest current conclusion after the full 2026-03-25 telemetry pass

The strongest current reading is now:

> The click root cause is very likely **not** that the signal-specific extractor input already starts dead/null. Instead, the bad raw head appears to be introduced **inside the GPU extractor path itself** (or at its immediate start/output semantics) before final trimming.

More specifically:
- signal-specific extractor input head looks non-zero and sane at the probe point
- raw GPU output head still repeatedly starts with an exact zero sample and a short bad settling region
- the trimmed head usually looks healthier
- yet the final extractor output still exhibits significant complex boundary discontinuity from block to block

This now points away from a simple "shared global input head is already zero" theory and toward one of these narrower causes:
1. GPU extractor kernel start semantics / warmup / first-output handling
2. phase-start or alignment handling at extractor block start
3. raw GPU output assembly semantics within the extractor path

### What should not be forgotten from this stage

- The overlap-prepend bug was real and worth fixing, but was not sufficient.
- The fixed read-size path (`SDR_FORCE_FIXED_STREAM_READ_SAMPLES=389120`) remains useful and likely worth promoting later, but it is not the root-cause fix.
- The telemetry system itself can perturb runs if overused; conservative sampling matters.
- `config.autosave.yaml` must be kept in sync with `config.yaml` or telemetry defaults can silently revert after restart.
- The most promising root-cause area is now the shared upstream/extractor-start boundary path, not downstream playback.

### 2026-03-25 refactor work status (post-reviewer instruction)

After the reviewer guidance, work pivoted away from symptomatic patching and onto the required two-track architecture change:

#### Track 1 — CPU/oracle path repair (in progress)
The following was added to start building a trustworthy streaming oracle:
- `internal/demod/gpudemod/streaming_types.go`
- `internal/demod/gpudemod/cpu_oracle.go`
- `internal/demod/gpudemod/cpu_oracle_test.go`
- `internal/demod/gpudemod/streaming_oracle_extract.go`
- `internal/demod/gpudemod/polyphase.go`
- `internal/demod/gpudemod/polyphase_test.go`

What exists now:
- explicit `StreamingExtractJob` / `StreamingExtractResult`
- explicit `CPUOracleState`
- exact integer decimation enforcement (`ExactIntegerDecimation`)
- monolithic-vs-chunked CPU oracle test
- explicit polyphase tap layout (`phase-major`)
- CPU oracle direct-vs-polyphase equivalence test
- persistent CPU oracle runner state keyed by signal ID
- config-hash reset behavior
- cleanup of disappeared signals from oracle state

Important limitation:
- this is **not finished production validation yet**
- the CPU oracle path is being built toward the reviewer’s required semantics, but it is not yet the final signed-off oracle for GPU validation

#### Track 2 — GPU path architecture refactor (in progress)
The following was added to begin the new stateful GPU architecture:
- `internal/demod/gpudemod/stream_state.go`
- `internal/demod/gpudemod/streaming_gpu_stub.go`
- `docs/gpu-streaming-refactor-plan-2026-03-25.md`
- `cmd/sdrd/streaming_refactor.go`

What exists now:
- explicit `ExtractStreamState`
- batch-runner-owned per-signal state map
- config-hash reset behavior for GPU-side stream state
- exact integer decimation enforcement in relevant batch path
- base taps and polyphase taps initialized into GPU-side stream state
- explicit future production entry point: `StreamingExtractGPU(...)`
- explicit separation between current legacy extractor path and the new streaming/oracle path
- persistent oracle-runner lifecycle hooks, including reset on stream-drop events

Important limitation:
- the new GPU production path is **not implemented yet**
- the legacy overlap+trim production path still exists and is still the current active path
- the new GPU entry point currently exists as an explicit architectural boundary and state owner, not as the finished stateful polyphase kernel path

#### Tests currently passing during refactor
Repeatedly verified during the refactor work:
- `go test ./internal/demod/gpudemod/...`
- `go test ./cmd/sdrd/...`

#### Incremental progress reached so far inside the refactor

Additional progress after the initial refactor scaffolding:
- the CPU oracle runner now uses the explicit polyphase oracle path (`CPUOracleExtractPolyphase`) instead of only carrying polyphase tap data passively
- the CPU oracle now has a direct-vs-polyphase equivalence test
- the GPU-side stream state now initializes both `BaseTaps` and `PolyphaseTaps`
- the GPU side now has an explicit future production entry point `StreamingExtractGPU(...)`
- the GPU streaming stub now advances `NCOPhase` over NEW samples only
- the GPU streaming stub now advances `PhaseCount` modulo exact integer decimation
- the GPU streaming stub now builds and persists `ShiftedHistory` from already frequency-shifted NEW samples
- the new streaming/oracle path is explicitly separated from the current legacy overlap+trim production path

Important current limitation:
- `StreamingExtractGPU(...)` still intentionally returns a not-implemented error rather than pretending to be the finished production path
- this is deliberate, to avoid hidden quick-fix semantics or silent goalpost shifts

Additional note on the latest step:
- the GPU streaming stub now also reports an estimated output-count schedule (`NOut`) derived from NEW sample consumption plus carried `PhaseCount`
- this still does **not** make it a production path; it only means the stub now models output cadence semantics more honestly
- the new CPU/oracle path is also now exposing additional runtime telemetry such as `streaming.oracle.rate` and `streaming.oracle.output_len`, so the reference path becomes easier to inspect as it matures
- a reusable complex-slice comparison helper now exists (`CompareComplexSlices`) to support later oracle-vs-GPU equivalence work without improvising comparison logic at the last minute
- a dedicated `TestCPUOracleMonolithicVsChunkedPolyphase` now verifies chunked-vs-monolithic self-consistency for the polyphase oracle path specifically
- explicit reset tests now exist for both CPU oracle state and GPU streaming state, so config-change reset semantics are no longer only implicit in code review
- a dedicated `ExtractDebugMetrics` structure now exists as a future comparison/telemetry contract for reviewer-required state/error/boundary metrics
- the first mapper from oracle results into that debug-metric structure now exists, so the comparison contract is beginning to attach to real refactor code rather than staying purely conceptual
- the same minimal debug-metric mapping now also exists for GPU-stub results, so both sides of the future GPU-vs-oracle comparison now have an initial common reporting shape
- a first comparison-pipeline helper now exists to turn oracle-vs-GPU-stub results into shared `CompareStats` / `ExtractDebugMetrics` output, even though the GPU path is still intentionally incomplete
- that comparison helper is now also covered by a dedicated unit test, so even the scaffolding around future GPU-vs-oracle validation is being locked down incrementally
- GPU-side stream-state initialization is now also unit-tested (`Decim`, `BaseTaps`, `PolyphaseTaps`, `ShiftedHistory` capacity), so the new state ownership layer is no longer just trusted by inspection
- the GPU streaming stub now also has a dedicated test proving that it advances persistent state while still explicitly failing as a not-yet-implemented production path
- at this point, enough scaffolding exists that the next sensible step is to build the broader validation/test harness in one larger pass before continuing the actual production-path rewrite
- that harness pass has now happened: deterministic IQ/tone fixtures, harness config/state builders, chunked polyphase oracle runners, and additional validation tests now exist, so the next step is back to the actual production-path rewrite
- the first non-stub NEW-samples-only production-like path now exists as `StreamingExtractGPUHostOracle(...)`: it is still host-side, but it executes the new streaming/stateful semantics and therefore serves as a concrete bridge between pure test infrastructure and the eventual real GPU production path
- that host-side production-like path is now directly compared against the CPU oracle in tests and currently matches within tight tolerance, which is an important confidence step before any real CUDA-path replacement
- the canonical new production entry point `StreamingExtractGPU(...)` is now structurally wired so that the host-side production-like implementation can sit behind the same API later, without forcing a premature switch today
- a top-level `cmd/sdrd` production path hook now exists as well (`extractForStreamingProduction` plus `useStreamingProductionPath=false`), so the new architecture is no longer isolated to internal packages only
- the new production path now also emits first-class output/heading telemetry (`rate`, `output_len`, `head_mean_mag`, `head_max_step`) in addition to pure state counters, which will make activation/debugging easier later
- a top-level comparison observation hook now also exists in `cmd/sdrd`, so oracle-vs-production metrics no longer have to remain buried inside internal package helpers
- after the broader monitoring/comparison consolidation pass, the next agreed work mode is to continue in larger clusters rather than micro-steps: (1) wire the new production semantics more deeply, (2) isolate the legacy path more sharply, (3) keep preparing the eventual real GPU production path behind the same architecture
- after the first larger cluster, the next explicit target is to complete Cluster B: make the host-oracle bridge sit more naturally behind the new production execution architecture, rather than leaving production-path semantics spread across loosely connected files
- after Cluster B, the remaining GPU rewrite work is now best split into two explicit parts: `C1 = prepare` and `C2 = definitive implementation`, so the project can keep momentum without pretending that the final CUDA/stateful production path is already done
- Cluster B is now effectively complete: CPU oracle runner, host-oracle production-like path, and top-level production comparison all share the same host streaming core, and that common core is directly tested against the polyphase oracle
- Cluster C1 is now also complete: the new GPU production layer has an explicit invocation contract, execution-result contract, state handoff/build/apply stages, and a host-side execution strategy already running behind the same model

### Current refactor status before C2

At this point the project has:
- a corrected streaming/oracle architecture direction
- a shared host-side streaming core used by both the CPU oracle runner and the host-side production-like bridge
- explicit production-path hooks in `cmd/sdrd`
- comparison and monitoring scaffolding above and below the execution layer
- a prepared GPU execution contract (`StreamingGPUInvocation` / `StreamingGPUExecutionResult`)

What it does **not** have yet:
- a real native CUDA streaming/polyphase execution entry point with history-in/history-out and phase-count in/out semantics
- a real CUDA-backed implementation behind `StreamingExtractGPUExec(...)`
- completed GPU-vs-oracle validation on the final native execution path

### C2 plan

#### C2-A — native CUDA / bridge entry preparation
Goal:
- introduce the real native entry shape for stateful streaming/polyphase execution

Status note before starting C2-A:
- C2 is **not** honestly complete yet because the native CUDA side still only exposes the old separate freq-shift/FIR/decimate pieces.
- Therefore C2-A must begin by creating the real native entry shape rather than continuing to stack more Go-only abstractions on top of the old kernels.

Required outcomes:
- explicit native/CUDA function signature for streaming execution
- bridge bindings for history in/out, phase count in/out, new samples in, outputs out
- Go-side wrapper ready to call the new native path through the prepared invocation/result model

#### C2-B — definitive execution implementation hookup
Goal:
- put a real native CUDA-backed execution strategy behind `StreamingExtractGPUExec(...)`

Status note after C2-A:
- the native entry shape now exists in CUDA, the Windows bridge can resolve it, and the Go execution layer can route into a native-prepared strategy.
- what is still missing for C2-B is the actual stateful execution body behind that new native entrypoint.
- therefore C2-B now means exactly one serious thing: replace the current placeholder body of the new native entrypoint with real stateful streaming/polyphase execution semantics, rather than adding more scaffolding around it.
- C2-B is now materially done: the new native entrypoint no longer returns only placeholder state, and the Go native execution path now uploads inputs/history/taps, runs the new native function, and reads back outputs plus updated state.
- when the new exact-integer streaming decimation rules were turned on, an immediate runtime integration issue appeared: previous WFM extraction defaults expected `outRate=500000`, but the live sample rate was `4096000`, which is not exactly divisible. The correct fix is to align streaming defaults with the new integer-decimation model instead of trying to preserve the old rounded ratio behavior.
- the concrete immediate adjustment made for this was: `wfmStreamOutRate = 512000` (instead of `500000`), because `4096000 / 512000 = 8` is exactly divisible and therefore consistent with the new streaming architecture’s no-rounding rule.

Required outcomes:
- `StreamingExtractGPUExec(...)` can execute a real native stateful path
- host-oracle bridge remains available only as a comparison/support path, not as the disguised production implementation
- state apply/backflow goes through the already prepared invocation/result contract

#### C2-C — final validation and serious completion gate
Goal:
- validate the real CUDA-backed path against the corrected oracle and make the completion criterion explicit

Required outcomes:
- GPU-vs-oracle comparison active on the real native path
- test coverage and runtime comparison hooks in place
- after C2-C, the CUDA story must be treated as complete, correct, and serious — not half-switched or pseudo-finished

#### Why the refactor is intentionally incremental
The reviewer explicitly required:
- no start-index-only production patch
- no continued reliance on overlap+trim as final continuity model
- no silent decimation rounding
- no GPU sign-off without a corrected CPU oracle

Because of that, the work is being done in ordered layers:
1. define streaming types and state
2. build the CPU oracle with exact streaming semantics
3. establish shared polyphase/tap semantics
4. prepare GPU-side persistent state ownership
5. only then replace the actual production GPU execution path

This means the repo now contains partially completed new architecture pieces that are deliberate stepping stones, not abandoned half-fixes.

### Reviewer package artifacts created for second-opinion review

To support external/secondary review of the GPU extractor path, a focused reviewer package was created in the project root:
- `reviewer-gpu-extractor-package/`
- `reviewer-gpu-extractor-package.zip`
- `reviewer-gpu-extractor-package.json`

The package intentionally contains:
- relevant GPU extractor / kernel code
- surrounding host-path code needed for context
- current debug notes
- a reviewer brief
- a short reviewer prompt
- relevant config files used during live telemetry work

The JSON variant is uncompressed and stores all included package files as a single JSON document with repeated entries of:
- `path`
- `content`

This was created specifically so the same reviewer payload can be consumed by tools or APIs that prefer a single structured text file instead of a ZIP archive.

---

## Meta note

This investigation already disproved several plausible explanations. That is progress.

The most important thing not to forget is:
- the overlap prepend bug was real, but not sufficient
- the click is already present in demod audio
- whole-process CPU saturation is not the main explanation
- excessive debug instrumentation can itself create misleading secondary problems
- the 2026-03-25 extractor telemetry strongly suggests the remaining root cause is upstream of the final trim stage