Only embed watermark chips in STFT bins where the audio signal provides
sufficient masking. Bins in spectral valleys (>25 dB below local peak
within ±4 bins) are skipped — the watermark would be audible there and
they contribute more carrier noise than signal to the correlation.
PAFM is applied in the encoder only. The decoder correlates all bins
unconditionally, because the FM channel alters the spectral shape —
masking decisions made at the encoder do not match the receiver's
spectrum. Skipped bins contribute zero watermark energy (the encoder
didn't modify them) and only carrier noise, which the cepstrum filter
already suppresses by ~6 dB.
On average ~60-70% of bins carry watermark energy per frame, matching
Kirovski's observation. The remaining bins are silent (multiplicative
embedding: magnitude × 1.0 = unchanged).
Over-the-air result (62-minute recording):
avg|c| = 6286 (27 WM cycles averaged)
BER = 0/128
Erasures = 0
The decoder took ~70s for a 20-minute recording. Profiling revealed the
bottleneck was not the 6400-candidate cycle-offset search, but the
cepstrum filter's naive O(N²) DCT calling math.Cos() in the inner loop:
55458 STFT frames × 2 passes × 256² × math.Cos() = 7.27 billion calls
At ~20ns per call: ~145 seconds (dominated total runtime)
Fixes:
1. Precomputed cosine table: compute 256×256 = 65536 cosine values
once, then use table lookups in the inner loop. Eliminates all
math.Cos() calls from the per-frame processing.
2. Parallel cycle-offset search: 5 goroutines (one per rep offset),
each searching 1280 cycle offsets independently. The rep offsets
are fully independent — no shared state, no synchronization needed
until the final result merge.
3. Precomputed center-frame lists: instead of checking f%timeRep for
every frame in every candidate test, precompute which frames are
center frames for each rep offset. Eliminates per-frame branching.
4. Float64 PN chip arrays: convert int8 PN chips to float64 once at
startup. Eliminates int8→float64 conversion in the hot inner loop
(204 conversions × 11000 frames × 6400 candidates = 14.4 billion
avoided conversions).
Performance (20-minute recording, 55458 STFT frames):
Before: 70s (math.Cos dominated)
After: 11.5s (6x faster)
Unit test (round-trip): 20s → 1.4s (14x faster)
Note: attempted coarse/fine search (testing every 10th group offset,
then refining) but abandoned — the chi-squared metric peak is too
narrow and the coarse step missed the true peak, causing false
positives. The full 6400-candidate brute-force search is kept for
correctness; the speedup comes entirely from eliminating per-operation
overhead, not from reducing the number of operations.
Add an STFT watermark path inspired by Kirovski & Malvar, including the frequency-domain embedder/decoder, FFT support, and round-trip coverage. Wire the generator and CLI tools to use the new analysis/synthesis flow for watermark experiments on the watermark-rework branch.