The decoder took ~70s for a 20-minute recording. Profiling revealed the
bottleneck was not the 6400-candidate cycle-offset search, but the
cepstrum filter's naive O(N²) DCT calling math.Cos() in the inner loop:
55458 STFT frames × 2 passes × 256² × math.Cos() = 7.27 billion calls
At ~20ns per call: ~145 seconds (dominated total runtime)
Fixes:
1. Precomputed cosine table: compute 256×256 = 65536 cosine values
once, then use table lookups in the inner loop. Eliminates all
math.Cos() calls from the per-frame processing.
2. Parallel cycle-offset search: 5 goroutines (one per rep offset),
each searching 1280 cycle offsets independently. The rep offsets
are fully independent — no shared state, no synchronization needed
until the final result merge.
3. Precomputed center-frame lists: instead of checking f%timeRep for
every frame in every candidate test, precompute which frames are
center frames for each rep offset. Eliminates per-frame branching.
4. Float64 PN chip arrays: convert int8 PN chips to float64 once at
startup. Eliminates int8→float64 conversion in the hot inner loop
(204 conversions × 11000 frames × 6400 candidates = 14.4 billion
avoided conversions).
Performance (20-minute recording, 55458 STFT frames):
Before: 70s (math.Cos dominated)
After: 11.5s (6x faster)
Unit test (round-trip): 20s → 1.4s (14x faster)
Note: attempted coarse/fine search (testing every 10th group offset,
then refining) but abandoned — the chi-squared metric peak is too
narrow and the coarse step missed the true peak, causing false
positives. The full 6400-candidate brute-force search is kept for
correctness; the speedup comes entirely from eliminating per-operation
overhead, not from reducing the number of operations.