| @@ -0,0 +1,31 @@ | |||||
| # gpudemod | |||||
| Phase 1 CUDA demod scaffolding. | |||||
| ## Current state | |||||
| - Standard Go builds use `gpudemod_stub.go` (`!cufft`). | |||||
| - `cufft` builds allocate GPU buffers and cross the CGO/CUDA launch boundary. | |||||
| - If CUDA launch wrappers are not backed by compiled kernels yet, the code falls back to CPU DSP. | |||||
| - The shifted IQ path is already wired so a successful GPU freq-shift result can be copied back and reused immediately. | |||||
| ## First real kernel | |||||
| `kernels.cu` contains the first candidate implementation: | |||||
| - `gpud_freq_shift_kernel` | |||||
| This is **not compiled automatically yet** in the current environment because the machine currently lacks a CUDA compiler toolchain in PATH (`nvcc` not found). | |||||
| ## Next machine-side step | |||||
| On a CUDA-capable dev machine with toolchain installed: | |||||
| 1. Compile `kernels.cu` into an object file | |||||
| 2. Link it into the `cufft` build | |||||
| 3. Replace `gpud_launch_freq_shift(...)` stub body with the real kernel launch | |||||
| 4. Validate copied-back shifted IQ against `dsp.FreqShift` | |||||
| 5. Only then move the next stage (FM discriminator) onto the GPU | |||||
| ## Why this is still useful | |||||
| The runtime/buffer/recorder/fallback structure is already in place, so once kernel compilation is available, real acceleration can be inserted without another architecture rewrite. | |||||
| @@ -5,6 +5,8 @@ | |||||
| // - cufft builds allocate GPU buffers and cross the CGO/CUDA launch boundary. | // - cufft builds allocate GPU buffers and cross the CGO/CUDA launch boundary. | ||||
| // - If/when a CUDA freq-shift launch succeeds, the shifted IQ is copied back and | // - If/when a CUDA freq-shift launch succeeds, the shifted IQ is copied back and | ||||
| // reused by the remaining CPU-side FIR/decimate/NFM pipeline. | // reused by the remaining CPU-side FIR/decimate/NFM pipeline. | ||||
| // - kernels.cu contains the first real candidate kernel source, ready for | |||||
| // toolchain-backed integration on a CUDA build machine. | |||||
| // | // | ||||
| // This keeps Phase 1 incremental and verifiable while later phases replace the | // This keeps Phase 1 incremental and verifiable while later phases replace the | ||||
| // placeholder launch wrappers with real kernels. | // placeholder launch wrappers with real kernels. | ||||
| @@ -0,0 +1,21 @@ | |||||
| #include <cuda_runtime.h> | |||||
| #include <math.h> | |||||
| extern "C" __global__ void gpud_freq_shift_kernel( | |||||
| const float2* __restrict__ in, | |||||
| float2* __restrict__ out, | |||||
| int n, | |||||
| double phase_inc, | |||||
| double phase_start | |||||
| ) { | |||||
| int idx = blockIdx.x * blockDim.x + threadIdx.x; | |||||
| if (idx >= n) return; | |||||
| double phase = phase_start + phase_inc * (double)idx; | |||||
| float si, co; | |||||
| sincosf((float)phase, &si, &co); | |||||
| float2 v = in[idx]; | |||||
| out[idx].x = v.x * co - v.y * si; | |||||
| out[idx].y = v.x * si + v.y * co; | |||||
| } | |||||