15-618 Final Project

Team: Steven Kolawole

We implement and optimize batched speculative decoding verification kernels in CUDA, targeting the parallelism challenges that arise when verifying draft tokens across multiple sequences simultaneously; specifically warp divergence, non-coalesced KV cache writes, and dynamic load imbalance caused by sequences accepting different numbers of draft tokens at runtime. We profile these bottlenecks with Nsight Compute, implement optimizations using warp-level primitives and prefix sums, and evaluate performance across batch sizes and acceptance rates on GPU.

Reports

Proposal
Milestone Report
Final Report