A – initial port to CUDA
B – factoring out cycle invariant video memory
allocation/de-allocation
C – three kernels (x and y of numerator and denominator)
merged into single kernel
D – general optimization of conditional logic in the
kernel
E – moving out cycle invariant computations into separate
kernel