Cuda vs CPU maintenance

Hi… I have an algorithm I originally coded up in numba, and then used numba’s cuda support to move it to GPU. Now that I have it working in both, I’m thinking about maintenance of both versions and am wondering if there’s a pattern I can use where my cuda code could perhaps be reused by the CPU version. At the end of the day the CUDA kernels should all be nopython compatible, I think, and could maybe become ufuncs that can then operate over numpy arrrays? Has anyone put any thought into this and come up with a good way to handle this type of situation?

I’m thinking that if I define a decorator that I use on my code, and then set it to either cuda.jit or numba.jit, I should be able to at least get my kernels to compile for both environments.
I’d probably have to do something tricky for the CUDA device/host copies to make them noops when compiling for the CPU configuration.

And the places where I’m invoking the CUDA kernels would need to instead invoke the vectorized ufuncs which would operate over the arrays much like numpy’s built in functions?

Any wisdom people can share?

wondering if there’s a pattern I can use where my cuda code could perhaps be reused by the CPU version. At the end of the day the CUDA kernels should all be nopython compatible, I think, and could maybe become ufuncs that can then operate over numpy arrrays? Has anyone put any thought into this and come up with a good way to handle this type of situation?

Functions decorated with @jit that compile in nopython mode can be called by CUDA kernels - for example within the CUDA target’s RNG implementation, the function xoroshiro128p_next function is decorated with @jit and can be called from both CUDA kernels (via xoroshiro128p_uniform_float32 and similar functions) and also from init_xoroshiro128p_states_cpu (via xoroshiro128p_jump) - in fact, the functions in the cuda.random module are all decorated only with @jit.

For ufuncs, one can write a single function definition then vectorize it for both targets:

from numba import vectorize, float64
import numpy as np

x = np.random.random(10)
y = np.random.random(10)

# A function to vectorize for both targets
def add(a, b):
    return a + b

# CUDA target requires types to be specified - a list of signatures
# in the form return_type(arg_types, ...)
cuda_add = vectorize([float64(float64, float64)], target='cuda')(add)
cpu_add = vectorize(target='cpu')(add)

gpu_result = cuda_add(x, y)
cpu_result = cpu_add(x, y)

# Sanity check
np.testing.assert_allclose(gpu_result, cpu_result)

Do these options provide enough flexibility for your use case?

If you go down the route of heavily optimizing for one or both targets, you may find that you need to have separate specialised versions of the functions for each target due to the different characteristics of CPUs and GPUs, but starting off with a single function for both is a good first step, then specialise further if the need arises.

I’d probably have to do something tricky for the CUDA device/host copies to make them noops when compiling for the CPU configuration.

If you can ensure that you are only passing device arrays to the CUDA jitted functions, then this should avoid Numba automatically inserting any copies.