Kernel within a kernel

Hi -

Is it possible to start a kernel within a @cuda.jit-ed kernel? Like you can do in C/C++?
PyCuda also supports this dynamic parallelism when compiled with “-rdc=true” option. Is there a way to do the same in Numba?

Thanks.

Numba doesn’t presently support dynamic parallelism, but you can call other @jit- and @cuda.jit-decorated functions from within a CUDA kernel - see for example Examples — Numba 0.56.2+0.gd6731f6d2.dirty-py3.7-linux-x86_64.egg documentation

Does this address your use case? If not, can you outline the pattern you’re interested in implementing and we can look for a way to build what you want with Numba?

Thank you @gmarkall. Does that mean that we can use @njit-ed cpu functions with some native numpy (or potentially even cupy?) functionality within @cuda.jit-ed kernels/functions? Thanks again.

You can’t call CuPy functions from inside a kernel (but you can call them on Numba device arrays from outside a kernel). There is a limited subset of NumPy functionality available in kernels, documented at Supported Python features in CUDA Python — Numba 0.56.2+0.gd6731f6d2.dirty-py3.7-linux-x86_64.egg documentation

I am working on expanding the supported NumPy features in kernels, starting with trigonometric functions (CUDA: Add trig ufunc support by gmarkall · Pull Request #8294 · numba/numba · GitHub), and I hope to enable most ufuncs following this PR.

Thank you again @gmarkall.
Our sample use case is outlined in the pseudo code below. Please refer to short comments in code to get more context.

FYI - most critical numpy functions for business logic are:
np.where, np.sort, np.argsort, np.split, np.unique, np.isin, and similar functions.

These are supported in numba but not in numba cuda.

Any workaround for achieving below objectives would be much appreciated.

@jit(nopython=True, cache=True)
def where(d2_array, col_index, _data_):
    return d2_array[np.where(d2_array[:, col_index] == _data_)]


@cuda.jit 
"""
    :param unique_data: unique data is a one dimentional array
    :param cp_data: is a huge 2D Array
    :param cuda_array_of_arrays: empty array of 2D arrays where data will be filtered by member of the unique_data
    :param col_index: index of the column
    :param book_size: book size (int)
    :param long_book: is 1 or 0; True/False
    :return: cuda_array_of_arrays array of 2D arrays filtered by date unique_data

"""
def fill_arrays_with_data_based_on_unique_data(unique_data, cp_data, cuda_array_of_arrays, col_index, book_size, long_book):
    _index = cuda.grid(1)
    if _index < unique_data.shape[0]:
        arr = cuda_array_of_arrays[_index]
        _data_ = unique_data[_index]
        counter = 0
        rank = book_size
        # would be lovely if below (where) function worked, but it doesn't
        cp_2d_array = where(cp_data, col_index, _data_)
       # would be great if below for loop could execute in parallel using seperate kernel function
        for row in range(cp_2d_array.shape[0]):
            if counter < arr.shape[0]:
                for col in range(cp_2d_array.shape[1]):
                    arr[counter, col] = cp_2d_array[row, col]
                if long_book == 1:
                    arr[counter, 3] = rank  # would be very usefuls if np.argsort() worked in numba cuda
                    arr[counter, 10] = arr[counter, 7]
                    arr[counter, 13] = arr[counter, 8]
                    rank -= 1
                counter += 1

Thanks for sharing the example - I think it will be tricky to make this work in Numba without rewriting the NumPy functions using more primitive constructs and working to avoid memory allocation in the kernel.

I think CuPy supports most or all of the functions you’re using, so if you can rework it to use CuPy calls on large enough items of data, then I think that could be a more promising route than using Numba to begin with.

Agreed! But often default CuPy functions are insufficient for parallelizing complex [business logic driven] algorithms which are in principle susceptible to parallelization.

In fact, we could achieve these objectives by leveraging CUDA’s dynamic parallelism, even without the ability of using numpy/cupy within the kernel itself. But outsourcing the same functionality to a numba’s device function running on a single thread will simply not cut it, performance wise.

Hope that Numba Team will considering implementing dynamic parallelism sometime soon. This would really bring Python into the Cuda world, fully and finally. :slight_smile:

Next best thing, PyCuda, is still a C/C++ stack disguised as python.

Hi @gmarkall , I know that we can call @jit decorated functions inside a CUDA kernel in which case Numba compiler will automatically compile a CUDA version of it to act as CUDA device function taking into consideration the restrictions of the GPU complier. Does The same thing happen when we call a @cuda.jit decorated function inside another @cuda.jit decorated function (kernel inside a kernel) will it be considered a device function? and how will it deal with the thread index operations that is done inside the kernel? (I know that calling a kernel inside a kernel is typically referred to as dynamic parallelism which is not supported yet in Numba but you said we can call a @cuda.jit decorated function inside a kernel so does the behavior is different as I described that the inner kernel will be executed thread-wise?)

Yes.

If you use the thread index operations inside a device function, they will return the current thread index, just as they do as if you’d called them inside the “top level” kernel.

It will be called as if it’s a device function - each thread executing the kernel will call into the callee.

You’re correct that dynamic parallelism is not yet supported in Numba - I’m not sure of how many use cases there are for it, but it hasn’t ever come up as a priority before.

1 Like

Thank you so much for clarification I really appreciate it. I now get it that it is possible in Numba to call a kernel inside a kernel as long as they are operating on the same grid while using the same thread index, I didn’t know that this was possible before. (What is not possible as per my understanding now is a thread inside a kernel launching a new grid).

To demonstrate the point, I wrote the below snippet of two kernels where the inner kernel multiplies the array elements by 2 and the outer kernel adds 1 to the result, and it gave expected output, just in case if anyone find it useful.

import numpy as np
from numba import cuda

random_cpu = np.arange(10) # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
random_gpu = cuda.to_device(random_cpu)
out_arr = cuda.device_array_like(random_cpu)

# Inner kernel definition
@cuda.jit(inline=True)
def multtiplies_by_two(in_arr, out_arr):
    pos = cuda.grid(1)
    if pos < in_arr.shape[0]:
        out_arr[pos] = in_arr[pos] * 2

# Outer kernel definition
@cuda.jit
def adds_one(in_arr, out_arr):
    pos = cuda.grid(1)
    if pos < in_arr.shape[0]:
        multtiplies_by_two(in_arr, out_arr)
        out_arr[pos] = out_arr[pos] + 1

# Outer kernel config
thread_per_block = len(random_cpu)
blocks_per_grid = 1

# Launch the outer kernel
adds_one[blocks_per_grid, thread_per_block](random_gpu, out_arr)

# Copy output to host
out_from_gpu = out_arr.copy_to_host()
# The result is as intended: array([ 1.,  3.,  5.,  7.,  9., 11., 13., 15., 17., 19.])
1 Like

Thanks for sharing your example. A couple of other comments, which I hope are helpful:

  • The inline=True kwarg has no effect, so it can be removed.
  • It’s also possible to return values from device functions, which can be more convenient for elementwise operations, to save indexing in a device function. I’ve added another function (add_one()) which does this.

The example modified with these points in mind follows:

import numpy as np
from numba import cuda

random_cpu = np.arange(10)
random_gpu = cuda.to_device(random_cpu)
out_arr = cuda.device_array_like(random_cpu)


# Inner function definition
@cuda.jit
def multtiplies_by_two(in_arr, out_arr):
    pos = cuda.grid(1)
    if pos < in_arr.shape[0]:
        out_arr[pos] = in_arr[pos] * 2


# Another inner function
@cuda.jit
def adds_one(x):
    return x + 1


# Outer kernel definition
@cuda.jit
def multiply_and_add(in_arr, out_arr):
    pos = cuda.grid(1)
    if pos < in_arr.shape[0]:
        multtiplies_by_two(in_arr, out_arr)
        out_arr[pos] = adds_one(out_arr[pos])


# Outer kernel config
thread_per_block = len(random_cpu)
blocks_per_grid = 1

# Launch the outer kernel
multiply_and_add[blocks_per_grid, thread_per_block](random_gpu, out_arr)

# Copy output to host
out_from_gpu = out_arr.copy_to_host()
print(out_from_gpu)
1 Like