Kernel within a kernel

Hi -

Is it possible to start a kernel within a @cuda.jit-ed kernel? Like you can do in C/C++?
PyCuda also supports this dynamic parallelism when compiled with “-rdc=true” option. Is there a way to do the same in Numba?

Thanks.

Numba doesn’t presently support dynamic parallelism, but you can call other @jit- and @cuda.jit-decorated functions from within a CUDA kernel - see for example Examples — Numba 0.56.2+0.gd6731f6d2.dirty-py3.7-linux-x86_64.egg documentation

Does this address your use case? If not, can you outline the pattern you’re interested in implementing and we can look for a way to build what you want with Numba?

Thank you @gmarkall. Does that mean that we can use @njit-ed cpu functions with some native numpy (or potentially even cupy?) functionality within @cuda.jit-ed kernels/functions? Thanks again.

You can’t call CuPy functions from inside a kernel (but you can call them on Numba device arrays from outside a kernel). There is a limited subset of NumPy functionality available in kernels, documented at Supported Python features in CUDA Python — Numba 0.56.2+0.gd6731f6d2.dirty-py3.7-linux-x86_64.egg documentation

I am working on expanding the supported NumPy features in kernels, starting with trigonometric functions (CUDA: Add trig ufunc support by gmarkall · Pull Request #8294 · numba/numba · GitHub), and I hope to enable most ufuncs following this PR.

Thank you again @gmarkall.
Our sample use case is outlined in the pseudo code below. Please refer to short comments in code to get more context.

FYI - most critical numpy functions for business logic are:
np.where, np.sort, np.argsort, np.split, np.unique, np.isin, and similar functions.

These are supported in numba but not in numba cuda.

Any workaround for achieving below objectives would be much appreciated.

@jit(nopython=True, cache=True)
def where(d2_array, col_index, _data_):
    return d2_array[np.where(d2_array[:, col_index] == _data_)]


@cuda.jit 
"""
    :param unique_data: unique data is a one dimentional array
    :param cp_data: is a huge 2D Array
    :param cuda_array_of_arrays: empty array of 2D arrays where data will be filtered by member of the unique_data
    :param col_index: index of the column
    :param book_size: book size (int)
    :param long_book: is 1 or 0; True/False
    :return: cuda_array_of_arrays array of 2D arrays filtered by date unique_data

"""
def fill_arrays_with_data_based_on_unique_data(unique_data, cp_data, cuda_array_of_arrays, col_index, book_size, long_book):
    _index = cuda.grid(1)
    if _index < unique_data.shape[0]:
        arr = cuda_array_of_arrays[_index]
        _data_ = unique_data[_index]
        counter = 0
        rank = book_size
        # would be lovely if below (where) function worked, but it doesn't
        cp_2d_array = where(cp_data, col_index, _data_)
       # would be great if below for loop could execute in parallel using seperate kernel function
        for row in range(cp_2d_array.shape[0]):
            if counter < arr.shape[0]:
                for col in range(cp_2d_array.shape[1]):
                    arr[counter, col] = cp_2d_array[row, col]
                if long_book == 1:
                    arr[counter, 3] = rank  # would be very usefuls if np.argsort() worked in numba cuda
                    arr[counter, 10] = arr[counter, 7]
                    arr[counter, 13] = arr[counter, 8]
                    rank -= 1
                counter += 1

Thanks for sharing the example - I think it will be tricky to make this work in Numba without rewriting the NumPy functions using more primitive constructs and working to avoid memory allocation in the kernel.

I think CuPy supports most or all of the functions you’re using, so if you can rework it to use CuPy calls on large enough items of data, then I think that could be a more promising route than using Numba to begin with.

Agreed! But often default CuPy functions are insufficient for parallelizing complex [business logic driven] algorithms which are in principle susceptible to parallelization.

In fact, we could achieve these objectives by leveraging CUDA’s dynamic parallelism, even without the ability of using numpy/cupy within the kernel itself. But outsourcing the same functionality to a numba’s device function running on a single thread will simply not cut it, performance wise.

Hope that Numba Team will considering implementing dynamic parallelism sometime soon. This would really bring Python into the Cuda world, fully and finally. :slight_smile:

Next best thing, PyCuda, is still a C/C++ stack disguised as python.