Dynamically adjust size of cuda.local.array without environment variables

For CUDA/Numba, is there a way to dynamically adjust the size of cuda.local.array without needing to rely on os.environ? I have a very hacky workaround using os.environ that essentially looks like the following:

import os
from importlib import reload
os.environ["COLUMNS"] = A.shape[1]
import my_module # noqa
reload(my_module)
fn = my_module.fn

my_module.py

import os
from numba import cuda
cols = os.environ.get("COLUMNS")

# ...
# fn that use cuda.local.array

However, this has been the source of much pain. While this hack has been working for deployment code, my current pinch point is that pytest doesn’t seem to inherit the environment variables, and it’s turning into a rabbit hole of “workaround for my workaround”. Should I write the variable to a file instead? I think that there is probably something more robust than what I just showed.

Are you looking for something that varies the size of a local array in a kernel? e.g.:

from numba import cuda, types


def gen_kernel(local_size):
    @cuda.jit
    def f():
        x = cuda.local.array(local_size, types.int32)
        print("Length of local array is", len(x))

    return f


# Generate kernels with different sized local arrays and launch them with a
# single thread
gen_kernel(2)[1, 1]()
gen_kernel(3)[1, 1]()

# Synchronize to make sure we don't exit before the output can be printed
cuda.synchronize()

which outputs:

Length of local array is 2
Length of local array is 3

Hi @gmarkall, this is great. I hadn’t considered defining the cuda.jit decorated function within a gen_kernel function itself.

(edited to avoid shadowing issue mentioned by gmarkall)

Slightly modifying this seems to work as well:

import numpy as np
from numba import cuda, types

arr1 = np.zeros(2)
arr2 = np.zeros(3)

def gen_kernel(x):
    local_size = len(x)

    @cuda.jit
    def f():
        x = cuda.local.array(local_size, types.int32)
        print("Length of local array is", len(x))

    return f

gen_kernel(arr1)[1, 1]()
gen_kernel(arr2)[1, 1]()
cuda.synchronize()

If I want to be able to use f in other contexts, I can also do the following with a device function:

import numpy as np
from numba import cuda, types, jit

arr1 = np.zeros(2)
arr2 = np.zeros(3)

@jit
def f2(local_size):
    x = cuda.local.array(local_size, types.int32)
    print("Length of local array is", len(x))

def gen_kernel(x):
    local_size = len(x)

    @cuda.jit
    def f():
        return f2(local_size)
    return f

gen_kernel(arr1)[1, 1]()
gen_kernel(arr2)[1, 1]()
cuda.synchronize()

It seems to break down if I don’t pass in the local_size as a parameter explicitly to the device function.

import numpy as np
from numba import cuda, types, jit

arr1 = np.zeros(2)
arr2 = np.zeros(3)

@jit
def f2(arr):
    local_size = len(arr)
    x = cuda.local.array(local_size, types.int32)
    print("Length of local array is", len(x))

def gen_kernel(x):
    local_size = len(x)

    @cuda.jit
    def f():
        return f2(x)
    return f
gen_kernel(arr1)[1, 1]()
gen_kernel(arr2)[1, 1]()
cuda.synchronize()
Failed in cuda mode pipeline (step: nopython frontend)
Internal error at <numba.core.typeinfer.CallConstraint object at 0x000001A7FD29EEE0>.
e[1me[1me[1mmodule, class, method, function, traceback, frame, or code object was expected, got CPUDispatchere[0m
e[0me[1mDuring: resolving callee type: Function(<numba.cuda.compiler.DeviceDispatcher object at 0x000001A7FD29E7F0>)e[0m
e[0me[1mDuring: typing of call at c:\Users\sterg\Documents\GitHub\scratch\discover-size-check\mat_discover\ElM2D\local_array3.py (20)
e[0m
Enable logging at debug level for details.
e[1m
File "mat_discover\ElM2D\local_array3.py", line 20:e[0m
e[1m    def f():
e[1m        return f2(x)
e[0m        e[1m^e[0me[0m
  File "C:\Users\sterg\Documents\GitHub\scratch\discover-size-check\mat_discover\ElM2D\local_array3.py", line 27, in <module> (Current frame)
    gen_kernel(x)[1, 1]()

It’s not immediately obvious why this last one doesn’t work, so the workaround would seem to be that local_size needs to be passed in explicitly as a parameter:

import numpy as np
from numba import cuda, types, jit

arr1 = np.zeros(2)
arr2 = np.zeros(3)

@jit(debug=True)
def f2(arr, local_size):
    x = cuda.local.array(local_size, types.int32)
    print("Length of local array is", len(x))

def gen_kernel(x):
    local_size = len(x)
    @cuda.jit(debug=True)
    def f():
        return f2(x, local_size)
    return f

gen_kernel(arr1)[1, 1]()
gen_kernel(arr2)[1, 1]()
cuda.synchronize()

A minimal sacrifice though.

In your slightly modified version, it is working as you expect, but note that you’re shadowing the definition of x in the kernel inside gen_kernel - it will probably be a good idea to use a different name for the local array inside the function to avoid potential confusion.

Parameterising the size of the local array in a device function only works because constant propagation manages to supply the literal value in the right place. In general using a variable for the local array size doesn’t work. (I have some WIP towards allowing true dynamic local array allocation, but I need to get the chance to finish up that patch).

1 Like

Kind of an aside, but another hacky workaround that doesn’t mess with the device function args (and avoid os.environ) is:
my_script.py

import json
import numpy as np
from importlib import reload
import gen_kernel

arr1 = np.zeros(2)

settings = {"COLUMNS": len(arr1)}
with open("dist_matrix_settings.json", "w") as f:
    json.dump(settings, f)

reload(gen_kernel)

gen_kernel_fn = gen_kernel.gen_kernel

gen_kernel_fn(arr1)[1, 1]()

arr2 = np.zeros(3)

settings = {"COLUMNS": len(arr2)}
with open("dist_matrix_settings.json", "w") as f:
    json.dump(settings, f)

reload(gen_kernel)

gen_kernel_fn(arr2)[1, 1]()

gen_kernel.py

import json
from numba import cuda, types, jit

with open("dist_matrix_settings.json", "r") as f:
    settings = json.load(f)
local_size = settings["COLUMNS"]


@jit(debug=True)
def f2(arr):
    x = cuda.local.array(local_size, types.int32)
    print("Length of local array is", len(x))


def gen_kernel(x):
    @cuda.jit(debug=True)
    def f():
        return f2(x)

    return f

Good point about shadowing. Updated it in the post. I’m not sure I fully understand why that parameterization works, but “In general using a variable for the local array size doesn’t work” has definitely been consistent with my experience so far. Thanks for showing an alternative! :slight_smile: As always, thank you for the thorough and timely help!

1 Like