Large gap between launching cuda kernel

Hello Everyone.
When I tried to benchmark the performance of my cuda kernel like:

for _ in range(500):
    my_kernel[block_per_grid, thread_per_block](args)

with nsight system, and I found there is always a large gap between the launching of each kernel.

The gap is even longer than the execution time of the kernel, which never happens when I call cupy function multiple times.

My question is, why this gap exists and how can I get rid of it?

Best,
Zhenyu

Configuring the kernel is expensive in Numba. Things should be much better in the next release due to a couple of PRs that recently went in:

In the meantime, how much does changing your code to:

configured_kernel = my_kernel[block_per_grid, thread_per_block]
for _ in range(500):
    configured_kernel(args)

close the gap?

Hi @gmarkall, Thanks for your reply. I think this makes sense but when I tried to replace the kernel with configured one, the gap is not closed and I check the cuda API by the nsisght, I notice that the


the cuStreamSynchronize is called 3 times between the gap. I think the gap exists for this reason. But I don’t understand why synchronizing is called multi-times.

Best,

Many thanks for the info!

What are your arguments to the kernel? Can you paste some code that reproduces the issue?

A hunch - are you calling the kernel on CuPy arrays? If so, and you know you don’t need to synchronize on any prior operations in CuPy, you could try setting the environment variable NUMBA_CUDA_ARRAY_INTERFACE_SYNC to 0 (or as a config variable in Python numba.config.CUDA_ARRAY_INTERFACE_SYNC = False) to close the gap. For example, if I run:

from numba import cuda
import cupy as cp

x = cp.arange(10)

@cuda.jit
def kernel(x):
    pass

kernel[1, 1](x)

with CUDA debug logging enabled, I see:

$ NUMBA_CUDA_LOG_LEVEL=DEBUG python repro.py 
== CUDA [434]  INFO -- init
== CUDA [434] DEBUG -- call driver api: cuInit
== CUDA [434] DEBUG -- call driver api: cuCtxGetCurrent
== CUDA [434] DEBUG -- call driver api: cuCtxGetDevice
== CUDA [434] DEBUG -- call driver api: cuDeviceGetCount
== CUDA [434] DEBUG -- call driver api: cuDeviceGet
== CUDA [434] DEBUG -- call driver api: cuDeviceGetAttribute
== CUDA [434] DEBUG -- call driver api: cuDeviceGetAttribute
== CUDA [434] DEBUG -- call driver api: cuDeviceGetName
== CUDA [434] DEBUG -- call driver api: cuDeviceGetUuid_v2
== CUDA [434] DEBUG -- call driver api: cuDevicePrimaryCtxRetain
== CUDA [435] DEBUG -- call driver api: cuMemGetInfo_v2
== CUDA [435] DEBUG -- call driver api: cuPointerGetAttribute
== CUDA [435] DEBUG -- call driver api: cuCtxGetCurrent
== CUDA [435] DEBUG -- call driver api: cuCtxGetDevice
== CUDA [435] DEBUG -- call driver api: cuPointerGetAttribute
== CUDA [589] DEBUG -- call driver api: cuCtxGetCurrent
== CUDA [589] DEBUG -- call driver api: cuCtxGetDevice
== CUDA [590] DEBUG -- call runtime api: cudaRuntimeGetVersion
== CUDA [597] DEBUG -- call driver api: cuCtxGetCurrent
== CUDA [597] DEBUG -- call driver api: cuCtxGetDevice
== CUDA [597] DEBUG -- call driver api: cuLinkCreate_v2
== CUDA [597] DEBUG -- call driver api: cuLinkAddData_v2
== CUDA [598] DEBUG -- call driver api: cuLinkComplete
== CUDA [598] DEBUG -- call driver api: cuLinkDestroy
== CUDA [598] DEBUG -- call driver api: cuModuleLoadDataEx
== CUDA [598] DEBUG -- call driver api: cuModuleGetFunction
== CUDA [598] DEBUG -- call driver api: cuFuncGetAttribute
== CUDA [598] DEBUG -- call driver api: cuFuncGetAttribute
== CUDA [598] DEBUG -- call driver api: cuFuncGetAttribute
== CUDA [598] DEBUG -- call driver api: cuFuncGetAttribute
== CUDA [598] DEBUG -- call driver api: cuFuncGetAttribute
== CUDA [598] DEBUG -- call driver api: cuCtxGetCurrent
== CUDA [598] DEBUG -- call driver api: cuCtxGetDevice
== CUDA [598] DEBUG -- call driver api: cuCtxGetCurrent
== CUDA [598] DEBUG -- call driver api: cuCtxGetDevice
== CUDA [599] DEBUG -- call driver api: cuPointerGetAttribute
== CUDA [599] DEBUG -- call driver api: cuStreamSynchronize
== CUDA [599] DEBUG -- call driver api: cuLaunchKernel
== CUDA [599]  INFO -- add pending dealloc: module_unload ? bytes

and with the sync disabled by environment variable, I see:

$ NUMBA_CUDA_ARRAY_INTERFACE_SYNC=0 NUMBA_CUDA_LOG_LEVEL=DEBUG python repro.py 
== CUDA [445]  INFO -- init
== CUDA [445] DEBUG -- call driver api: cuInit
== CUDA [446] DEBUG -- call driver api: cuCtxGetCurrent
== CUDA [446] DEBUG -- call driver api: cuCtxGetDevice
== CUDA [446] DEBUG -- call driver api: cuDeviceGetCount
== CUDA [446] DEBUG -- call driver api: cuDeviceGet
== CUDA [446] DEBUG -- call driver api: cuDeviceGetAttribute
== CUDA [446] DEBUG -- call driver api: cuDeviceGetAttribute
== CUDA [446] DEBUG -- call driver api: cuDeviceGetName
== CUDA [446] DEBUG -- call driver api: cuDeviceGetUuid_v2
== CUDA [446] DEBUG -- call driver api: cuDevicePrimaryCtxRetain
== CUDA [446] DEBUG -- call driver api: cuMemGetInfo_v2
== CUDA [446] DEBUG -- call driver api: cuPointerGetAttribute
== CUDA [447] DEBUG -- call driver api: cuCtxGetCurrent
== CUDA [447] DEBUG -- call driver api: cuCtxGetDevice
== CUDA [447] DEBUG -- call driver api: cuPointerGetAttribute
== CUDA [600] DEBUG -- call driver api: cuCtxGetCurrent
== CUDA [600] DEBUG -- call driver api: cuCtxGetDevice
== CUDA [601] DEBUG -- call runtime api: cudaRuntimeGetVersion
== CUDA [607] DEBUG -- call driver api: cuCtxGetCurrent
== CUDA [607] DEBUG -- call driver api: cuCtxGetDevice
== CUDA [607] DEBUG -- call driver api: cuLinkCreate_v2
== CUDA [608] DEBUG -- call driver api: cuLinkAddData_v2
== CUDA [608] DEBUG -- call driver api: cuLinkComplete
== CUDA [608] DEBUG -- call driver api: cuLinkDestroy
== CUDA [609] DEBUG -- call driver api: cuModuleLoadDataEx
== CUDA [609] DEBUG -- call driver api: cuModuleGetFunction
== CUDA [609] DEBUG -- call driver api: cuFuncGetAttribute
== CUDA [609] DEBUG -- call driver api: cuFuncGetAttribute
== CUDA [609] DEBUG -- call driver api: cuFuncGetAttribute
== CUDA [609] DEBUG -- call driver api: cuFuncGetAttribute
== CUDA [609] DEBUG -- call driver api: cuFuncGetAttribute
== CUDA [609] DEBUG -- call driver api: cuCtxGetCurrent
== CUDA [609] DEBUG -- call driver api: cuCtxGetDevice
== CUDA [609] DEBUG -- call driver api: cuCtxGetCurrent
== CUDA [609] DEBUG -- call driver api: cuCtxGetDevice
== CUDA [609] DEBUG -- call driver api: cuPointerGetAttribute
== CUDA [609] DEBUG -- call driver api: cuLaunchKernel
== CUDA [609]  INFO -- add pending dealloc: module_unload ? bytes

the difference being basically:

--- sync.txt	2022-06-27 10:56:04.357343269 +0100
+++ no-sync.txt	2022-06-27 10:56:13.741458107 +0100
@@ -35,7 +35,6 @@
 == CUDA [] DEBUG -- call driver api: cuCtxGetCurrent
 == CUDA [] DEBUG -- call driver api: cuCtxGetDevice
 == CUDA [] DEBUG -- call driver api: cuPointerGetAttribute
-== CUDA [] DEBUG -- call driver api: cuStreamSynchronize
 == CUDA [] DEBUG -- call driver api: cuLaunchKernel
 == CUDA []  INFO -- add pending dealloc: module_unload ? bytes

This synchronization behaviour is enabled by default to ensure correctness by default. This is explained / documented in this section of the CUDA Array Interface specification. The behaviour of Numba is intended to ensure that it meets the requirements laid out by the specification for general correctness when sharing arrays between libraries that operate asynchronously.

Hi @gmarkall, really appreciate the quick reply! For the two optimization strategies you suggest, I setnumba.config.CUDA_ARRAY_INTERFACE_SYNC = False and do the profiling with or without a configured kernel.

The performance changed from 219ms (no configured kernel, with sync=True) to 158ms (no configured kernel with sync=False) and 147ms (configured kernel with sync=True).
It seems that the sync behavior has a big influence. However, I found the kernel gap disappears for the launching of the first serval loops and then appears again.

, while the gap shrinks from 250us to 150us, which is still comparable to the kernel execution time. The change of gap time is really strange for me. And there is no other Cuda API calling between the gap.

Here I give the code of the target kernel:

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
"""
file : repro.py
created time : 2022/06/27
author : Zhenyu Wei
version : 1.0
contact : zhenyuwei99@gmail.com
copyright : (C)Copyright 2021-2021, Zhenyu Wei and Southeast University
"""

import numpy as np
import cupy as cp
import numba as nb
import numba.cuda as cuda
from cupy.cuda.nvtx import RangePush, RangePop

GRID_POINTS_PER_BLOCK = 8  # 8*8*8 grids point will be solved in one block
TOTAL_POINTS_PER_BLOCK = GRID_POINTS_PER_BLOCK + 2  # 10*10*10 neighbors are needed
nb.config.CUDA_ARRAY_INTERFACE_SYNC = False


@cuda.jit(
    nb.void(
        nb.float32[:, :, ::1],  # relative_permittivity_map
        nb.float32[:, :, ::1],  # coulombic_electric_potential_map
        nb.float32[::1],  # cavity_relative_permittivity
        nb.int32[::1],  # inner_grid_size
        nb.float32[:, :, ::1],  # reaction_field_electric_potential_map
    )
)
def solve_equation(
    relative_permittivity_map,
    coulombic_electric_potential_map,
    cavity_relative_permittivity,
    inner_grid_size,
    reaction_field_electric_potential_map,
):
    # Local index
    local_thread_index = cuda.local.array((3), nb.int32)
    local_thread_index[0] = cuda.threadIdx.x
    local_thread_index[1] = cuda.threadIdx.y
    local_thread_index[2] = cuda.threadIdx.z
    # Global index
    global_thread_index = cuda.local.array((3), nb.int32)
    global_thread_index[0] = local_thread_index[0] + cuda.blockIdx.x * cuda.blockDim.x
    global_thread_index[1] = local_thread_index[1] + cuda.blockIdx.y * cuda.blockDim.y
    global_thread_index[2] = local_thread_index[2] + cuda.blockIdx.z * cuda.blockDim.z
    # Grid size
    grid_size = cuda.local.array((3), nb.int32)
    for i in range(3):
        grid_size[i] = inner_grid_size[i]
        if global_thread_index[i] >= grid_size[i]:
            return
    # Neighbor array index
    local_array_index = cuda.local.array((3), nb.int32)
    for i in range(3):
        local_array_index[i] = local_thread_index[i] + nb.int32(1)
    # Shared array
    shared_relative_permittivity_map = cuda.shared.array(
        (TOTAL_POINTS_PER_BLOCK, TOTAL_POINTS_PER_BLOCK, TOTAL_POINTS_PER_BLOCK),
        nb.float32,
    )
    shared_coulombic_electric_potential_map = cuda.shared.array(
        (TOTAL_POINTS_PER_BLOCK, TOTAL_POINTS_PER_BLOCK, TOTAL_POINTS_PER_BLOCK),
        nb.float32,
    )
    shared_reaction_field_electric_potential_map = cuda.shared.array(
        (TOTAL_POINTS_PER_BLOCK, TOTAL_POINTS_PER_BLOCK, TOTAL_POINTS_PER_BLOCK),
        nb.float32,
    )
    # Load self point data
    shared_relative_permittivity_map[
        local_array_index[0], local_array_index[1], local_array_index[2]
    ] = relative_permittivity_map[
        global_thread_index[0], global_thread_index[1], global_thread_index[2]
    ]
    shared_coulombic_electric_potential_map[
        local_array_index[0], local_array_index[1], local_array_index[2]
    ] = coulombic_electric_potential_map[
        global_thread_index[0], global_thread_index[1], global_thread_index[2]
    ]
    shared_reaction_field_electric_potential_map[
        local_array_index[0], local_array_index[1], local_array_index[2]
    ] = reaction_field_electric_potential_map[
        global_thread_index[0], global_thread_index[1], global_thread_index[2]
    ]
    # Load boundary point
    for i in range(3):
        if local_thread_index[i] == 0:
            tmp_local_array_index = local_array_index[i]
            tmp_global_thread_index = global_thread_index[i]
            local_array_index[i] = 0
            global_thread_index[i] -= 1
            shared_relative_permittivity_map[
                local_array_index[0], local_array_index[1], local_array_index[2]
            ] = relative_permittivity_map[
                global_thread_index[0],
                global_thread_index[1],
                global_thread_index[2],
            ]
            shared_coulombic_electric_potential_map[
                local_array_index[0], local_array_index[1], local_array_index[2]
            ] = coulombic_electric_potential_map[
                global_thread_index[0],
                global_thread_index[1],
                global_thread_index[2],
            ]
            shared_reaction_field_electric_potential_map[
                local_array_index[0], local_array_index[1], local_array_index[2]
            ] = reaction_field_electric_potential_map[
                global_thread_index[0],
                global_thread_index[1],
                global_thread_index[2],
            ]
            local_array_index[i] = tmp_local_array_index
            global_thread_index[i] = tmp_global_thread_index
        elif local_array_index[i] == GRID_POINTS_PER_BLOCK or global_thread_index[
            i
        ] == (grid_size[i] - nb.float32(1)):
            # Equivalent to local_thread_index[i] == GRID_POINTS_PER_BLOCK - 1
            tmp_local_array_index = local_array_index[i]
            tmp_global_thread_index = global_thread_index[i]
            local_array_index[i] += 1
            global_thread_index[i] += 1
            global_thread_index[i] *= nb.int32(global_thread_index[i] != grid_size[i])
            shared_relative_permittivity_map[
                local_array_index[0], local_array_index[1], local_array_index[2]
            ] = relative_permittivity_map[
                global_thread_index[0],
                global_thread_index[1],
                global_thread_index[2],
            ]
            shared_coulombic_electric_potential_map[
                local_array_index[0], local_array_index[1], local_array_index[2]
            ] = coulombic_electric_potential_map[
                global_thread_index[0],
                global_thread_index[1],
                global_thread_index[2],
            ]
            shared_reaction_field_electric_potential_map[
                local_array_index[0], local_array_index[1], local_array_index[2]
            ] = reaction_field_electric_potential_map[
                global_thread_index[0],
                global_thread_index[1],
                global_thread_index[2],
            ]
            local_array_index[i] = tmp_local_array_index
            global_thread_index[i] = tmp_global_thread_index
    cuda.syncthreads()
    # Load constant
    cavity_relative_permittivity = cavity_relative_permittivity[0]
    # Calculate
    new_val = nb.float32(0)
    denominator = nb.float32(0)
    # Self term
    self_relative_permittivity = shared_relative_permittivity_map[
        local_array_index[0],
        local_array_index[1],
        local_array_index[2],
    ]
    self_coulombic_electric_potential = shared_coulombic_electric_potential_map[
        local_array_index[0],
        local_array_index[1],
        local_array_index[2],
    ]
    new_val += (
        nb.float32(6) * cavity_relative_permittivity * self_coulombic_electric_potential
    )
    # Neighbor term
    for i in range(3):
        for j in [-1, 1]:
            local_array_index[i] += j
            relative_permittivity = nb.float32(0.5) * (
                self_relative_permittivity
                + shared_relative_permittivity_map[
                    local_array_index[0],
                    local_array_index[1],
                    local_array_index[2],
                ]
            )
            new_val += (
                relative_permittivity
                * shared_reaction_field_electric_potential_map[
                    local_array_index[0],
                    local_array_index[1],
                    local_array_index[2],
                ]
            )
            new_val += shared_coulombic_electric_potential_map[
                local_array_index[0],
                local_array_index[1],
                local_array_index[2],
            ] * (relative_permittivity - cavity_relative_permittivity)
            denominator += relative_permittivity
            local_array_index[i] -= j
    # Update
    new_val /= denominator
    new_val -= self_coulombic_electric_potential
    old_val = reaction_field_electric_potential_map[
        global_thread_index[0],
        global_thread_index[1],
        global_thread_index[2],
    ]
    cuda.atomic.add(
        reaction_field_electric_potential_map,
        (
            global_thread_index[0],
            global_thread_index[1],
            global_thread_index[2],
        ),
        nb.float32(0.9) * new_val - nb.float32(0.9) * old_val,
    )


if __name__ == "__main__":
    inner_grid_size = [100, 100, 100]
    device_inner_grid_size = cp.array(inner_grid_size)
    device_relative_permittivity_map = cp.ones(inner_grid_size, cp.float32)
    device_coulombic_electric_potential_map = cp.ones(inner_grid_size, cp.float32)
    device_cavity_relative_permittivity = cp.ones([1], cp.float32)
    device_reaction_filed_electric_potential_map = cp.zeros(inner_grid_size, cp.float32)

    thread_per_block = (
        GRID_POINTS_PER_BLOCK,
        GRID_POINTS_PER_BLOCK,
        GRID_POINTS_PER_BLOCK,
    )
    block_per_grid = (
        int(np.ceil(inner_grid_size[0] / GRID_POINTS_PER_BLOCK)),
        int(np.ceil(inner_grid_size[1] / GRID_POINTS_PER_BLOCK)),
        int(np.ceil(inner_grid_size[2] / GRID_POINTS_PER_BLOCK)),
    )
    RangePush("Configure Kernel")
    configured_kernel = solve_equation[block_per_grid, thread_per_block]
    RangePop()
    RangePush("Update")
    for _ in range(500):
        configured_kernel(
            device_relative_permittivity_map,
            device_coulombic_electric_potential_map,
            device_cavity_relative_permittivity,
            device_inner_grid_size,
            device_reaction_filed_electric_potential_map,
        )
    RangePop()

Hope it may help!

Best,