Performance is comparable between two different numba cuda approaches

I am implementing a ray tracer using numba cuda and trying to optimize the performance since I have tons of frames to trace. Two approaches are taking about the same amount of time even though it seems that one should be faster than the other.

Approach 1 is to initialize the image with dim=(512,512) and simply loop through the desired number of rays in the kernel (1 thread per pixel):

@cuda.jit('void(float64[:], float64[:,:], float64[:,:,:], int64[:,:,:], float64[:,:], int64, float64[:,:])')
def trace_aabb(camera_data, rng_vals, triangles, stack, aabbtree, num_rays, image):

    xpix, ypix = cuda.grid(2)
    
    # same random offsets for all sets of rays across pixels; probably fine
    for ray_idx in range(0,num_rays):
        dx = rng_vals[ray_idx,RNG_DX]
        dy = rng_vals[ray_idx,RNG_DY]
        angle = rng_vals[ray_idx,RNG_THETA]
        distance = camera_data[CAMERA_APERTURE_RADIUS] * rng_vals[ray_idx,RNG_R]
        ...

Approach 2 is to initialize the image with dim=(512, 512*ray_block_size) and have 1 thread per pixel per block of rays (ray_block_size threads per pixel):

@cuda.jit('void(float64[:], float64[:,:,:], float64[:,:,:], int64[:,:,:,:], float64[:,:], int64, int64, float64[:,:])')
def trace_aabb_new(camera_data, rng_vals, triangles, stack, aabbtree, ray_block_size, rays_per_block, image):
    
    
    xpix, y = cuda.grid(2)

    y_placeholder = y

    dim_x = image.shape[0]
    dim_y = image.shape[1] // ray_block_size

    
    ypix = y_placeholder % dim_y
    ray_block_idx = y_placeholder // dim_y

    for ray_idx in range(rays_per_block):
        dx = rng_vals[ray_block_idx, ray_idx, RNG_DX]
        dy = rng_vals[ray_block_idx, ray_idx, RNG_DY]
        angle = rng_vals[ray_block_idx, ray_idx, RNG_THETA]
        distance = camera_data[CAMERA_APERTURE_RADIUS] * rng_vals[ray_block_idx, ray_idx, RNG_R]
        ...

In the first approach, the stack is (512,512,stack_size). In the second it is (512,512,ray_block_size,stack_size). Stack access is per pixel, so looking values up in the stack is stack[xpix,ypix] for approach 1 and stack[xpix,ypix,ray_block_idx] for approach 2. Approach 1 looping through 256 rays takes 13 seconds, and approach 2 using ray_block_size=2 with 128 rays per block takes 11.5 seconds. It seems that approach 2 should be significantly faster, so I was wondering if I’m doing something obviously wrong with how I’m indexing things. TIA!

I haven’t time to dig into this, but an obvious issue is the signatures:

@cuda.jit('void(float64[:], float64[:,:,:], float64[:,:,:], int64[:,:,:,:], float64[:,:], int64, int64, float64[:,:])')
def trace_aabb_new(camera_data, rng_vals, triangles, stack, aabbtree, ray_block_size, rays_per_block, image):

When you write float64[:], that suggests that the arrays can have any ordering, which forces extra index computation. It would be better to omit the signatures and let Numba determine the correct ones, e.g.:

@cuda.jit
def trace_aabb_new(camera_data, rng_vals, triangles, stack, aabbtree, ray_block_size, rays_per_block, image):

Whoops, hit “reply” too soon - I was going to add that I hope to take another look at this after the Christmas break.

Thank you for the feedback! Interestingly, removing the signatures didn’t speed it up and actually increased the time by around 10%. Does this indicate that there might be something more fundamentally wrong with my code? I am using numba version 0.51.2.

Are you counting the compilation time in your measurements?

I believe so, I’m timing it like this:

trace_aabb[bpg, tpb](d_camera_data, d_rng_vals, d_tris, d_stack, d_tree, num_rays, d_materials, image)
cuda.synchronize()

gpu_start = time.time()
trace_aabb[bpg, tpb](d_camera_data, d_rng_vals, d_tris, d_stack, d_tree, num_rays, d_materials, image)
cuda.synchronize()
print("GPU Time: {0:1.6f}s ".format(time.time() - gpu_start))

With that version of Numba it may be that dispatch is slower when signatures are not provided (it’s quite an old version, and I can’t remember when it was improved).

I would suggest trying with Numba 0.60.

If you need to keep the signatures for performance, I think the following signature would be better if your input data is C-contiguous:

'void(float64[::1], float64[:,::1], float64[:,:,::1], int64[:,:,::1], float64[:,::1], int64, float64[:,::1])'

where the shape ::1 indicates that the step is 1, so the indexing computation can be simpler.