Performance is comparable between two different numba cuda approaches

myociss · December 19, 2024, 10:06pm

I am implementing a ray tracer using numba cuda and trying to optimize the performance since I have tons of frames to trace. Two approaches are taking about the same amount of time even though it seems that one should be faster than the other.

Approach 1 is to initialize the image with dim=(512,512) and simply loop through the desired number of rays in the kernel (1 thread per pixel):

@cuda.jit('void(float64[:], float64[:,:], float64[:,:,:], int64[:,:,:], float64[:,:], int64, float64[:,:])')
def trace_aabb(camera_data, rng_vals, triangles, stack, aabbtree, num_rays, image):

    xpix, ypix = cuda.grid(2)
    
    # same random offsets for all sets of rays across pixels; probably fine
    for ray_idx in range(0,num_rays):
        dx = rng_vals[ray_idx,RNG_DX]
        dy = rng_vals[ray_idx,RNG_DY]
        angle = rng_vals[ray_idx,RNG_THETA]
        distance = camera_data[CAMERA_APERTURE_RADIUS] * rng_vals[ray_idx,RNG_R]
        ...

Approach 2 is to initialize the image with dim=(512, 512*ray_block_size) and have 1 thread per pixel per block of rays (ray_block_size threads per pixel):

@cuda.jit('void(float64[:], float64[:,:,:], float64[:,:,:], int64[:,:,:,:], float64[:,:], int64, int64, float64[:,:])')
def trace_aabb_new(camera_data, rng_vals, triangles, stack, aabbtree, ray_block_size, rays_per_block, image):
    
    
    xpix, y = cuda.grid(2)

    y_placeholder = y

    dim_x = image.shape[0]
    dim_y = image.shape[1] // ray_block_size

    
    ypix = y_placeholder % dim_y
    ray_block_idx = y_placeholder // dim_y

    for ray_idx in range(rays_per_block):
        dx = rng_vals[ray_block_idx, ray_idx, RNG_DX]
        dy = rng_vals[ray_block_idx, ray_idx, RNG_DY]
        angle = rng_vals[ray_block_idx, ray_idx, RNG_THETA]
        distance = camera_data[CAMERA_APERTURE_RADIUS] * rng_vals[ray_block_idx, ray_idx, RNG_R]
        ...

In the first approach, the stack is (512,512,stack_size). In the second it is (512,512,ray_block_size,stack_size). Stack access is per pixel, so looking values up in the stack is stack[xpix,ypix] for approach 1 and stack[xpix,ypix,ray_block_idx] for approach 2. Approach 1 looping through 256 rays takes 13 seconds, and approach 2 using ray_block_size=2 with 128 rays per block takes 11.5 seconds. It seems that approach 2 should be significantly faster, so I was wondering if I’m doing something obviously wrong with how I’m indexing things. TIA!

gmarkall · December 20, 2024, 9:59pm

I haven’t time to dig into this, but an obvious issue is the signatures:

@cuda.jit('void(float64[:], float64[:,:,:], float64[:,:,:], int64[:,:,:,:], float64[:,:], int64, int64, float64[:,:])')
def trace_aabb_new(camera_data, rng_vals, triangles, stack, aabbtree, ray_block_size, rays_per_block, image):

When you write float64[:], that suggests that the arrays can have any ordering, which forces extra index computation. It would be better to omit the signatures and let Numba determine the correct ones, e.g.:

@cuda.jit
def trace_aabb_new(camera_data, rng_vals, triangles, stack, aabbtree, ray_block_size, rays_per_block, image):

gmarkall · December 20, 2024, 10:00pm

Whoops, hit “reply” too soon - I was going to add that I hope to take another look at this after the Christmas break.

myociss · December 23, 2024, 6:06pm

Thank you for the feedback! Interestingly, removing the signatures didn’t speed it up and actually increased the time by around 10%. Does this indicate that there might be something more fundamentally wrong with my code? I am using numba version 0.51.2.

nelson2005 · December 24, 2024, 4:20pm

Are you counting the compilation time in your measurements?

myociss · December 24, 2024, 8:06pm

I believe so, I’m timing it like this:

trace_aabb[bpg, tpb](d_camera_data, d_rng_vals, d_tris, d_stack, d_tree, num_rays, d_materials, image)
cuda.synchronize()

gpu_start = time.time()
trace_aabb[bpg, tpb](d_camera_data, d_rng_vals, d_tris, d_stack, d_tree, num_rays, d_materials, image)
cuda.synchronize()
print("GPU Time: {0:1.6f}s ".format(time.time() - gpu_start))

gmarkall · December 31, 2024, 1:30pm

With that version of Numba it may be that dispatch is slower when signatures are not provided (it’s quite an old version, and I can’t remember when it was improved).

I would suggest trying with Numba 0.60.

If you need to keep the signatures for performance, I think the following signature would be better if your input data is C-contiguous:

'void(float64[::1], float64[:,::1], float64[:,:,::1], int64[:,:,::1], float64[:,::1], int64, float64[:,::1])'

where the shape ::1 indicates that the step is 1, so the indexing computation can be simpler.

Topic		Replies	Views
Why is my Numba CUDA function slower than PyTorch? Community Support	1	816	February 13, 2024
Random array generation : numba cuda slower than cupy? Support: How do I do ...?	3	1947	July 23, 2021
Supporting several targets in library code Numba	5	320	August 12, 2022
CUDA shared memory on 1d arrays Numba	1	445	July 23, 2021
Numba CUDA implementation producing different results and taking longer time Support: How do I do ...?	0	336	May 13, 2023

Performance is comparable between two different numba cuda approaches

Related topics