I am implementing a ray tracer using numba cuda and trying to optimize the performance since I have tons of frames to trace. Two approaches are taking about the same amount of time even though it seems that one should be faster than the other.
Approach 1 is to initialize the image with dim=(512,512) and simply loop through the desired number of rays in the kernel (1 thread per pixel):
@cuda.jit('void(float64[:], float64[:,:], float64[:,:,:], int64[:,:,:], float64[:,:], int64, float64[:,:])')
def trace_aabb(camera_data, rng_vals, triangles, stack, aabbtree, num_rays, image):
xpix, ypix = cuda.grid(2)
# same random offsets for all sets of rays across pixels; probably fine
for ray_idx in range(0,num_rays):
dx = rng_vals[ray_idx,RNG_DX]
dy = rng_vals[ray_idx,RNG_DY]
angle = rng_vals[ray_idx,RNG_THETA]
distance = camera_data[CAMERA_APERTURE_RADIUS] * rng_vals[ray_idx,RNG_R]
...
Approach 2 is to initialize the image with dim=(512, 512*ray_block_size) and have 1 thread per pixel per block of rays (ray_block_size threads per pixel):
@cuda.jit('void(float64[:], float64[:,:,:], float64[:,:,:], int64[:,:,:,:], float64[:,:], int64, int64, float64[:,:])')
def trace_aabb_new(camera_data, rng_vals, triangles, stack, aabbtree, ray_block_size, rays_per_block, image):
xpix, y = cuda.grid(2)
y_placeholder = y
dim_x = image.shape[0]
dim_y = image.shape[1] // ray_block_size
ypix = y_placeholder % dim_y
ray_block_idx = y_placeholder // dim_y
for ray_idx in range(rays_per_block):
dx = rng_vals[ray_block_idx, ray_idx, RNG_DX]
dy = rng_vals[ray_block_idx, ray_idx, RNG_DY]
angle = rng_vals[ray_block_idx, ray_idx, RNG_THETA]
distance = camera_data[CAMERA_APERTURE_RADIUS] * rng_vals[ray_block_idx, ray_idx, RNG_R]
...
In the first approach, the stack is (512,512,stack_size). In the second it is (512,512,ray_block_size,stack_size). Stack access is per pixel, so looking values up in the stack is stack[xpix,ypix] for approach 1 and stack[xpix,ypix,ray_block_idx] for approach 2. Approach 1 looping through 256 rays takes 13 seconds, and approach 2 using ray_block_size=2 with 128 rays per block takes 11.5 seconds. It seems that approach 2 should be significantly faster, so I was wondering if I’m doing something obviously wrong with how I’m indexing things. TIA!