About understanding simple cuda results

In order to write GPU-based simulation code for using in deep learning, I am interested in NUMBA_CUDA.
The code attached below is an example code of addition that I simply wrote to check Numba’s parallel computation performance.
cuda that I run

grid_size = (N_x, N_y, N_z)
a = np.random.random(grid_size)  
b = np.random.random(grid_size)   
     
a_g = cuda.to_device(a)   
b_g = cuda.to_device(b)   
c_g = cuda.device_array_like(a) 

@cuda.jit
def f(a, b, c):
    xid, yid, zid = cuda.grid(3)
    size = a.shape
    if xid < size[0] and yid < size[1] and zid<size[2]:
        c[xid][yid][zid] = a[xid][yid][zid] + b[xid][yid][zid]

threads_per_block = (16,16, 4) 
blockspergrid_x = math.ceil(grid_size[0] / threads_per_block[0])
blockspergrid_y = math.ceil(grid_size[1] / threads_per_block[1])
blockspergrid_z = math.ceil(grid_size[2] / threads_per_block[2])
blocks_per_grid = (blockspergrid_x, blockspergrid_y, blockspergrid_z)

print(f"CUDA threads: {threads_per_block}, blocks: {blocks_per_grid}, grid_size: {grid_size}")

for i in range(iter):
    start_time_cuda = time.time()
    f[blocks_per_grid, threads_per_block](a_g, b_g, c_g)
    cuda.synchronize()
    end_time_cuda = time.time()
    cuda_execution_time = end_time_cuda - start_time_cuda
    
print(f"CUDA Execution Time: {cuda_execution_time:.6f} seconds")

and Results

result 1
Used memory[GB]: 4.802609152
grid_size: (400, 400, 100), CUDA threads: (16, 16, 4), blocks: (25, 25, 25)
CUDA Execution Time: 0.004559 seconds

result 2
Used memory[GB]: 7.877230592
grid_size: (800, 800, 200), CUDA threads: (16, 16, 4), blocks: (50, 50, 50)
CUDA Execution Time: 0.038483 seconds

I wonder if it is acceptable to accept that increasing the GPU’s operation time proportionally as the matrix size increases is a problem caused by the small CUDA core of the running GPU.

I Used TITAN RTX (cudacore~4600). Will using rtx 3090 with more cuda cores fix that issue?
Or, should I use an approach that calculates parts of the matrix separately with multiple titanrtx?

Personally, I think it takes lot of time to calculate the matrix than I thought to create a grid-based simulation solver.

Pasting the full code you’re using for benchmarking would help to identify exactly how to get a good measurement, but looking at the kernel I would say it’s doing far too little meaningful work to be a useful benchmark.

Generally grid-based simulations map very well onto GPUs (and hence they’re often used as demonstrations of GPU performance) so your application domain should map onto the GPU quite well, but when benchmarking you will likely need to write something that more closely models the actual workload be measure performance in a meaningful way.

I attached the additional code regarding to the execution of the numba code. Please understand that the above code is not a function included in the simulation code. The code I am actually writing is an IB-LBM code written in numba, but I think an appropriate interpretation of the results of the simple cuda code above will solve all my questions. I would appreciate it if you could let me know if you know anything about my questions about the above results.