Errors when with using Numba with vectorize with cuda flag and normal cpu for Denoising Point Clouds


Recently, I decided to practice using numba on numpy scripts, but suddenly, numba now refuses to work with a numpy only section of my script. First, I thought that the numba was throwing errors becuase I had non-numpy areas of my code in the function which I used the vectorize operator on, but after creating a new function, with only numpy arguments and code, I still get more errors. Here is the section of the code that I am attempting to vectorize

@vectorize([float64(float64, float64, float64, float64, float64, float64)], target='cuda')
def np_filter(k, idx, epsilon, points, points_copy, i):
     neighbors = points[idx, :]
    mean = np.mean(neighbors)
    cov = np.cov(neighbors.T)
     e = np.linalg.inv(cov + epsilon * np.eye(3))

    A = cov @ e
    b = mean - A @ mean
    points_copy[i] = A @ points[i] + b
    return points_copy

I used float64 as both input and output argument dytype values, since I want them both to be in float 64. When I try the code with the target=‘cuda’ flag, I get these errors.

 getitem(float64, Tuple(float64, slice<a:b>))
 There are 16 candidate implementations:
 [1m   - Of which 16 did not match due to:
 Overload of function 'getitem': File: <numerous>: Line N/A.
 With argument(s): '(float64, Tuple(float64, slice<a:b>))':[0m
 [1m    No match.[0m

I thought this might be due to the fact that the numpy function which I called might not have the ability to be parallelize, so I removed the cuda flag, and got new errors.

  File "", line 61, in <module>
  File "", line 14, in main
    pcd = guided_filter(pcd, 0.01, 0.25)
  File "", line 43, in guided_filter
    points_copy = np_filter(k, idx, epsilon, points, points_copy, i)
  ValueError: operands could not be broadcast together with shapes () (114,) () (35947,3) (35947,3) ()

Could someone please help me resolve these errors? I’ve been having a hard time understand what might be causing this issues, and for me, even the pure numpy function in the filter functions can not be optimized, so I’m unsure about how to fix my program.

When I run your reproducer without the target=cuda kwarg to the jit decorator, I get:

$ python NumbaWarning: 
Compilation is falling back to object mode WITHOUT looplifting enabled because Function "np_filter" failed type inference due to: No implementation of function Function(<built-in function getitem>) found for signature:
 >>> getitem(float64, Tuple(float64, slice<a:b>))
There are 22 candidate implementations:
   - Of which 22 did not match due to:
   Overload of function 'getitem': File: <numerous>: Line N/A.
     With argument(s): '(float64, Tuple(float64, slice<a:b>))':
    No match.

During: typing of intrinsic-call at (7)

File "", line 7:
def np_filter(k, idx, epsilon, points, points_copy, i):
    neighbors = points[idx, :]

This is because points is typed in your signature as a float64, but you are treating it like a 2D array of float64 when you do points[idx, :]. Note that the vectorize decorator compiles a function that accepts and returns scalars, and when it is invoked, it is invoked elementwise on its input operands. Perhaps the @jit decorator (or cuda.jit) is more appropriate for your function:

To make progress, I’d suggest the following steps:

  1. Remove the vectorize decorator, and make sure your function works as you intend when it is pure Python.
  2. Add the @jit decorator. Don’t initially add a signature specifying the types of arguments - Numba can work these out at call time and compile an appropriate specialization of your function, and adding them (especially when starting out with Numba) can cause more problems than it solves.
  3. If the function no longer works due to it containing some unsupported operations / functions, you may have to modify it - see the list of supported language and library functions here: and

Once you have the function working properly (and hopefully fast) with the @jit decorator, you can think about moving to CUDA. My strategy for this is usually:

  1. Replace the @jit decorator with @cuda.jit, and change the syntax when calling the function from f(args) to f[1, 1](args) - this specifies invoking a CUDA kernel with one thread and one block - in effect running the kernel sequentially on the GPU.
  2. There are fewer features supported on the CUDA target than the CPU target, so you may need to rewrite the function - for example, array operations need to be rewritten using for loops over the arrays (i.e. replacing things like r = x + y with for i in range(len(x)): r[i] = x[i] + y[i] where r, x, and y are arrays).
  3. Once you have the function working on the GPU single-threaded, parallelise across threads. This usually involves distributing loop iterations across threads by replacing loops and the induction variable with the results from cuda.grid() - see e.g. When you do this, you change the launch parameters to use multiple threads and blocks - e.g. f[1, 1](args) becomes f[griddim, blockdim](args) where griddim and blockdim are appropriate sizes for your data.
  4. At this point you can also think about optimizing data movement by declaring arrays on the device and explicitly transferring data to and from the GPU only when necessary - see

If you’re having problems progressing through these steps, do feel free to post back - it will also be helpful to have an example that can be run to demonstrate any error messages you encounter in the future.