Since input type of kernel function could be np.array, so we dont need to do to_device explicitly?

I found that kernel function also accept np.array, it is recommended to use np.array as input to reduce the lines of code or even get performance gain?

No, the opposite - it’s recommended to use device arrays so that you don’t force implicit data transfers and synchronization with the device.

If you do pass NumPy arrays, Numba emits a warning:

NumbaPerformanceWarning: Host array used in CUDA kernel will incur copy overhead to/from device.

thanks for the advice! btw the, is there any way init device array with certain value? now I have to use numpy method and then move the data to gpu