Hi sklam, that did not happened early on the program. I tried to install the CUDA toolkit 11 in the machine using the GeForce GTX 1060 and the code still runs fine, and the machine using the Quadro RTX 5000 have the CUDA toolkit 11 also installed, and gives me those errors.
I found the issue related with the CUDA error, it was on this part:
First I created the main_data_structure with a specific shape that will be source of data for computation and a reply_data_structure that is the one responsible from copying the values from the original data structure in the .copy_to_host() call.
The issue was solved adding boudaries checking when copying the data. I still not sure why that happens only changing the type of GPU. Because both data structures have the same shape.
After solving that Cuda error I get another problem. My algorith runs something like this:

And the my_kernel_func() is almost like this, and there inside is the actually kernel call:

It runs for most of the datasets. But sometimes it just freeze inside a kernel call. I saw a related post saying that cuda.synchronize() should be used before the kernel call, because some queue times issues. But that didn’t solved the problem, it still freezes sometimes for some datasets. You have any ideas why it might be freezing?
here is the stackoverflow topic talking about cuda.synchronize() in a similar problem. But it might not be the case: https://stackoverflow.com/questions/52263701/why-numba-cuda-is-running-slow-after-recalling-it-several-times/52263834#52263834