[IHELP] Numba CUDA running of different GPUs

Hello there, I am writing a code for GPU using numba cuda. The code in my machine works perfectly (GeForce GTX 1060), but when I run this code in a machine using (Quadro RTX 5000) it gives me an error.

numba.cuda.cudadrv.driver.CudaAPIError: [700] Call to cuMemFree results in UNKNOWN_CUDA_ERROR

Please, I need to know what might be happening, and how I fix it.

Did that happen early on in the program? e.g. before calling kernels?
It probably has to do with which CUDA toolkit version you have. Quadro RTX 5000 is a CC 7.5 card which requires CUDA toolkit >10.

pinging @gmarkall for additional insights.

Hi sklam, that did not happened early on the program. I tried to install the CUDA toolkit 11 in the machine using the GeForce GTX 1060 and the code still runs fine, and the machine using the Quadro RTX 5000 have the CUDA toolkit 11 also installed, and gives me those errors.

I found the issue related with the CUDA error, it was on this part:
First I created the main_data_structure with a specific shape that will be source of data for computation and a reply_data_structure that is the one responsible from copying the values from the original data structure in the .copy_to_host() call.

The issue was solved adding boudaries checking when copying the data. I still not sure why that happens only changing the type of GPU. Because both data structures have the same shape.

After solving that Cuda error I get another problem. My algorith runs something like this:

image

And the my_kernel_func() is almost like this, and there inside is the actually kernel call:

image

It runs for most of the datasets. But sometimes it just freeze inside a kernel call. I saw a related post saying that cuda.synchronize() should be used before the kernel call, because some queue times issues. But that didn’t solved the problem, it still freezes sometimes for some datasets. You have any ideas why it might be freezing?

here is the stackoverflow topic talking about cuda.synchronize() in a similar problem. But it might not be the case: https://stackoverflow.com/questions/52263701/why-numba-cuda-is-running-slow-after-recalling-it-several-times/52263834#52263834

I have setted a timeout in the function that calls the kernel, in such a way that I could end it and restart. If I set a very small timeout, the kernel is terminated and restarted normally. But when the timeout have enough size and the code still freezes, the timeout is hit and the kernel (apparently) terminated. But it doesn’t restart anymore. Any cuda call I make after that doesn’t respond, that got me a red flag indicating that the Quadro RTX 5000 might have silently crashed or got in a freezed state? It doesn’t respond anymore and never ends. What might cause that? Wrong memory access? Any ideas?

Also, there is a way I can force the kernel to terminate so I can run it again?

Can you try running your code under cuda-memcheck to see if it reports any errors? If you were running with:

python code.py

to run instead with:

cuda-memcheck

I ran:

cuda-memcheck python code.py

returned:

========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors

maybe because the kernel call is deeper in the code?

It is possible that I have to develop different code for different GPU architectures using Numba? Or it should work ALWAYS the same way on every GPU supported by Numba?

It is possible that I have to develop different code for different GPU architectures using Numba? Or it should work ALWAYS the same way on every GPU supported by Numba?

A correct program should always work the same way on every GPU supported by Numba, with the exception of functionality that requires specific minimum compute capabilities - e.g. some of the warp intrinsics: CUDA Kernel API — Numba 0.50.1 documentation

maybe because the kernel call is deeper in the code?

When you say “deeper in the code”, what do you mean? Did your program launch a subprocess?

Another thing to try - can you test on the RTX 5000 with this Numba PR? https://github.com/numba/numba/pull/6030 - one of the things it fixes is some weird behaviour (an out of bounds access) that I’ve seen only on RTX and not on a GTX - this is one instance of it: CUDA: Invalid global accesses with for loops and indexing · Issue #5576 · numba/numba · GitHub - you probably don’t have an out of bounds access if your program runs successfully under cuda-memcheck, but this patch is worth trying because I suspect there’s a multitude of subtle issues that can be caused by the behaviour it removes.

Sorry the kind of the question, but how do I use that specific PR? I have installed Numba directly from pip, could you give me an example of how I could use that?

Apologies for the delay in the reply.

The PR is now merged, so you could build Numba from the latest master branch and install it to test if PR 6030 resolves the issue. The Numba Developer Documentation starting with Getting Set Up explains how to get set up and build Numba from source.