[IHELP] Numba CUDA running of different GPUs

arkanario · July 8, 2020, 7:14pm

Hello there, I am writing a code for GPU using numba cuda. The code in my machine works perfectly (GeForce GTX 1060), but when I run this code in a machine using (Quadro RTX 5000) it gives me an error.

numba.cuda.cudadrv.driver.CudaAPIError: [700] Call to cuMemFree results in UNKNOWN_CUDA_ERROR

Please, I need to know what might be happening, and how I fix it.

sklam · July 14, 2020, 7:32pm

Did that happen early on in the program? e.g. before calling kernels?
It probably has to do with which CUDA toolkit version you have. Quadro RTX 5000 is a CC 7.5 card which requires CUDA toolkit >10.

pinging @gmarkall for additional insights.

arkanario · July 14, 2020, 9:13pm

Hi sklam, that did not happened early on the program. I tried to install the CUDA toolkit 11 in the machine using the GeForce GTX 1060 and the code still runs fine, and the machine using the Quadro RTX 5000 have the CUDA toolkit 11 also installed, and gives me those errors.

I found the issue related with the CUDA error, it was on this part:
First I created the main_data_structure with a specific shape that will be source of data for computation and a reply_data_structure that is the one responsible from copying the values from the original data structure in the .copy_to_host() call.

The issue was solved adding boudaries checking when copying the data. I still not sure why that happens only changing the type of GPU. Because both data structures have the same shape.

After solving that Cuda error I get another problem. My algorith runs something like this:

And the my_kernel_func() is almost like this, and there inside is the actually kernel call:

It runs for most of the datasets. But sometimes it just freeze inside a kernel call. I saw a related post saying that cuda.synchronize() should be used before the kernel call, because some queue times issues. But that didn’t solved the problem, it still freezes sometimes for some datasets. You have any ideas why it might be freezing?

here is the stackoverflow topic talking about cuda.synchronize() in a similar problem. But it might not be the case: https://stackoverflow.com/questions/52263701/why-numba-cuda-is-running-slow-after-recalling-it-several-times/52263834#52263834

arkanario · July 22, 2020, 8:39pm

I have setted a timeout in the function that calls the kernel, in such a way that I could end it and restart. If I set a very small timeout, the kernel is terminated and restarted normally. But when the timeout have enough size and the code still freezes, the timeout is hit and the kernel (apparently) terminated. But it doesn’t restart anymore. Any cuda call I make after that doesn’t respond, that got me a red flag indicating that the Quadro RTX 5000 might have silently crashed or got in a freezed state? It doesn’t respond anymore and never ends. What might cause that? Wrong memory access? Any ideas?

Also, there is a way I can force the kernel to terminate so I can run it again?

gmarkall · July 23, 2020, 9:39am

Can you try running your code under cuda-memcheck to see if it reports any errors? If you were running with:

python code.py

to run instead with:

cuda-memcheck

arkanario · July 23, 2020, 9:13pm

I ran:

cuda-memcheck python code.py

returned:

========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors

maybe because the kernel call is deeper in the code?

arkanario · July 24, 2020, 2:56am

It is possible that I have to develop different code for different GPU architectures using Numba? Or it should work ALWAYS the same way on every GPU supported by Numba?

gmarkall · July 24, 2020, 2:13pm

It is possible that I have to develop different code for different GPU architectures using Numba? Or it should work ALWAYS the same way on every GPU supported by Numba?

A correct program should always work the same way on every GPU supported by Numba, with the exception of functionality that requires specific minimum compute capabilities - e.g. some of the warp intrinsics: CUDA Kernel API — Numba 0.50.1 documentation

maybe because the kernel call is deeper in the code?

When you say “deeper in the code”, what do you mean? Did your program launch a subprocess?

Another thing to try - can you test on the RTX 5000 with this Numba PR? https://github.com/numba/numba/pull/6030 - one of the things it fixes is some weird behaviour (an out of bounds access) that I’ve seen only on RTX and not on a GTX - this is one instance of it: CUDA: Invalid global accesses with for loops and indexing · Issue #5576 · numba/numba · GitHub - you probably don’t have an out of bounds access if your program runs successfully under cuda-memcheck, but this patch is worth trying because I suspect there’s a multitude of subtle issues that can be caused by the behaviour it removes.

arkanario · July 28, 2020, 10:31pm

Sorry the kind of the question, but how do I use that specific PR? I have installed Numba directly from pip, could you give me an example of how I could use that?

gmarkall · August 3, 2020, 7:39am

Apologies for the delay in the reply.

The PR is now merged, so you could build Numba from the latest master branch and install it to test if PR 6030 resolves the issue. The Numba Developer Documentation starting with Getting Set Up explains how to get set up and build Numba from source.

Topic		Replies	Views
Numba Cuda error Support: What is this error message?	3	636	September 19, 2023
BUG: Numba using a lot of GPU memory Development	3	632	September 20, 2024
NumbaRuntimeError: Failed in cuda mode pipeline NRT required but not enabled Support: What is this error message?	0	313	January 23, 2024
Large gap between launching cuda kernel Community Support	4	606	June 27, 2022
Ubuntu 22.04 fresh install with conda/numba does not see GPU Numba	25	3384	April 26, 2023

[IHELP] Numba CUDA running of different GPUs

Related topics