Hello all,

I was trying to speed up my python code and discovered the Numba library, it works pretty well on CPU. I had up to x14 speed up in heavy-numerical code (mostly FEM analysis) and I wanted to run in parallel. I was planning to speed up my numerical-heavy code on NVidia GPU.

Last 10-12 days, I was trying to implement some basic functions on numba.cuda library and it became a disaster for me. I have some questions. If anybody can help me, I would be so grateful.

I listed my question below with situation-question style.

Situation: None of the numpy, cupy, cupy.cublas, cupy.[any cuda library] works under cuda.jit decorator.

Question: How can I use the numba.cuda library efficiently without any basic array creation, array operation, and linear algebra operation?

Question: Should I create my linear algebra library with matmul, inverse, QR, and LU?

Question: Are there any plans to change this situation and enable usage of external, well-established libraries (numpy, cupy or any other library) with numba.cuda library?

Situation: It is really hard to work on arrays on GPU since it does not allow to create numpy or cupy arrays in the device function. It allows you to create an empty array in local or global memory but does not allow you to create it with a library. Also, the library claims that the arrays are the same, both are cuda.array type of arrays.

Question: Why is it difficult to get a zeros, ones, identity array? Again, should I create my own library to create these basic matrices?

Situation: I tried to multiply a cuda.local.array() with an integer. The console gave me an error saying that I cannot multiply an int64 with an array_2D float32.

Question: Am I doing something wrong or should I work like an alone computer scientist designing every single operation from vectorization to array operations? Should I define my function that multiply an array with an integer?

Situation: I think there is a paradox in the CUDA examples of numba documentation. It is created a fast_matmul function in numba which uses threads to multiply 2 matrices. It is not a device function, it is a kernel. Then, it is said that dynamic parallelism is not supported.

Question: How can I use fast_matmul function in a different kernel? It is impossible, a contradiction. Because dynamic parallelism is not supported, I cant use a kernel inside a kernel. I cannot understand the point of this fast_matmul function. If I want to use it, I cant. If I dont use it, numba.cuda is meaningless.

Question: Why would I use a single fast_matmul function while BLAS is available in cupy.cublas? Are there any advantages of using numba.cuda with respect to other libraries other than running simple arithmetic code on parallel?

As far as I understand, I should have a whole summer break to create a meaningful usage of this library, which is impossible to have that amount of time. I think it is a good thing to have a library that can run python on GPU with simple syntax. However if things get complicated, somebody has to use vstacks, hstacks, reordering, multiplications, factorizations, comparisons, etc. Basically, I am trying to understand that is it feasible and doable right now?

Forgive me if I am missing some points. Are these situations are universal for any GPU programming framework (CUDA, ROC, OpenGL, etc.)? Would I get these types of errors in CUDA C++ or CUDA Fortran? I am not so experienced with GPU programming, I started to learn about the mindset of GPU programming 2 months ago.

TL-DR: I have a numerical code (again computational mechanics) that must be speeded up, I wanted to use numba.cuda but it does not seem feasible. I would like to use it just like numpy (or cupy) because I am already experienced with python code and intuition of the language. Do you have any suggestions for me? If you can help, I would be so happy. Any advice to do some linear algebra in a kernel? I am open to any suggestion because I could not find any source around me or on the internet that can help me better this discourse community. Thank you so much.