Usage of CUDA Python, Linear Algebra on GPU and Computational Code

Hello all,

I was trying to speed up my python code and discovered the Numba library, it works pretty well on CPU. I had up to x14 speed up in heavy-numerical code (mostly FEM analysis) and I wanted to run in parallel. I was planning to speed up my numerical-heavy code on NVidia GPU.

Last 10-12 days, I was trying to implement some basic functions on numba.cuda library and it became a disaster for me. I have some questions. If anybody can help me, I would be so grateful.

I listed my question below with situation-question style.

Situation: None of the numpy, cupy, cupy.cublas, cupy.[any cuda library] works under cuda.jit decorator.
Question: How can I use the numba.cuda library efficiently without any basic array creation, array operation, and linear algebra operation?
Question: Should I create my linear algebra library with matmul, inverse, QR, and LU?
Question: Are there any plans to change this situation and enable usage of external, well-established libraries (numpy, cupy or any other library) with numba.cuda library?

Situation: It is really hard to work on arrays on GPU since it does not allow to create numpy or cupy arrays in the device function. It allows you to create an empty array in local or global memory but does not allow you to create it with a library. Also, the library claims that the arrays are the same, both are cuda.array type of arrays.
Question: Why is it difficult to get a zeros, ones, identity array? Again, should I create my own library to create these basic matrices?

Situation: I tried to multiply a cuda.local.array() with an integer. The console gave me an error saying that I cannot multiply an int64 with an array_2D float32.
Question: Am I doing something wrong or should I work like an alone computer scientist designing every single operation from vectorization to array operations? Should I define my function that multiply an array with an integer?

Situation: I think there is a paradox in the CUDA examples of numba documentation. It is created a fast_matmul function in numba which uses threads to multiply 2 matrices. It is not a device function, it is a kernel. Then, it is said that dynamic parallelism is not supported.
Question: How can I use fast_matmul function in a different kernel? It is impossible, a contradiction. Because dynamic parallelism is not supported, I cant use a kernel inside a kernel. I cannot understand the point of this fast_matmul function. If I want to use it, I cant. If I dont use it, numba.cuda is meaningless.
Question: Why would I use a single fast_matmul function while BLAS is available in cupy.cublas? Are there any advantages of using numba.cuda with respect to other libraries other than running simple arithmetic code on parallel?

As far as I understand, I should have a whole summer break to create a meaningful usage of this library, which is impossible to have that amount of time. I think it is a good thing to have a library that can run python on GPU with simple syntax. However if things get complicated, somebody has to use vstacks, hstacks, reordering, multiplications, factorizations, comparisons, etc. Basically, I am trying to understand that is it feasible and doable right now?

Forgive me if I am missing some points. Are these situations are universal for any GPU programming framework (CUDA, ROC, OpenGL, etc.)? Would I get these types of errors in CUDA C++ or CUDA Fortran? I am not so experienced with GPU programming, I started to learn about the mindset of GPU programming 2 months ago.

TL-DR: I have a numerical code (again computational mechanics) that must be speeded up, I wanted to use numba.cuda but it does not seem feasible. I would like to use it just like numpy (or cupy) because I am already experienced with python code and intuition of the language. Do you have any suggestions for me? If you can help, I would be so happy. Any advice to do some linear algebra in a kernel? I am open to any suggestion because I could not find any source around me or on the internet that can help me better this discourse community. Thank you so much.

Did you use parallel=True with the Numba JIT decorator or not? I don’t know how many cores you have but parallel=True is, in my opinion, an easier path to parallelization than CUDA and may be sufficient for your needs.

Here are some answers to a few of your questions.

All of the numpy, cupy, cupy.cublas and cupy library functions are implemented in Python. In order for a GPU function to be callable in Numba it must also be a GPU device function. There is a list of all the supported functions here Supported Python features in CUDA Python — Numba 0.52.0.dev0+274.g626b40e-py3.7-linux-x86_64.egg documentation. All of the basic mathematic functions have device equivalent functions. For anything else you would need to implement it yourself. In terms of supported linear algebra functions, cupy has a number of CUDA related functions that make calls to things like the CUDA BLAS library and other CUDA related libraries but they are geared more towards solving large single problems (i.e. computing a large matrix multiply in parallel using the GPU’s threads). It sounds like what you want is to have each thread compute its own linear algebra function on smaller problems and there aren’t really libraries that allow you to do this, at least that I know of.

A multiply between a scalar and a local array isn’t defined. You’ll have to do it the old fashion way with a for loop. There are a lot of layers of abstraction that the CPU provides that aren’t readily available on the GPU.

I’m not totally sure where the paradox comes from with dynamic parallelism. Dynamic parallelism is a feature where a CUDA kernel can itself launch a CUDA kernel. This isn’t supported in Numba. If you want to launch the fast_matmul CUDA kernel you can launch it from the host. This function works similar to what I was describing earlier with cupy where a single kernel is launched to solve a single large problem.

Dr. Todd,

Yes my @njit contains parallel=True argument but I also wanted to process, say every element in a structure, parallel. I aimed to assign a thread to every element and perform calculations for that particular element. For my GPU program to work, every thread had to perform matrix inversion, multiplication etc. That is where my problem started to got larger and larger. I have no problem with CPU application. I am open to any suggestion on GPU application. Thank your for your advice :slight_smile: .

uchytilc,

Yes, my aim was exactly as you stated, every thread had to perform linear algebra operations, (max matrix size of 3x3, but many many of threads). It is good to know that a large amount of work must be done by me, at least I will not search for it. I thought cupy.cublas work on my threads but I think they are kind of functions that can be called from host to device, not from device to device. So, I could not use cupy.cublas functions.

For the second part, I hope that @guvectorize("…", “…”, target=“cuda”) can perform these kinds of functions.

Okay, maybe it is not a paradox but my point was that if i cannot call fast_matmul in my kernel, I have to use it alone. If I am going to use a single fast_matmul function everytime, why would I use that function? Because cupy already has cublas library, I can use it. In summary, I think if I am allowed to use it alone, I would not use a single function written by me, I would like to take it on the shelf. My optimization skills are limited however (I hope) cupy is written by computer science professionals. Their single function would be much faster than mine. It leads to a meaningless use in my perspective.

Thank you for your responses :slight_smile:

Hi Moukann,

Thanks for trying out the CUDA target and posting a detailed summary of your experience with it - I’ll try to address each of your questions here - before I do, I’d note that it is expected to be quite a lot of work to build a well-optimized FEM solver on a GPU, particularly if it uses unstructured meshes and/or low-order elements.

A general approach for implementing it would be to try to reuse other libraries that package higher level functionality where possible (e.g. linear solves) and only resort to writing Numba kernels for operations specific to your application (e.g. local / global assembly). Many of the Python libraries that provide CUDA functionality can share their data structures with each other transparently through the CUDA Array Interface (CAI), so combining multiple libraries / abstractions is not usually an issue.

Situation: None of the numpy, cupy, cupy.cublas, cupy.[any cuda library] works under cuda.jit decorator.
Question: How can I use the numba.cuda library efficiently without any basic array creation, array operation, and linear algebra operation?

Usually you would create any large arrays outside kernels and pass them in as parameters - this is in keeping with the CUDA programming model in general. Coming from a CPU-based code, this often does need the code to be restructured to meet this requirement.

Rather than creating Numba device arrays, I’d suggest creating arrays using CuPy - its arrays are much richer in functionality than Numba’s (many more NumPy features are implemented on them) and thanks to the CAI you can pass them directly to Numba kernels.

Question: Should I create my linear algebra library with matmul, inverse, QR, and LU?

Only if you can’t find a library that already provides them. I believe at least some of those operations listed will be provided by CuPy.

Question: Are there any plans to change this situation and enable usage of external, well-established libraries (numpy, cupy or any other library) with numba.cuda library?

I think this is already addressed by the CAI - if there’s a specific library you’d like to use in conjunction with Numba’s CUDA target, please do create a new thread specifying which library / function you’re trying to use, any steps you might have already tried, and I (or someone) will try to help in that specific instance.

Situation: It is really hard to work on arrays on GPU since it does not allow to create numpy or cupy arrays in the device function.

There is a learning curve here that takes some getting used to - it is most difficult in the beginning when none of your code follows the pattern of creating arrays outside of kernels, but it does become more manageable as your code evolves towards following the pattern and you become more used to thinking about the separation between memory allocation and computation.

A slightly tangential observation - in general you will always need to optimize the way you allocate memory to get the highest-performing code. By using a GPU, you make this requirement hard and explicit, rather than being able to get things working with potentially less-than-optimal performance, then work on optimizing / restructuring later.

It allows you to create an empty array in local or global memory but does not allow you to create it with a library. Also, the library claims that the arrays are the same, both are cuda.array type of arrays.

I’m not quite sure I understand this remark, but I hope I can add some clarifying points:

  • Inside kernels, you can allocate local memory (arrays that are private to each thread) and shared memory (arrays that are private to each block, but shared between threads in the block).
  • Outside kernels, you can allocate global memory by creating a device array with Numba’s memory management API functions or any other library that supports CAI.

Question: Why is it difficult to get a zeros, ones, identity array? Again, should I create my own library to create these basic matrices?

In the past I did consider adding many functions to Numba that would allocate and initialize these types of arrays - however, I concluded that this would duplicate the effort of other libraries like CuPy, so I instead recommend using CuPy to create these instead. I would imagine you can avoid building your own implementation for most of these.

Situation: I tried to multiply a cuda.local.array() with an integer. The console gave me an error saying that I cannot multiply an int64 with an array_2D float32.
Question: Am I doing something wrong or should I work like an alone computer scientist designing every single operation from vectorization to array operations?

CUDA kernels are generally written from the perspective of a single thread / element, and there are many array / matrix operations that are unsupported in them - this generally aligns reasonably well with the common case of individual threads operating on individual elements of data - for example, in a basic a * x + y operation:

@cuda.jit
def axpy(r, a, x, y):
    # Determine Thread ID / index into arrays
    i = cuda.grid(1)

    # Ensure we're not going to step off the end of the array
    # if there are more threads than elements
    if i >= len(r):
        return

    # Compute a * x + y elementwise
    r[i] = a * x[i] + y[i]

However, where an individual thread is required to iterate over an array, as in your case, this fits less well - it is on the roadmap to enable more support for array operations in the CUDA target - this needs some work to align the way the CUDA target more closely with the rest of Numba so that it can reuse the implementations that the CPU uses. My aim is to make progress here after the upcoming release (0.55, which will be in January).

Should I define my function that multiply an array with an integer?

At present you may well need to do that, e.g.:

@cuda.jit(device=True):
def ax(r, a, x):
    for i in range(len(r)):
        t[i] = a * x[i]

For a multiplication across an entire array by a single thread. Note that if you do this, all calls to this function will be inlined into the kernel / device function that they were called from, so from a performance perspective it will be as if you wrote the for loop in the caller (i.e. there is no overhead of the function call).

Situation: I think there is a paradox in the CUDA examples of numba documentation. It is created a fast_matmul function in numba which uses threads to multiply 2 matrices. It is not a device function, it is a kernel. Then, it is said that dynamic parallelism is not supported.

I’m not sure I understand what the paradox is - in order to multiply two matrices using the kernel from the documentation one would enqueue the kernel from host-side code. The example would need some modification to turn it into a device function.

Dynamic parallelism is not yet supported by Numba. I can’t see an issue explicitly requesting it in the Issue Tracker - if this is something that would be useful to you, please do open a feature request for dynamic parallelism, and I can look into the possibility for adding support.

Question: How can I use fast_matmul function in a different kernel? It is impossible, a contradiction.
Because dynamic parallelism is not supported, I cant use a kernel inside a kernel.

Indeed, the example would need some modification to make it a device function.

I cannot understand the point of this fast_matmul function. If I want to use it, I cant. If I dont use it, numba.cuda is meaningless.

It’s an example intended to show how a CUDA kernel can be written in Python, using concepts that are already familiar to many users - I don’t expect this specific kernel will be of use in many real-world codes.

Question: Why would I use a single fast_matmul function while BLAS is available in cupy.cublas?

I would suggest using CuPy for this purpose instead.

Are there any advantages of using numba.cuda with respect to other libraries other than running simple arithmetic code on parallel?

The advantage is that you can implement whatever you need to inside a CUDA-jitted kernel - it is not trivial to do so, but it provides the ability to write arbirtrary operations and the flexibility to modify your implementation towards optimizing performance (for example editing the code to use shared memory, coalescing, libdevice functions, etc.) - it is not quick / easy and not the first tool you should reach for, but if you need to implement (for example) local assembly of any finite element form, preconditioner, mass lumping scheme, etc. of your own derivation, you can do it in a Numba kernel.

As far as I understand, I should have a whole summer break to create a meaningful usage of this library, which is impossible to have that amount of time. I think it is a good thing to have a library that can run python on GPU with simple syntax. However if things get complicated, somebody has to use vstacks, hstacks, reordering, multiplications, factorizations, comparisons, etc. Basically, I am trying to understand that is it feasible and doable right now?

I think to determine how feasible / doable right now it is - I would try to do everything you can with CuPy or any other appropriate library, then see how much you have left that has to run in your own Python functions on the CPU. Then, pick a small set of these functions and aim to translate them into a CUDA kernel - you might find that after doing a couple, a pattern emerges in your code and you can work on the remainder a lot faster.

If you’re happy with sharing some code, I can take a look and suggest some general directions that might make it easier for you to see the right way to go and arrive at a better estimate of the effort.

Forgive me if I am missing some points. Are these situations are universal for any GPU programming framework (CUDA, ROC, OpenGL, etc.)? Would I get these types of errors in CUDA C++ or CUDA Fortran? I am not so experienced with GPU programming, I started to learn about the mindset of GPU programming 2 months ago.

I think your experience is quite common and you’re encountering issues that most people find difficult in the beginning. The programming model is the same for CUDA C++ and CUDA Fortran, so if Python is the language you’re most comfortable with, then I’d suggest continuing to work in it rather than switching to a lower-level language, which will present additional challenges.

The “mindset” of CUDA programming (and accelerators in general) certainly is difficult compared to working on the CPU, and deciding whether it is the right choice for your application is a tradeoff between performance, effort, complexity of the code, etc.

TL-DR: I have a numerical code (again computational mechanics) that must be speeded up, I wanted to use numba.cuda but it does not seem feasible. I would like to use it just like numpy (or cupy) because I am already experienced with python code and intuition of the language. Do you have any suggestions for me? If you can help, I would be so happy. Any advice to do some linear algebra in a kernel? I am open to any suggestion because I could not find any source around me or on the internet that can help me better this discourse community. Thank you so much.

Many thanks for posting all your questions - they are all good questions and I think they’re commonly encountered so it’s good to have your post to be able to answer and serve as a reference for the future. CUDA (and GPU in general) programming is high effort in general, so it depends on your performance requirements whether it is the best path for your application.

I am happy to try to help understand the right way to go, so please do follow up with further questions or specific examples from your code that we might discuss. (Also, if I missed or misunderstood any of your questions above then please do let me know).

In any case - good luck with your implementation work over the summer :slight_smile:

Graham Markall,

Thank you for your extensive response, now I have a much clear image in my mind about what I should do.

If this topic will create a future reference for all people, I would like to ask some more questions and touch new points about CUDA Python. This message will be so long. The questions are divided by long equal signs.

Q1 start ===================================================================

Yes, but my first aim to train myself about numba.cuda. My code is not a complex FEM, it was a course project. Because I am familiar with the code, I wanted to transfer it to GPU. I think it is parallelizable however I had many problems, then I created this topic to seek help. After I gain some experience, I will try to implement new projects.

Yes, I pass large arrays as pre-created to kernel however I cannot manipulate them inside kernels. Please see the example below.

import numba.cuda as cuda
import cupy as cp

size = 4   
Am = cp.random.rand(size,size,size).astype(cp.float32)
Bm = cp.random.rand(size,size,size).astype(cp.float32)
Cm = cp.zeros((size,size,size), dtype=cp.float32)

# user defined dot product
@cuda.jit(device=True)
def dot_product(vec1, vec2):
    if vec1.shape[0] == vec2.shape[0]:
        tot = 0.0
        for k in range(vec1.shape[0]):
            tot += vec1[k] * vec2[k]
        return tot
    else:
        raise ArithmeticError("Vector lengths are different.")

# user defined matrix multiplication
@cuda.jit(device=True)
def matrix_multiplication(A, B, C):
    if A.shape[1] == B.shape[0]:
        for i in range(A.shape[0]):
            for j in range(B.shape[1]):
                C[i,j] = dot_product(A[i,:], B[:,j])
    else:
        raise ArithmeticError("mxk != kxn")



# matmul kernel with user defined functions
@cuda.jit
def kernel_matmul_user(Am, Bm, Cm, size):
    tidx = cuda.grid(1)
    if tidx < size:
        matrix_multiplication(Am[tidx, :, :], Bm[tidx, :, :], Cm[tidx, :, :])


# matmul kernel with cupy defined functions
@cuda.jit
def kernel_matmul_cupy(Am, Bm, Cm, size):
    tidx = cuda.grid(1)
    if tidx < size:
        cp.matmul(Am[tidx, :, :], Bm[tidx, :, :], Cm[tidx, :, :])


# works
kernel_matmul_user[32,128](Am, Bm, Cm, size)

# does not work
kernel_matmul_cupy[32,128](Am, Bm, Cm, size)

This is a very basic example but the point I am trying to emphasize is that how can I use cupy.matmul inside a kernel? How can I make cupy.matmul work for every thread? We can replace cupy.matmul with cupy.linalg.inv for example. For every element, I need to invert a 2X2 small matrix. How can someone utilize cupy, cupy.linalg even cupy.cublas inside their kernels? As far as I understand, it is not possible right now. If usign cupy is not possible inside kernels, I have to write my own linear algebra functions like in the upper example (user_dot, user_matmul, etc.). This was the point I wanted to tell you. Do you have any suggestions?
Q1 end ===================================================================

Q2 start ===================================================================

Again I will try to tell my point with an example, matrix inversion. So consider 2 situations, we are computing matrix inverse with cupy and with my functions.

import numba
import numba.cuda as cuda
import cupy as cp

#user defined identity matrix maker for local memory
@cuda.jit(device=True)
def identity(mat0):
    for r in range(mat0.shape[0]):
        for c in range(mat0.shape[1]):
            if r == c:
                mat0[r, c] = 1
            else:
                mat0[r, c] = 0  
    return mat0


# user defined matrix copy
@cuda.jit(device=True)
def copy_matrix(mat0, mat1):
    for r in range(mat0.shape[0]):
        for c in range(mat0.shape[1]):
            mat1[r,c] = mat0[r,c]    

    return mat1

# user defined matrix inversion
@cuda.jit(device=True)
def invert_matrix(AM, IM):
    for fd in range(AM.shape[0]): 
        fdScaler = 1.0 / AM[fd,fd]

        for j in range(AM.shape[0]): 
            AM[fd,j] *= fdScaler
            IM[fd,j] *= fdScaler
        
        for i in range(AM.shape[0]):
            if fd != i:
                crScaler = AM[i,fd] 
                for j in range(AM.shape[0]): 
                    AM[i,j] = AM[i,j] - crScaler * AM[fd,j]
                    IM[i,j] = IM[i,j] - crScaler * IM[fd,j]


@cuda.jit
def kernel_invert(Am, inv, size):
    tidx = cuda.grid(1)
    if tidx < size:
        # notice i had to pass literal 12 because it does not allow initiation with variables
        temp_identity = identity(cuda.local.array((12,12), numba.float32))
        temp_matrix   = cuda.local.array((12,12), numba.float32)
        
        # I have to copy original matrix to invert. Why? because inversion algorithm yields changing original matrix up to identity
        # and it yields changing identity matrix up to inverse, classical algorithm.
        copy_matrix(Am[tidx, :, :], temp_matrix)
        
        # my inversion function
        invert_matrix(temp_matrix, temp_identity)
        
        # I have to use another function to copy result because inv[tidx, :, :] = temp_identity is not supported
        copy_matrix(temp_identity, inv[tidx, :, :])
        
        # make sure every thread works sychronized
        cuda.syncthreads()
 
#inversion with cupy
def cupy_invert(Am, cupy_invert, size):
    for i in range(size):
        cupy_invert[i, :, :] = cp.linalg.inv(Am[i, :, :])
 
        
size = 12  
Am = cp.random.rand(size,size,size).astype(cp.float32)
cupy_inverted = cp.zeros((size,size,size), dtype=cp.float32)
user_inverted = cp.zeros((size,size,size), dtype=cp.float32)


#execute functions
kernel_invert[32,128](Am, user_inverted, size)
cupy_invert(Am, cupy_inverted, size)

# check whether functions inverted the matrices correctly. We should get identity matrices back. (and yes they are correct, you can check it on your computer.)
for i in range(size):
    print("check cupy inversion")
    print(cp.round(cupy_inverted[i, :, :] @ Am[i, :, :], 3))
    print("=======")
    print("check user inversion")
    print(cp.round(user_inverted[i, :, :] @ Am[i, :, :], 3))
    print("-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=")

So after a little work, I managed to have a function that works inside kernels to invert matrices. This function was originally created by https://github.com/ThomIves. I converted it to work on GPUs. The point I want to make in this situation is that, I need to pass AM an IM to invert_matrix(AM, IM) function to invert matrices. Is is possible to create an identity matrix inside the invert_matrix function? It can be local array, device array, temp array or some array. If I am allowed to create matrices inside device functions, I can delete IM variable from the function parameters and create it inside the function. By this, I can reduce code lines and make the code more beautiful. If you think about it, it is not natural to pass an identity to invert. User should have be able to pass an array and just get the inverse in result. I am not criticizing. I am trying to understand and help the community. I think this is an important point. Do you have any suggestions?

The other point about this inversion function

temp_identity = identity(cuda.local.array((12,12), numba.float32))
temp_matrix = cuda.local.array((12,12), numba.float32)

I shoud have created these arrays as

temp_identity = identity(cuda.local.array((size,size), numba.float32))
temp_matrix = cuda.local.array((size,size), numba.float32)

However this is not allowed. I think for practicality and universal code (for different sizes, adaptive code), this is something needed. Am I doing something wrong? Is it possible to create local arrays with variables? Okay, it works with size 12. What if I change the size to 24? I have to manually edit my kernel. I hope I am successful expressing myself.
Q2 end ===================================================================

Q3 start ===================================================================

Yes, just to make clear, this is not something allowed:

@cuda.jit
def some_kernel():
     a = cp.zeros(10,10)

This is something allowed:

@cuda.jit
def some_kernel():
     a = cuda.local.array((10,10))
     #user defined zeros function (for inside for)
     zeros(a)

Just to make clear, this is something I should do (for practicality):

a = cp.zeros(10,10)

@cuda.jit
def some_kernel(a):

This is something I should not do (for practicality):

a = cuda.device_array(10,10)

@cuda.jit
def some_kernel(a):

Q3 end ===================================================================

Q4 start ===================================================================

A clear explanation, no need for another question. Thank you.
Q4 end ===================================================================

Q5 start ===================================================================

My answer is to this “paradox” is below.

If I come up with a feature that can be added, I will request if from Github, thank you. Being able to initiate kernels inside kernels would be good for CUDA Python.
Q5 end ===================================================================

Q6 start ===================================================================

Thank you for these examples. It is good to hear that this is something possible from a professional.
Q6 end ===================================================================

Q7 start ===================================================================

I should discuss it with my academic advisor, not all things are up to me :slight_smile:. But if I am allowed, I can ask specific points about numba.cuda.
Q7 end ===================================================================

Q8 start ===================================================================

It looks like I will bother you many many times :slight_smile:. Thank you for your all responses. I had many questions in this post, maybe I should ask some more after you answer these. Again, these answers are incredibly valuable for me. Also, it can help a person seeking for help. Thank you for clarifying some points in my mind.
Q8 end ===================================================================

Hi Moukann,

Apologies for the delay in my reply - it’s taken a while to get some free time to concentrate on producing some proper (and hopefully helpful!) responses.

Q1: Linear algebra in kernels

This is a very basic example but the point I am trying to emphasize is that how can I use cupy.matmul inside a kernel? How can I make cupy.matmul work for every thread? We can replace cupy.matmul with cupy.linalg.inv for example. For every element, I need to invert a 2X2 small matrix. How can someone utilize cupy, cupy.linalg even cupy.cublas inside their kernels? As far as I understand, it is not possible right now. If usign cupy is not possible inside kernels, I have to write my own linear algebra functions like in the upper example (user_dot, user_matmul, etc.). This was the point I wanted to tell you. Do you have any suggestions?

Your observations here are correct, it isn’t possible to use the CuPy functions, and at present you need to write your own linear algebra functions for each thread to do e.g. a 2x2 matrix inversion. I have been looking for a good CUDA C/C++ library that does this that I can integrate into Numba (we can provide support for calling C/C++ functions from the Python kernels transparently) but I haven’t found anything suitable yet. Once I come across a suitable library it should be possible to enable what you want. I previously looked at CUTLASS but I think it doesn’t quite fit for the kinds of operations we want to do here (mainly being focused on launching standalone linear algebra kernels).

Do you know of a library that fits your use case well that’s written in CUDA C/C++? If so, we could try and find a way to use it from the Python kernels.

Q2 - Allocation of local arrays

Is is possible to create an identity matrix inside the invert_matrix function?

I think so, but it’s taking me too long to correctly modify your code so I’m going to leave this for now and come back to it in a subsequent post (put it down to the end of the year :slightly_smiling_face:)

I think for practicality and universal code (for different sizes, adaptive code), this is something needed. Am I doing something wrong? Is it possible to create local arrays with variables? Okay, it works with size 12. What if I change the size to 24? I have to manually edit my kernel. I hope I am successful expressing myself.

You can’t yet dynamically allocate local memory (I have a half-finished patch somewhere that supports this). However, you can use a closure to generate kernels with the required local memory size programmatically, e.g.:

from numba import cuda
import numpy as np


def factory(size):
    @cuda.jit('void(float32[:])')
    def f(x):
        cuda.local.array(size, dtype=np.float32)

        # ...

# kernel_2 has a local array of size 2
kernel_2 = factory(2)

# kernel_10 has a local array of size 10
kernel_10 = factory(10)

Q3 - Use of CuPy arrays or Numba device arrays

That is correct.

That is also correct.

If you have CuPy available in your setup, I would always prefer it.

If CuPy is not available, then you would have to do this. If CuPy is available, then I probably wouldn’t bother doing this. Note that if you’re working with another library that uses Numba CUDA and it happened to give you Numba device arrays, you could still call CuPy functions on them because of the CUDA Array Interface. e.g.

import cupy as cp
from numba import cuda

# Assume `x` came from some third-party library
x = cuda.device_array(10)
for i in range(len(x)):
    x[i] = i - 5

print(cp.abs(x))

works and gives

[5. 4. 3. 2. 1. 0. 1. 2. 3. 4.]

as expected.

Q4 - (skipped, no further question)

Q5 - Dynamic parallelism

I just checked on the issue tracker and there’s presently no feature request for dynamic parallelism, so please feel free to open a request there. If you do so, sharing a motivating code / example, and a short explanation of how you think the API should look would be really helpful in moving forward with an implementation.

Q6-7 - (skipped, no further questions)

Q8

It looks like I will bother you many many times :slight_smile:.

You’re welcome to do so :slightly_smiling_face: - I’ll respond as soon as I can, but feel free to ping posts if I don’t get back to you very quickly.

Thank you for your all responses. I had many questions in this post, maybe I should ask some more after you answer these. Again, these answers are incredibly valuable for me. Also, it can help a person seeking for help. Thank you for clarifying some points in my mind.

Thank you for the questions! I have no doubt that the questions and the answers will be a useful reference for others, so thankyou for keeping the discussion going!

Happy New Year, and best of luck in your endeavours for 2022!