Extending LAPACK functions into numba

jessegrabowski · June 18, 2022, 10:54am

Hello,

I was hoping to include the function scipy.linalg.ordqz into a numba njit function. This function calls two LAPACK functions, and I found this issue on github (Issue 5301, sorry I’m not allowed to include links) showing how to do this, so I was initially optimistic. The first function I tried to implement was zgges, which is the complex generalized Schur decomposition. I’m on windows, numba 0.55.1, python 3.10.4. The windows thing turns out to be relevant, quelle surprise!

Anyway, here is my attempt to implement the function:

from numba import njit
import numpy as np
from numba.extending import get_cython_function_address
import ctypes

# Datatype pointers to give to the cython LAPACK functions
_PTR = ctypes.POINTER

_dbl = ctypes.c_double
_int = ctypes.c_int

_ptr_dbl = _PTR(_dbl)
_ptr_int = _PTR(_int)

# zgges is the complex QZ-decomposition
zgges_addr = get_cython_function_address('scipy.linalg.cython_lapack', 'zgges')
zgges_functype = ctypes.CFUNCTYPE(None,
                                  _ptr_int,  # JOBVSL
                                  _ptr_int,  # JOBVSR
                                  _ptr_int,  # SORT
                                  _ptr_int,  # SELCTG
                                  _ptr_int,  # N
                                  _ptr_dbl,  # A, complex
                                  _ptr_int,  # LDA
                                  _ptr_dbl,  # B, complex
                                  _ptr_int,  # LDB
                                  _ptr_int,  # SDIM
                                  _ptr_dbl,  # ALPHA, complex
                                  _ptr_dbl,  # BETA, complex
                                  _ptr_dbl,  # VSL, complex
                                  _ptr_int,  # LDVSL
                                  _ptr_dbl,  # VSR, complex
                                  _ptr_int,  # LDVSR
                                  _ptr_dbl,  # WORK, complex
                                  _ptr_int,  # LWORK
                                  _ptr_dbl,  # RWORK
                                  _ptr_int,  # BWORK
                                  _ptr_int)  # INFO
zgges_fn = zgges_functype(zgges_addr)

@njit
def numba_zgges(x, y):
    _M, _N = x.shape

    A = x
    B = y

    JOBVSL = np.array([ord('V')], dtype=np.int32)
    JOBVSR = np.array([ord('V')], dtype=np.int32)
    SORT = np.array([ord('N')], dtype=np.int32)
    SELCTG = np.empty(1, dtype=np.int32)

    N = np.array(_N, dtype=np.int32)
    LDA = np.array(_N, dtype=np.int32)
    LDB = np.array(_N, dtype=np.int32)
    SDIM = np.array(0, dtype=np.int32) # out

    ALPHA = np.empty(_N, dtype=np.complex128) # out
    BETA = np.empty(_N, dtype=np.complex128) # out
    LDVSL = np.array(_N, dtype=np.int32)
    VSL = np.empty((_N, _N), dtype=np.complex128) # out
    LDVSR = np.array(_N, dtype=np.int32)
    VSR = np.empty((_N, _N), dtype=np.complex128) # out

    WORK = np.empty((1,), dtype=np.complex128) #out
    LWORK = np.array(-1, dtype=np.int32)
    RWORK = np.empty(_N, dtype=np.float64)
    BWORK = np.empty(_N, dtype=np.int32)
    INFO = np.empty(1, dtype=np.int32)

    zgges_fn(JOBVSL.ctypes,
             JOBVSR.ctypes,
             SORT.ctypes,
             SELCTG.ctypes,
             N.ctypes,
             A.view(np.float64).ctypes,
             LDA.ctypes,
             B.view(np.float64).ctypes,
             LDB.ctypes,
             SDIM.ctypes,
             ALPHA.view(np.float64).ctypes,
             BETA.view(np.float64).ctypes,
             VSL.view(np.float64).ctypes,
             LDVSL.ctypes,
             VSR.view(np.float64).ctypes,
             LDVSR.ctypes,
             WORK.view(np.float64).ctypes,
             LWORK.ctypes,
             RWORK.ctypes,
             BWORK.ctypes,
             INFO.ctypes)

    print("Calculated workspace size as", WORK[0])
    WS_SIZE = np.int32(WORK[0].real)
    LWORK = np.array(WS_SIZE, np.int32)
    WORK = np.empty(WS_SIZE, dtype=np.complex128)
    zgges_fn(JOBVSL.ctypes,
             JOBVSR.ctypes,
             SORT.ctypes,
             SELCTG.ctypes,
             N.ctypes,
             A.view(np.float64).ctypes,
             LDA.ctypes,
             B.view(np.float64).ctypes,
             LDB.ctypes,
             SDIM.ctypes,
             ALPHA.view(np.float64).ctypes,
             BETA.view(np.float64).ctypes,
             VSL.view(np.float64).ctypes,
             LDVSL.ctypes,
             VSR.view(np.float64).ctypes,
             LDVSR.ctypes,
             WORK.view(np.float64).ctypes,
             LWORK.ctypes,
             RWORK.ctypes,
             BWORK.ctypes,
             INFO.ctypes)

    # The LAPACK function also returns SDIM, WORK, BWORK, but I don't need them here.
    return A, B, ALPHA, BETA, VSL.T, VSR.T, INFO

This function actually works great on small matrices. Here I reproduce the relevant test case from scipy:

    n = 5
    A = np.random.random([n, n])
    B = np.random.random([n, n])
    AA, BB, a, b, Q, Z, info = numba_zgges(np.asfortranarray(A, dtype='D'), 
                                           np.asfortranarray(B, dtype='D'))

    aa = Q @ AA @ Z.conj().T
    assert np.allclose(aa.real, A)
    assert np.allclose(aa.imag, 0)
    bb = Q @ BB @ Z.conj().T
    assert np.allclose(bb.real, B)
    assert np.allclose(bb.imag, 0)
    assert np.allclose(Q @ Q.conj().T, np.eye(n))
    assert np.allclose(Z @ Z.conj().T, np.eye(n))

    assert np.all(np.diag(BB) >= 0)

When n <= 5 all the tests pass, but for larger matrices I get a memory heap corruption error, code 0xc0000374. I also get this error when I loop the test over and over, or when I try to pass the results of the function on to another jitted function.

I did find another stack overflow question (again apologies because I can’t add links, here’s the partial address – 64957662/cpython-memory-heap-corruption-issue) that seemed similar, involving a memory heap corruption when repeatedly calling a C function. In that case it was evidently related to de-referencing, a word which I repeat here as if I had any idea what it means.

On the other hand, the example linked on the Numba github that implements zgees does happily loop for arbitrarily large matrices – I tested this as well just to be sure it wasn’t a more general problem. In addition, there is a note in the Scipy code about zgges being janky on windows. So it might be some idiosyncratic to this function?

Anyway, I’m not sure this is numba related per se, but since I got the example from the numba github here I am. I’m hoping someone more experienced with these sorts of Cython issues can spot something obvious I missed right away.

Cheers.

max9111 · June 20, 2022, 1:48pm

It looks like that the datatype of integer is wrong. I guess it is a 64bit integer. At least if I change everything to 64 bit integers it works.

jessegrabowski · June 20, 2022, 3:01pm

Hi, thanks for taking the time to look at this!

Are you on Windows? Can you run the test in a loop and see if it gives you a memory heap corruption fault?

for _ in range(1000):
    n = 10
    A = np.random.random([n, n])
    B = np.random.random([n, n])
    AA, BB, a, b, Q, Z, info = numba_zgges(np.asfortranarray(A, dtype='D'), 
                                           np.asfortranarray(B, dtype='D'))

    aa = Q @ AA @ Z.conj().T
    assert np.allclose(aa.real, A)
    assert np.allclose(aa.imag, 0)
    bb = Q @ BB @ Z.conj().T
    assert np.allclose(bb.real, B)
    assert np.allclose(bb.imag, 0)
    assert np.allclose(Q @ Q.conj().T, np.eye(n))
    assert np.allclose(Z @ Z.conj().T, np.eye(n))

    assert np.all(np.diag(BB) >= 0)

max9111 · June 20, 2022, 5:08pm

Yes, I tried it on Windows (including your example). I am not sure how to determine the right integer size programmatically.
I also assumed int64 there https://github.com/numba/numba/pull/7825#discussion_r807786767

BTW: If your function should be as fast as possible, it may make sense to use stack-allocated arrays, for small temporary data. Allocating small arrays (maybe only with size 1) on the heap just for passing something by reference is quite slow.

from numba import njit
import numpy as np
from numba.extending import get_cython_function_address
import ctypes

# Datatype pointers to give to the cython LAPACK functions
_PTR = ctypes.POINTER

_dbl = ctypes.c_double
_int = ctypes.c_int64

_ptr_dbl = _PTR(_dbl)
_ptr_int = _PTR(_int)

# zgges is the complex QZ-decomposition
zgges_addr = get_cython_function_address('scipy.linalg.cython_lapack', 'zgges')
zgges_functype = ctypes.CFUNCTYPE(None,
                                  _ptr_int,  # JOBVSL
                                  _ptr_int,  # JOBVSR
                                  _ptr_int,  # SORT
                                  _ptr_int,  # SELCTG
                                  _ptr_int,  # N
                                  _ptr_dbl,  # A, complex
                                  _ptr_int,  # LDA
                                  _ptr_dbl,  # B, complex
                                  _ptr_int,  # LDB
                                  _ptr_int,  # SDIM
                                  _ptr_dbl,  # ALPHA, complex
                                  _ptr_dbl,  # BETA, complex
                                  _ptr_dbl,  # VSL, complex
                                  _ptr_int,  # LDVSL
                                  _ptr_dbl,  # VSR, complex
                                  _ptr_int,  # LDVSR
                                  _ptr_dbl,  # WORK, complex
                                  _ptr_int,  # LWORK
                                  _ptr_dbl,  # RWORK
                                  _ptr_int,  # BWORK
                                  _ptr_int)  # INFO
zgges_fn = zgges_functype(zgges_addr)

@njit
def numba_zgges(x, y):
    _M, _N = x.shape

    A = x
    B = y

    JOBVSL = np.array([ord('V')], dtype=np.int64)
    JOBVSR = np.array([ord('V')], dtype=np.int64)
    SORT = np.array([ord('N')], dtype=np.int64)
    SELCTG = np.empty(1, dtype=np.int64)

    N = np.array(_N, dtype=np.int64)
    LDA = np.array(_N, dtype=np.int64)
    LDB = np.array(_N, dtype=np.int64)
    SDIM = np.array(0, dtype=np.int64) # out

    ALPHA = np.empty(_N, dtype=np.complex128) # out
    BETA = np.empty(_N, dtype=np.complex128) # out
    LDVSL = np.array(_N, dtype=np.int64)
    VSL = np.empty((_N, _N), dtype=np.complex128) # out
    LDVSR = np.array(_N, dtype=np.int64)
    VSR = np.empty((_N, _N), dtype=np.complex128) # out

    WORK = np.empty((1,), dtype=np.complex128) #out
    LWORK = np.array(-1, dtype=np.int64)
    RWORK = np.empty(8*_N, dtype=np.float64)
    BWORK = np.empty(_N, dtype=np.int64)
    INFO = np.empty(1, dtype=np.int64)

    zgges_fn(JOBVSL.ctypes,
             JOBVSR.ctypes,
             SORT.ctypes,
             SELCTG.ctypes,
             N.ctypes,
             A.view(np.float64).ctypes,
             LDA.ctypes,
             B.view(np.float64).ctypes,
             LDB.ctypes,
             SDIM.ctypes,
             ALPHA.view(np.float64).ctypes,
             BETA.view(np.float64).ctypes,
             VSL.view(np.float64).ctypes,
             LDVSL.ctypes,
             VSR.view(np.float64).ctypes,
             LDVSR.ctypes,
             WORK.view(np.float64).ctypes,
             LWORK.ctypes,
             RWORK.ctypes,
             BWORK.ctypes,
             INFO.ctypes)

    print("Calculated workspace size as", WORK[0])
    WS_SIZE = np.int64(WORK[0].real)
    LWORK = np.array(WS_SIZE, np.int64)
    WORK = np.empty(WS_SIZE, dtype=np.complex128)
    zgges_fn(JOBVSL.ctypes,
             JOBVSR.ctypes,
             SORT.ctypes,
             SELCTG.ctypes,
             N.ctypes,
             A.view(np.float64).ctypes,
             LDA.ctypes,
             B.view(np.float64).ctypes,
             LDB.ctypes,
             SDIM.ctypes,
             ALPHA.view(np.float64).ctypes,
             BETA.view(np.float64).ctypes,
             VSL.view(np.float64).ctypes,
             LDVSL.ctypes,
             VSR.view(np.float64).ctypes,
             LDVSR.ctypes,
             WORK.view(np.float64).ctypes,
             LWORK.ctypes,
             RWORK.ctypes,
             BWORK.ctypes,
             INFO.ctypes)

    # The LAPACK function also returns SDIM, WORK, BWORK, but I don't need them here.
    return A, B, ALPHA, BETA, VSL.T, VSR.T, INFO

for _ in range(1000):
    n = 10
    A = np.random.random([n, n])
    B = np.random.random([n, n])
    AA, BB, a, b, Q, Z, info = numba_zgges(np.asfortranarray(A, dtype='D'), 
                                           np.asfortranarray(B, dtype='D'))

    aa = Q @ AA @ Z.conj().T
    assert np.allclose(aa.real, A)
    assert np.allclose(aa.imag, 0)
    bb = Q @ BB @ Z.conj().T
    assert np.allclose(bb.real, B)
    assert np.allclose(bb.imag, 0)
    assert np.allclose(Q @ Q.conj().T, np.eye(n))
    assert np.allclose(Z @ Z.conj().T, np.eye(n))

    assert np.all(np.diag(BB) >= 0)

jessegrabowski · June 20, 2022, 5:40pm

Interesting, my kernel still crashes on the looping test after making the int64 switch. Making that switch does let me set n = very large without any problems, though. Any guesses why we would be getting different results? If it matters I’m using g++ 5.3.0 installed via m2w64-toolchain on conda-forge as a compiler.

Your code looks super interesting; it seems like it would be relevant in my case? To call the fortran code directly without going through cython?

I would like my function to be as fast as possible (this is all going into an MCMC sampler, so it will be called a lot), but I’m afraid I don’t know what a stack-allocated array is. A quick search turned up some C++ resources. I assume then this isn’t something I would be doing at the python level? Can you point me to some references I could look at?

Thanks again for taking the time to look at this.

max9111 · June 21, 2022, 9:19am

The first try was just coincidence.
By having a look at lapack documentation I found out that RWORK was to small. Now it works. I am using scipy from anaconda, but compilation settings of scipy may make a difference here. Also some other values may have the wrong type (char =-> np.unit8), but it looks like this doesn’t matter here.

Yes you can call functions from a shared object (or dll) without cython, if you know the inputs and outputs, using ctypes. It doesn’t matter from where you get the function pointer. With quite some additional effort F2PY may be extended or parts of it can be used to (semi-)automatically generate Numba wrappers for a lot of functions, where a Python Interface generated with F2PY already exists.

In C you would simply pass a integer by reference, instead of allocating an array (calloc() ,free()) just to get a pointer to a value. This is not only more convenient, but also faster. With some intrinsics you can do the same in Numba.

If the called function has a very short runtime, the overhead of array allocation could be as high or higher than the runtime of the called function itself. Exampe

jessegrabowski · June 21, 2022, 9:28am

The change to RWORK did it. I stared at the LAPACK documentation for hours, I’m very embarrassed I missed the note about the size of RWORK. Thank you again for everything. I will also work on tuning the array allocation overhead as you suggest. Cheers!

Topic		Replies	Views
Rotation cartesian coordinates, numba vs C extension Support: How do I do ...?	9	513	December 5, 2022
Usage of CUDA Python, Linear Algebra on GPU and Computational Code Community Support	7	3760	December 31, 2021
How can I improve the runtime of this linear system solve? Support: How do I do ...?	19	1784	March 6, 2024
Problems using numba and math functions from scipy.special.cython_special Community Support	18	5201	January 24, 2021
Feedback on tips for first-timers Community Support	14	979	August 15, 2023

Extending LAPACK functions into numba

Related topics