Calling a function from a shared object from within guvectorized code works, but induces extra slowness

This is the setup of the function from a shared object (C-library) :

from numba import guvectorize, float64, float32, int32
from numba.core import types, typing
from llvmlite import binding

binding.load_library_permanently("../../fitting_library.so")

return_type = types.int32
int_ty = types.CPointer(int32)
float_ty = types.CPointer(float32)
ret_and_arg_sig = typing.signature(return_type, float_ty, int_ty, int_ty,
                                   types.int32, float_ty, types.int32, float_ty,
                                   float_ty, float_ty, float_ty)
fit_gauss = types.ExternalFunction("fit_gauss", ret_and_arg_sig)


@guvectorize(...)
function_that_does_a_whole_lot(...):
     ...
     fit_gauss(first_ndarray.ctypes, second_ndarray.ctypes,..., some_integer,...etc)

This works and gives correct results.

However, it is much slower than expected. I know how long an individual call to fit_gauss takes, it is 17 microseconds. If I multiply that with the number of times it is called, I get to just less than 3s.
However, the call to function_that_does_a_whole_lot now takes 19s more than without the call.

Does the call to fit_gauss somehow break the vectorization?

Would it make a difference if fit_gauss were registered as a first-class function - using WAP - and passed as an argument to function_that_does_a_whole_lot?

But not sure if that is possible, would a guvectorize decorator be able to handle anything different from numbers, such that you can provide a function as an argument?

I designed a test where I compiled a shared object with a function fit_gauss that does nothing except returning 0 immediately.
In that case there is no measurable delay, i.e. no extra slowness.

I tend to conclude that the slowness is not induced by any design flaw in Numba’s guvectorize when calling a function from an external library but rather that the extra computations from fit_gauss cause the CPU load to exceed some hardware specific limit.