Numba + np.linalg.eigvalsh + parallel=True does show worse efficiency

Problem Statement

To solve eigenvalue problems per frequency point, I try to use numba parallel to accelerate np.linalg.eigvalsh.
But, I found numba doesnot provide much high superiority.

For example:
n_freq=2048, n_port=50 gives the following message:

Normal Calculation Time: 0.6144 sec
Numba Calculation Time: 0.6714 sec
Numba Calculation with parallel Time: 1.0486 sec
(Numba Calculation Time) / (Normal Calculation Time): 0.92
(Numba Calculation with parallel Time) / (Normal Calculation Time): 0.5

n_freq=256, n_port=400 gives the following message:

Normal Calculation Time: 22.1956 sec
Numba Calculation Time: 15.4975 sec
Numba Calculation with parallel Time: 94.7215 sec
(Numba Calculation Time) / (Normal Calculation Time): 1.43
(Numba Calculation with parallel Time) / (Normal Calculation Time): 0.23

Does anyone know why numpy linear algebra does not get boosted?

Test Code

import time
import numpy as np
import numba as nb
import os
from numba import jit, prange

import warnings
warnings.filterwarnings("ignore")


@nb.njit([nb.float64[:, :](nb.complex128[:, :, :]), nb.float64[:, :](nb.complex64[:, :, :]),],  parallel=True)
def calculate_passivity_matrix_njit_parallel(s_parameter: np.ndarray) -> np.ndarray:
    
    passivity_array = np.zeros(s_parameter.shape[:2], dtype=np.float64)
    for i in nb.prange(s_parameter.shape[0]):
        s = s_parameter[i]
        gram_matrix = s @ s.conj().T
        passivity_array[i] = np.linalg.eigvalsh(gram_matrix).real

    return passivity_array

@nb.njit([nb.float64[:, :](nb.complex128[:, :, :]), nb.float64[:, :](nb.complex64[:, :, :]),],  )
def calculate_passivity_matrix_njit(s_parameter: np.ndarray) -> np.ndarray:
    
    passivity_array = np.zeros(s_parameter.shape[:2], dtype=np.float64)
    for i in range(s_parameter.shape[0]):
        s = s_parameter[i]
        gram_matrix = s @ s.conj().T
        passivity_array[i] = np.linalg.eigvalsh(gram_matrix).real

    return passivity_array

def calculate_passivity_matrix(s_parameter: np.ndarray) -> np.ndarray:
    
    passivity_array = np.zeros(s_parameter.shape[:2], dtype=np.float64)
    for i in range(s_parameter.shape[0]):
        s = s_parameter[i]
        gram_matrix = s @ s.conj().T
        passivity_array[i] = np.linalg.eigvalsh(gram_matrix).real

    return passivity_array

# Testing

n_freq = 256
n_port = 400
test_matrix = np.random.rand(n_freq, n_port, n_port) + 1j * np.random.rand(n_freq, n_port, n_port)
test_matrix = (test_matrix + test_matrix.conj().transpose(0, 2, 1)) / 2


#* Normal Calculation
start_time = time.time()
calculate_passivity_matrix(test_matrix)
normal_time = time.time() - start_time
print(f"Normal Calculation Time: {normal_time:.4f} sec")

#* Numba Calculation
start_time = time.time()
calculate_passivity_matrix_njit(test_matrix)
numba_time = time.time() - start_time
print(f"Numba Calculation Time: {numba_time:.4f} sec")

#* Numba Calculation with parallel
start_time = time.time()
calculate_passivity_matrix_njit_parallel(test_matrix)
numba_parallel_time = time.time() - start_time
print(f"Numba Calculation with parallel Time: {numba_parallel_time:.4f} sec")

#* Show consuming time ratio
print(f"(Numba Calculation Time) / (Normal Calculation Time): {normal_time/numba_time:.2f}")
print(f"(Numba Calculation with parallel Time) / (Normal Calculation Time): {normal_time/numba_parallel_time:.2f}")

Hey @SHF101202021 ,
Numba’s and NumPy’s np.linalg.eigvalsh implementations are both using LAPACK, which is already optimized and multithreaded.
Which library is faster calling the external function depends on how much overhead they introduce.
Numba’s explicit parallelism introduces additional overhead in this case which even slows down the computation.