Huge Slowdown on HPC

I have a simple loop, which inverts a matrix that I wanted to parallelize using Numba:

@jit(nopython=True, parallel=True)
def _solve_finite_inner(
    solution_vector,
    lattice_inverse_polarisability,
    matrix_a,
    source_vector,
    unique_wavevectors_im
):
    for ll in prange(solution_vector.shape[1]):
        for kk in range(solution_vector.shape[0]):
            solution_vector[kk, ll] = np.linalg.solve(
                (
                    lattice_inverse_polarisability[kk]
                        - matrix_a[kk] * unique_wavevectors_im[kk] ** 2
                ),
                source_vector[kk, ll, ...]
            )

On my laptop (Macbook pro) for a small system (matrices of dimension (15,15), solution_vector.shape=(100, 100)) this loop takes around 5 seconds to execute using 8 cores.

I am also running the same code on a Linux HPC, where I had hoped to increase the core count to speed the loop up for larger matrix dimensions. On the HPC however, using 8 cores, the loop takes over 30 minutes for exactly the same parameters.

I am also finding that all the (non-parallelised) njit decorated functions run significantly more slowly (between 2 and 10 times) on the HPC.

What could be going wrong? I’m really at a loss.

What kind of HPC system is this? I once worked on an IBM Blue Gene system and that initially needed 30 minutes for a simple hello world. The issue was that each core was attempting to load the required Python binaries and modules to execute hello world and that caused significant traffic on the shared file system. The solution was to use MPI and get the first node to load the Python binaries and modules and then MPI-broadcast them to all the other nodes.

The other thing to look at is the available threading layers and select one appropriate for your system:

https://numba.readthedocs.io/en/stable/user/threading-layer.html#which-threading-layers-are-available

Thank you for replying so quickly.

It’s a Linux system, standard nodes have dual 2.0 GHz Intel Skylake processors (40 cores, 196GB memory). I am only trying to run on one node, as that should be sufficient for what I need.

I have tried set the threading_layer to opm, as the cluster supports OpenMPI by default, I also tried installing tbb after having this issue. The runtime was the same in both cases.

I don’t think it’s a loading issue, or if it is there are two issues. Without parallel=True I get through the loop above in around 12 seconds (twice the time as on my laptop). The huge slowdown seems to be from the parallel decorator.