I have a simple loop, which inverts a matrix that I wanted to parallelize using Numba:

```
@jit(nopython=True, parallel=True)
def _solve_finite_inner(
solution_vector,
lattice_inverse_polarisability,
matrix_a,
source_vector,
unique_wavevectors_im
):
for ll in prange(solution_vector.shape[1]):
for kk in range(solution_vector.shape[0]):
solution_vector[kk, ll] = np.linalg.solve(
(
lattice_inverse_polarisability[kk]
- matrix_a[kk] * unique_wavevectors_im[kk] ** 2
),
source_vector[kk, ll, ...]
)
```

On my laptop (Macbook pro) for a small system (matrices of dimension `(15,15)`

, `solution_vector.shape=(100, 100)`

) this loop takes around 5 seconds to execute using 8 cores.

I am also running the same code on a Linux HPC, where I had hoped to increase the core count to speed the loop up for larger matrix dimensions. On the HPC however, using 8 cores, the loop takes over 30 minutes for exactly the same parameters.

I am also finding that all the (non-parallelised) njit decorated functions run significantly more slowly (between 2 and 10 times) on the HPC.

What could be going wrong? I’m really at a loss.