Huge Slowdown on HPC

cgubbin · August 7, 2023, 1:13pm

I have a simple loop, which inverts a matrix that I wanted to parallelize using Numba:

@jit(nopython=True, parallel=True)
def _solve_finite_inner(
    solution_vector,
    lattice_inverse_polarisability,
    matrix_a,
    source_vector,
    unique_wavevectors_im
):
    for ll in prange(solution_vector.shape[1]):
        for kk in range(solution_vector.shape[0]):
            solution_vector[kk, ll] = np.linalg.solve(
                (
                    lattice_inverse_polarisability[kk]
                        - matrix_a[kk] * unique_wavevectors_im[kk] ** 2
                ),
                source_vector[kk, ll, ...]
            )

On my laptop (Macbook pro) for a small system (matrices of dimension (15,15), solution_vector.shape=(100, 100)) this loop takes around 5 seconds to execute using 8 cores.

I am also running the same code on a Linux HPC, where I had hoped to increase the core count to speed the loop up for larger matrix dimensions. On the HPC however, using 8 cores, the loop takes over 30 minutes for exactly the same parameters.

I am also finding that all the (non-parallelised) njit decorated functions run significantly more slowly (between 2 and 10 times) on the HPC.

What could be going wrong? I’m really at a loss.

esc · August 7, 2023, 1:35pm

What kind of HPC system is this? I once worked on an IBM Blue Gene system and that initially needed 30 minutes for a simple hello world. The issue was that each core was attempting to load the required Python binaries and modules to execute hello world and that caused significant traffic on the shared file system. The solution was to use MPI and get the first node to load the Python binaries and modules and then MPI-broadcast them to all the other nodes.

The other thing to look at is the available threading layers and select one appropriate for your system:

https://numba.readthedocs.io/en/stable/user/threading-layer.html#which-threading-layers-are-available

cgubbin · August 7, 2023, 3:05pm

Thank you for replying so quickly.

It’s a Linux system, standard nodes have dual 2.0 GHz Intel Skylake processors (40 cores, 196GB memory). I am only trying to run on one node, as that should be sufficient for what I need.

I have tried set the threading_layer to opm, as the cluster supports OpenMPI by default, I also tried installing tbb after having this issue. The runtime was the same in both cases.

I don’t think it’s a loading issue, or if it is there are two issues. Without parallel=True I get through the loop above in around 12 seconds (twice the time as on my laptop). The huge slowdown seems to be from the parallel decorator.

Topic		Replies	Views
Scaling of prange parallelization in long calculation Community Support	3	550	August 12, 2021
Advice in parallelizing Support: How do I do ...?	2	1329	September 5, 2022
Help improving performance of embarassingly parallel loop Community Support	8	194	February 28, 2024
Does Numba support MPI and/or openMP parallelization? Community Support	20	3540	December 23, 2021
Strange behaviour on HPC cluster Community Support	3	552	July 16, 2021

Huge Slowdown on HPC

Related Topics