Is there any way to get round the overhead of numba prange with one thread?

I like to be about to set the number of threads for my code to 1 sometimes. Unfortunately, all the numba code that uses prange then runs more slowly than if I had used range. Is there a way round this?

Would a GitHub issue about this be welcome? Maybe it would be possible for numba to skip the parallel machinery entirely if num threads equals 1?

Do you have a minimal reproducer for the slowdown you’re seeing?

Here is an example:

import numpy as np
import numba as nb
from numba import njit, prange

nb.set_num_threads(1)

@njit(parallel=True, fastmath=True)
def lower_tri_matvec_par(L, x):
    """
    Multiply a lower triangular matrix L (n x n)
    by a vector x (length n).

    Returns y = L @ x, using only the lower triangle.
    """

    n = L.shape[0]
    y = np.zeros(n, dtype=L.dtype)

    for i in prange(n):
        s = 0.0
        # only sum up to the diagonal (j <= i)
        for j in range(i + 1):
            s += L[i, j] * x[j]
        y[i] = s
    return y

@njit(fastmath=True)
def lower_tri_matvec(L, x):
    """
    Multiply a lower triangular matrix L (n x n)
    by a vector x (length n).

    Returns y = L @ x, using only the lower triangle.
    """
    n = L.shape[0]
    y = np.zeros(n, dtype=L.dtype)
    for i in range(n):
        s = 0.0
        # only sum up to the diagonal (j <= i)
        for j in range(i + 1):
            s += L[i, j] * x[j]
        y[i] = s
    return y

N = 2000


L = np.tril(np.random.rand(N, N).astype(np.float32))
x = np.random.rand(N).astype(np.float32)

In [103]: %timeit -n 1000 -r 100 lower_tri_matvec_par(L, x)
851 μs ± 35 μs per loop (mean ± std. dev. of 100 runs, 1,000 loops each)

In [104]: %timeit -n 1000 -r 100 lower_tri_matvec(L, x)
819 μs ± 20.7 μs per loop (mean ± std. dev. of 100 runs, 1,000 loops each)

Thanks… is it possible that compiling time is being included in your observations?

What is the result of you run %timeit a second time immediately after what you have now?

In [2]: %timeit -n 1000 -r 100 lower_tri_matvec_par(L, x)
831 μs ± 218 μs per loop (mean ± std. dev. of 100 runs, 1,000 loops each)

In [3]: %timeit -n 1000 -r 100 lower_tri_matvec(L, x)
790 μs ± 18.3 μs per loop (mean ± std. dev. of 100 runs, 1,000 loops each)

In [4]: %timeit -n 1000 -r 100 lower_tri_matvec_par(L, x)
812 μs ± 3.56 μs per loop (mean ± std. dev. of 100 runs, 1,000 loops each)

In [5]: %timeit -n 1000 -r 100 lower_tri_matvec(L, x)
803 μs ± 16.7 μs per loop (mean ± std. dev. of 100 runs, 1,000 loops each)

Thanks, your results are directionally similar to mine, though you clearly have a much faster machine than I do :grinning_face:

I don’t know of any way to avoid the ~1% slowdown you’re seeing due to the parallelization overhead. As I understand it numba doesn’t make any different code path for different thread counts, the compilation is binary. (Threaded or not threaded) So clearly I haven’t added much value with respect to your original post. :frowning:

Hey @lesshaste ,

If disc space and compilation time is not critical, you can compile the same core algorithm twice, once with parallel=False and once with parallel=True.
You can use a dispatcher to choose between them at runtime. This lets you switch modes within the same Python session without changing thread settings.

Here is an example:

import numpy as np
from numba import njit, prange

NBCONFIG = {'fastmath': True, 'cache': False}

@njit(**NBCONFIG)
def lower_tri_matvec(L, x):
    """Benchmark"""
    n = L.shape[0]
    y = np.zeros(n, dtype=L.dtype)
    for i in range(n):
        s = 0.0
        for j in range(i + 1):
            s += L[i, j] * x[j]
        y[i] = s
    return y

@njit(**NBCONFIG, inline='always')
def lower_tri_matvec_core(L, x):
    """Core algorithm to be inlined."""
    n = L.shape[0]
    # initialization overhead: use empty array instead
    y = np.zeros(n, dtype=L.dtype)  
    for i in prange(n):
        # casting overhead: use type aware scalar variable instead
        s = 0.0  
        for j in range(i + 1):
            s += L[i, j] * x[j]
        y[i] = s
    return y

@njit(**NBCONFIG, parallel=False)
def lower_tri_matvec_seq(L, x):
    """Sequential algorithm (same as lower_tri_matvec)."""
    return lower_tri_matvec_core(L, x)

@njit(**NBCONFIG, parallel=True)
def lower_tri_matvec_par(L, x):
    """Parallel algorithm."""
    return lower_tri_matvec_core(L, x)

@njit(**NBCONFIG)
def lower_tri_matvec_dispatch(L, x, parallel=False):
    return lower_tri_matvec_par(L, x) if parallel else lower_tri_matvec_seq(L, x)

N = 2000
L = np.tril(np.random.rand(N, N).astype(np.float32))
x = np.random.rand(N).astype(np.float32)

# warm-up
lower_tri_matvec(L, x)
lower_tri_matvec_dispatch(L, x)
lower_tri_matvec_dispatch(L, x, parallel=True)

%timeit -n 300 -r 100 lower_tri_matvec(L, x)
%timeit -n 300 -r 100 lower_tri_matvec_dispatch(L, x, parallel=False)
%timeit -n 300 -r 100 lower_tri_matvec_dispatch(L, x, parallel=True)

# 811 μs ± 80 μs per loop (mean ± std. dev. of 100 runs, 300 loops each)
# 793 μs ± 35.9 μs per loop (mean ± std. dev. of 100 runs, 300 loops each)
# 670 μs ± 29.8 μs per loop (mean ± std. dev. of 100 runs, 300 loops each)

# 794 μs ± 40 μs per loop (mean ± std. dev. of 100 runs, 300 loops each)
# 788 μs ± 22.3 μs per loop (mean ± std. dev. of 100 runs, 300 loops each)
# 673 μs ± 39.4 μs per loop (mean ± std. dev. of 100 runs, 300 loops each)

# 788 μs ± 24.6 μs per loop (mean ± std. dev. of 100 runs, 300 loops each)
# 794 μs ± 38 μs per loop (mean ± std. dev. of 100 runs, 300 loops each)
# 705 μs ± 112 μs per loop (mean ± std. dev. of 100 runs, 300 loops each)
1 Like