Is numba suitable to map a list of array into an array in parallel?

Let’s say we have an typed list containing numpy arrays. I want to reduce each array into a scalar (e.g., sum) so the result will become an array of the same row number as the input list. Since the reduction is independent I think it maybe good idea to parallelize the execution (parallel=True). However I encountered the following problems:

Method 1: I preallocate a np.zeros array, then in a prange loop, update n[idx] accordingly. This method costed 2x more time than pure Python implementation.

Method 2: create a list inside the jit function and append to it in the prange loop. Since the list is not thread safe, this method randomly crashed the program.

My reduction operation takes tens to hundreds of milliseconds to complete. I think the penalty of parallelization overhead should be offset by the computation time.

My question is whether the parallelization in numba is suitable for my described case? In the doc all example is showing the reduction to a scalar. When I want to perform mapping and return a list, is there a recommended approach to use parallelized jit function?

Thanks in advance!

Hi @wuyuanyi135,

I agree that this is probably a case where the cost of parallelization is greater than the potential savings for practically sized data. The jitted but non-parellel verion of such a function still outperforms the numpy/python version however. Tests follow.

import numpy as np
import numba as nb

@nb.njit('f8(f8[:,:,:])', parallel = True)
def msum_pnumba(x: np.ndarray) -> float:
    " Find the sum of an array of arrays. Parallel and jitted. "
    y = np.array([np.sum(_) for _ in x])
    return np.sum(y)

@nb.njit('f8(f8[:,:,:])', parallel = False)
def msum_numba(x: np.ndarray) -> float:
    " Find the sum of an array of arrays. Jitted but not parallel. "
    y = np.array([np.sum(_) for _ in x])
    return np.sum(y)

def msum_numpy(x: np.ndarray) -> float:
    " Find the sum of an array of arrays. Numpy/Python only."
    y = np.array([np.sum(_) for _ in x])
    return np.sum(y)

For small data the jitted but non-parallel version outperforms the others.

x = np.array([np.array(((1.,2.,3.),(4.,5.,6.))), 
              np.array(((7., 8., 9.), (10., 11., 12.)))])

> x[0], x[1]
(array([[1., 2., 3.],
        [4., 5., 6.]]),
 array([[ 7.,  8.,  9.],
        [10., 11., 12.]]))

> %timeit msum_pnumba(x)
131 µs ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

> %timeit msum_numba(x)
563 ns ± 8.35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

> %timeit msum_numpy(x)
20.8 µs ± 302 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The results hold for a 10,000 arrays of random (2*2)s. However the gap between the parallel and straight numpy versions narrow.

xx = [np.random.rand(2, 2) for _ in range(10000)]
xx = np.array(xx)

> xx[0:3]

array([[[0.67448821, 0.34600937],
        [0.56723079, 0.50345596]],

       [[0.03706687, 0.37920189],
        [0.99338415, 0.1772098 ]],

       [[0.28515366, 0.53257918],
        [0.84548535, 0.15695816]]])

> %timeit msum_pnumba(xx)
487 ms ± 205 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

> %timeit msum_numba(xx)
704 µs ± 3.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

> %timeit msum_numpy(xx)
580 ms ± 6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Perhaps with very large data (e.g. 1-million arrays of (2*2)s) the parallel version could outperform the others.

1 Like

If your code can tolerate fastmath=True the reductions will likely vectorize and hopefully it’ll get a good performance improvement.

xref docs: Performance Tips — Numba 0.52.0-py3.7-linux-x86_64.egg documentation

Thanks for your insightful analysis on this problem!

Thanks! fastmath=True does improve the performance. It’s now comparable with the non-jit version but the performance gain is not significant yet. I will stick to the non-jit version at this moment since it’s time own time is short compared to the other bottlenecks

I wonder if I want to return a sequence of variable-sized array, which I could not simply join together with a numpy array, for example

@jit(nopython=True, parallel=True)
def fun_jit(n):
    ret = []
    
    for i in prange(n):
        long_computation_var_size_array = np.zeros((np.random.randint(1, 10), 1))
        ret.append(long_computation_var_size_array)
        
    return ret

In this case, list.append may potentially cause crash because of the thread safety. Without considering the performance gain, can you comment if there is anyway to achieve a parallel list creation?

Hi @wuyuanyi135, I haven’t thought about parallel list creation yet, but it’s interesting and perhaps I’ll come back to it.

The parallelization debate is fun. I’d like to show a case where the parallel algorithm outperforms the single threaded one.

Say for example I want to find the first principal component of a set of (n*3) arrays:

import numpy as np
import numpy.linalg as la
import numba as nb

@nb.njit('f8[:,:] (f8[:,:])')
def _pca(y:np.ndarray):
    " Find the first principal component for a single numpy array. "
    # Scatter matrix
    mu = [np.mean(_) for _ in y.T]
    mu = np.array(mu)
    ymmu = y - mu
    s = np.dot(ymmu.T, ymmu)
    
    # Eigenvectors
    w, v = la.eig(s)
    v = v.T

    # 1st PC
    pc = v[w == max(w)]

    return pc

@nb.njit(parallel = True)
def mpca_parallel(x:np.ndarray) -> np.ndarray:
    " Find the first principal component for a set of numpy arrays in parallel. "
    n = len(x)
    mpc = np.zeros((n, 3))
    for i in nb.prange(n):
        mpc[i] = _pca(x[i])
    return mpc

@nb.njit(parallel = False)
def mpca_notparallel(x:np.ndarray) -> np.ndarray:
    " Find the first principal component for a set of numpy arrays on a single thread. "
    n = len(x)
    mpc = np.zeros((n, 3))
    for i in range(n):
        mpc[i] = _pca(x[i])
    return mpc

Now generate a set of 995 random numpy arrays of varying length (number of observations) but constant dimensions:

x = [np.random.rand(10*_, 3) for _ in np.arange(5, 1000)]

Timings follow:

> %timeit mpca_parallel(x)
52.7 ms ± 505 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

> %timeit mpca_notparallel(x)
126 ms ± 6.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And to prove both functions return the same result:

> mpca_parallel(x)[:3].squeeze(), mpca_notparallel(x)[:3].squeeze()

(array([[-0.58356343, -0.34276455,  0.73618353],
        [-0.22326603,  0.94215117,  0.25000692],
        [-0.76942768, -0.63292223, -0.08596801]]),
 array([[-0.58356343, -0.34276455,  0.73618353],
        [-0.22326603,  0.94215117,  0.25000692],
        [-0.76942768, -0.63292223, -0.08596801]]))

Anyway my point is that whether or not to parallelize depends on the algorithm and the size of the data. In general, my experience suggests that parallelization increases in effectiveness as algorithms and data increase in complexity and size respectively.

Hope this helps. It made for some enjoyable programming.

@wuyuanyi135 list.append is not thread-safe and, without doing something very involved, it cannot in general be made so. I think in either the next Numba release or the one after that there could be a way to structure computation so that something like a list-of-lists could be used for such cases, it relies on investigative work in Improve _get_thread_id and make the schedule accessible. · stuartarchibald/numba@35d2b10 · GitHub

@stuartarchibald @ryanchien Thank you all for the very insightful elaboration of this problem. They are very helpful!

@stuartarchibald The parallel computation and assignment to a pre-allocated numpy array is indeed working. I could confirm the performance gain when the large dataset is involved. However, this usage does not cover the case of returning a sequence of constant dimension but variable-size array, right? For example, let’s say the input is a list of 1d vector and the output is a list of cumsum of the vector. Since the size of input is not fixed, the output could be heterogeneous in size and could not be concatenated in one array. (I do understand this is a bad example as we can simply invoke the function on each element of the list. Just try to demonstrate the case).

1 Like

@wuyuanyi135 do you have an example of what you are describing? Whilst the input size and output size are determined at run time, this doesn’t seem to preclude allocating an array that’s the same length as the list as a storage buffer.