Let’s say we have an typed list containing numpy arrays. I want to reduce each array into a scalar (e.g., sum) so the result will become an array of the same row number as the input list. Since the reduction is independent I think it maybe good idea to parallelize the execution (parallel=True). However I encountered the following problems:

Method 1: I preallocate a np.zeros array, then in a prange loop, update n[idx] accordingly. This method costed 2x more time than pure Python implementation.

Method 2: create a list inside the jit function and append to it in the prange loop. Since the list is not thread safe, this method randomly crashed the program.

My reduction operation takes tens to hundreds of milliseconds to complete. I think the penalty of parallelization overhead should be offset by the computation time.

My question is whether the parallelization in numba is suitable for my described case? In the doc all example is showing the reduction to a scalar. When I want to perform mapping and return a list, is there a recommended approach to use parallelized jit function?

Thanks in advance!