Parallel, prange and (fixed length) lists

I’m looking at a specialised case of the rp-forest construction in pynndescent. Right now pynndescent uses joblib to parallelise over tree creation, where each tree is created uses numba jitted functions. In the specialised case the return value is not a complicated tree structure but rather a simple numpy array (of variable, and at the time of calling unknown, length). I was interested in trying to use numba parallelism and avoid having to use joblib. Since the returned arrays of of variable size I can’t assign to a pre-allocated array, but instead have a list of arrays as the result. However lists aren’t thread-safe so the parallelism is worrisome. While experimenting I tried something like:

def make_forest(data, rng_states, leaf_size, max_depth):
    return [
        make_leaf_array(data, rng_states[i], leaf_size, max_depth=max_depth) 
        for i in numba.prange(rng_states.shape[0])

and it “works”, but I presume it isn’t safe in general. Nonetheless, I feel like there should be a solution, in that the computations are all independent, and the size of the list is fixed (and known at the outset), so it is all embarrassingly parallel. Is there a “right” safe way to do this? Should I pre-allocate the fixed length list of arrays with dummy arrays and then assign to the fixed length list? Is there something more clever I can be doing? Should I just go with the simple joblib parallelism 'cause it’s the right tool for this particular job?

Numba supports parellel array comprehensions (see documentation), but I don’t think list comprehensions are thread-safe.
But the other solution you suggested sounds good to me. You can fill the list with a single value initially, so you don’t have to create rng_states.shape[0] arrays. E.g. like this:

def make_forest(data, rng_states, leaf_size, max_depth):
    # This allocates only one array and fills the list with references to it
    # The dtype and ndim of the array should match the return type of make_leaf_array
    ret = [numpy.empty((1, 1))] * rng_states.shape[0] 
    for i in numba.prange(len(ret)):
        ret[i] = make_leaf_array(data, rng_states[i], leaf_size, max_depth=max_depth) 
    return ret

Unless make_leaf_array is an incredibly cheap function, this should be fine in terms of performance.

I would agree with what @sschaer has mentioned.
If you know what the max possible size of the array will be, allocate that and possibly return an integer to track the index that will be used to slice the array upon completion. I did this recently to get rid of np.concatenate in a prange loop and I got a 3x speed increase due to np.concatenate having to constantly allocate memory for a new array with each of the parallel threads.

Along those lines …
If you can do a little re-architecting to pre-allocate as many of the data structures being used by make_leaf_array outside of the prange loop and pass them in as parameters to the function so you just reference them rather than create them, you may be really surprised about the increased performance you see … at least I was.