Prange slowdown

The following function is part of a program to create renderings of the buddhabrot:

@njit(parallel=False, fastmath=True)  
def buddhabrot_trajectory(c, point_list, iteration_ranges, iteration_ranges_of_escape, iterations_for_escape):                      
    for o in prange(0,OVERSAMPLE):        
        if not definitely_in_mandelbrot_set(c[o]):
            iteration_range_counter = 0
            while (iteration_range_counter<NUM_ITER_RANGES):
                k = iterate_and_collect(c[o], iteration_ranges[iteration_range_counter], iteration_ranges[iteration_range_counter+1], point_list[:,o])
                if k < iteration_ranges[iteration_range_counter+1]:
                    iteration_ranges_of_escape[o] = iteration_range_counter
                    iteration_range_counter = NUM_ITER_RANGES
                    iterations_for_escape[o] = k
                    iteration_range_counter += 1

In this, c is a list of (slightly different) complex numbers of length OVERSAMPLE. For each of these, a function iterate_and_collect is called that performs the Mandelbrot iteration and collects all points created in a separate slice of point_list.

iteration_ranges_of_escape and iterations_for_escape are numpy arrays of length OVERSAMPLE and dtype=np.uint32.

My thinking was that by creating these numpy arrays outside the function and reusing them, I would save the overhead of creating local arrays inside the function.

I wrote this function with the idea that the prange(0,OVERSAMPLE) loop would be parallelizable. The calculations for each o in the loop are independent, and I set up the arrays so that no race conditions would occur.

Still, parallel=True causes a dramatic slowdown of this function rather than the expected speedup. I roughly timed factor 6 slower. CPU load is above 90% with parallel=True (vs 10% with parallel=False), so something is happening, just not the huge performance increase I am trying to get.

I have read Automatic parallelization with @jit — Numba 0.50.1 documentation but I have not found a solution. I’m aware that there is a problem when the same variable or slice is being written to by parallel threads, but I thought I had taken care to avoid that in my code.