How does Numba parallelize reductions (e.g. dot products)


Im not sure about how Numba parallelizes reductions. The docs state, that there are supported reductions such as “+=” . Which means, the following runs in parallel even though that should normaly result in a “race condition”:

def add():
    x = np.arange(0, 10000)
    res = 0
    for i in numba.prange(10000):
        res += x[i]

For research I’d like to know where I can find informations from a offical numba source relating to this so I can cite them. I scanned the docs, but I couldnt find explanations regarding to what exactly happens here.

Thanks in advanced

Hi @burakmartin,

Perhaps have a read of the related developer docs here and here and see if that answers your question. I’m pretty sure that the description is in the latter link (former is for context)!

Hope this helps?

Hey @stuartarchibald ,

Yes that helped! For anyone coming across the same question, this is the answer:

Parallel reductions are not natively provided by GUFuncs but the parfor lowering strategy allows us to use GUFuncs in a way that reductions can be performed in parallel. To accomplish this, for each reduction variable computed by a parfor, the parallel GUFunc and the code that calls it are modified to make the scalar reduction variable into an array of reduction variables whose length is equal to the number of Numba threads. In addition, the GUFunc still contains a scalar version of the reduction variable that is updated by the parfor body during each iteration. One time at the end of the GUFunc this local reduction variable is copied into the reduction array. In this way, false sharing of the reduction array is prevented. Code is also inserted into the main function after the parallel GUFunc has returned that does a reduction across this smaller reduction array and this final reduction value is then stored into the original scalar reduction variable.

Which, in my own words, means: Numba switches my reduction variable “res” in the given example into a array res = [0,0,0,…], and on every iteration, each thread, instead of working with the real reduction variable, works with a copy of that variable inside the array. At the end of each iteration, the original reduction variable is updated again, which for my example means, that the values inside the smaller array are reduced and the result of that is put into the real reduction variable.