Hey @stuartarchibald ,
Yes that helped! For anyone coming across the same question, this is the answer:
Parallel reductions are not natively provided by GUFuncs but the parfor lowering strategy allows us to use GUFuncs in a way that reductions can be performed in parallel. To accomplish this, for each reduction variable computed by a parfor, the parallel GUFunc and the code that calls it are modified to make the scalar reduction variable into an array of reduction variables whose length is equal to the number of Numba threads. In addition, the GUFunc still contains a scalar version of the reduction variable that is updated by the parfor body during each iteration. One time at the end of the GUFunc this local reduction variable is copied into the reduction array. In this way, false sharing of the reduction array is prevented. Code is also inserted into the main function after the parallel GUFunc has returned that does a reduction across this smaller reduction array and this final reduction value is then stored into the original scalar reduction variable.
Which, in my own words, means: Numba switches my reduction variable “res” in the given example into a array res = [0,0,0,…], and on every iteration, each thread, instead of working with the real reduction variable, works with a copy of that variable inside the array. At the end of each iteration, the original reduction variable is updated again, which for my example means, that the values inside the smaller array are reduced and the result of that is put into the real reduction variable.