How do I parallelize this code?

Hello,

I am learning numba and I try to parallelize a code that calculates y=f(x,parameters) for a large number of x values. So x is chunked. The following code is an example that calculates y=a*exp(b*x) for 3 parameters (a,b) and a few x values but there is not parallelization. It is only to be sure that it works.

import numpy as np
from numba import jit, prange

@jit(('float64[:], float64[:,:], int64'), nopython=True, parallel=False)
def myFoo(x, par, numthreads):
     result = np.empty((par.shape[0],len(x)), dtype=np.float64)
     chunklen = (len(x) + numthreads - 1) // numthreads
     for i in range(numthreads):
         result[:,i * chunklen:(i + 1) * chunklen] = par[:,0:1]*np.exp(x[i * chunklen:(i + 1) * chunklen]*par[:,0:1])
     return result
     
     
 x=np.array(np.arange(0,10,1.0))
 par=np.array([[1.0,0.5],[2.0,3.0],[3.0,5.0]])
 myFoo(x, par, 4)

This code works. In my real problem, I have a few parameter sets and a lot ot x values. So I set the parallel flag to True and I replaced range by prange

import numpy as np
from numba import jit, prange

@jit(('float64[:], float64[:,:], int64'), nopython=True, parallel=True)
def myFoo(x, par, numthreads):
    result = np.empty((par.shape[0],len(x)), dtype=np.float64)
    chunklen = (len(x) + numthreads - 1) // numthreads
    for i in prange(numthreads):
        result[:,i * chunklen:(i + 1) * chunklen] = par[:,0:1]*np.exp(x[i * chunklen:(i + 1) * chunklen]*par[:,0:1])
    return result
    
    
x=np.array(np.arange(0,10,1.0))
par=np.array([[1.0,0.5],[2.0,3.0],[3.0,5.0]])
myFoo(x, par, 4)

And I got the following error

AssertionError: Sizes of $80binary_subscr.12, $106binary_subscr.26 do not match on /tmp/ipykernel_17011/815571837.py (9)

How can I solve it ?

Thanks for answer.

Hi @stef1611,

could I briefly ask which version of numba you are using? I seem to be unable to reproduce the error you are experiencing.

Also slightly offtopic: Would you mind sharing, what encouraged you to define the function signature in the jit decorator? I see a lot of people doing this, especially when they are new to numba, and since it is generally not needed, and often maybe even discouraged, I am curious to understand why so many people define a signature when they could just be lazy instead :’)

Cheers!

Hi Hannes,

My numba version is 0.53.1
I agree with you about the function signature. I think I do this because I program often in C. But, I am becoming lazy :slight_smile: and it is better.

Cheers,

Stephane

Hi Hannes,

The code I posted with paralllel=True and prange gives no error on your config ?

Sheers,

Stephane

Hi Stephane,

I first tested on 0.54.0 and now on 0.55.0rc1 and I was a bit careless yesterday. The parallel code runs, but produces different (and wrong) results.
I think there is a good chance you have hit a bug here, I will try to play around a little more. But prange is known to be a bit fragile in some cases. There have been a few issues raised on Github in the past.

Cheers

PS: Oh and thanks for sharing your signature motivation :slight_smile:

Some observations:

  1. Should be irrelevant for the problem, but I think you are indexing par wrong (you take the same column for both a and b)

  2. If I run with parallel=True, but with range instead of prange I can reproduce your error!

  3. Slightly refactoring your code and giving names to the various slices and chunks the error message becomes a bit more helpful. It is complaining about a size mismatch between the parameter slice and the x value slice. But this is only raised when trying to run in parallel and it should really be valid numpy broadcasting syntax.

  4. If I run with parallel=False and prange the result is correct (that is expected since prange should just behave like range in that case)

  5. There is an indexing problem in your loop. On the last iteration you are accessing elements that are out of bounds. Without compilation numpy seems to take care of this, but I don’t know why to be honest. Compiling with boundscheck=True does NOT catch this either, that is weird to me. I think it is because these are slices and not singular indices.

  6. Fixing the indexing by limiting the column index for the last iteration does NOT solve the problem.

  7. If I change result = np.empty(...) to result = np.zeros(...), most of the (seemingly random) entries in the result turn to 0 when running in parallel, I think this points at some problem with the value assignment in the loop. It almost looks as if only the last chunk fills the result array correctly.

  8. If I extend the length of x by one (to make the last chunk have a length of 2 instead of 1, everything is broken)

So I get the vibe that there is more than one problem here, or it is some weird interaction

  1. prange steamrolls the error messages I see with range, but the returned result it wrong

  2. There seems to be a broadcasting issue when running with parallel=True

  3. Probably because of the broadcasting issues the result values are not filled into the output array

  4. boundscheck=True seems to behave like numpy but not necessarily as I expected for slices

Ping @stuartarchibald , @sklam, @sseibert : I feel like this is about as deep as I am able to dig, maybe one of you has an idea? :slight_smile:

Last but not least, here is my modified version of your example. Unless I made a mistake it should do the same as your original code (EXCEPT for the out of bounds index, and the initial output allocation). The lazy f-strings I used are not supported in compiled mode yet, but you can uncomment them in pure Python (>=3.8) mode for a little debug output

import numpy as np
from numba import jit, prange, njit

@njit(parallel=True, boundscheck=True)
def myFoo(x, par, numthreads):
    n_par = par.shape[0]
    n_x = len(x)
    result = np.zeros((n_par, n_x), dtype=np.float64)
    chunklen = (len(x) + numthreads - 1) // numthreads
    
    a = par[:,0:1]
    b = par[:,0:1]  # This is probably wrong but should not have an effect on the problem
        
#     print(f"{n_par = }")
#     print(f"{n_x = }")
#     print(f"{result.shape = }")   
#     print(f"{chunklen = }")   
#     print(f"{a = }")
#     print(f"{b = }")
    
    for i in prange(numthreads):
        start_col = i * chunklen
        stop_col = min(n_x, (i + 1) * chunklen)  # Here I fixed the index issue

        x_chunk = x[start_col:stop_col]
        chunk = a*np.exp(x_chunk*b)    
        result[:,start_col:stop_col] = chunk
        
#         print(f"\nchunk_id = {i}")
#         print(f"{start_col = }")   
#         print(f"{stop_col = }")   
#         print(f"{chunk.shape = }")
    
    return result
    
    
x=np.array(np.arange(0,11,1.0))
par=np.array([[1.0,0.5],[2.0,3.0],[3.0,5.0]])
myFoo(x, par, 4)

Hi Hannes,

You are right. The second parameter must be par[:,1:2]. But I agree with you, it is irrevelant for the problem.

Concerning the problem of the size of last chunk, it is also surprising for me that it works. This code is based on from Examples — Numba 0.50.1 documentation. And I try to use prange instead of using threading explicitely. I planned to correct this problem. Thanks to have done it.

Something that perhaps could help you. This very very simple code.

from numba import jit
import numpy as np

@jit(nopython=True, parallel=True)
def myFoo(par,x):
    par[:,0:1]*x
    return 1

x=np.array(np.arange(0,10,2.0))
par=np.array([[i*10.0,i*4.0] for i in range(1,3)])
myFoo(par,x)

outputs the same error

AssertionError: Sizes of $18binary_subscr.10, x do not match on /tmp/ipykernel_33664/2489146421.py (6)

But it works when parallel is set to False.

Thanks to your help.

Cheers

Stephane

1 Like

Hi,
In addition to my previous post. This code works

@jit(nopython=True, parallel=True)
def myFoo(par,x):
    print(np.outer(par[:,0],x))
    return 1

Hi,

that’s a great simplification of the reproducer. To me this really looks like a binary operator broadcasting problem when working in parallel mode.

I think this should go to the issue tracker. Nevermind, Graham was faster <3

Cheers

Many thanks for the discussion of the issue so far. I’ve created issues for the items brought up in the thread:

1 Like

Another addition. I tried to remove binary operator * for my program using np.outer, np.dot, np.multiply, … And I faced a new problem but related, I think.
This is my new minimalist reproducer of the problem. This code works with parallel set to False.

@jit(nopython=True, parallel=True)
def myFoo(par,x):
    print(np.outer(par[:,0],x))   # Ok, as before
    print(np.multiply(par[:,0:1],x)) # Sizes of $52binary_subscr.27, x do not match
    return 1

Hi,

Thanks a lot for your help. I posted another minimalist reproducer.

Best

Stephane.

@Hannes, @gmarkall

Not sure it is related but we never know (I am a new numba user so perhaps it is out context but it concerns slicing).

In python, to keep the second dimension to an array when I extract a column, I used
par[:,i,None] # or np.newaxis (it is the same)
But it does not work with numba (parallel or not parallel). So, I used this syntax
par[:,i:i+1]

Cheers

@stef1611

The newaxis thingy is a known limitation. Numpy provides quite some indexing and slicing magic that goes beyond standard python abilities, and I guess reimplementing it is not always easy.

I had actually stumbled over that way of indexing because I didn’t know that it preserves the shape and I think that is a great trick :slight_smile: In the past I always had to resort to using np.atleast_2d