Another List of Arrays question (Numpy array inside a List Comprehension)

Hi team!!

Someone knows how to make List Comprehension work with NumPy arrays inside?

I’m having trouble getting the following jitted function to work, which is a simplified version of a larger code:

def multimask(search_arr: np.ndarray, bigarray: np.ndarray):
      return np.array([bigarray == q for q in search_arr])   # <-Numba error

It doesn’t work when jitted. Below are more explanations and questions…

Expected return:

The function should return a boolean 2D array where each row shows where the corresponding value in search_arr was found in bigarray . For example:

>>> a = np.array([10,20,10,30,10])
>>> s = np.array([10,30])
>>> multimask(s, a)
array([[ True, False,  True, False,  True],  # Found 10 in 1st, 3rd, 5th
       [False, False, False, True, False]])  # Found 30 in 4th item

But I get the following error:

.... click here to see error ...
  No implementation of function Function(<built-in function setitem>) found for signature:
   >>> setitem(array(undefined, 1d, C), int64, array(bool, 1d, C))
  There are 16 candidate implementations:
     - Of which 16 did not match due to:
     Overload of function 'setitem': File: <numerous>: Line N/A.
       With argument(s): '(array(undefined, 1d, C), int64, array(bool, 1d, C))':
      No match.

Work around, but verbose…

While I managed to work around the jit error by opening the List Comprehension into a explicit loop, I don’t like the verbosity:

def multimask_okcompile(search_arr:np.ndarray, bigarray:np.ndarray):
    size_search = len(search_arr)
    size_bigarray = len(bigarray)
    arr_mask = np.empty((size_search, size_bigarray), dtype="bool")
    for pos in range(size_search):
        q = search_arr[pos]
        arr_mask[pos,:] = (bigarray == q)
    return arr_mask

… I would prefer a cleaner pythonic solution and insist on List Comprehension or similar (if possible in Numba or Numpy).

Any ideas?

I know that list comprehension of simple types works fine in Numba, but I’m not sure how to make List Comprehension work with NumPy arrays inside. I’m wondering:

  • Do I need to add (the right) type hints ?
  • Can TypedList be used somehow to avoid explicit loop?
  • or maybe I need to overloading np.array or the built-in function setitem in some way? (I have taken a look to Numba issue 4470 (“Can’t create a numpy array from a numpy array”) to overload “np.array” to receive arrays. Fine for me if something like this works while waiting for Numba support arrays inside list comprehension)

Any direction or help on this will be highly appreciated!

What have I explored so far:

Before posting here, I made a lot of code testing and investigation (in the documentation, this forum, and even ChatGPT :wink:). So far, I have tried:

  • Adding type hints within the function in different ways.
  • Defining the types outside the JIT scope and adding them as a dtype parameter.
  • Searching unsuccessfully for a different NumPy function to replace the Numba code (I have tried np.isin , np.vstack , and different NumPy array creation functions) in case it is not possible to make list comprehension work with NumPy arrays inside.
  • Reading a lot about Numba arrays of arrays or lists of arrays (for example, “passing a list of numpy arrays into np array with numba”). However, something like make_2d() is just similar to my verbose workaround.

You could consider using guvectorize, that simplifies it to a 1D problem, which is perhaps less verbose to you.

def multimask(a, q, out):
    for i in range(a.size):
        out[i] = a[i] == q
a = np.array([10,20,10,30,10])
s = np.array([10,30])

out = np.empty((s.size, a.size), dtype="bool")
multimask(a, s, out)

You can have Numba do the output allocation if you’re capable of pinning down the datatypes it should use, for example:

@guvectorize("void(int32[:], int32, boolean[:])", "(n),()->(n)")
def multimask(a, q, out):
    for i in range(a.size):
        out[i] = a[i] == q

multimask(a, s)

You can flip the “axis” of vectorization if needed. If the sizes between the two arrays are very different, it might matter for performance.

Thanks, @Rutger, for introducing me to guvectorize . I plan to use it in the future. However, it would be great if I could keep the NumPy arrays inside a Comprehension List in Numba. Is there a known way to make Numba handle this? Perhaps it can be implemented by overloading something?

Why to insist in arrays inside Comprehension List?

  • The original Python code can be Jitted without modification. This is similar to how the solution to Numba issue #4470 opens the door to directly use np.array with arrays in Numba, avoiding awkward mangling modifications before Jitting the code.
  • It enhances code clarity and expressiveness.
  • It seems to be a super powerful tool if we have as an alternative in Numba.

I’m taking this as a personal challenge and I’m willing to try a few more things before giving up. Do you have any ideas for me to try?

For a first attempt at my challenge, I’d like to explore a solution similar to the one presented in issue #4470 .

I may be mistaken, but I have a hunch that overloading the setitem function or the list creation method (if there is one in Numba) could be the way to go to implement comprehension lists with arrays inside. The documentation for @overload and @overload_method are clear, but I’m unsure which object should be overloaded for the comprehension list case.

If anyone have information or guidance on this topic, I would greatly appreciate it!

PS: To make the quest less generic, I am willing to focus on one specific type of Numpy array. For example, it could be a 1D boolean array inside a comprehension list iterating over integers.

Thanks for elaborating on why you want to use the comprehensions.

  • It’s indeed a nice feature if you can have “regular” Numpy code that can be jitted without modifications, that allows seamlessly switching between both without needing to maintain redundant implementations. I do find though that Numba especially shines when abandoning the Numpy-vectorized notation (eg explicit loops etc), but it depends on the specific case of course.
  • I disagree somewhat with calling comprehensions clear. I use them all the time, it’s a very powerful notation, but not the most readable to me personally. This is very subjective of course.
  • It would indeed be great to have Numba support this.

Specifically regarding your requirement of wanting to keep the code generic Numpy, an alternative way of achieving that with the current release of Numba is to utilize the broadcasting mechanism.

I would normally write this myself as a[None,:] == s[:,None] or equivalently a[np.newaxis,:] == s[:,np.newaxis], both of which aren’t supported at the moment. But the same result is obtained when using the reshape method, which does work:

def multimask(a, s):
    return a.reshape(1,-1) == s.reshape(-1,1)


It shows good performance for me, slightly slower compared to the above mentioned guvectorized alternative, but faster compared to the same un-jitted function (pure Numpy).

1 Like

Thanks @Rutger! For my original problem your last solution is what I needed! Super clean a.reshape(1,-1) == s.reshape(-1,1) compared with my first comprehension list (Looking forward to the day when Numba will support its equivalent with an even cleaner syntax: a[None,:] == s[:,None]).

:point_up_2:t4: Of course, it is subjective, but I would say that it depends on the case. Having more than two for loops inside a list comprehension is too cryptic for my taste. However, a simple list comprehension with one or two for loops plus an if statement is usually easy to grasp than the explicit loop with an if condition… in my subjective opinion…

:point_up_2:t4: I fully agree. I’m looking forward to seeing that implemented someday.

Thank you again Rutger for all your help!

PS: no news unfortunately from my side regarding using @overload for making the simple use case: a 1D boolean array inside a comprehension list iterating over integers. (But I haven’t been able to make many attempts either).

1 Like

Hi @andressommerhoff

The solution of @Rutger is indeed very elegant. As for the slowdown compared to the guvectorize function, I can give you the tip to only reshape one of the arguments. E.g

def multimask(a, s):
    return a == s.reshape(-1,1)

will be considerably faster and even faster than using guvectorize.

1 Like

Such a nice solution @sschaer! You nailed it! (And now I feel more like a newbie than I thought I was, hahaha)