Best practices for using read-only Python lists

Thanks for making Numba, I am a big fan! I’ve got the Numba fan-poster, t-shirt, cap, underwear and everything, even my bed-sheets say Numba! (OK, that’s not really true, but it could be!)

My question relates to the best practices of passing read-only Python lists to Numba jit functions.

Versions

  • Numba: 0.54.1
  • Numpy: 1.20.3
  • Python: 3.8.12

Simple Example

Let me first show the problem with a simple example, where we just sum all the elements of a list:

# Number of elements in the list.
n = 100000

# Numpy array with random floats.
x_np = np.random.normal(size=n)

# Convert Numpy array to Python list.
x_list = x_np.tolist()

@njit
def sum_list(x):
    # Sum all elements of the list/array x.
    s = 0
    for i in range(len(x)):
        s += x[i]
    return s

%timeit sum_list(x=x_np)
# 126 µs ± 3.64 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit sum_list(x=x_list)
# This works but generates a warning:
# NumbaPendingDeprecationWarning: Encountered the use of a type that is
# scheduled for deprecation: type 'reflected list' found for argument
# 'x' of function 'sum_list'.
# 91 ms ± 5.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit sum_list(x=numba.typed.List(x_list))
# 141 ms ± 3.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The fastest method by far is passing a Numpy array. The slowest is to use the new and experimental numba.typed.List.

Sparse Matrix Example

In my actual use-case I have made a function that operates on a sparse matrix, whose elements are defined in a list of tuples like this: [(i0, j0, v0), (i1, j1, v1), ...] where the iX and jX are integer-values for the coordinates into the sparse matrix and vX are the float values for the matrix-elements at those coordinates. This does not seem to be supported very well in Numba as it runs very slowly.

Below is a somewhat contrived example which just sums all the elements of a sparse matrix and could therefore ignore the coordinates for the sparse matrix, but my actual problem needs to use the coordinates of the sparse matrix as well.

import scipy.sparse

# Generate a sparse matrix with random elements.
n = 100
x_sparse = scipy.sparse.random(n, n, density=0.1, format='coo')

# Convert the sparse matrix to a dense matrix.
x_dense = x_sparse.toarray()

# Numpy arrays for the column and row indices, and the matrix values.
x_col_np = x_sparse.col
x_row_np = x_sparse.row
x_val_np = x_sparse.data

# Convert Numpy arrays to Python lists.
x_col_list = x_col_np.tolist()
x_row_list = x_row_np.tolist()
x_val_list = x_val_np.tolist()

# Convert to a list of tuples, which is the format I prefer in my use-case.
x_list_tup = list(zip(x_col_list, x_row_list, x_val_list))

@njit
def sum_matrix_dense(x):
    # Sum all elements of a dense matrix.
    s = 0
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            s += x[i,j]
    return s

%timeit sum_matrix_dense(x=x_dense)
# 13.1 µs ± 354 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@njit
def sum_matrix_sparse1(x_list_tup):
    # Sum all elements of a sparse matrix.
    s = 0
    for i, j, v in x_list_tup:
        # This is a bit contrived because the indices i and j are ignored,
        # but in my real use-case I need those as well.
        s += v
    return s

%timeit sum_matrix_sparse1(x_list_tup=x_list_tup)
# This runs but generates a warning:
# NumbaPendingDeprecationWarning: Encountered the use of a type that is
# scheduled for deprecation: type 'reflected list' found for argument
# 'x_list_tup' of function 'sum_matrix_sparse1'.
# 24.7 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit sum_matrix_sparse1(x_list_tup=numba.typed.List(x_list_tup))
# This runs without warnings but is much slower than 'sum_matrix_dense()'.
# 1.85 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@njit
def sum_matrix_sparse2(x_col_list, x_row_list, x_val_list):
    # Sum all elements of a sparse matrix.
    s = 0
    for k in range(len(x_col_list)):
        col = x_col_list[k]
        row = x_row_list[k]
        val = x_val_list[k]
        s += val
    return s

%timeit sum_matrix_sparse2(x_col_list=x_col_list, x_row_list=x_row_list, x_val_list=x_val_list)
# This runs but generates a warning:
# NumbaPendingDeprecationWarning: Encountered the use of a type that is
# scheduled for deprecation: type 'reflected list' found for argument
# 'x_col_list' of function 'sum_matrix_sparse2'.
# 5.69 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit sum_matrix_sparse2(x_col_list=numba.typed.List(x_col_list), x_row_list=numba.typed.List(x_row_list), x_val_list=numba.typed.List(x_val_list))
# This runs without a warning but is much slower than 'sum_matrix_dense()'.
# 4.33 ms ± 29.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit sum_matrix_sparse2(x_col_list=x_col_np, x_row_list=x_row_np, x_val_list=x_val_np)
# This is very fast when using separate Numpy arrays as the arguments.
# 1.88 µs ± 7.01 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Conclusion

It seems that in the current version of Numba (0.54.1), the fastest method by far, is to pass Numpy arrays instead of Python lists to Numba jit functions.

My questions are:

  • Will numba.typed.List eventually run just as fast with Python lists as Numpy arrays, so I can use a Python list or even a Python list of tuples directly with my Numba jit functions? Or will it most likely always be much faster to pass Numpy arrays to Numba jit functions?

  • Is there a better / faster way of using read-only Python lists (or even a list of tuples) with Numba jit than I have done above?

Thanks very much and Merry Christmas!

Hi @Hvass-Labs

I think you will never see a conventional list perform as fast as an array. And that is not due to numba, but simply due to how list and arrays usually work on the hardware level.

Array memory is normally sequential which means the computer can jump directly to a known memory address and read the value. A list will always require jumping through possibly non sequential memory and will have one or more levels of indirection, meaning the computer first needs to find the address where a value is stored before it can find the value. This is probably made even worse by how this may affect memory caching in the CPU.

However, I think you don’t really need a list here, it is perfectly fine to pass the numpy arrays for cols rows and values into a function and iterate over them directly. Putting the same exact rows/cols/values into tuples in a list just makes the data structure more complex than required.

Here is an example using 3 arrays, but maybe one could also use a single “structured array”, if that feels more natural, and it could be even faster.

from numba import njit
import scipy.sparse

# Generate a sparse matrix with random elements.
n = 100
x_sparse = scipy.sparse.random(n, n, density=0.1, format='coo')

# Convert the sparse matrix to a dense matrix.
x_dense = x_sparse.toarray()

# Numpy arrays for the column and row indices, and the matrix values.
x_col_np = x_sparse.col
x_row_np = x_sparse.row
x_val_np = x_sparse.data

@njit
def sum_matrix_sparse_np(rows, cols, vals):
    s = 0
    for i, j, v in zip(rows, cols, vals):
        # Here you could use i and j in your real code as well of course
        s += v
    return s

sum_matrix_sparse_np(x_row_np, x_col_np, x_val_np)
%timeit sum_matrix_sparse_np(x_row_np, x_col_np, x_val_np)

One more note: It is not recommended to use %timeit immediately, because in this way it will also measure the compilation time that is needed the very first time the function is called. If you want to benchmark the performance of your function, call it once so it gets compiled, and then use timeit to measure the execution time

Thanks for the detailed reply!

In my actual Numba Python code I also use a for i, j, v in zip(...) loop like you propose.

The reason I would like to supply a list of tuples to the function is because it is much more natural to define it in that way.

I have done a few more tests that are interesting. These build on the code in my original post.

New Test 1

In the first test I pass a Numpy matrix with shape (1000, 3) instead of 3 separate Numpy arrays of length 1000, which may improve CPU/RAM caching, especially for much larger arrays, because the data is closer together in memory. The test shows it is slightly faster:

# Convert Numpy arrays to Numpy matrix.
x_sparse_stack = np.vstack([x_col_np, x_row_np, x_val_np]).T

@njit
def sum_matrix_sparse3(x_sparse_stack):
    # Sum all elements of a sparse matrix.
    s = 0
    for col, row, val in x_sparse_stack:
        s += val
    return s

%timeit sum_matrix_sparse3(x_sparse_stack=x_sparse_stack)
# This is slightly faster than using separate Numpy arrays as the arguments.
# 1.55 µs ± 46.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

New Test 2

Furthermore, I wanted to test how much time is spent converting a list to numba.typed.List:

%timeit numba.typed.List(x_list)
# 146 ms ± 4.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

x_list_numba = numba.typed.List(x_list)
%timeit sum_list(x=x_list_numba)
# 276 µs ± 3.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So it is extremely slow to convert a Python list to numba.typed.List, which takes about 600x the time to run the actual computation.

Question

I have lots of functions where I only need to read the Python lists, I will not write, append or update the lists in any way. Some of them are Python lists-of-lists, so they cannot easily be converted into Numpy arrays.

Would it be possible to provide an argument to numba.typed.List to let it know that it should not copy / convert the data into Numba-format, it should just use the raw Python data directly. Presumably this would be a lot faster as it would avoid the copy/convert overhead.

If you’re not a part of the Numba dev-team, perhaps we could tag someone from the dev-team who might help answer this question.

Thanks!

@Hvass-Labs operating on “raw Python data directly” sounds very similar to what Cython does. Cython operates on Python’s C-API without modifying the data structures. Have you tried writing a cython function to perform the operation you want?

If you did that, you could compare the speed to the numba array version, and see if this does what you want. There’s a tradeoff between not converting the data to an array and being fast. As @Hannes said, a list (read only or not) will never be as fast as an array.
If the cython-raw list version is fast enough for you, then you can call that function from numba. High-level extension API — Numba 0.56.0.dev0+20.gbf480b9e0.dirty-py3.7-linux-x86_64.egg documentation
If this is not fast enough for you, then you know that converting from list to array is an unavoidable cost to achieve what you want.

Hope this helps,
Luk

@luk-f-a Thanks for the suggestion! I did consider using Cython, but I have never used it before and it looks far more complicated than the elegance and simplicity of Numba, so I would prefer to only use Numba.

I think there is a problem with numba.typed.List for converting a Python list to a Numba list, as it is extremely slow. I have therefore opened issue #7727 on Numba’s GitHub page about this. If numba.typed.List was much faster, then there wouldn’t be a problem.

I should also mention that I have tried using Numba with Python tuples, but that is only supported for tuples with less than 1000 elements.

Thanks again for all the suggestions!