Performance of typed.List outside of jit functions

It seems that outside a jitted function, operations on numba.typed.List are very slow compared to the same operation on native Python lists, on my machine the difference is about 20x. In my project I had trouble finding a way to work around the problem, I was wondering if I can get some advice on how to get the best performance.

The situation is that I had one function that accounted for the vast majority of the computational work, which I decided to optimize using numba. However it was a graph algorithm and it required ragged arrays that constantly have new values appended, meaning I cannot simply use numpy arrays, so I refactored my algorithm to use typed.List. This sped up the crucial function very much. However, all the code surrounding it became much slower since they operated on typed.List, becoming the bottleneck in my algorithm, and this surrounding code cannot be jitted because they use other Python objects such as sets.

I’ve reproduced the dilemma in the snippet of code at the bottom of the post. The function update_large_graph performs the vast majority of the computational work. However, search_large_graph becomes slow since lst is a typed.List.

We can simulate the expected optimal performance by changing search_large_graph(lst, reached) to search_large_graph(lst2, reached), which leaves almost all the execution time inside the jitted function update_large_graph, since lst2 is a native Python list.

But that does not perform the same computation unless lst is properly duplicated to lst2. However, the commented code needed to ensure any updates to lst are replicated in lst2 is expensive, even with my attempt to optimize this process through the function nb_list_to_array. I can’t seem to find a way to make the algorithm perform close to the expected optimum. Is there a good solution, or is my algorithm unable to be improved unless typed.List becomes more efficient outside of jitted functions?

The replication of the dilemma:

import numpy as np
import numba
import time


def noop_decorator(func):
    return func


jit = numba.njit
NBList = numba.typed.List
# jit = noop_decorator
# NBList = list


@jit
def update_large_graph(lst):
    updates = NBList()
    val = 0
    for i in range(100000):
        for j in range(5):
            val = (val * 295236 + 2976737) % 395687437
        lst.append(val)
        updates.append(val)
    return updates


@jit
def nb_list_to_array(lst):
    arr = np.empty(len(lst), dtype=np.int64)
    for i, v in enumerate(lst):
        arr[i] = v
    return arr


def update_large_graph_v2(lst, updates):
    updates = nb_list_to_array(updates).tolist()
    for i in range(100000):
        lst.append(updates[i])


@jit
def revert_large_graph(lst):
    del(lst[1000000:])


def revert_large_graph_v2(lst):
    del(lst[1000000:])


# @jit
def search_large_graph(lst, reached):
    for i in range(5000):
        lst[i] in reached


def workflow_iter(lst, lst2, reached):
    updates = update_large_graph(lst)
    # update_large_graph_v2(lst2, updates)
    search_large_graph(lst, reached)
    revert_large_graph(lst)
    # revert_large_graph_v2(lst2)


def workflow():
    lst = NBList(list(range(1000000)))
    lst2 = list(range(1000000))
    reached = set(list(range(0, 1000000, 2)))
    workflow_iter(lst, lst2, reached)
    start = time.time()
    for i in range(100):
        workflow_iter(lst, lst2, reached)
    print(time.time() - start)


if __name__ == '__main__':
    workflow()

As much as I adore numba, this is one of my main pet peeves with it right now. The reason for this performance issue is that the core List methods like List.append() and List.getitem() are implemented with numba’s just-in-time compilation pipeline, so every time one of those is called there is some overhead associated with resolving their type-specific implementation. A little overhead amounts to a big slowdown on function like append() and getitem() that get called repeatedly.

In principle it doesn’t need to be implemented this way
 for instance once a List() instance is assigned a type it could simply set those methods with type specific end-points instead of JIT dispatchers. This would avoid the type checking on each call. Alas, that isn’t how things currently work.

To get around this issue you could perhaps just allocate a very big numpy array as a buffer, and “resize” slices of it by taking new bigger slices when necessary. Unfortunately this approach would be a bit messy, but it should get rid of the most of the overhead you’re experiencing.

Like @DannyWeitekamp alluded to, you can build all kinds of fast containers by allocating a numpy array and overlaying control structures on it. One discussion on the topic here

For the record, this is not the only contributing factor why typed.List is slow and I will try to shed some light on this here, I hope it helps, since I often perceive that typed.List isn’t that well understood. There is a critical aspect that needs to be explained.

One big reason why typed.List is ‘slow’ when used outside of @njit decorated functions is that the elements stored inside the list need be unboxed/boxed during operations. The typed.List uses a c-style array for storage, but Python integers are stored as Python objects, with all their overhead. For example, during append on an integer based typed.List, Numba will need to extract the value of the integer from the object representation and store that value in it’s underlying array, this is called “unboxing”. Conversely on __getitem__ the value from the c-style array needs to be wrapped in a Python object, this is called “boxing”. Boxing and unboxing are both rather costly operations. Finding a shortcut for dispatching to the correct type implementation may help, but I don’t (yet?) see a way around boxing/unboxing.

It becomes quite evident when benchmarking simple typed.List creation inside and outside of @njit decorated functions:

In [1]: from numba import njit

In [2]: from numba.typed import List

In [3]: @njit
   ...: def fun(size):
   ...:     l = List()
   ...:     for i in range(size):
   ...:         l.append(i)
   ...:     return l
   ...:

In [4]: fun(10)
Out[4]: ListType[int64]([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ...])

In [5]: %timeit fun(100000)
577 ”s ± 18.5 ”s per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [6]: %timeit fun.py_func(100000)
76.2 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit [i for i in range(100000)]
1.65 ms ± 4.15 ”s per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@JubilantJerry to re-iterate what @DannyWeitekamp and @nelson2005 have said: using NumPy arrays may be a better way to solve your issue. As for ragged datastructures, you may be able to leverage the AwkwardArray package: Awkward Array documentation — Awkward Array 2.6.4 documentation

I should also add, that there is a partial typed.Set implementation as a PR. From what I can tell, that might be a solution too, since you could then @njit the searching code also.

@esc Thanks for chiming in! I agree that the boxing/unboxing overhead is indeed something that can’t be helped, however boxing/unboxing doesn’t need to be a limiting source of overhead here. And unless numba has especially slow boxing/unboxing machinery (which I don’t think is the case), then I doubt it is the leading cause of overhead here. After all numpy also allocates contiguous C-style array memory and must deal with boxing/unboxing in it’s implementation of setitem() and getitem(). Those functions are only about a factor of 2 slower on my machine than calling setitem() and getitem() directly on python lists (see below). Your benchmark above however is showing a ~50x slowdown when using List().

One thing worth considering is that the overhead cost for boxing/unboxing numerical types in principle may be less than other types, both because there is no malloc’ing or refcount incrementing on the numpy/numba end for unboxing, and because, if I recall correctly, CPython has a special fast pipeline for ‘int’ and ‘float’ types in it’s internal memory management system, which makes boxing faster than it might otherwise be.

import numpy as np
def py_setitem(size):
    l = [0]*size
    for i in range(size):
        l[i] = i
    return l

def np_setitem(size):
    l = np.zeros(size)
    for i in range(size):
        l[i] = i
    return l

def py_getitem(size):
    l = [0]*size
    s = 0
    for i in range(size):
        s = l[i]
    return s
        
def np_getitem(size):
    l = np.zeros(size)
    s = 0
    for i in range(size):
        s = l[i]
    return s 
%timeit py_setitem(100000)
%timeit np_setitem(100000)
%timeit py_getitem(100000)
%timeit np_getitem(100000)
2.31 ms ± 95.5 ”s per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.05 ms ± 99.4 ”s per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.36 ms ± 129 ”s per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.18 ms ± 57.9 ”s per loop (mean ± std. dev. of 7 runs, 100 loops each)

To illustrate what I’m suggesting, I can cut out more than 60% of the runtime of List.append() by manually extracting the entry_point() of an equivalent njit (ignoring all the multiple dispatch machinery).

from numba import njit, types, i8
from numba.typed import List
from numba.types import ListType

@njit(types.void(ListType(i8), i8))
def append_it(lst, other):
    lst.append(other)

append_it_ep = append_it.overloads[(ListType(i8), i8)].entry_point

def nb_append(size):
    l = List()
    for i in range(size):
        l.append(i)
    return l

def nb_append_custom(size):
    l = List.empty_list(i8)
    for i in range(size):
        append_it(l, i)
    return l

def nb_append_custom_ep(size):
    l = List.empty_list(i8)
    for i in range(size):
        append_it_ep(l, i)
    return l

def py_append(size):
    l = []
    for i in range(size):
        l.append(i)
    return l

%timeit py_append(100000)
%timeit nb_append(100000)
%timeit nb_append_custom(100000)
%timeit nb_append_custom_ep(100000)
4.1 ms ± 119 ”s per loop (mean ± std. dev. of 7 runs, 100 loops each)
111 ms ± 4.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
103 ms ± 8.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
40.9 ms ± 154 ”s per loop (mean ± std. dev. of 7 runs, 10 loops each)

Still off from Python’s append() by about a factor of 10, so there is clearly still some unboxing overhead
 although what remains (including boxing/unboxing) doesn’t seem to be the majority of the overhead maybe ~30-40% of it.

@DannyWeitekamp that looks great! This can probably be done on demand from the first call – the type of the list should be available as soon as something was appended. Is this available as a PR already?

FWIW, I tried to generalize the benchmark a bit (now working on an IPython notebook to illustrate) and wanted to share. I thought it would be a good idea to remove the creation of the container from the benchmark function, so I ended up with:

import numpy as np
from numba import njit
from numba.typed import List as TypedList
@njit
def setitem(l):
    for i in range(len(l)):
        l[i] = 1
    return l
@njit
def getitem(l):
    c = 0
    for i in range(len(l)):
        c += l[i]
    return c
n = 100000
py_list = [0] * n
np_array = np.zeros(n)
nbty_list = TypedList(py_list)

Setitem Python

%timeit setitem.py_func(py_list)
1.67 ms ± 3.75 ”s per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit setitem.py_func(np_array)
5.31 ms ± 27.9 ”s per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit setitem.py_func(nbty_list)
76.6 ms ± 541 ”s per loop (mean ± std. dev. of 7 runs, 10 loops each)

Setitem @njit compiled

%timeit setitem(py_list)
50.9 ms ± 200 ”s per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit setitem(np_array)
18.1 ”s ± 2.62 ”s per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit setitem(nbty_list)
1.11 ms ± 3.15 ”s per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Getitem Python

%timeit getitem.py_func(py_list)
3.04 ms ± 60.6 ”s per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit getitem.py_func(np_array)
7.3 ms ± 76.6 ”s per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit getitem.py_func(nbty_list)
74.8 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Getitem @njit compiled

%timeit getitem(py_list)
51.5 ms ± 70.4 ”s per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit getitem(np_array)
93.6 ”s ± 373 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit getitem(nbty_list)
1.09 ms ± 1.11 ”s per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

I can share the notebook, perhaps?

I tried to use the entry-point trick for setitem and getitem in my benchmark using:

@njit(types.int64(ListType(int64), int64))
def getitem_i64(lst, index):
    return lst[index]

getitem_i64_ep = getitem_i64.overloads[(ListType(int64), int64)].entry_point

and

@njit(types.void(ListType(int64), int64, int64))
def setitem_i64(lst, index, i):
    lst[index] = i

setitem_i64_ep = setitem_i64.overloads[(ListType(int64), int64, int64)].entry_point

which cuts the time in half, still quite slow compated to NumPy, but it’s certainly worth considering adding something like this to Numba.

@esc I actually don’t have an clean backend implementation along these lines just yet (although I wouldn’t mind writing one). I’ve been holding off from writing a PR along these lines because I assumed that it would need to be rewritten when the new AOT/PIXIE stuff came out. I was worried that using .overloads[
].entry_point was a possibly unstable approach. I’ve read a large fraction of the numba source code and haven’t run into entry_point being used to shortcut the type checking, so I wasn’t sure if it was a trick that could be used reliably. Right now multiple dispatch JIT compilation is the only fully supported way to use numba, the new proposed AOT stuff seemed like it was going to have some standard way of expressing strict single signature functions without multiple dispatch overhead (the old pycc AOT worked like this).

Long story short, I’m happy to write it, I just don’t know if I should hold off because there will be some more standard way to support it in the near future.

To maybe generalize the scope of this issue, there are lots of cases similar to this where there is some proxy object on the Python side (which usually knows its type) with an interface for modifying some data allocated by the NRT. There isn’t any good reason to use multiple dispatch to implement interface methods for those proxies since their types are immutable. Another good example is getattr() and setattr() for jitclasses and structrefs. One could imagine wanting to do a lot of setting and getting on the python side to prepare those objects, but that would be quite slow with the current setup.

Also there are perhaps a few more optimizations which might account for some of the discrepancy between my entry_point trick (10x slower) and numpy (2x slower), which I was hoping the new AOT stuff would make easier. At the moment List()'s methods are implemented by calling into a pre-compiled C library that implements most of its functionality so it runs something like this:

Python->Multiple Dispatch->Call End-point->Call Opaque C-Func

The inefficiency is that there are multiple opaque function calls that aren’t currently inlined, but could be if the compilation pipeline could inline the C library by compiling it with clang and inlining the LLVM IR instead of calling out to it as an external library.

In addition to this I expect the unboxing phase does some refcount incrementing that could be skipped.

I’ll find out by adding this to the agenda for the the Numba office hours next week. I’ll update here if I find anything.

1 Like

Agreed. This point will also hurt List or Dict even if users only use them in jitted func. The inlining of C libraries (at least, these C libraries implemented in Numba internal) is a big help for these kinds of situations.

@DannyWeitekamp I had a chat about this with some of the other developers, and yes it’s probably OK to bypass the dispatcher when not using the typed list from a JIT compiled function. Just, that this needs to be done in a safe way, with an explicit type check at the Python level. PIXIE advancements should not impact this path.

Also, I did some benchmarks and it turns out that unboxing the typed.List is quite costly due to this call:

I did look into removing the check altogether and this does simplify things a bit, but looking at the git blame output there do seem to be instances where val can be something other than the typed list. Perhaps it would be possible to cache that type so that a simple call to object_getattr_string could suffice in this case.

The more I think about this and look into it, the more it becomes apparent that the typed.List has room for improvement on many levels.

1 Like

Hey @esc

there do seem to be instances where val can be something other than the typed list.

This comes as a surprise to me. It looks like these checks were added to handle an issue related to segfaulting when returning typed List/Dict from objmode. However, that is essentially the same boxing situation as when using a dispatcher because you cannot guarantee that the user will not accidentally pass something of the wrong type like an untyped list, a list of the wrong type, or something else entirely.

As far as I can foresee, implementing methods on a typed proxy object is a much safer situation, and those checks should be unnecessary. The proxy object (i.e. the python side List/Dict/Jitclass/etc
) should know its type either because it was strongly defined (like a jitclass) or fixed after an element was added (like in List and Dict). All subsequent calls to getitem, setitem, append, etc. should probably work safely with just a simple python-side isinstance(val, type) check. Beyond this I see no reason that the container type should need to be rechecked, since in principle it should be immutable after the type is settled.

For this to work, the unboxing machinery would need to work differently for the safe object methods than it does for a regular unsafe user-defined dispatcher. Perhaps one easy way around this is to have the signature of each of these methods be more like void(MeminfoPtr, val_type), and then just cast the meminfo back to the proper list within the implementation. I’ve succeeded at implementing this sort of pattern in the past without issue.

@DannyWeitekamp what you write is very interesting. I took the two suggestions from this thread and started working on a notebook. I’m currently a bit stuck on deciding if removing the type-check at the builder level is safe or not.

I uploaded the WIP notebook to a gist here:

Basically, there are two optimizations we are currently looking at:

a) when using typed.List from the interpreter, bypass the dispatcher

b) when unboxing, eliminate the type check to see if we are indeed unboxing a typed.List

Let me know what you think and if this is in-line with what you wrote above. :pray:

@esc Yes! This pretty much sums it up. This implements both of the ideas I was considering apart from finding a solution for working around needing two implementations of @unbox at once for this to work.

Your benchmarks show the new setitem being around 5x faster, but still x3 slower than numpy. I had some new thoughts on why this still might be the case. Numpy is written as a C python extension so a numpy ndarray’s outermost wrappers is actually an extension of PyObject which has some extra stuff tacked on like in this example:

This lets numpy avoid the step in numba where c.pyapi.object_getattr_string(val, '_opaque') is used to get the meminfo pointer from the proxy object. I’m pretty sure _opaque is a slot, so this is faster than a __dict__ lookup, but I bet this is still a big part of the remaining overhead. If you look at the guts of CPython’s PyObject_GetAttr, there is quite a lot going on:

In numpy I think the data pointer is immediately available as part of the wrapper’s struct, so setitem/getitem are closer to just applying load/store to an offset of that pointer.

It makes me wonder if the NRT should implement something like an NRTProxyPyObject class as a C extension of PyObject that just tacks on a meminfo pointer after the normal PyObject_HEAD stuff. Most of the current implementation of List, Dict, Jitclass, Structref, etc. could stay mostly the same, but any call to c.pyapi.object_getattr_string(val, '_opaque') would be replaced with an inlined call that dereferences the attribute holding the meminfo pointer in that object. And each of their Python-side proxy implementations would need to subclass from the new NRTProxyPyObject definition.

This would make the Python-side versions of any NRT-friendly object have a similar data-layout.

@esc Just as a sanity check to ensure that c.pyapi.object_getattr_string(val, '_opaque') is the culprit I tried this:


# Imports from Numba internals
from numba.core import cgutils
from numba.typed import listobject
from numba.core.extending import intrinsic

meminfo_void_t = types.MemInfoPointer(types.voidptr)

@intrinsic
def _list_from_meminfo(typingctx, struct_type, meminfo):
    inst_type = struct_type.instance_type

    def codegen(context, builder, sig, args):
        _, meminfo = args

        data_pointer = context.nrt.meminfo_data(builder, meminfo)
        data_pointer = builder.bitcast(
            data_pointer,
            listobject.ll_list_type.as_pointer(),
        )

        st = cgutils.create_struct_proxy(inst_type)(context, builder)
        st.data = builder.load(data_pointer)
        st.meminfo = meminfo

        #NOTE: I guess we incref because the new List on the numba side is borrowing
        #   but why is decref needed in the normal unbox then? 
        context.nrt.incref(builder, meminfo_void_t, meminfo)

        return st._getvalue()

    sig = inst_type(struct_type, types.MemInfoPointer(types.voidptr))
    return sig, codegen

i8_list_type = ListType(int64)


@njit(types.void(meminfo_void_t, int64, int64))
def _mi_setitem(meminfo, i, val):
    lst = _list_from_meminfo(i8_list_type, meminfo)
    lst[i] = val
    
# Get Entry Point
_mi_setitem_ep = _mi_setitem.overloads[(meminfo_void_t, int64, int64)].entry_point
    
def mi_setitem(l, mi):
    for i in range(len(l)):
        _mi_setitem_ep(mi, i, 1)
    return l

#Refcount should still be 1 after running
mi_setitem(nbty_list, nbty_list._opaque)
print(nbty_list._opaque.refcount)

Numpy and the most optimized numba setitem for reference

%timeit setitem.py_func(np_array)  # NumPy array setting from the interpreter
%timeit nb_setitem(nbty_list) # Numba setting from interpreter w/ ep and no container type check
5.18 ms ± 77 ”s per loop (mean ± std. dev. of 7 runs, 100 loops each)
17.9 ms ± 380 ”s per loop (mean ± std. dev. of 7 runs, 100 loops each)

Now we can try a version that bypasses getattr(obj, ‘_opaque’) by passing meminfo directly (should be about as fast as if the meminfo were tacked onto the end of the PyObject of the List)

%timeit mi_setitem(nbty_list, nbty_list._opaque)
8.15 ms ± 124 ”s per loop (mean ± std. dev. of 7 runs, 100 loops each)

This knocks out most of the remaining overhead, just 1.6x worse than numpy, which is probably just because the C implementation of List.setitem() isn’t inlined in the jitted code.

Following up here again – so this thread resulted in three potential optimizations for typed.List that can probably go into three pull-requests:

a) Bypass the dispatcher and do a simple isinstance check when using typed.List from interpreter.

b) Eliminate run-time check in builder code.

c) Optimize access to the meminfo pointer in _opaque by using an intrinsic.

@DannyWeitekamp does this make sense?