QST: How to "cache" the boxing of an object?

Currently for implementing df.apply in numba, I have some code looking like this

        @numba.jit(nogil=nogil, nopython=nopython, parallel=parallel)
        def numba_func(values, col_names, df_index):
            results = {}
            for j in range(values.shape[1]):
                ser = Series(values[:, j], index=df_index, name=str(col_names[j]))

                results[j] = jitted_udf(ser)

            return results

(where values is a 2-D array, basically what you’d get if you did df.to_numpy, col_names is df.columns, and df_index is df.index)

Notice how the index is shared among all of these Series. If the user was to specify a function that did a no-op (e.g. lambda x: x) as the jitted_udf, what ends up happening is that most of the time is spent re-boxing the objects into pandas Series and Indexes (which is to be expected).

What’s unexpected, though, is the fact, that even though the Indexes are the same for each of the Series, every time Numba boxes an Series, it creates a new Index object every single time when it gets to boxing its index (which is currently done via c.box).

Our boxing code for Series/Index for reference:

@box(IndexType)
def box_index(typ, val, c):
    """
    Convert a native index structure to a Index object.

    If our native index is of a numpy string dtype, we'll cast it to
    object.
    """
    # First build a Numpy array object, then wrap it in a Index
    index = cgutils.create_struct_proxy(typ)(c.context, c.builder, value=val)

    # class_obj = c.pyapi.unserialize(c.pyapi.serialize_object(typ.pyclass))
    class_obj = c.pyapi.unserialize(c.pyapi.serialize_object(Index))
    array_obj = c.box(typ.as_array, index.data)
    # this is basically Index._simple_new(array_obj, name_obj) in python
    index_obj = c.pyapi.call_method(class_obj, "_simple_new", (array_obj,))

    # Decrefs
    c.pyapi.decref(class_obj)
    c.pyapi.decref(array_obj)
    return index_obj

def box_series(typ, val, c):
    """
    Convert a native series structure to a Series object.
    """
    series = cgutils.create_struct_proxy(typ)(c.context, c.builder, value=val)
    class_obj = c.pyapi.unserialize(c.pyapi.serialize_object(Series))
    index_obj = c.box(typ.index, series.index)
    array_obj = c.box(typ.as_array, series.values)
    name_obj = c.box(typ.namety, series.name)
    true_obj = c.pyapi.unserialize(c.pyapi.serialize_object(True))
    # This is equivalent of pd.Series(data=array_obj, index=index_obj, dtype=None, name=name_obj, copy=None, fastpath=True)
    series_obj = c.pyapi.call_function_objargs(
        class_obj,
        (
            array_obj,
            index_obj,
            c.pyapi.borrow_none(),
            name_obj,
            c.pyapi.borrow_none(),
            true_obj,
        ),
    )

    # Decrefs
    c.pyapi.decref(class_obj)
    c.pyapi.decref(index_obj)
    c.pyapi.decref(array_obj)
    c.pyapi.decref(name_obj)
    c.pyapi.decref(true_obj)

    return series_obj

Is there a way to cache this boxed object, and re-use it every time c.box is called?

It’s a bit hard to tell precisely what is going on without seeing your full implementation (you only show the boxing code, but not the unboxing), although maybe what you’ve shown is enough, although I’m not quite sure since your description is perhaps mixing boxing with unboxing, for instance the phrasing “when it gets to unboxing its index (which is currently done via c.box)”. It would also be helpful to see the data-models for IndexType, and I assume also (but you don’t show it) SeriesType.

Nonetheless, what I’m grokking from your question is that redundant Index objects are being recreated (on the python side) when it would be desirable for them to just be generated once and then re-referenced. While I’m not 100% whether or not it is still used (because I’m not one of the devs), there is some precedent in what I’ve seen in the numba source code for jitted objects holding references to their python-side counterpart objects and then just emitting a reference to that counterpart during boxing. For instance, the to-be-deprecated reflected lists type (i.e. “types.List”) does something like this, there is also a slot for something similar in the data-model for types.unicode_type.

See for instance:

And its data-model:

Sorry, I got confused. I’ve updated my post to only mention boxing (or whatever going from numba → Python) is, since that’s what I want.

The code you linked for List looks interesting. I’ll give it a go and post back here if I’m successful.

Thanks a bunch!