QST: How to "cache" the boxing of an object?

lithomas1 · September 12, 2023, 3:17pm

Currently for implementing df.apply in numba, I have some code looking like this

        @numba.jit(nogil=nogil, nopython=nopython, parallel=parallel)
        def numba_func(values, col_names, df_index):
            results = {}
            for j in range(values.shape[1]):
                ser = Series(values[:, j], index=df_index, name=str(col_names[j]))

                results[j] = jitted_udf(ser)

            return results

(where values is a 2-D array, basically what you’d get if you did df.to_numpy, col_names is df.columns, and df_index is df.index)

Notice how the index is shared among all of these Series. If the user was to specify a function that did a no-op (e.g. lambda x: x) as the jitted_udf, what ends up happening is that most of the time is spent re-boxing the objects into pandas Series and Indexes (which is to be expected).

What’s unexpected, though, is the fact, that even though the Indexes are the same for each of the Series, every time Numba boxes an Series, it creates a new Index object every single time when it gets to boxing its index (which is currently done via c.box).

Our boxing code for Series/Index for reference:

@box(IndexType)
def box_index(typ, val, c):
    """
    Convert a native index structure to a Index object.

    If our native index is of a numpy string dtype, we'll cast it to
    object.
    """
    # First build a Numpy array object, then wrap it in a Index
    index = cgutils.create_struct_proxy(typ)(c.context, c.builder, value=val)

    # class_obj = c.pyapi.unserialize(c.pyapi.serialize_object(typ.pyclass))
    class_obj = c.pyapi.unserialize(c.pyapi.serialize_object(Index))
    array_obj = c.box(typ.as_array, index.data)
    # this is basically Index._simple_new(array_obj, name_obj) in python
    index_obj = c.pyapi.call_method(class_obj, "_simple_new", (array_obj,))

    # Decrefs
    c.pyapi.decref(class_obj)
    c.pyapi.decref(array_obj)
    return index_obj

def box_series(typ, val, c):
    """
    Convert a native series structure to a Series object.
    """
    series = cgutils.create_struct_proxy(typ)(c.context, c.builder, value=val)
    class_obj = c.pyapi.unserialize(c.pyapi.serialize_object(Series))
    index_obj = c.box(typ.index, series.index)
    array_obj = c.box(typ.as_array, series.values)
    name_obj = c.box(typ.namety, series.name)
    true_obj = c.pyapi.unserialize(c.pyapi.serialize_object(True))
    # This is equivalent of pd.Series(data=array_obj, index=index_obj, dtype=None, name=name_obj, copy=None, fastpath=True)
    series_obj = c.pyapi.call_function_objargs(
        class_obj,
        (
            array_obj,
            index_obj,
            c.pyapi.borrow_none(),
            name_obj,
            c.pyapi.borrow_none(),
            true_obj,
        ),
    )

    # Decrefs
    c.pyapi.decref(class_obj)
    c.pyapi.decref(index_obj)
    c.pyapi.decref(array_obj)
    c.pyapi.decref(name_obj)
    c.pyapi.decref(true_obj)

    return series_obj

Is there a way to cache this boxed object, and re-use it every time c.box is called?

DannyWeitekamp · September 12, 2023, 11:53pm

It’s a bit hard to tell precisely what is going on without seeing your full implementation (you only show the boxing code, but not the unboxing), although maybe what you’ve shown is enough, although I’m not quite sure since your description is perhaps mixing boxing with unboxing, for instance the phrasing “when it gets to unboxing its index (which is currently done via c.box)”. It would also be helpful to see the data-models for IndexType, and I assume also (but you don’t show it) SeriesType.

Nonetheless, what I’m grokking from your question is that redundant Index objects are being recreated (on the python side) when it would be desirable for them to just be generated once and then re-referenced. While I’m not 100% whether or not it is still used (because I’m not one of the devs), there is some precedent in what I’ve seen in the numba source code for jitted objects holding references to their python-side counterpart objects and then just emitting a reference to that counterpart during boxing. For instance, the to-be-deprecated reflected lists type (i.e. “types.List”) does something like this, there is also a slot for something similar in the data-model for types.unicode_type.

See for instance:

github.com

numba/numba/blob/596e8a55334cc46854e3192766e643767bd7c934/numba/core/boxing.py#L593C17-L593C17


      
          @box(types.List)

And its data-model:

github.com

numba/numba/blob/596e8a55334cc46854e3192766e643767bd7c934/numba/core/datamodel/models.py#L782


      
                      ('allocated', types.intp),
                      # This member is only used only for reflected lists
                      ('dirty', types.boolean),
                      # Actually an inlined var-sized array
                      ('data', fe_type.container.dtype),
                  ]
                  super(ListPayloadModel, self).__init__(dmm, fe_type, members)
          
          
          @register_default(types.List)
          class ListModel(StructModel):
              def __init__(self, dmm, fe_type):
                  payload_type = types.ListPayload(fe_type)
                  members = [
                      # The meminfo data points to a ListPayload
                      ('meminfo', types.MemInfoPointer(payload_type)),
                      # This member is only used only for reflected lists
                      ('parent', types.pyobject),
                  ]
                  super(ListModel, self).__init__(dmm, fe_type, members)

lithomas1 · September 13, 2023, 12:28am

Sorry, I got confused. I’ve updated my post to only mention boxing (or whatever going from numba → Python) is, since that’s what I want.

The code you linked for List looks interesting. I’ll give it a go and post back here if I’m successful.

Thanks a bunch!

Topic		Replies	Views
How can I rewrite the indexing so the function compiles? Support: How do I do ...?	2	314	September 3, 2023
Numba and PySpark users? Community Support	3	1445	May 11, 2022
Best practices for using read-only Python lists Community Support	4	1579	January 12, 2022
Unpacking np.where index alternative Support: How do I do ...?	2	2087	January 23, 2021
Reproducibility of pickles Support: How do I do ...?	5	1132	March 15, 2021

QST: How to "cache" the boxing of an object?

Related Topics