Currently for implementing df.apply in numba, I have some code looking like this
@numba.jit(nogil=nogil, nopython=nopython, parallel=parallel)
def numba_func(values, col_names, df_index):
results = {}
for j in range(values.shape[1]):
ser = Series(values[:, j], index=df_index, name=str(col_names[j]))
results[j] = jitted_udf(ser)
return results
(where values is a 2-D array, basically what you’d get if you did df.to_numpy, col_names is df.columns, and df_index is df.index)
Notice how the index is shared among all of these Series. If the user was to specify a function that did a no-op (e.g. lambda x: x) as the jitted_udf, what ends up happening is that most of the time is spent re-boxing the objects into pandas Series and Indexes (which is to be expected).
What’s unexpected, though, is the fact, that even though the Indexes are the same for each of the Series, every time Numba boxes an Series, it creates a new Index object every single time when it gets to boxing its index (which is currently done via c.box).
Our boxing code for Series/Index for reference:
@box(IndexType)
def box_index(typ, val, c):
"""
Convert a native index structure to a Index object.
If our native index is of a numpy string dtype, we'll cast it to
object.
"""
# First build a Numpy array object, then wrap it in a Index
index = cgutils.create_struct_proxy(typ)(c.context, c.builder, value=val)
# class_obj = c.pyapi.unserialize(c.pyapi.serialize_object(typ.pyclass))
class_obj = c.pyapi.unserialize(c.pyapi.serialize_object(Index))
array_obj = c.box(typ.as_array, index.data)
# this is basically Index._simple_new(array_obj, name_obj) in python
index_obj = c.pyapi.call_method(class_obj, "_simple_new", (array_obj,))
# Decrefs
c.pyapi.decref(class_obj)
c.pyapi.decref(array_obj)
return index_obj
def box_series(typ, val, c):
"""
Convert a native series structure to a Series object.
"""
series = cgutils.create_struct_proxy(typ)(c.context, c.builder, value=val)
class_obj = c.pyapi.unserialize(c.pyapi.serialize_object(Series))
index_obj = c.box(typ.index, series.index)
array_obj = c.box(typ.as_array, series.values)
name_obj = c.box(typ.namety, series.name)
true_obj = c.pyapi.unserialize(c.pyapi.serialize_object(True))
# This is equivalent of pd.Series(data=array_obj, index=index_obj, dtype=None, name=name_obj, copy=None, fastpath=True)
series_obj = c.pyapi.call_function_objargs(
class_obj,
(
array_obj,
index_obj,
c.pyapi.borrow_none(),
name_obj,
c.pyapi.borrow_none(),
true_obj,
),
)
# Decrefs
c.pyapi.decref(class_obj)
c.pyapi.decref(index_obj)
c.pyapi.decref(array_obj)
c.pyapi.decref(name_obj)
c.pyapi.decref(true_obj)
return series_obj
Is there a way to cache this boxed object, and re-use it every time c.box is called?