Reproducibility of pickles

Hi all,
TL;DR I have an issue with a custom sklearn transformer class I wrote that uses numba.jit. Namely, when I pickle an object of the said class, reload the object from pickle, and pickle it again, the file on disk is different at byte level. This is causing an issue with version control (using dvc) of machine learning models I create. My question is how do I ensure the class pickles consistently across different sessions?

I have added a MWE to reproduce this issue to this github repo.

For anybody who wants a quick glance, here is the class mentioned above:

class NumbaColumnTransformer(BaseEstimator, TransformerMixin):
    """
    A faster version of `CustomColumnTransformer`. Uses Numba. Currently supports transformations of single column.
    
    """
    def __init__(self, func=None, func_arg=None):
        """Construct a `CustomColumnTransformer` with `func` as the transformation function
            
            Parameters
            ----------
            func: a function, default `None`
                A scalar function.
               
            func_arg: str
                Name of the column that is to be transformed

            Returns
            -------
            None

        """
        super().__init__()
        if func is None:
            self.func = lambda x: x  # identity transformation by default
        else:
            self.func = func
        numba_func = numba.jit(func,forceobj=True)
        self.func_arg = func_arg
        def apply_func(col_a):
            n = len(col_a)
            result = np.empty(n, dtype='float64')
            for i in range(n):
                result[i] = numba_func(col_a.values[i])
            return result
        self.apply_func = numba.jit(apply_func,forceobj=True)

    def fit(self, X, y=None):
        return self
    
    def transform(self, X,*_):
        result = self.apply_func(X[self.func_arg])
        return pd.DataFrame(pd.Series(result, index=X.index, name='encoded'))

And here is the part of notebook in MWE describing the issue of non matching pickles:
numba_post

Thanks for any help!

hi @jayant91089 , the CPUDispatcher object (aka the numba function) that wraps the compiled is a complex, stateful object. I’ve never tried what you did, but I’m not surprised that this happened. Please remember that numba compiles on demand (and saves the compiled function for future re-use), so depending on the arguments that you have passed in this particular session you will get different content at the byte level.
I’m guessing that you will have to write a custom serializer to pickle your object (pickle — Python object serialization — Python 3.9.2 documentation), and make sure that you clean all state from the CPUDispatcher. A way to do this, could be, while pickling, to save only the python function (attribute .py_func of the dispatcher), and when unpickling to apply the decorator again.

hope this help,
Luk

1 Like

@luk-f-a thank you very much for a quick response. I wrote __getstate__ and __setstate__ methodsof the NumbaColumnTransformer as follows and pickling looks consistent now! I hope this is what you meant by “write custom serializer”?

    def __getstate__(self):
        state = self.__dict__.copy()
        # remove the numba version of function
        del state['apply_func']
        return state

    def __setstate__(self,state):
        self.__dict__.update(state)
        # restore the numba function
        self.apply_func = numba.jit(self.func,forceobj=True)

Basically I am just removing apply_func which was the jit created CPUDispatcher stored as attribute of the class, and recreating it with __setstate__.

EDIT 1: basically I am whole-heartedly throwing away the CPUDispatcher when pickling. Is there any reason to keep the CPUDispatcher and cleaning the stateful parts of its __dict__ instead?

yes, that’s what I meant. I don’t think you need to keep the dispatcher.

Is the pickling working now?

Yeah, it is identical between sessions now!

I was premature to declare it solved…I had to further modify __setstate__ as follows to make sure the transformer loaded from the pickle worked. Basically I had to mimic the flow in __init__ to create the CPUDispatcher in an identical manner.

    def __getstate__(self):
        state = self.__dict__.copy()
        # remove the numba version of function
        del state['apply_func']
        return state

    def __setstate__(self,state):
        self.__dict__.update(state)
        numba_func = numba.jit(self.func,forceobj=True)
        def apply_func(col_a):
            n = len(col_a)
            result = np.empty(n, dtype='float64')
            for i in range(n):
                result[i] = numba_func(col_a.values[i])
            return result
        self.apply_func = numba.jit(apply_func,forceobj=True)