TL;DR I have an issue with a custom
sklearn transformer class I wrote that uses
numba.jit. Namely, when I pickle an object of the said class, reload the object from pickle, and pickle it again, the file on disk is different at byte level. This is causing an issue with version control (using dvc) of machine learning models I create. My question is how do I ensure the class pickles consistently across different sessions?
I have added a MWE to reproduce this issue to this github repo.
For anybody who wants a quick glance, here is the class mentioned above:
class NumbaColumnTransformer(BaseEstimator, TransformerMixin): """ A faster version of `CustomColumnTransformer`. Uses Numba. Currently supports transformations of single column. """ def __init__(self, func=None, func_arg=None): """Construct a `CustomColumnTransformer` with `func` as the transformation function Parameters ---------- func: a function, default `None` A scalar function. func_arg: str Name of the column that is to be transformed Returns ------- None """ super().__init__() if func is None: self.func = lambda x: x # identity transformation by default else: self.func = func numba_func = numba.jit(func,forceobj=True) self.func_arg = func_arg def apply_func(col_a): n = len(col_a) result = np.empty(n, dtype='float64') for i in range(n): result[i] = numba_func(col_a.values[i]) return result self.apply_func = numba.jit(apply_func,forceobj=True) def fit(self, X, y=None): return self def transform(self, X,*_): result = self.apply_func(X[self.func_arg]) return pd.DataFrame(pd.Series(result, index=X.index, name='encoded'))
And here is the part of notebook in MWE describing the issue of non matching pickles:
Thanks for any help!