_numba_unpickle takes very long

Hello

I am encountering a behavior with _numba_unpickle function, it’s taking very long. I pinned the problem down to this function:

@njit
def kim_statistics_nb(arr, tau):
  nom = ((1-tau)*len(arr))**(-2)
  denom = (tau*len(arr))**(-2)
  break_point = int(tau*len(arr))
  
  x0 = np.arange(1, break_point+1, 1, dtype=np.float_)
  a = np.vstack((x0, np.ones(len(x0), dtype=np.float_))).T
  m0, c0 = np.linalg.lstsq(a, arr[:break_point])[0]
  res0 = arr[:break_point] - m0*x0+c0

  x1 = np.arange(break_point+1, len(arr), 1, dtype=np.float_)
  a = np.vstack((x1, np.ones(len(x1), dtype=np.float_))).T
  m1, c1 = np.linalg.lstsq(a, arr[break_point+1:])[0]
  res1 = arr[break_point+1:] - m1*x1+c1

  cusum0_sq = np.cumsum(res0)**2
  cusum1_sq = np.cumsum(res1)**2
  return (nom*np.sum(cusum1_sq))/(denom*np.sum(cusum0_sq))

This function I call many times in another njit function, it’s basically a rolling calculation of the above function. The arrays passed into the above function are of size 40 and the overall array I am rolling over is 100k roughly.

Any ideas?

Cheers

Can you post a minimal complete program that reproduces the issue?

Here’s an example, how to roughly reproduce it (actual calculation is more involved using parallelization instead of for loop).

@njit
def kim_statistics_nb(arr, tau):
  nom = ((1-tau)*len(arr))**(-2)
  denom = (tau*len(arr))**(-2)
  break_point = int(tau*len(arr))
  
  x0 = np.arange(1, break_point+1, 1, dtype=np.float_)
  a = np.vstack((x0, np.ones(len(x0), dtype=np.float_))).T
  m0, c0 = np.linalg.lstsq(a, arr[:break_point])[0]
  res0 = arr[:break_point] - m0*x0+c0

  x1 = np.arange(break_point+1, len(arr), 1, dtype=np.float_)
  a = np.vstack((x1, np.ones(len(x1), dtype=np.float_))).T
  m1, c1 = np.linalg.lstsq(a, arr[break_point+1:])[0]
  res1 = arr[break_point+1:] - m1*x1+c1

  cusum0_sq = np.cumsum(res0)**2
  cusum1_sq = np.cumsum(res1)**2
  return (nom*np.sum(cusum1_sq))/(denom*np.sum(cusum0_sq))

@njit
def rolling_calc(arr, length, func, *args):
  result = np.full(arr.shape, np.nan)
  for i in range(length, len(arr)):
    result[i] = func(arr[i-length+1:i+1], *args)
  return result

@njit
def np_apply_along_axis(func1d, axis, arr):
  assert arr.ndim == 2
  assert axis in [0, 1]
  if axis == 0:
    result = np.empty(arr.shape[1])
    for i in range(len(result)):
      result[i] = func1d(arr[:, i])
  else:
    result = np.empty(arr.shape[0])
    for i in range(len(result)):
      result[i] = func1d(arr[i, :])
  return result

@njit
def rolling_kim_nb(arr, length=28, n_tau=10):
  taus = np.linspace(0.2, 0.8, n_tau)
  kim = np.full((len(arr), n_tau), np.nan)
  for n, tau in enumerate(taus):
    kim_tau = rolling_calc(arr, length, kim_statistics_nb, tau)
    kim[:, n] = kim_tau
  kim_stats = np_apply_along_axis(np.max, axis=1, arr=kim)
  return kim_stats

arr = np.random.random(100_000)
for _ in range(10):
  r = rolling_kim_nb(arr)

I don’t exactly know, how to show the time spent in _numba_unpickle, but I can see in google colab that this part takes tremendously long.

What does _numba_unpickle actually do? What objects are unpickled?

Thanks! I instrumented numba/core/serialize.py:_numba_unpickle but am unable to reproduce what you’re seeing. I’m using numba 0.56.4 on Windows. The instrumented code is below, perhaps I missed something.

def _numba_unpickle(address, bytedata, hashed):
    import time
    from timeit import default_timer as timer
    start = time.perf_counter()
    key = (address, hashed)
    try:
        obj = _unpickled_memo[key]
    except KeyError:
        _unpickled_memo[key] = obj = cloudpickle.loads(bytedata)
    print(time.perf_counter() - start)
    return obj

Ok interesting. How did you make numba use your performance measurement unpickle function?

How long does your code run to finish the 10 calculations?

Can you tell, where the code takes the most time?

I am pretty new to this whole profiling stuff.

Cheers

I edited the numba library function in serialize.py. Numba already calls that function so I didn’t need to do anything there.
On my laptop the sample code takes about 70 seconds or so.
I didn’t profile or look into where it spends its time, only took a look at the unpickle function since that’s what you asked about. Here is a conversation about profiling numba that may or may not be of interest to you.