Why this assignment of subarray is so slow?

Functions f0 and f1 produce the same result, but differ in method of assignment of subarray:

import numpy as np
import numba as nb
import timeit as ti

@nb.njit(fastmath=True)
def f0(a, i, j):
  a[i, j] = [4, 2]


@nb.njit(fastmath=True)
def f1(a, i, j):
  a[i, j, 0] = 4
  a[i, j, 1] = 2


@nb.njit(fastmath=True)
def g0(a):
  for i in range(a.shape[0]):
    for j in range(a.shape[1]):
      f0(a, i, j)


@nb.njit(fastmath=True)
def g1(a):
  for i in range(a.shape[0]):
    for j in range(a.shape[1]):
      f1(a, i, j)


for i in range(2):
  a = np.zeros((100, 100, 2))
  fun = f'g{i}(a)'
  t = 1000 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=100))
  print(f'{fun}:  {np.amin(t):6.3f}ms  {np.median(t):6.3f}ms')

The performance difference is huge:

g0(a):   0.718ms   0.719ms
g1(a):   0.005ms   0.005ms

Any ideas why? Is there a more performant replacement of a[i, j] = [4, 2] other than the tedious method of f1?

I tried a[i, j, :] = [4, 2] and it is almost as slow.

I’d hazard a guess that there’s some memory allocation going on with f0, and f1 is a direct memory access through a pointer dereference. But I’m just guessing :slight_smile:

If you’re an assembly or LLVM expert (I’m not) you could dig a little deeper with inspect_asm or inspect_llvm after the timing loop, like

print(f1.inspect_asm(f1.signatures[0]))
1 Like

I don’t know enough to make sense of it. Surprisingly, a[i, j] = (4, 2) is as fast or only 2x slower (depending on the context) than the a[i, j] = [4, 2]. That should work for me.

Hi @pauljurczak

my best guess here is that the intermediate initialisation of a list in f0 is a performance trap.
Lists are inherently more complex than tuples, since mutation is expected by design.

In a tuple, my best guess is that the values or pointers to them are stored as direct neighbours in memory. Here you are even using constant literal entries in the tuple, which may allow for even more compilation time optimisations. For a list the same kind of promises cannot be made and their implementation is likely generally slower than an immutable data structure. (For f1 assigning scalar literal values directly, the same (maybe even more optimal) is true)

Cheers