# Why this assignment of subarray is so slow?

Functions `f0` and `f1` produce the same result, but differ in method of assignment of subarray:

``````import numpy as np
import numba as nb
import timeit as ti

@nb.njit(fastmath=True)
def f0(a, i, j):
a[i, j] = [4, 2]

@nb.njit(fastmath=True)
def f1(a, i, j):
a[i, j, 0] = 4
a[i, j, 1] = 2

@nb.njit(fastmath=True)
def g0(a):
for i in range(a.shape):
for j in range(a.shape):
f0(a, i, j)

@nb.njit(fastmath=True)
def g1(a):
for i in range(a.shape):
for j in range(a.shape):
f1(a, i, j)

for i in range(2):
a = np.zeros((100, 100, 2))
fun = f'g{i}(a)'
t = 1000 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=100))
print(f'{fun}:  {np.amin(t):6.3f}ms  {np.median(t):6.3f}ms')
``````

The performance difference is huge:

``````g0(a):   0.718ms   0.719ms
g1(a):   0.005ms   0.005ms
``````

Any ideas why? Is there a more performant replacement of `a[i, j] = [4, 2]` other than the tedious method of `f1`?

I tried `a[i, j, :] = [4, 2]` and it is almost as slow.

I’d hazard a guess that there’s some memory allocation going on with f0, and f1 is a direct memory access through a pointer dereference. But I’m just guessing If you’re an assembly or LLVM expert (I’m not) you could dig a little deeper with inspect_asm or inspect_llvm after the timing loop, like

``````print(f1.inspect_asm(f1.signatures))
``````
1 Like

I don’t know enough to make sense of it. Surprisingly, `a[i, j] = (4, 2)` is as fast or only 2x slower (depending on the context) than the `a[i, j] = [4, 2]`. That should work for me.

my best guess here is that the intermediate initialisation of a list in `f0` is a performance trap.
Lists are inherently more complex than tuples, since mutation is expected by design.

In a tuple, my best guess is that the values or pointers to them are stored as direct neighbours in memory. Here you are even using constant literal entries in the tuple, which may allow for even more compilation time optimisations. For a list the same kind of promises cannot be made and their implementation is likely generally slower than an immutable data structure. (For `f1` assigning scalar literal values directly, the same (maybe even more optimal) is true)

Cheers