# @stencil kernel performance issue

``````import numba as nb
import numpy as np

input_arr = np.linspace(1, 50000, 500000000).reshape((50000, 10000))
``````
``````@nb.stencil
def kernel1(a):
return 0.25 * (a[0, 1] + a[1, 0] + a[0, -1] + a[-1, 0])

@nb.njit
def calc1(in_arr):
out_arr = kernel1(in_arr)
return out_arr

@nb.njit(parallel= True)
def calc2(in_arr):
out_arr = kernel1(in_arr)
return out_arr
``````
``````@nb.njit
def calc3(in_arr):
out_arr = np.zeros(in_arr.shape)
for i in range(1, in_arr.shape[0]-1):
for j in range(1, in_arr.shape[1]-1):
out_arr[i,j] = 0.25 * (in_arr[i, j+1] + in_arr[i+1, j] + in_arr[i, j-1] + in_arr[i-1, j])
return out_arr

@nb.njit(parallel=True)
def calc4(in_arr):
out_arr = np.zeros(in_arr.shape)
for i in range(1, in_arr.shape[0]-1):
for j in range(1, in_arr.shape[1]-1):
out_arr[i,j] = 0.25 * (in_arr[i, j+1] + in_arr[i+1, j] + in_arr[i, j-1] + in_arr[i-1, j])
return out_arr
``````

For a large matrix, I am experiencing the following run times (avg. 5 runs):
`calc1(input_arr)` [njit + Stencil] - 2.09s
`calc2(input_arr)` [Parallel njit + Stencil] - 2.35s
`calc3(input_arr)` [njit + Loops] - 2.10s
`calc4(input_arr)` [Parallel njit + Loops] - 1.08s

As per the above timings, `njit + Stencil` is performing similar to `njit + Loops`, as by default the stencil kernels are not executed in parallel.
I am trying to understand why `Parallel njit + Stencil` not executing the computation in parallel and performing similar to `Parallel njit + Loops`?
I have 16 threads available on my machine.

Any help/explanation/code improvement will be highly appreciated. Thank you!

Regards,
Ankit

Unfortunately, I don’t have an answer for you, just more questions and data points. This code:

``````import numba as nb
import numpy as np
import timeit as ti

def ker0(a):
return 42*a

def ker1(a):
return list(map(lambda x: 42*x, a))

@nb.stencil
def ker(a):
return 42*a[0, 0]

@nb.njit(fastmath=True)
def ker2(a):
return ker(a)

@nb.njit(fastmath=True, parallel=True)
def ker3(a):
return ker(a)

a = np.arange(10000).reshape((100, 100))

for i in range(4):
fun = f'ker{i}(a)'
t = 1000 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=10))
print(f'{fun}:  {np.amin(t):6.3f}ms  {np.median(t):6.3f}ms')
``````

produces:

``````ker0(a):   0.005ms   0.005ms
ker1(a):   0.072ms   0.076ms
ker2(a):   0.009ms   0.009ms
ker3(a):   0.020ms   0.020ms
``````

which indicates that parallel version of `stencil` performs poorly compared to sequential one. I’ve run this code on 6 core CPU with Python 3.10.4.

@pauljurczak There is an issue I opened and pull request made by Dr. Todd in this respect.