```
import numba as nb
import numpy as np
input_arr = np.linspace(1, 50000, 500000000).reshape((50000, 10000))
```

```
@nb.stencil
def kernel1(a):
return 0.25 * (a[0, 1] + a[1, 0] + a[0, -1] + a[-1, 0])
@nb.njit
def calc1(in_arr):
out_arr = kernel1(in_arr)
return out_arr
@nb.njit(parallel= True)
def calc2(in_arr):
out_arr = kernel1(in_arr)
return out_arr
```

```
@nb.njit
def calc3(in_arr):
out_arr = np.zeros(in_arr.shape)
for i in range(1, in_arr.shape[0]-1):
for j in range(1, in_arr.shape[1]-1):
out_arr[i,j] = 0.25 * (in_arr[i, j+1] + in_arr[i+1, j] + in_arr[i, j-1] + in_arr[i-1, j])
return out_arr
@nb.njit(parallel=True)
def calc4(in_arr):
out_arr = np.zeros(in_arr.shape)
for i in range(1, in_arr.shape[0]-1):
for j in range(1, in_arr.shape[1]-1):
out_arr[i,j] = 0.25 * (in_arr[i, j+1] + in_arr[i+1, j] + in_arr[i, j-1] + in_arr[i-1, j])
return out_arr
```

For a large matrix, I am experiencing the following run times (avg. 5 runs):

`calc1(input_arr)`

[njit + Stencil] - **2.09s**

`calc2(input_arr)`

[Parallel njit + Stencil] - **2.35s**

`calc3(input_arr)`

[njit + Loops] - **2.10s**

`calc4(input_arr)`

[Parallel njit + Loops] - **1.08s**

As per the above timings, `njit + Stencil`

is performing similar to `njit + Loops`

, as by default the stencil kernels are not executed in parallel.

I am trying to understand why `Parallel njit + Stencil`

not executing the computation in parallel and performing similar to `Parallel njit + Loops`

?

I have 16 threads available on my machine.

Any help/explanation/code improvement will be highly appreciated. Thank you!

Regards,

Ankit