The future of kernel programming style with Numba

I did a few tests using stencil and the results are not great. This snippet:

import numba as nb
import numpy as np
import timeit as ti

def ker0(a):
  return 42*a

@nb.stencil
def ker(a):
  return 42*a[0, 0]

@nb.njit(fastmath=True)
def ker1(a):
  return ker(a)

@nb.njit(fastmath=True, parallel=True)
def ker2(a):
  return ker(a)

a = np.arange(10000).reshape((100, 100))

for i in range(3):
  fun = f'ker{i}(a)'
  t = 1000 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=100))
  print(f'{fun}:  {np.amin(t):6.3f}ms  {np.median(t):6.3f}ms')

produces:

ker0(a):   0.005ms   0.005ms
ker1(a):   0.009ms   0.009ms
ker2(a):   0.020ms   0.020ms

on a 6-core CPU with Python 3.10.4. Parallel mode slows stencil down. Are there more performant options to write kernels with? Perhaps GitHub - IntelPython/numba-dpex: Numba extension for Intel(R) XPUs? Anything else? Where is the development effort going these days?

@pauljurczak There is a stencil issue I opened and an ongoing pull request made by Dr. Todd in this respect.
It would be great if you can add your findings to the issue thread.

It took a while, but I just did.