I did a few tests using stencil
and the results are not great. This snippet:
import numba as nb
import numpy as np
import timeit as ti
def ker0(a):
return 42*a
@nb.stencil
def ker(a):
return 42*a[0, 0]
@nb.njit(fastmath=True)
def ker1(a):
return ker(a)
@nb.njit(fastmath=True, parallel=True)
def ker2(a):
return ker(a)
a = np.arange(10000).reshape((100, 100))
for i in range(3):
fun = f'ker{i}(a)'
t = 1000 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=100))
print(f'{fun}: {np.amin(t):6.3f}ms {np.median(t):6.3f}ms')
produces:
ker0(a): 0.005ms 0.005ms
ker1(a): 0.009ms 0.009ms
ker2(a): 0.020ms 0.020ms
on a 6-core CPU with Python 3.10.4. Parallel mode slows stencil
down. Are there more performant options to write kernels with? Perhaps GitHub - IntelPython/numba-dpex: Numba extension for Intel(R) XPUs? Anything else? Where is the development effort going these days?