Any faster data structure than TypedList to hold tuples?

The performance of reflected lists to hold tuples is alright. In comparison TypedList are twice as slow and that’s a problem in stencil-based code where iterating through the stencils (as tuples) is in the hot loop.

As reflected list are going away, I wonder what their replacement should be.

Hi gael,

In comparison TypedList are twice as slow and that’s a problem in stencil-based code where iterating through the stencils (as tuples) is in the hot loop.

Could you share some code using reflected lists and typed lists where this happens?

Regards

I can’t seem to produce a small example that exhibits the problem blatantly. So excuse the large amount of code (should be self-contained and run fine anywhere though). You should see a difference in execution time around 25%. With real life data I get:

CPU times: user 5.72 s, sys: 21 µs, total: 5.72 s
Wall time: 5.72 s
CPU times: user 8.46 s, sys: 16 µs, total: 8.46 s
Wall time: 8.47 s

Which is close to 50%.

Code:

import numba as nb
import numpy as np
from numba.extending import intrinsic
from numba.types import ir

@intrinsic
def addmod_tuples(tyctx, tpl1, tpl2, shape):
    """Returns tuple((t1+t2)%s for t1, t2, s in zip(tpl1, tpl2, shape))"""
    ret = tpl1
    count = tpl1.count
    dtype = ret.dtype
    typ = ir.types.IntType(dtype.bitwidth)
    zero = ir.Constant(typ, 0)
    
    def codegen(cgctx, builder, sig, args):
        val_tpl1, val_tpl2, val_shape = args

        tup = cgctx.get_constant_undef(ret)
        
        for i in range(count):
            # Extract scalars
            t1 = builder.extract_value(val_tpl1, i)
            t2 = builder.extract_value(val_tpl2, i)
            s = builder.extract_value(val_shape, i)
            
            # Modulo
            bb_main = builder.block
            val_ = builder.srem(builder.add(t1, t2), s)
            
            is_neg = builder.icmp_signed("<", val_, zero, name='')
            with builder.if_then(is_neg, likely=False):
                pos_val = builder.add(s, val_)
                bb_neg = builder.block
            
            # Phi node:  val = (t1+t2)%s
            val = builder.phi(typ)
            val.add_incoming(val_, bb_main)
            val.add_incoming(pos_val, bb_neg)
            
            # Assign to tuple
            tup = builder.insert_value(tup, val, i)
        return tup
    sig = ret(tpl1, tpl2, shape)
    return sig, codegen


@nb.njit(parallel=True)
def process(arr, stencils, weights):
    
    shape = arr.shape     
    direction = np.empty(arr.shape, dtype=np.int8)
    
    for index in nb.pndindex(shape):
        old_g = -np.inf
        cdir = -1  # No gradient, default
        g_c = arr[index]

        for i, (stencil, w) in enumerate(zip(stencils, weights)):
            
            nindex = addmod_tuples(index, stencil, shape)
            
            g_n = arr[nindex]
            
            new_g = (g_n - g_c) * w
            
            if old_g < new_g:
                old_g = new_g
                if new_g > 0.0:
                    # Saves i (as in 'stencils[i]') as main direction
                    cdir = i
            
            if (new_g < 0.0) and (cdir == -1):
                # Local maximum
                cdir = -2
        direction[index] = cdir
    return direction


# Generate main data
from scipy.ndimage import gaussian_filter
shape = (259, 259, 259)
ary = np.zeros(shape, dtype=np.float64)

for _ in range(3):
    ary += (np.random.rand(*shape) > 0.95).astype(np.float64)
    ary = gaussian_filter(ary, 10.0, mode="wrap")

# Generate stencils
r = (-1, 0, 1)
stencils_refl = [(i, j, k) for i in r for j in r for k in r if (i, j, k) != (0, 0, 0)]
stencils_typed = nb.typed.List()
for s in stencils_refl:
    stencils_typed.append(s)
    
# Generate weights
weights = 1.0 / np.linalg.norm(stencils_refl, axis=1)

# JIT warm-up
nb.set_num_threads(2)
process(ary, stencils_refl, weights)
process(ary, stencils_typed, weights)

# Benchmark
dir1 = %time process(ary, stencils_refl, weights)
dir2 = %time process(ary, stencils_typed, weights)

assert np.all(dir1 == dir2)

So I tried using an explicit loop instead of iterators…

Interestingly:

  • The loop version is 10% slower than iterators for the reflected list.
  • The loop version is 10-15% faster than iterators for the typed list (using getitem_unchecked()).
  • In their loop version, typed lists are 2-3% slower than reflected list (and still 10-13% slower than reflected lists used as iterators).
  • All this is consistent with the 25% difference I’m seeing when using iterators.

Edit: disregard what’s below… It turns out bounds checking was the culprit. It’s likely removed when using iterators.


Oh god this is crazy! Change

for i, (stencil, w) in enumerate(zip(stencils, weights)):

into

for i in range(N):
    stencil = stencils[i]
    w = weights[i]

and add N = len(stencils) somewhere near the top.

I get those new timings:

CPU times: user 11.62 s, sys: 27 µs, total: 11.62 s
Wall time: 5.81 s
CPU times: user 1min 5s, sys: 22 µs, total: 1min 5s
Wall time: 32.9 s

>>> print(nb.version_info)
version_info(major=0, minor=55, patch=1, short=(0, 55), full=(0, 55, 1), string='0.55.1', tuple=('0', '55', '1'), git_revision=None)