Guvectorize not writing all multiple outputs for stateful recursive algorithm

Hi,

I am working on stateful recursive functions, e.g. exponential moving average or first order IIR filter. I’d like to build them with guvectorize such that I can also use them with dask and get all the broadcasting semantics. My problem is that all of these functions have two outputs, the signal and the last state, such that one can pass the last state to the next function call. Yet, it seems the state is never written to the output for my numba implementation. Here is a minimal example which compares against a pure numpy implementation.

def ema_numpy(x, state, alpha):
    y = np.empty_like(x)
    N = x.shape[-1]
    for i in range(0, N):
        y[..., i] = state + (x[..., i] - state) * alpha
        state = y[..., i]
    final_state = state
    return y, final_state

@guvectorize(
    [
        "void(f8[:], f8, f8, f8[:], f8)",
        "void(f4[:], f4, f4, f4[:], f4)",
    ],
    "(k), (), () -> (k), ()",
    nopython=True,
    target="cpu",
)
def ema_numba(x, state, alpha, y, final_state):
    N = x.shape[-1]
    for i in range(0, N):
        y[..., i] = state + (x[..., i] - state) * alpha
        state = y[..., i]
    final_state = state # this line seems not to do anything

init_state = 0.0
alpha = 0.01
np.random.seed(1337)
x = np.random.normal(size=(12,))
x1 = x[:6]
x2 = x[6:]

rmse = lambda x,y: np.sqrt(np.mean(np.abs(x-y)**2))
# compare running stateless
np_out, _ = ema_numpy(x, init_state, alpha)
numba_out, _ = ema_numba(x, init_state, alpha)
print("rmse stateless:", rmse(np_out, numba_out.squeeze()))
np.testing.assert_almost_equal(np_out, numba_out.squeeze())

# run numpy stateful implementation
np_out1, state = ema_numpy(x1, init_state, alpha)
np_out2, _ = ema_numpy(x2, state, alpha)
print("init_state", init_state, "state", state)
print("rmse numpy stateful:", rmse(np_out, np.concatenate((np_out1, np_out2))))
np.testing.assert_almost_equal(np_out, np.concatenate((np_out1, np_out2)))

# run numba stateful implementation
numba_out1, state = ema_numba(x1, init_state, alpha)
numba_out2, _ = ema_numba(x2, state, alpha)
print("init_state", init_state, "state", state)
print("rmse numba stateful:", rmse(np_out, np.concatenate((numba_out1, numba_out2))))
np.testing.assert_almost_equal(np_out, np.concatenate((numba_out1, numba_out2)))

Output:

stateless
rmse: 0.0

numpy stateful
init_state 0.0 state -0.04978762936563727
rmse: 0.0

numba stateful
init_state 0.0 state 0.0
rmse: 0.03399832826443144

As one can see, for the numba implementation the init_state == state, which is strange as the state should change.

Thanks!

Do you see the same result if you use more explicit slicing rather than the ‘…’ notation?

I think the main issue is that you need to assign the state to the output, instead of creating a new variable final_state (basically overwriting the one that’s returned (unchanged)). This certainly is one of the bigger gotchas when working with Numba.

You could try it with:

@guvectorize(
    [
        "void(f8[:], f8, f8, f8[:], f8[:])",
        "void(f4[:], f4, f4, f4[:], f4[:])",
    ],
    "(k), (), () -> (k), ()",
    nopython=True,
    target="cpu",
)
def ema_numba(x, state, alpha, y, final_state):
    N = x.shape[-1]
    for i in range(0, N):
        y[..., i] = state + (x[..., i] - state) * alpha
        state = y[..., i]
    final_state[:] = state

The final assignment is changed to final_state[:] = state and this also requires it to be defined as an array in the signature (f8[:]).

I think this pattern makes more sense if you initialize the outputs outside the function and pass them as arguments explicitly. Because you specified the types in the signature in this case, Numba is clever enough to initialize the output arrays itself (on-the-fly) when they are omitted from the inputs. But it still requires you to follow the same pattern “as if” they were input arguments that you need to modify.

If you would omit the datatypes in the signature, Numba can’t initialize the outputs and insists you also provide those yourself, which also sort of hints at what’s going on.

@guvectorize("(k), (), () -> (k), ()", nopython=True, target="cpu")
def ema_numba(x, state, alpha, y, final_state):
    ....

# TypeError: Too few arguments for function 'ema_numba'. 
# Note that the pattern `out = gufunc(Arg1, Arg2, ..., ArgN)` is not allowed. 
# Use `gufunc(Arg1, Arg2, ..., ArgN, out) instead.

@nelson2005 thanks, I tried but it does not solve my problem. Actually, the ... are not necessary at all and one can directly do [i] everywhere.

@Rutger that works, thanks a lot. Why is the the other part of the signature not updated?

"(k), (), () -> (k), ()" changed to "(k), (), () -> (k), (k)" as f8 became f8[:]

I always thought:
() must be []
and (k) must be [:]
and (m,k) must be [:,:]

Yes it’s a bit confusing. Even though you need to declare the type of final_state as an array (eg f8[:]), it’s still a effectively a scalar from the perspective of (inside) your function. To be honest, I’m not sure what the best way of thinking about it it is, I guess it makes it a 1D array of size 1 since you can slice it (but certainly not size k!). I suspect this syntax comes from the fact that you can’t assign/set the value of a scalar.

Only other output y has the same size as the input x, hence they share the same label k in the signature.

In case you haven’t, you might gain some additional intuition by vectorizing over some additional dimensions, make the input x 2D or 3D etc. Suppose your x input has shape [128, 64, 12], the output y will have the same shape, and the final_state output will be [128, 64].

This also relates to why you don’t need to do the ... for any extra dimensions in your code, since Numba will already “pre-slice” the dimension (labeled k) and the function will only ever get 1D arrays for x & y (both of length k). Such a scenario is also where you would see the real performance benefit, since it’s the additional dimensions that can be parallelized.

Also note that Numba assumes that the function should be applied over the last dimension. When using guvectorize you’ll automatically get the axis keyword as well, but if you omit that, you’re effectively doing:

ema_numba(x, init_state, alpha, y_out, final_state_out, axis=-1)

But suppose your x input has shape [12, 128, 64], and you want to apply it over the first dimension, you can do:

ema_numba(x, init_state, alpha, y_out, final_state_out, axis=0)

And shape of final_state_out will again be of shape [128, 64].

edit:
The docs explain it for the input case, see the “Note” box at:
https://numba.pydata.org/numba-doc/dev/user/vectorize.html#the-guvectorize-decorator

I guess what you now have is the same, but for the output case. And so instead of “getting” you’re dealing with “setting”.

Really cool! thanks a lot. I am indeed having larger XD arrays, but I’m religiously keeping the time domain in the righter most dimension, with the other dimensions being data batch related or hyperparameter related. Yet, I did not know about that I get the axis kwarg for free, is there somewhere I can read up on this?

A last question, related to having the time domain in the righter most dimension, do you have some thoughts on the memory layout of such data and what would be optimal? Which dimension should the time domain populate? I’m thinking about row/column major concepts.

Thanks a lot!
Rasmus

I’m not sure about the best documentation about this. I suspect most of the behavior from the vectorize and guvectorize functions is mimicking the Numpy ufuncs as close as possible, so that might be a start:
https://numpy.org/doc/stable/reference/ufuncs.html#optional-keyword-arguments

But Numba only supports a subset of the ufunc functionality, things like the where keyword etc aren’t supported for as far as I know. But the axis keyword is.

I think what you’re doing now should already be optimal for most cases, I suspect that’s also why the default axis is assumed to be the last (-1). It depends on the layout of the array, for Numpy it’s common to have a c_contiguous layout (row-major) but Numpy also supports f_contiguous (column-major). You can check it by looking at the flags attached to the input array you have (x.flags).

If it’s suboptimal, it might be worth to first create a copy of the input with the correct layout before calling the function.

A quick benchmark, using your function and some input as:

init_state = 0.0
alpha = 0.01
np.random.seed(1337)

other_dims = [512, 512]
vec_dim = [128]
shp = other_dims + vec_dim
x = np.random.normal(size=shp)

final_state_out = np.empty(other_dims, dtype=np.float64)

It’s possible to create both the c & f layout, and use those both with the optimal and sub-optimal axis. So:

# c optimal
x1 = x.copy(order="C")
axis_x1 = -1

# c subopt
x2 = np.moveaxis(x, -1, 0).copy(order="C")
axis_x2 = 0

# fortran subopt
x3 = x.copy(order="F")
axis_x3 = -1

# fortran optimal
x4 = np.moveaxis(x, -1, 0).copy(order="F")
axis_x4 = 0

This gives me:

y_out1 = np.empty_like(x1)
%timeit ema_numba(x1, init_state, alpha, y_out1, final_state_out, axis=axis_x1)
# 130 ms ± 3.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

y_out2 = np.empty_like(x2)
%timeit ema_numba(x2, init_state, alpha, y_out2, final_state_out, axis=axis_x2)
# 243 ms ± 6.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

y_out3 = np.empty_like(x3)
%timeit ema_numba(x3, init_state, alpha, y_out3, final_state_out, axis=axis_x3)
# 1.84 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

y_out4 = np.empty_like(x4)
%timeit ema_numba(x4, init_state, alpha, y_out4, final_state_out, axis=axis_x4)
# 149 ms ± 762 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Especially the Fortran layout (column-major) with a suboptimal axis seems to suffer a lot. I guess that what you have, and are doing, is similar to the x1-case which performs fine.

And the result should of course be the same regardless layout/axis:

np.testing.assert_array_equal(y_out1, np.moveaxis(y_out2, 0, -1))
np.testing.assert_array_equal(y_out1, y_out3)
np.testing.assert_array_equal(y_out1, np.moveaxis(y_out4, 0, -1))
1 Like