Numba access of NumPy ufunc inner-loop functions?

We need to change the inner loop functions in NumPy, including how they are stored and their signature. While doing that it should be possible to keep old loops working as they are (at a speed penalty maybe).

I realize that Numba is a fast moving project, and that you are probably adapt to a change in NumPy, but maybe we can make things work to make the transition smooth.

So I have two questions (currently):

  1. I know that Numba messes with the internal structure of ufuncs for the ufuncs created by numba itself. But does it access the inner-loops of universal functions not created by Numba itself? E.g. extracting the loop two add two float64?

  2. We have to add an error return value to loops and wish to pass additional information into the UFunc loop, either by extending the signature or passing a structure (or, not unlikely) both, e.g.:

    int ufunc_loop(char **args, npy_intp const *dimensions, npy_intp const *steps,
                            new_info_struct *info, void *user_data)
    

    do you have any needs/limitations or just ideas on this? I am in the process of revising this, but e.g. the idea is to always pass in all dtypes, because it makes implementing parametric DTypes (e.g. Units) much easier. Even the return value could be a discussion point, does numba have a need to do more than return -1 with a Python error set? (e.g. I could see returning a positive value to abort an unnecessary iteration, but would defer it in case we find a better usage)

I realize this is a technical deep-dive maybe, so please let me know if this is unclear. I am revising NEP 43 here: https://github.com/numpy/numpy/pull/16723, but unless you are enthusiastic, it may be best to defer detailed read (but maybe a quick read can help with the details).

@seberg, thanks for letting us know about the changes to ufunc. We have been waiting for a Ufunc and DType upgrade for so long, so thanks for doing that.

  1. I don’t think and couldn’t find any access to inner-loops of ufuncs not created by Numba. The closest thing we have is accessing the math functions in https://github.com/numba/numba/blob/b9efc8323c4b5915be55b7a375c4b0b8fdd84fef/numba/_npymath_exports.c, but I don’t think those are affected by the NEP.
  2. The error return value is a good addition. As for passing argument dtypes for parametric dtypes, can innerloops be registered for specific instances of dtypes (e.g. datatime64[ms])? In general, numba has not needed additional information but I don’t see any harm to have more information.

FYI, most of the basic ufunc support in Numba is written many years ago following https://numpy.org/doc/stable/user/c-info.ufunc-tutorial.html. The more advanced feature is our Dynamic Ufunc (DUfunc), which supports dynamically adding new innerloops (see code starting at https://github.com/numba/numba/blob/b9efc8323c4b5915be55b7a375c4b0b8fdd84fef/numba/np/ufunc/_internal.c#L99). DUfunc uses more tricks/hacks with the Ufunc API and will likely break with the update.

Its slow, and I would love to rope in people for technical discussions to help me settle things fully, these are not easy topics to address, especially in the backdrop of a package like NumPy were even small consistency fixes are tricky…

Thanks, that is good to know. I think for loops registered in the old way, we can wrap them in new style loops and then basically fall back to the old loop with a single additional indirection (i.e. we will do the full old-style loop linear lookup, but wrap it into the new machinery – I had a prototype that basically did that).

On the error return: Yes, its an obvious one :). Since you don’t access existing loops directly, I think we are fine with signature changes then.

About dispatching to datetime[ms]: My plan is not now, but I think we can extend it later to cover the same use-case!
I see registration as being only on the DType (i.e. class), the reason is that that feels like the right abstraction to think about “dispatching” itself. Also we cannot cache/register all datetime[...] in general, e.g. a Unit DType might have an arbitrary amount of specializations (not known ahead of time).

But: After dispatching, we must have one more step to get the actual inner-loop function, although in a first round that might not be exposed/configurable (I want to avoid decisions for Step 3 when I am at Step 1).

So this step ufuncimpl.get_innerloop_function_and_setup(...) can be passed the actual dtype instance, so the UFuncImpl must be resolved by type, but that function could return a specialized inner-loop.
(Lets say you do a lot of string work with "S1" strings and want to make sure that is not bogged down by useless length iterations.
So in that sense, it would not be part of “dispatching”, but assuming that architecture sounds good, it is something we can address later (with mus less baggage of figuring all the other things out)!

Yeah, I guess those were the ones I am worried about. To be clear, I have very much hopes I don’t need to break them, it would preserve some of my sanity to do so, but NumPy is too big to really gauge the impact of such choices…


From a technical point, there are a few things that we need to decide, and I admit right now the NEPs are organized by “concept” rather “what are the big decisions”.

I think one of the big decisions is how ufunc dispatching (or rather promotion) should work. NEP 43 is in a rough state, but let me know if you guys have thoughts on it, or would even just be open to discuss it for an hour or so. That is the type of decision that I feel has huge long term impact, but getting actual deep feedback (or even making up the mind) is hard.

I would be extremely happy about any specific feedback, the issue is in no way spending time on that. My issue is much more roping in those who can help solidify the technical/architectural decisions.

To bring this up again, NEP 43 is a bit further in design, and I am wondering if you have any input. Mainly on the things I mentioned before, but:

  1. The error return (returning either -1 or 0 for now), early successful stop would be a possible option, but I am not sure it is really important (it mainly is if you have also casting involved, since then casting can be slow such as np.any(float_array).
  2. The idea of passing a struct containing:
    • A self object (so to speak)
    • The descriptors/dtypes of the input
    • The caller (e.g. the ufunc) if applicable
    • Possibly additional information in the future (I still have to decide on how we could version this), an example is the Python Thread/Interpreter state.
  3. Replacing the current data field with a user allocatable one (at least in the future), the previous “static” data should be part of “self
    • I am considering a little extra: If users do not use this userdata field, I could pass in an npy_intp *scratch_space = 0 so that simple flags are easy. For example, if a function gives a warning, it could set a flag to give that warning only once.
  4. To be clear: This design asks for most errors/warnings to be set in the inner-loop function and not in a “teardown”. That seemed easiest to reason about to me.

While it should not be impossible to extend this in the future, it seems undesirable to do that more than once.

So I am wondering if this type of signature would be convenient for Numba, or if you can think of any other thing you would like passed (or less)? Any comments on the ArrayMethod design mentioned in the NEP draft would be much appreciated! Without comments, I can only hope and assume that this is the right path to continue. (Promotion is another thing mentioned in the NEP draft. It is fairly distinct but also important and I am very open to adapting that as well.)

On the topic of promotion, quoting from the NEP listing of steps for a ufunc call:

  1. Promotion and dispatching
  • Given the DTypes of all inputs, find the correct implementation. E.g. an implementation for float64 , int64 or a user-defined DType.
  • When no exact implementation exists, promotion has to be performed. For example, adding a float32 and a float64 is implemented by first casting the float32 to float64 .

It would be beneficial to Numba (and other JITs) if there was a way to register a callback to be called before promotion, giving a JIT compiler the chance to generate an exact implementation as needed. Numba’s @vectorize decorator used to require a complete enumeration of type combinations up front for generating loop implementations to be registered with the NumPy ufunc object. This was tedious (and error prone) for users, so we added an alternative code path that would generate new implementations on the fly as new type combinations appeared. Unfortunately, this approach required us to make a “fake ufunc” (which we call a DUfunc) object that would give Numba the chance to compile new type combinations.

Having an official hook for compiler callbacks on NumPy ufuncs would allow us to drop that alternative code path, which is a minor source of confusion as it isn’t a true NumPy ufunc.

@sseibert that makes sense. This is the second step: “promotion and dispatching”

My current view (as written in the NEP draft) is that we need a way to register new “promoters” anyway (i.e. so that you can define that timedelta64 / int -> timedetal64 (which is impossible to write in generic terms).
This would be a generic function that must return an ArrayMethod (which could be cached). Before returning it, it could also explicitly register/add it to the full list of loops/implementations that can be dispatched. (Note that this is independent from byte-swapping, since byte-swapping is part of the ArrayMethod in terms of resolve_descriptors, but even that could be overridden for your own ufuncs easily)

@seberg,

I don’t think I understand it fully. It seems odd to ask the promoters to handle this. I will start with rephrasing @sseibert’s question with a specific situation. In Numba, we can make a ufunc with no known signature:

@numba.vectorize
def foo(x, y):
    return x + y

The foo() ufunc is viewed as a generic function that is open to any argument types that has a valid +. Let say we first call it with foo(np.ones(10, dtype=np.float32), np.ones(10, dtype=np.float32)). Since there are no known signature, the compiler is invoked to check if the incoming types (float32, float32) has a valid +. It has, so numba compiles a new loop and inject it into the ufunc. Ufunc that dispatches to the new loop.

Later, the ufunc is called with different types foo(float64, float64). The type float64 should not be downcasted to float32 if possible. Preferably, we want NumPy to ask Numba for a new loop. Is this possible?

Yes, I realize that is what you want. Why do you feel the promoter is an odd place? Promotion is exactly the place that you need to influence:

  • The promoter (whether defined on the ufunc or registered) will be called whenever an unknown signature is seen (known signatures should at least be cached).
  • The promotion step is exactly the place where to decide that "The type float64 should not be downcast to float32" (or maybe more relevant the opposite).

Now, I can see that it might seem a bit weird to mix promotion and “compilation”, but the result of promotion (in this sense) is exactly the implementation (i.e. compiled/new function/loop).

The main limitation is that in my current thoughts the promoter returns the ArrayMethod (i.e. “new compiled loop” – or old loop with casting). That would seem like a perfect match for you, but it means that if you would want users to write their own promotion rules, you may require an additional mechanism.

That may be OK, though, unless we can think of a good way to allow calling the original or “next” promoter?

Sounds like it will work.

Any user customization will be part of the compiler. I don’t think Numba users should do anything special just for ufunc.

I am missing some details to understand. However, I can imagine the current design will work as long as the promoter can callback into numba with the current ufunc instance to add new loops.

I reviewed all related NEPs yesterday. I don’t have any other concerns at this time. I just can’t wait to play with the proposed dtype system =).

Thanks, much appreciated! Hopefully, in a few months, some of the more annoying things seem settled and in, the ufunc stuff is the biggest throw still waiting. We may have to think about the right path for adding all of the new public API, but I guess it should be straight forward in the end.

Just as an update: An experimental exposure of (parts) of this API (and the DType API) is now available in NumPy main. It seems the last update here did not include it unfortunately (but this should be fixed at most in a few days).

The interesting stuff is in: numpy/experimental_dtype_api.h at main · numpy/numpy · GitHub

You could try out new custom DTypes of course. On the ufunc side, advantages include:

  1. A return value to indicate errors (stop the iteration).
  2. Access to the actual operation dtypes (e.g. to get the string sizes).
  3. Ability to indicate per-loop GIL requirement.

The next plan is to expose (very basic!) promoters (and “common-DType”) as well (in a few days).
That should allow a DUfunc to compile and register a new loop as part of its promotion (removing the need for that ugly try: ... except TypeError: add_loop hack.

I have no (really) exposed an internal get_loop() function that could specialize even datetime[ms] as opposed to datetime[ns], etc. It could be exposed (in some form!) but the API leaks some internals right now :(. And I would really like to pass in the maximum power-of-two alignment for each array rather than a single aligned flag…

Promoters are now alse available also at: Files :: Anaconda.org the documentation is not that great and getting the DType classes in C is a bit awkward (the easiest is to copy PyArray_DTypeFromTypeNum from NumPy). But I think it is ready for a test ride, just make sure you ping me as soon as it gets bumpy :)!