If that’s the case, if we are a library and want to support several targets at the same time, we have to either
Specify the types on the signature, which is not recommended for the cpu target, or
Not specify the types, which might give problems for the parallel and cuda targets.
There’s another possibility, which is that we are wrong with our assessment and all targets have equal type inference capabilities and we can let numba do the work.
The targets have equal type inference capabilities, but the impact of the typing on the targets differs. CUDA kernels are extremely sensitive to register usage, so typing everything as 64-bit types has little impact on the CPU variant of a function, but a big impact on the CUDA variant - the performance improvements from providing signatures that I understand @s-m-e has observed from providing signatures is likely down to this.
The general advice is not to provide signatures - however, I think your use case is a more advanced one for which I’d deviate from the standard advice, and instead I’d suggest explicitly providing signatures.
Providing signatures for the CPU target isn’t an issue per se - it’s just that beginners get it wrong a lot of the time and it is not necessary in many cases unlike your particular situation - so we generally advise against doing it - but, there does come a point at which it makes sense to do so which in my opinion you have reached with this Poliastro work.
Thanks a lot for the detailed answer @gmarkall! My main concern has always been losing generality by using float64 instead of letting numba figure out the types, getting errors if someone accidentally passes an integer, etc. But these are minor things that can be compensated by other means.
It’s partially performance, yes, but also failures of the compile process itself. If I do not provide signatures for the parallel and cuda targets, the JIT will spit out various fun tracebacks, in some cases not even related to types at all. The “fix”, from experience, is to provide types and suddenly the JIT is happy again. I ran into this before, on smaller scales, but big time in poliastro. I can provide a few examples if I find the time. I distinctly remember three tracebacks I was seeing frequently. My impression was that thanks to the (automatic) parallelization that happens in those cases for both the CPU and GPU the code generation differs enough to make the type interference fail. At least this how I pictured it.
Were any of these for the @guvectorize decorator? That still requires signatures for the CUDA target (I don’t know if it’s also needed for the parallel one).