Why my numba+numpy implementation is faster than c+cffi?

Hi Everyone,
I am currently working on an image processing code for a project which was previously written in C and executed in Python using cffi. My implementation uses numba+numpy and I am surprised to find it 4x faster than the current C + cffi execution.
I previously encountered similar behaviour when a numba+numpy kmeans implementation was a bit faster than scikit-learn kmeans which I did not pay much heed to as my implementation was not exactly the same.
Can someone guide me in this regard as I am unable to explain how this implementation is faster than the C one. Here are some questions which are currently going through my mind:

  • Is it theoretically possible than numba+numpy implementation is faster than the corresponding C/C++ codes?
  • Is it something to do with llvmlite?
  • Is there some overhead due to the foreign function interface which can lead to this execution difference?

I shall be grateful if someone can thrown some light on this matter.

Thanks and Regards,
Ankit

hi, interesting question!

a few things off the top of my head:

  • if you completely re-wrote the code, I think it’s easy to introduce differences that generate better performance.
  • depending how you call the code, you might be going through python and adding some overhead (I don’t know much about cffi though)
  • many C compilers set the default to -O0, while Numba’s default is -O3. If the C code was not compiled with optimizations enabled, it can be many times slower.
  • jit compilers like numba get very specific information about the objects they work with, and in some cases can use this information to generate more efficient code. Think of the difference between std::vector (variable size) and std::array (fixed size). Or between n-D arrays, and 2-d arrays. Code that can support n-D arrays is more general and less efficient that code that only supports 2-d arrays. Numba compiles functions for the exact number of dimensions of the inputs, so it can be more efficient in some cases. In other cases, the time spent in additional compilations might not pay off, and the generic code is faster than the specific code once compilation time is taken into account.

Without seeing the code it is impossible to say what the reason is, and even with the code it can be very hard. But I’m guessing you don’t care about the exact reasons, and more about potential reasons why this is happening.

Luk

1 Like

Hi Luk,

Thanks for your answer. Really appreciate some of the insights you have put forward.

Regards,
Ankit