Hello, I’m a Numba user, a co-founder of the PyDataLondon meetup and conference series and semi-regular conference speaker. Earlier in the year I spoke at a couple of conferences on high performance Python (EuroPython, Remote Pizza Python, PyData Amsterdam) and at one Intel also spoke where they introduced Intel SDC (Scalable Dataframe Compiler). I had some questions for them and wasn’t clear on their answers.
In chat they noted that they had extended Numba to work efficiently on strings and on datetimes (IIRC, my memory may be faulty) and to work with Pandas. As best I see such changes aren’t in Numba so I’m confused.
Does anyone know about the relationship with Intel SDC to Numba, whether improvements are shared back to Numba (and to other projects) and whether SDC does indeed extend Numba to work with Pandas? The SDC documentation isn’t brilliant at this stage.
No worries if nobody here has a clear answer, I figured this might be a sensible first place to check on the topic. Cheers, Ian (UK)
The Intel SDC team has been working with the Numba team and has been upstreaming many enhancements, including but not limited to string features. SDC’s work on supporting Pandas API has also sponsored many improvements to Numba around extending the compiler to support new datatypes and compiler passes.
Not all SDC’s features are shared back to Numba and, IMO, is a good thing. SDC is pushing the limit of what a compiler can do in auto-parallelization of Pandas operations, but Numba needs to be more cautious to ensure stability for its users. The separation also help ease our maintenance burden as Numba is becoming large and complicated.
Also, the complexity of Pandas support (which requires reimplementing many algorithms in a compiler-friendly way) means that it would not be a good idea to upstream Pandas support into Numba itself. Instead, that feature set should live as an extension to Numba, which is how SDC is implemented. This work has pushed the Numba team to increase focus on extensibility by external code bases, which also enables tools like numba-scipy, Awkward Array, and others.
Incidentally, the NumPy support in Numba probably ought be moved (in principle, though there is much entanglement currently due to the age of this part of the code) to an extension, as well as the GPU targets. There are historical reasons not to do that now, but conceptually it should be possible.
@sseibert @sklam apologies to be slow to acknowledge your replies - many thanks to you both (I’m a new first-time father and a baby makes…the world much slower!). I’ll keep an eye on Intel SDC and I look forward to trying it, I’m glad there’s good cooperation back to Numba. Thanks both! Ian.
@ianozsvald Bodo provides comprehensive Pandas support (uses Numba underneath) so you might be interested in trying it. It’s similar to SDC except that it provides more Pandas coverage, more optimizations, and MPI parallelization. I agree with @sklam that not all Numba-based acceleration functionality should be inside Numba itself. Packages evolving and being maintained separately can enable faster development.
Hello @ehsantn @sklam (sorry as new user I can name only two users)
Could we tabulate the SDC vs Bodo features and differences so users can get a quick glance at the differences.
Can we have SDC and Bodo in the same project without conflict? As SDC is currently at 3.7.9
Thank you
@scheung38 you can look at the documentations to compare API coverage:
https://docs.bodo.ai/latest/source/pandas.html
https://intelpython.github.io/sdc-doc/latest/apireference.html
There may be conflict if you try to use both at the same time probably, since they both define dataframe/series/… data structures.
@ianozsvald , from my perspective, Intel SDC seems already to be unmaintained since Dec, 2021. I do read/modify much SDC source code, I think it’s really a good starting project which focuses on using Numba to accelerate Pandas. However, SDC still has many internal tech issues unsolved.
BTW, in fact, Pandas has already integrated Numba into its own source code to accelerate some kinds of Pandas operations, it’s also a good sign, using Numba to accelerate partial operations of Pandas, rather than implementing a whole extension project, i.e., making Numba aware of most Pandas types and related operations, which is time-consuming.
PS, I found some Intel guys previously in the SDC project has focused on another Intel Numba extension project, called dpex (GitHub - IntelPython/numba-dpex: Data Parallel Extension for Numba).