Intel SDC, Pandas and contributions to Numba?

ianozsvald · September 19, 2020, 4:41pm

Hello, I’m a Numba user, a co-founder of the PyDataLondon meetup and conference series and semi-regular conference speaker. Earlier in the year I spoke at a couple of conferences on high performance Python (EuroPython, Remote Pizza Python, PyData Amsterdam) and at one Intel also spoke where they introduced Intel SDC (Scalable Dataframe Compiler). I had some questions for them and wasn’t clear on their answers.
In chat they noted that they had extended Numba to work efficiently on strings and on datetimes (IIRC, my memory may be faulty) and to work with Pandas. As best I see such changes aren’t in Numba so I’m confused.
Does anyone know about the relationship with Intel SDC to Numba, whether improvements are shared back to Numba (and to other projects) and whether SDC does indeed extend Numba to work with Pandas? The SDC documentation isn’t brilliant at this stage.
No worries if nobody here has a clear answer, I figured this might be a sensible first place to check on the topic. Cheers, Ian (UK)

sklam · September 21, 2020, 4:17pm

The Intel SDC team has been working with the Numba team and has been upstreaming many enhancements, including but not limited to string features. SDC’s work on supporting Pandas API has also sponsored many improvements to Numba around extending the compiler to support new datatypes and compiler passes.

Not all SDC’s features are shared back to Numba and, IMO, is a good thing. SDC is pushing the limit of what a compiler can do in auto-parallelization of Pandas operations, but Numba needs to be more cautious to ensure stability for its users. The separation also help ease our maintenance burden as Numba is becoming large and complicated.

sseibert · September 21, 2020, 4:22pm

Also, the complexity of Pandas support (which requires reimplementing many algorithms in a compiler-friendly way) means that it would not be a good idea to upstream Pandas support into Numba itself. Instead, that feature set should live as an extension to Numba, which is how SDC is implemented. This work has pushed the Numba team to increase focus on extensibility by external code bases, which also enables tools like numba-scipy, Awkward Array, and others.

Incidentally, the NumPy support in Numba probably ought be moved (in principle, though there is much entanglement currently due to the age of this part of the code) to an extension, as well as the GPU targets. There are historical reasons not to do that now, but conceptually it should be possible.

ianozsvald · October 22, 2020, 1:11pm

@sseibert @sklam apologies to be slow to acknowledge your replies - many thanks to you both (I’m a new first-time father and a baby makes…the world much slower!). I’ll keep an eye on Intel SDC and I look forward to trying it, I’m glad there’s good cooperation back to Numba. Thanks both! Ian.

ehsantn · March 12, 2021, 8:48pm

@ianozsvald Bodo provides comprehensive Pandas support (uses Numba underneath) so you might be interested in trying it. It’s similar to SDC except that it provides more Pandas coverage, more optimizations, and MPI parallelization. I agree with @sklam that not all Numba-based acceleration functionality should be inside Numba itself. Packages evolving and being maintained separately can enable faster development.

scheung38 · May 28, 2021, 12:42pm

Hello @ehsantn @sklam (sorry as new user I can name only two users)

Could we tabulate the SDC vs Bodo features and differences so users can get a quick glance at the differences.

Can we have SDC and Bodo in the same project without conflict? As SDC is currently at 3.7.9

Thank you

ehsantn · May 28, 2021, 9:20pm

@scheung38 you can look at the documentations to compare API coverage:
https://docs.bodo.ai/latest/source/pandas.html
https://intelpython.github.io/sdc-doc/latest/apireference.html

There may be conflict if you try to use both at the same time probably, since they both define dataframe/series/… data structures.

dlee992 · January 29, 2023, 6:59am

@ianozsvald , from my perspective, Intel SDC seems already to be unmaintained since Dec, 2021. I do read/modify much SDC source code, I think it’s really a good starting project which focuses on using Numba to accelerate Pandas. However, SDC still has many internal tech issues unsolved.

BTW, in fact, Pandas has already integrated Numba into its own source code to accelerate some kinds of Pandas operations, it’s also a good sign, using Numba to accelerate partial operations of Pandas, rather than implementing a whole extension project, i.e., making Numba aware of most Pandas types and related operations, which is time-consuming.

PS, I found some Intel guys previously in the SDC project has focused on another Intel Numba extension project, called dpex (GitHub - IntelPython/numba-dpex: Data Parallel Extension for Numba).

Topic		Replies	Views
Numba / Mojo blog post Showcase	2	997	October 18, 2023
Does Numba support MPI and/or openMP parallelization? Community Support	22	3996	October 12, 2024
Dec 5th talk "Numba-dpex: A portable accelerator programming extension for Numba." Announcements	1	372	December 5, 2023
NumPy 2.x support community update Announcements	0	124	November 26, 2024
Numba Activity Newsletter, April 2024 Announcements	0	286	May 17, 2024

Intel SDC, Pandas and contributions to Numba?

Related topics