I regularly use numba, and in general attain massive speedups for my code. Love it
Recently I wanted to try and learn more about numba features, e.g. jitting classes etc.
So as a small project to learn this, I tried to numbafy this small chess engine: github/thomasahle/sunfish
I have rewritten all functions to nopython mode, and it all compiles. Unfortunately, I do not see any speedup. Perhaps 1 % speedup, but no more.
I would very much like to understand why this code did not achieve more of a speedup.
Could the code uses unicode strings, and for these there will never be a speedup? Or are typed.Dicts generically slow?
hi @julius , have you profiled the application with and without numba? with such a large code, it’s very hard to just look at it and be able to tell what it’s going on.
You should profile with and without Numba, take note of which functions take longest, while functions are ran the most times, how long it’s spent in compilation vs in runtime. If you post that here, someone might be able to help.
As you can see, much of the time spent operating on strings (e.g. isupper()). The call counts are very large, and I thought that most of the time was spent on function calling overhead, which is why I hoped numba would help out a lot. Also note that the code has a lot of recursion (e.g. bound() calls moves() which calls bound() etc)., so total time has to been seen with that in mind. Nevertheless, gen_moves() is not recursive, and that sees a large fraction of time spent there, mostly operating on strings. It does not seem like dictionary getitem is using a lot of time.
I don’t really have a good way of profiling my numba code. Any pointers?
I just get that the non-jitted python function (search) calling the jitted functions spends all the time:
I also tried to replace all strings with numpy arrays, which can be made to work as well. But for some reason this made the runtime worse, so I do not think that this is the source of the tardiness.
Regarding “how long it’s spent in compilation vs in runtime”: compilation takes about 30 seconds, while runtime is about 2 seconds. Naturally, I don’t compare with compilation time when comparing the two scripts.
Any ideas on how to proceed would be fantastic.
Thanks again.
Just tried to run cProfile directly (rather than PyCharm as posted above). Ran code 250 times to dilute compilation time. This gives the following:
(where string upper does not spent most of the time. In fact, no single function seems to take up most time)
if the performance of the python and numba versions are similar, let’s assume for now that the timings are similar. They won’t be, for sure, but let’s focus on get_moves. It relies heavily on string operations.
I don’t have a lot of experience with high performance string operations. If there’s someone around that does, they might be able to help better. My intuition (biased from working normally with numerical arrays) is to wonder whether the board representation based on strings could be replaced by a representation based on numbers. I guess it won’t be as close to the original problem domain, and therefore harder to debug, but I’m betting that numerical operations are at least an order of magnitude faster than string operations.
Another open question, probably to one of the core devs, is whether string operations in Numba have been optimized, ie is something like isupper using a naive or optimized algorithm.
I agree with all your thoughts.
As mentioned, I did also try a numerical version of the code. This actually let to worse performance.
I guess I could try again though, to rule out silly mistakes.
when you said that you “replaced strings with numpy arrays” I assumed you meant numpy arrays with strings inside. did you test with arrays of integers?
in your first post you asked about typed.Dict. I haven’t done extensive comparisons but for some tests I ran some time ago, I remember that in interpreted mode typed.Dict is slower than python dict, but in jitted code typed.Dict is faster than python dict. They are both 10-100x slower than a numpy array, so I had to re-factor an application from dictionaries to arrays. I gained 50x doing that.
Sure! Pushed it to here now: sunfish/numnunfish.py at master · juliusbierk/sunfish · GitHub
This version uses no string operations (e.g. isupper(), etc…), but it still makes the array into a string now and then to use as a key in dictionaries. I will try and see if I can get around this.
This numerical version is about two times slower than the string version.
Okay @luk-f-a you’re making me think about getting rid of all string operations. I found this smart hash function for chess boards (Zobrist hashing - Wikipedia) and quickly tried implementing it. Now all string operations are gone, and finally I see a speedup! So thank you!!
The speedup is only 6x, but definitely better than nothing.
New version here if you are interested sunfish/numnunfish_zobrist.py at master · juliusbierk/sunfish · GitHub
I am still longing after those 100 x / 1000 x speedups I am used to from numba
is it not possible to remove the dictionaries? that would create a massive speed improvement. if that’s not possible, then not converting to string would be the second best alternative.
Something else to look into is the heavy use of yield, @stuartarchibald would you say using yield in jitted code is efficient? any pitfalls?
Thanks @luk-f-a !
In the zobrist version I just posted I got rid of many of the dictionaries, but two still remain (tp_score, tp_move). These are harder to remove. Could do it, but then I lose algorithmic performance. Might be worth it though. I will think about it.
I can definitely get rid of yield if you think that might be a good idea.
@luk-f-a but most of the 6x speedup might actually come from just doing numerical evaluations instead of string operations. I don’t seem to gain much from numba
however, it has an overhead per-function-call and you have some many function calls that I’m not sure it will be useful for you. That overhead might distort the results too much.
It can make it harder to optimise but LLVM can sometimes “see” through it. For this sort of application, finding the hot spots and using benchmarks to assess improvements seems like it could be an effective approach.
Thanks @stuartarchibald. Any recommendations on a benchmarking/profiling approach? The method suggested by @luk-f-a, as he also points out, doesn’t work well due to the many function calls.
For now, your options are probably to collect individual function execution times and obtain information about the number of invocations and infer where there’s potential for optimisation from that. The info @luk-f-a referenced above might help with the invocation count part.