No speedup when porting small chess engine

Hi,

I regularly use numba, and in general attain massive speedups for my code. Love it :slight_smile:

Recently I wanted to try and learn more about numba features, e.g. jitting classes etc.
So as a small project to learn this, I tried to numbafy this small chess engine: github/thomasahle/sunfish

I have rewritten all functions to nopython mode, and it all compiles. Unfortunately, I do not see any speedup. Perhaps 1 % speedup, but no more.
I would very much like to understand why this code did not achieve more of a speedup.
Could the code uses unicode strings, and for these there will never be a speedup? Or are typed.Dicts generically slow?

My code is here: sunfish/nunfish.py at master · juliusbierk/sunfish · GitHub
I realize this is quite a long code, but there are only a few places where numba is used.
If anyone wants to run it, the run time should be compared to sunfish/sunfish.py at master · juliusbierk/sunfish · GitHub

I hope someone can give me some insight.

Thank you!
/ Julius

hi @julius , have you profiled the application with and without numba? with such a large code, it’s very hard to just look at it and be able to tell what it’s going on.
You should profile with and without Numba, take note of which functions take longest, while functions are ran the most times, how long it’s spent in compilation vs in runtime. If you post that here, someone might be able to help.

Cheers,
Luk

Hi @luk-f-a

Thanks for the reply.
I understand, thanks for wanting to take a look nonetheless! Appreciate it.

The non-numba version (sunfish.py) has the following profile


As you can see, much of the time spent operating on strings (e.g. isupper()). The call counts are very large, and I thought that most of the time was spent on function calling overhead, which is why I hoped numba would help out a lot. Also note that the code has a lot of recursion (e.g. bound() calls moves() which calls bound() etc)., so total time has to been seen with that in mind. Nevertheless, gen_moves() is not recursive, and that sees a large fraction of time spent there, mostly operating on strings. It does not seem like dictionary getitem is using a lot of time.

I don’t really have a good way of profiling my numba code. Any pointers?
I just get that the non-jitted python function (search) calling the jitted functions spends all the time:

I also tried to replace all strings with numpy arrays, which can be made to work as well. But for some reason this made the runtime worse, so I do not think that this is the source of the tardiness.

Regarding “how long it’s spent in compilation vs in runtime”: compilation takes about 30 seconds, while runtime is about 2 seconds. Naturally, I don’t compare with compilation time when comparing the two scripts.

Any ideas on how to proceed would be fantastic.
Thanks again.

Just tried to run cProfile directly (rather than PyCharm as posted above). Ran code 250 times to dilute compilation time. This gives the following:
(where string upper does not spent most of the time. In fact, no single function seems to take up most time)

         41567999 function calls (38831721 primitive calls) in 735.185 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   120000  689.553    0.006  722.449    0.006 nunfish.py:292(search)
   115813   25.314    0.000   27.774    0.000 ffi.py:149(__call__)
994946/178002    0.713    0.000    1.031    0.000 ir.py:313(_rec_list_vars)
  6148163    0.643    0.000    0.983    0.000 {built-in method builtins.isinstance}
  1220528    0.588    0.000    1.528    0.000 event.py:227(notify)
        1    0.532    0.532  732.177  732.177 nunfish.py:500(timeit)
699457/371671    0.380    0.000    1.208    0.000 {method 'format' of 'str' objects}
    29750    0.351    0.000    5.759    0.000 boxing.py:58(wrapper)
766049/334170    0.349    0.000    0.742    0.000 abstract.py:117(__hash__)
   109579    0.319    0.000    0.825    0.000 instructions.py:13(__init__)
   242416    0.279    0.000    1.834    0.000 event.py:193(broadcast)
   610264    0.268    0.000    0.373    0.000 event.py:254(on_end)
    92006    0.252    0.000    0.520    0.000 utils.py:418(unified_function_type)
  1220528    0.231    0.000    0.231    0.000 event.py:122(is_start)
113115/76315    0.219    0.000    2.152    0.000 abstract.py:60(__call__)
   139086    0.208    0.000    0.292    0.000 _utils.py:24(deduplicate)
    29750    0.207    0.000    0.237    0.000 <string>:2(method)
   653502    0.206    0.000    0.211    0.000 {built-in method _abc._abc_instancecheck}
383914/373579    0.201    0.000    0.278    0.000 abstract.py:120(__eq__)
190206/69874    0.191    0.000    3.968    0.000 functools.py:872(wrapper)
785464/353311    0.187    0.000    0.465    0.000 {built-in method builtins.hash}
   610264    0.177    0.000    0.232    0.000 event.py:249(on_start)
   242579    0.172    0.000    0.296    0.000 weakref.py:404(__getitem__)
60038/30287    0.166    0.000    3.637    0.000 typeof.py:164(_typeof_tuple)
    30012    0.166    0.000    2.678    0.000 typeddict.py:153(__setitem__)
   144038    0.162    0.000    0.547    0.000 values.py:232(_set_name)
371331/145477    0.153    0.000    1.038    0.000 _utils.py:44(__str__)
    33867    0.152    0.000    0.163    0.000 containers.py:233(__init__)
   113550    0.150    0.000    0.905    0.000 values.py:219(_to_string)
    62046    0.146    0.000    4.224    0.000 dispatcher.py:677(typeof_pyval)
304819/303345    0.142    0.000    0.344    0.000 {method 'join' of 'str' objects}
    32083    0.140    0.000    0.545    0.000 containers.py:290(__init__)
915242/913088    0.133    0.000    0.160    0.000 {built-in method builtins.getattr}
   190206    0.130    0.000    0.282    0.000 functools.py:816(dispatch)
   653502    0.129    0.000    0.340    0.000 abc.py:96(__instancecheck__)
    60218    0.128    0.000    2.709    0.000 containers.py:144(from_types)
     1837    0.126    0.000    0.777    0.000 analysis.py:23(compute_use_defs)
268696/268280    0.123    0.000    0.256    0.000 _utils.py:54(get_reference)
   216827    0.120    0.000    0.583    0.000 {method 'get' of 'dict' objects}
    33678    0.115    0.000    0.161    0.000 typeof.py:60(_typeof_buffer)
    32083    0.115    0.000    0.884    0.000 containers.py:313(__init__)
    69874    0.114    0.000    4.155    0.000 typeof.py:25(typeof)
  1182786    0.113    0.000    0.113    0.000 {built-in method time.perf_counter}
   121208    0.113    0.000    1.308    0.000 event.py:388(end_event)
   113115    0.110    0.000    0.650    0.000 abstract.py:48(_intern)
    23651    0.108    0.000    0.225    0.000 instructions.py:415(__init__)
   121208    0.105    0.000    0.958    0.000 event.py:374(start_event)
   610264    0.105    0.000    0.105    0.000 event.py:132(is_end)
   223404    0.102    0.000    0.103    0.000 {built-in method builtins.hasattr}
   242416    0.102    0.000    0.215    0.000 event.py:84(__init__)
116180/103736    0.099    0.000    0.150    0.000 {built-in method builtins.sorted}
    32122    0.099    0.000    0.291    0.000 containers.py:300(__new__)
   138627    0.095    0.000    0.645    0.000 values.py:212(__init__)
      700    0.093    0.000    0.340    0.000 postproc.py:175(_patch_var_dels)
    30000    0.093    0.000    0.127    0.000 nunfish.py:304(searcher_search)
    21720    0.093    0.000    0.422    0.000 builder.py:964(extract_value)
177766/176959    0.091    0.000    0.331    0.000 {built-in method builtins.any}
   155530    0.090    0.000    0.090    0.000 {built-in method __new__ of type object at 0x00007FFB3E5C3C60}
   719364    0.087    0.000    0.087    0.000 abstract.py:95(key)
5092/2029    0.085    0.000   14.209    0.007 functions.py:283(get_call_type)
     3119    0.078    0.000    0.078    0.000 {built-in method nt.stat}
   209762    0.077    0.000    0.077    0.000 serialize.py:140(_numba_unpickle)
    60218    0.077    0.000    0.232    0.000 containers.py:128(is_homogeneous)
   728819    0.076    0.000    0.076    0.000 {method 'append' of 'list' objects}
   248988    0.076    0.000    0.118    0.000 event.py:50(_guard_kind)
   144038    0.075    0.000    0.384    0.000 _utils.py:16(register)
60038/30287    0.075    0.000    1.836    0.000 typeof.py:166(<listcomp>)
   115813    0.074    0.000    1.383    0.000 ffi.py:73(__exit__)
    30012    0.074    0.000    0.087    0.000 typeddict.py:32(_setitem)
111637/110978    0.072    0.000    0.566    0.000 <frozen importlib._bootstrap>:1033(_handle_fromlist)
   115813    0.072    0.000    1.051    0.000 ffi.py:67(__enter__)
   404536    0.068    0.000    0.068    0.000 {method 'startswith' of 'str' objects}
    81492    0.067    0.000    0.081    0.000 containers.py:756(key)
    61924    0.067    0.000    0.154    0.000 typeof.py:121(_typeof_int)
     1825    0.065    0.000    0.155    0.000 analysis.py:118(compute_dead_maps)
     4296    0.064    0.000    0.077    0.000 analysis.py:81(def_reach)
    71051    0.064    0.000    0.074    0.000 contextlib.py:86(__init__)
     3301    0.063    0.000    0.065    0.000 values.py:554(__init__)
   389782    0.063    0.000    0.063    0.000 typeddict.py:131(_numba_type_)
202243/174517    0.063    0.000    0.521    0.000 {built-in method builtins.next}
   790173    0.062    0.000    0.062    0.000 {method 'extend' of 'list' objects}
      472    0.059    0.000    0.059    0.000 {built-in method io.open_code}
696462/692524    0.059    0.000    0.060    0.000 {built-in method builtins.len}
    33678    0.057    0.000    0.284    0.000 typeof.py:39(typeof_impl)
   109579    0.056    0.000    0.071    0.000 builder.py:351(_insert)
     3630    0.055    0.000    0.197    0.000 numpy_support.py:354(ufunc_find_matching_loop)
     3854    0.055    0.000    0.073    0.000 analysis.py:91(liveness)
    61924    0.053    0.000    0.087    0.000 utils.py:294(bit_length)
   115810    0.051    0.000    1.296    0.000 base.py:1224(exit_fn)
    25191    0.051    0.000    0.188    0.000 context.py:532(_rate_arguments)
    97064    0.051    0.000    0.093    0.000 values.py:239(_get_reference)
28338/14291    0.050    0.000    0.508    0.000 context.py:180(get_meminfos)
   115810    0.049    0.000    0.959    0.000 base.py:1221(enter_fn)
   120444    0.048    0.000    0.133    0.000 containers.py:133(<genexpr>)
     4113    0.047    0.000    0.258    0.000 context.py:582(resolve_overload)
   410802    0.044    0.000    0.048    0.000 {method 'add' of 'set' objects}
   143574    0.043    0.000    0.904    0.000 ir.py:346(list_vars)
   262309    0.043    0.000    0.043    0.000 _utils.py:13(is_used)
    30039    0.042    0.000    0.472    0.000 containers.py:170(_make_homogeneous_tuple)
    23651    0.041    0.000    0.304    0.000 builder.py:768(store)
     6572    0.040    0.000    0.072    0.000 event.py:321(install_timer)
11623/10863    0.040    0.000    5.690    0.001 lowering.py:321(lower_inst)
    66675    0.039    0.000    0.039    0.000 npytypes.py:453(key)
     4637    0.037    0.000    0.158    0.000 analysis.py:235(compute_cfg_from_blocks)
    27002    0.036    0.000    0.061    0.000 functions.py:80(add_error)
     1829    0.036    0.000    0.050    0.000 analysis.py:190(compute_live_variables)
    37615    0.036    0.000    0.073    0.000 values.py:120(__init__)
71051/57197    0.036    0.000    0.139    0.000 contextlib.py:121(__exit__)
 3588/198    0.035    0.000   39.220    0.198 compiler_machinery.py:257(_runPass)
    96353    0.035    0.000    0.043    0.000 containers.py:319(<genexpr>)
31042/8641    0.035    0.000   19.923    0.002 templates.py:343(apply)
      472    0.034    0.000    0.034    0.000 {built-in method marshal.loads}
     6048    0.033    0.000    0.097    0.000 instructions.py:65(__init__)
    26340    0.033    0.000    0.610    0.000 value.py:143(name)
    36454    0.032    0.000    0.074    0.000 typeconv.py:43(check_compatible)
    14291    0.032    0.000    0.596    0.000 context.py:197(_call_incref_decref)
   190500    0.032    0.000    0.032    0.000 containers.py:243(key)
    22857    0.032    0.000    0.227    0.000 builder.py:755(load)
    70576    0.032    0.000    0.093    0.000 abstract.py:123(__ne__)
    71051    0.031    0.000    0.106    0.000 contextlib.py:242(helper)
    21720    0.031    0.000    0.065    0.000 instructions.py:639(descr)
71051/57182    0.031    0.000    0.396    0.000 contextlib.py:112(__enter__)
    23651    0.030    0.000    0.109    0.000 instructions.py:419(descr)
3795/3788    0.030    0.000    0.075    0.000 {built-in method builtins.__build_class__}
37044/37043    0.030    0.000    0.030    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
     6025    0.029    0.000    0.093    0.000 instructions.py:113(_descr)
    53898    0.029    0.000    0.043    0.000 errors.py:724(new_error_context)
    87689    0.029    0.000    0.458    0.000 ir.py:594(list_vars)
    21720    0.029    0.000    0.311    0.000 instructions.py:624(__init__)
    45169    0.029    0.000    0.151    0.000 ffi.py:309(close)
    40432    0.029    0.000    0.044    0.000 typeinfer.py:896(__getitem__)
45356/43622    0.029    0.000    0.130    0.000 context.py:507(can_convert)
   190207    0.029    0.000    0.029    0.000 {built-in method _abc.get_cache_token}
  897/821    0.028    0.000    5.792    0.007 lowering.py:220(lower_block)
    31124    0.027    0.000    0.032    0.000 controlflow.py:366(_add_edge)
    34751    0.027    0.000    0.829    0.000 module.py:213(__next__)
   131400    0.026    0.000    0.026    0.000 controlflow.py:116(successors)
 4744/142    0.026    0.000   33.772    0.238 typeinfer.py:568(resolve)
   287857    0.026    0.000    0.026    0.000 {method 'values' of 'dict' objects}
    32919    0.026    0.000    0.036    0.000 controlflow.py:381(_dfs)
   242416    0.026    0.000    0.026    0.000 event.py:92(kind)
    43041    0.026    0.000    0.118    0.000 ffi.py:352(__del__)
     7834    0.026    0.000    0.404    0.000 cgutils.py:362(alloca_once)
    29845    0.026    0.000    1.457    0.000 containers.py:174(_make_heterogeneous_tuple)
     3096    0.026    0.000    0.066    0.000 inspect.py:2150(_signature_from_function)
    22857    0.026    0.000    0.064    0.000 instructions.py:399(descr)
    37272    0.025    0.000    0.075    0.000 numpy_support.py:336(ufunc_can_cast)
   180114    0.025    0.000    0.025    0.000 typeof.py:167(<genexpr>)
    21558    0.024    0.000    0.078    0.000 types.py:77(__call__)
   195908    0.024    0.000    0.024    0.000 abstract.py:114(__repr__)
    31316    0.024    0.000    0.026    0.000 {built-in method _functools.reduce}
      330    0.024    0.000    0.525    0.002 registry.py:54(apply)
61048/52445    0.023    0.000    0.220    0.000 misc.py:47(unliteral)
13576/5825    0.023    0.000   13.757    0.002 templates.py:592(generic)
    17264    0.023    0.000    0.393    0.000 value.py:206(is_declaration)
19282/14158    0.023    0.000    0.042    0.000 pprint.py:529(_safe_repr)
   116060    0.023    0.000    0.024    0.000 ffi.py:93(__getattr__)
   361/36    0.023    0.000   34.471    0.958 typeinfer.py:141(propagate)
    53825    0.023    0.000    0.036    0.000 ir.py:1211(find_insts)
    68809    0.022    0.000    0.052    0.000 types.py:30(__ne__)
     7860    0.022    0.000    1.046    0.000 values.py:799(<listcomp>)
    50/49    0.022    0.000    0.022    0.000 {built-in method _imp.create_dynamic}
     6272    0.021    0.000    0.092    0.000 ir.py:575(__repr__)
    22857    0.021    0.000    0.175    0.000 instructions.py:394(__init__)
52373/52060    0.021    0.000    0.201    0.000 manager.py:22(lookup)
   121233    0.021    0.000    0.021    0.000 {method 'acquire' of '_thread.RLock' objects}
   113076    0.021    0.000    0.021    0.000 abstract.py:92(__init__)
10587/10443    0.021    0.000    0.043    0.000 numpy_support.py:124(as_dtype)
     2143    0.021    0.000    0.082    0.000 templates.py:185(fold_arguments)
    19594    0.020    0.000    0.431    0.000 module.py:233(_next)
     4744    0.020    0.000    0.042    0.000 typeinfer.py:496(fold_arg_vars)
    30580    0.020    0.000    0.103    0.000 ssa.py:177(_run_ssa_block_pass)
     2670    0.020    0.000    0.028    0.000 inspect.py:2926(_bind)
   161334    0.020    0.000    0.020    0.000 builder.py:195(block)
    19836    0.020    0.000    0.072    0.000 errors.py:437(catch_warnings)
    52739    0.019    0.000    0.048    0.000 {built-in method builtins.all}
     6895    0.019    0.000    0.060    0.000 interpreter.py:692(store)
     1825    0.019    0.000    1.719    0.001 postproc.py:68(run)
    64254    0.019    0.000    0.025    0.000 containers.py:308(<genexpr>)
    61924    0.018    0.000    0.018    0.000 {built-in method builtins.bin}
     2615    0.018    0.000    0.039    0.000 controlflow.py:570(_find_back_edges)
   127674    0.018    0.000    0.018    0.000 {method 'insert' of 'list' objects}
     3628    0.018    0.000    0.300    0.000 npydecl.py:96(generic)
   102670    0.018    0.000    0.018    0.000 controlflow.py:124(predecessors)
    35072    0.018    0.000    0.027    0.000 value.py:81(__init__)
    14189    0.018    0.000    0.023    0.000 ntpath.py:124(splitdrive)
    50660    0.018    0.000    0.026    0.000 ir.py:379(__getattr__)
    27795    0.018    0.000    0.029    0.000 values.py:538(__iter__)
13576/5825    0.018    0.000   13.619    0.002 templates.py:681(_get_impl)
    15668    0.018    0.000    0.068    0.000 builder.py:281(goto_entry_block)
   104893    0.017    0.000    0.022    0.000 _utils.py:71(_stringify_metadata)
   128134    0.017    0.000    0.017    0.000 {method 'rstrip' of 'str' objects}
36401/36257    0.017    0.000    0.037    0.000 values.py:130(_get_reference)
    12444    0.017    0.000    0.140    0.000 values.py:729(__init__)
    37024    0.017    0.000    0.050    0.000 <__array_function__ internals>:2(can_cast)
    46015    0.017    0.000    0.017    0.000 ffi.py:319(detach)
    11816    0.016    0.000    0.037    0.000 analysis.py:66(fix_point_progress)
   104640    0.016    0.000    0.016    0.000 abstract.py:480(initial_value)
    15884    0.016    0.000    0.300    0.000 models.py:626(get)
    15157    0.016    0.000    0.346    0.000 module.py:244(_next)
     3301    0.016    0.000    0.252    0.000 values.py:593(__init__)
46238/46218    0.016    0.000    0.189    0.000 manager.py:33(__getitem__)
    76492    0.015    0.000    0.045    0.000 templates.py:355(<genexpr>)
    32973    0.015    0.000    0.226    0.000 functions.py:232(_unlit_non_poison)
      341    0.015    0.000    1.024    0.003 codegen.py:570(_optimize_functions)
   126250    0.015    0.000    0.015    0.000 values.py:229(_get_name)
    32123    0.015    0.000    0.042    0.000 containers.py:282(is_types_iterable)
     5600    0.015    0.000    0.362    0.000 lowering.py:1282(delvar)
    13657    0.015    0.000    0.015    0.000 __init__.py:509(cast)
    99659    0.014    0.000    0.014    0.000 containers.py:327(key)
    96018    0.014    0.000    0.019    0.000 analysis.py:69(<genexpr>)
      883    0.014    0.000    0.115    0.000 inline_closurecall.py:1327(_inline_const_arraycall)
     4808    0.014    0.000    0.052    0.000 controlflow.py:394(_eliminate_dead_blocks)
     2389    0.014    0.000    0.080    0.000 packer.py:73(__init__)
     5157    0.014    0.000    0.958    0.000 lowering.py:1258(storevar)
     3628    0.014    0.000    0.040    0.000 npydecl.py:24(_handle_inputs)
 2692/118    0.014    0.000   33.576    0.285 typeinfer.py:558(__call__)
 5169/150    0.013    0.000   33.768    0.225 context.py:231(_resolve_user_function_type)
      171    0.013    0.000    0.135    0.001 byteflow.py:78(run)
     5348    0.013    0.000    0.155    0.000 builder.py:297(if_then)
   1643/9    0.013    0.000   39.311    4.368 dispatcher.py:864(compile)
    57780    0.013    0.000    0.022    0.000 __init__.py:1412(debug)
    30870    0.013    0.000    0.017    0.000 values.py:250(function_type)
      171    0.013    0.000    0.099    0.001 interpreter.py:285(peep_hole_delete_with_exit)
   121233    0.013    0.000    0.013    0.000 {method 'release' of '_thread.RLock' objects}
    64248    0.013    0.000    0.013    0.000 containers.py:291(<lambda>)
7570/7562    0.013    0.000    0.056    0.000 typeinfer.py:1077(add_type)
     9923    0.013    0.000    0.014    0.000 warnings.py:458(__enter__)
      522    0.013    0.000    0.013    0.000 {method 'read' of '_io.BufferedReader' objects}
3180/3096    0.013    0.000    0.089    0.000 inspect.py:2244(_signature_from_callable)
    43511    0.013    0.000    0.013    0.000 ffi.py:302(__init__)
      660    0.013    0.000    0.032    0.000 compiler_machinery.py:341(dependency_analysis)
     5031    0.013    0.000    0.028    0.000 ntpath.py:180(split)
    13642    0.013    0.000    0.018    0.000 ir.py:1471(get_definition)
    10095    0.012    0.000    0.022    0.000 inspect.py:2515(__init__)
     3906    0.012    0.000    0.146    0.000 cgutils.py:873(gep)
 8919/426    0.012    0.000    0.105    0.000 pprint.py:163(_format)
     3946    0.012    0.000    0.102    0.000 instructions.py:493(__init__)
    99532    0.012    0.000    0.012    0.000 {method 'pop' of 'list' objects}
     6048    0.012    0.000    0.114    0.000 builder.py:874(call)
      911    0.012    0.000    0.106    0.000 ir.py:1244(dump)
   107569    0.012    0.000    0.012    0.000 {method 'items' of 'dict' objects}
     7860    0.012    0.000    1.072    0.000 values.py:797(descr)
    14543    0.012    0.000    0.012    0.000 {built-in method builtins.repr}
     2106    0.012    0.000    0.867    0.000 module.py:11(parse_assembly)
     8059    0.012    0.000    0.017    0.000 ir.py:1200(find_exprs)
    31346    0.012    0.000    0.012    0.000 typeof.py:144(_typeof_str)
    14015    0.011    0.000    0.022    0.000 context.py:248(_get_attribute_templates)
     7962    0.011    0.000    0.059    0.000 instructions.py:475(__init__)
22354/18422    0.011    0.000    0.012    0.000 {built-in method _abc._abc_subclasscheck}
     9962    0.011    0.000    0.091    0.000 ir.py:862(__str__)
    29404    0.011    0.000    0.015    0.000 ir.py:702(__init__)
       32    0.011    0.000    0.011    0.000 {built-in method _ctypes.LoadLibrary}
    10593    0.011    0.000    0.053    0.000 base.py:108(_match_arglist)
    23078    0.011    0.000    0.014    0.000 containers.py:449(key)
     1444    0.011    0.000    0.392    0.000 types.py:33(_get_ll_pointer_type)
    12444    0.011    0.000    0.015    0.000 values.py:683(__init__)
    11935    0.011    0.000    0.034    0.000 base.py:128(_match)
    15679    0.011    0.000    0.015    0.000 controlflow.py:593(push_state)
      535    0.011    0.000    0.016    0.000 ir_utils.py:513(remove_dels)
     6282    0.011    0.000    0.107    0.000 context.py:364(resolve_value_type)
    31124    0.011    0.000    0.043    0.000 controlflow.py:101(add_edge)
      320    0.011    0.000    0.619    0.002 codegen.py:1063(<setcomp>)
     6467    0.011    0.000    0.068    0.000 byteflow.py:260(dispatch)
     5778    0.011    0.000    0.032    0.000 instructions.py:325(descr)
     2181    0.011    0.000    0.026    0.000 ir_utils.py:1541(find_callname)
42260/21797    0.011    0.000    0.013    0.000 packer.py:144(rec)
24928/23795    0.011    0.000    0.013    0.000 types.py:127(__eq__)
     5778    0.011    0.000    0.042    0.000 instructions.py:309(__init__)
     3417    0.011    0.000    0.016    0.000 inspect.py:2798(__init__)
    18575    0.011    0.000    0.233    0.000 functions.py:298(<listcomp>)
13990/8650    0.011    0.000    1.304    0.000 utils.py:274(__get__)
     7962    0.010    0.000    0.077    0.000 builder.py:737(alloca)
    24455    0.010    0.000    0.020    0.000 utils.py:305(stream_list)
     6516    0.010    0.000    0.024    0.000 numpy_support.py:384(choose_types)
  865/135    0.010    0.000    0.027    0.000 sre_parse.py:493(_parse)
      319    0.010    0.000    0.010    0.000 {built-in method builtins.compile}
      168    0.010    0.000    0.108    0.001 ssa.py:154(_run_block_rewrite)
     9923    0.010    0.000    0.011    0.000 warnings.py:477(__exit__)
    11113    0.010    0.000    0.275    0.000 models.py:706(getter)
     2332    0.010    0.000    0.050    0.000 controlflow.py:653(_find_loops)
    80055    0.010    0.000    0.010    0.000 templates.py:46(args)
    83058    0.010    0.000    0.010    0.000 analysis.py:40(<genexpr>)
    12444    0.010    0.000    0.031    0.000 values.py:716(_to_list)
      501    0.010    0.000    0.014    0.000 controlflow.py:503(_find_dominators_internal)
     6871    0.010    0.000    0.026    0.000 types.py:423(wrap_constant_value)
18454/16387    0.010    0.000    0.256    0.000 ir_utils.py:1495(guard)
5429/2389    0.010    0.000    0.014    0.000 packer.py:175(rec)
    22729    0.010    0.000    0.013    0.000 types.py:116(__init__)
     6446    0.010    0.000    0.054    0.000 instructions.py:167(descr)
      911    0.010    0.000    0.016    0.000 interpreter.py:538(_remove_unused_temporaries)
     2665    0.010    0.000    0.052    0.000 module.py:45(add_metadata)
     5771    0.010    0.000    0.056    0.000 builder.py:568(_icmp)
      344    0.009    0.000    0.012    0.000 ir_utils.py:1516(build_definitions)
     2120    0.009    0.000    0.107    0.000 containers.py:693(__init__)
    82189    0.009    0.000    0.009    0.000 postproc.py:186(<genexpr>)
    67308    0.009    0.000    0.020    0.000 {built-in method builtins.issubclass}
    58269    0.009    0.000    0.009    0.000 compiler.py:269(__getattr__)
     6373    0.009    0.000    0.112    0.000 interpreter.py:668(_dispatch)
    18439    0.009    0.000    0.022    0.000 os.py:674(__getitem__)
    23242    0.009    0.000    0.011    0.000 types.py:240(format_constant)
    70967    0.009    0.000    0.009    0.000 {method 'pop' of 'dict' objects}
    32012    0.009    0.000    0.009    0.000 enum.py:354(__getitem__)
  518/516    0.009    0.000    0.531    0.001 lowering.py:562(lower_binop)
2488/2327    0.009    0.000    4.015    0.002 lowering.py:997(lower_expr)
      341    0.009    0.000    0.009    0.000 __init__.py:48(create_string_buffer)

if the performance of the python and numba versions are similar, let’s assume for now that the timings are similar. They won’t be, for sure, but let’s focus on get_moves. It relies heavily on string operations.

I don’t have a lot of experience with high performance string operations. If there’s someone around that does, they might be able to help better. My intuition (biased from working normally with numerical arrays) is to wonder whether the board representation based on strings could be replaced by a representation based on numbers. I guess it won’t be as close to the original problem domain, and therefore harder to debug, but I’m betting that numerical operations are at least an order of magnitude faster than string operations.
Another open question, probably to one of the core devs, is whether string operations in Numba have been optimized, ie is something like isupper using a naive or optimized algorithm.

I agree with all your thoughts.
As mentioned, I did also try a numerical version of the code. This actually let to worse performance.
I guess I could try again though, to rule out silly mistakes.

when you said that you “replaced strings with numpy arrays” I assumed you meant numpy arrays with strings inside. did you test with arrays of integers?

in your first post you asked about typed.Dict. I haven’t done extensive comparisons but for some tests I ran some time ago, I remember that in interpreted mode typed.Dict is slower than python dict, but in jitted code typed.Dict is faster than python dict. They are both 10-100x slower than a numpy array, so I had to re-factor an application from dictionaries to arrays. I gained 50x doing that.

@julius could you share the numerical version? I’m curious to see how it can be slower

Sure! Pushed it to here now: sunfish/numnunfish.py at master · juliusbierk/sunfish · GitHub
This version uses no string operations (e.g. isupper(), etc…), but it still makes the array into a string now and then to use as a key in dictionaries. I will try and see if I can get around this.
This numerical version is about two times slower than the string version.

Okay @luk-f-a you’re making me think about getting rid of all string operations. I found this smart hash function for chess boards (Zobrist hashing - Wikipedia) and quickly tried implementing it. Now all string operations are gone, and finally I see a speedup! So thank you!!
The speedup is only 6x, but definitely better than nothing.
New version here if you are interested sunfish/numnunfish_zobrist.py at master · juliusbierk/sunfish · GitHub

I am still longing after those 100 x / 1000 x speedups I am used to from numba :slight_smile:

Many thanks again!

is it not possible to remove the dictionaries? that would create a massive speed improvement. if that’s not possible, then not converting to string would be the second best alternative.

Something else to look into is the heavy use of yield, @stuartarchibald would you say using yield in jitted code is efficient? any pitfalls?

cheers,
Luk

Thanks @luk-f-a !
In the zobrist version I just posted I got rid of many of the dictionaries, but two still remain (tp_score, tp_move). These are harder to remove. Could do it, but then I lose algorithmic performance. Might be worth it though. I will think about it.

I can definitely get rid of yield if you think that might be a good idea.

@luk-f-a but most of the 6x speedup might actually come from just doing numerical evaluations instead of string operations. I don’t seem to gain much from numba :frowning:

Have you got any idea of a method to profile numba-compiled code? I have not been able to find any good resources for this. Thanks again :smiley:

there’s no native way to do it. you can hack an approximation with this recipe: Profiling — Numba-how-to documentation

however, it has an overhead per-function-call and you have some many function calls that I’m not sure it will be useful for you. That overhead might distort the results too much.

It can make it harder to optimise but LLVM can sometimes “see” through it. For this sort of application, finding the hot spots and using benchmarks to assess improvements seems like it could be an effective approach.

Thanks @stuartarchibald. Any recommendations on a benchmarking/profiling approach? The method suggested by @luk-f-a, as he also points out, doesn’t work well due to the many function calls.

@julius I’m afraid function level and line level profiling are not really supported in Numba yet, this issue is tracking Feature request: line profiler/performance profiler · Issue #5028 · numba/numba · GitHub perhaps subscribe to that?

For now, your options are probably to collect individual function execution times and obtain information about the number of invocations and infer where there’s potential for optimisation from that. The info @luk-f-a referenced above might help with the invocation count part.