Performance question / issue on particular problem

Hi all,

I have a question rather than an issue - I was hoping that numba on my code will work much faster than it actually is, but maybe that’s my unrealistic expectations (or I did something wrong). It all runs, all right, just slower than I want.
A bit of background:
I am co-developing python package for whole-brain simulations and currently working on heterogeneous modelling. Typically, your brain network has something like ~80 “nodes” with internal dynamics and then you have a coupling in between them. This means that usually, you write your code like this:

@njit()
for t in time:
   # resolve all coupling here
    for n in node:
        # compute internal dynamics for each node

However, the whole idea of heterogeneity means that each node can have different internal dynamics, hence you cannot use the inner for loop. I came up with an idea that: each node defines its dynamics as symbolic mathematics (think sympy) in a list (each equations is a symbolic equation in a list) and then when integrating, all nodes’ equations are simply summed up into one rather big list of symbolic objects. Now, what I did was to prepare a string that looks like Euler intergations scheme:

NUMBA_EULER_TEMPLATE = """
def integrate(dt, n, max_delay, t_max, y0, input_y):
    y = np.empty((n, t_max + max_delay + 1))
    y[:] = np.nan
    y[:, :max_delay + 1] = y0
    for i in range(1, t_max + 1):
        dy = np.array({dy_eqs})
        y[:, max_delay + i] = y[:, max_delay + i - 1] + dt*dy

    return y[:, max_delay + 1:]
"""

and finally, the {dy_eqs} got formatted using the list of symbolic equations - you just literally print it into this string using sympy hence you get something like:

def integrate(dt, n, max_delay, t_max, y0, input_y):
    y = np.empty((n, t_max + max_delay + 1))
    y[:] = np.nan
    y[:, :max_delay + 1] = y0
    for i in range(1, t_max + 1):
        dy = np.array([1.0 + 0.508152554938578*(-y[0, max_delay + i - 1] + y[2, max_delay + i - 1]) + 4.0*y[0, max_delay + i - 1]**2 - 3.0*y[0, max_delay + i - 1]**3 - 1.5*y[0, max_delay + i - 1] - y[1, max_delay + i - 1] + input_y[0, i], 0.0 + 0.05*(y[0, max_delay + i - 1] - 0.5*y[1, max_delay + i - 1]) + input_y[1, i], 1.0 + 0.219292220770658*(y[0, max_delay + i - 1] - y[2, max_delay + i - 1]) + 4.0*y[2, max_delay + i - 1]**2 - 3.0*y[2, max_delay + i - 1]**3 - 1.5*y[2, max_delay + i - 1] - y[3, max_delay + i - 1] + input_y[2, i], 0.0 + 0.05*(y[2, max_delay + i - 1] - 0.5*y[3, max_delay + i - 1]) + input_y[3, i]])
        y[:, max_delay + i] = y[:, max_delay + i - 1] + dt*dy

    return y[:, max_delay + 1:]

(the above example is with 2 nodes, of course, when I go to 80 the definition is much longer, but it’s still just basic mathematical operations - the hardest thing is exponential) This function is then compiled using:

numba.njit(<signature here>, nopython=True)(FunctionType(numba_string))

it works all right, but is very slow. Now, my question is why is it slow: everything that is in the function itself is either already defined (np.array, int, float) and, moreover, I am providing a proper function signature so numba knows all the types and everything. When I do %prun in jupyter, the long spending functions are: ffi.py:111(__call__), {built-in method builtins.isinstance}, ir.py:313(_rec_list_vars), os.py:676(__getitem__), {built-in method _abc._abc_instancecheck}, etc… In short 9 of the 10 longest function calls are within numba - if I am not mistaken. My question is whether I am doing something wrong here and suboptimal, or whether this slowness is to be expected when I have such a long definition of one integration step.
Thank you very much!

Hi @jajcayn,

This looks like an interesting problem. Just to check, are you measuring the compilation time or the run time, or both? Thanks.

Hi @stuartarchibald,

thanks for your interest. At this point I just run

%prun mm.run()

where the mm object is a class built on top of the scheme I was describing above. Behind the scenes it is forming the initial vector (y0 from my numba function), pre-integrating the noise (the input_y parameter), etc. in the end it returns an array with the results (the integrated y in my function)
The results of %prun are:

36279192 function calls (34047145 primitive calls) in 32.482 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1801    3.694    0.002    3.694    0.002 ffi.py:111(__call__)
1436892/223312    1.715    0.000    2.512    0.000 ir.py:313(_rec_list_vars)
8297092/8297038    1.591    0.000    2.379    0.000 {built-in method builtins.isinstance}
   121905    0.910    0.000    1.698    0.000 instructions.py:13(__init__)
734565/379397    0.677    0.000    2.075    0.000 {method 'format' of 'str' objects}
   964930    0.517    0.000    0.528    0.000 {built-in method _abc._abc_instancecheck}
    19374    0.506    0.000    1.742    0.000 numpy_support.py:349(ufunc_find_matching_loop)
    12977    0.482    0.000    8.242    0.001 functions.py:280(get_call_type)
    83541    0.327    0.000    1.228    0.000 context.py:528(_rate_arguments)
222464/222460    0.313    0.000    0.317    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
327671/327344    0.305    0.000    0.440    0.000 abstract.py:120(__eq__)
396376/170437    0.304    0.000    1.818    0.000 _utils.py:44(__str__)
       44    0.303    0.007    2.015    0.046 analysis.py:23(compute_use_defs)
421176/270125    0.303    0.000    0.640    0.000 abstract.py:117(__hash__)
    19374    0.294    0.000    0.550    0.000 npydecl.py:24(_handle_inputs)
     7580    0.280    0.000    0.369    0.000 instructions.py:651(__init__)
    78346    0.268    0.000    1.135    0.000 ansitowin32.py:73(__init__)
   964930    0.263    0.000    0.791    0.000 abc.py:137(__instancecheck__)
675032/670609    0.244    0.000    0.265    0.000 {built-in method builtins.getattr}
   122060    0.243    0.000    1.525    0.000 values.py:219(_to_string)
   222672    0.241    0.000    0.663    0.000 numpy_support.py:331(ufunc_can_cast)
   134348    0.237    0.000    0.546    0.000 typeconv.py:43(check_compatible)
291165/291138    0.234    0.000    0.481    0.000 _utils.py:54(get_reference)

As I was saying, the significantly long calls are related to numba. Should I maybe profile my code with some other means? I don’t know, maybe this is expected:) However, when I use the first method (non-heterogeneous network = same nodes) with the two loops the same code can be run in like few seconds, instead of 32.

Hi @jajcayn,

It looks like you are timing compilation along with execution, is this intended? Numba is a Just In Time (JIT) compiler, it compiles on the first call made to a function and then runs the compiled code, then, so long as the argument types are the same on the next call, it just runs the compiled version. If you are new to Numba perhaps take a look at the 5 minute guide, it also contains some hints on timings etc.

Once expectation over timings is determined it’ll be easier to make some suggestions.

Thanks :slight_smile:

Hi @stuartarchibald,

thanks for the response. prun was my first choice because I can see individual function calls (at first I thought that the boilerplate around integrating is actually causing the overhead). When I run the same code multiple times (so it’s compiled the first time and then just run from the compiled code) the time difference is minimal - the compilation is fast, the runtime is not. By the way, the result of prun in the above post was measured on the second function run, i.e. should be already compiled.

Thanks

hi @jajcayn , a few observations based on my experience:

  • don’t pass signatures. Numba can figure out the types better than the user. This is the opposite as in Cython, there’s no upside in writing the signatures yourself.
  • replace the vectorized expressions like
    for i in range(1, t_max + 1):
        dy = np.array({dy_eqs})
        y[:, max_delay + i] = y[:, max_delay + i - 1] + dt*dy

with a loop

for j in range(....):
    for i in range(1, t_max + 1):
        dy = {dy_eqs}
        y[j, max_delay + i] = y[j, max_delay + i - 1] + dt*dy

I think that np.array() is probably slow.

Something you should take into account is that if you are generating those text functions every time you execute the calculations, then the compilation is being done over and over. Compiling takes time, and if you don’t save the functions you generated, then you’re wasting the compilation time. I think this might be happening because I see _rec_list_vars in the %prun list which looks like the type of internal function Numba uses for compilation (what do you think @stuartarchibald?).

hope this help

Luk

1 Like

Hey @luk-f-a,

thanks for chiming in! I’ve just tried various possibilities:

  • signature vs. no signature
  • np.array vs. no np.array
  • loop over j vs. numpy vectorized
  • init empty y vector inside numba vs. outside and just pass to numba

all of them lead to very very similar compilation + run times (the full run), so nothing to do there I guess…

Also, not sure I understand your point about compilation - let’s say the model definition (i.e. the sympy-generated text) does not change. However, the way it works now is that when you call run() on your model, the equations are printed into the template and then this function is compiled and wrapped with njit(). Does it mean that it is compiled every time? I know numba caches results of compilation, but I do not know the internals, hence I have no idea how it works in my case - the python function itself is compiled from string every time, so I guess also the numba compiles every time? Is it somehow possible to tell numba that it already compiled this code (since the model definition is not changed?). In other words, can I force caching?
Thanks a lot!

hey @jajcayn , what I meant about compilation is the following:

from numba import njit

def make_function():
    ns = {}
    exec("def foo():\n    return 1", ns, ns)
    return njit(ns['foo'])

def my calculations():
    my_fc = make_function()
    results = my_fc()
    return results

The code above generates the function (in make_function) and executes the function as part of one block. Using prun on my_calculations will always include the compilation time, because the function is generated fresh every time so it must be compiled again. The function objects are not recognized as being the same (even if the code is the same) so there is no caching. In the future, this will improve with this PR https://github.com/numba/numba/pull/6406, but this is how it is at the moment.

If you need to run my_calculations many times you have two options:

  • if those executions are sequential, as part one single python execution you can do
def my calculations():
    my_fc = make_function()
    results = my_fc(args)
    results2 = my_fc(args2)
    results3 = my_fc(args3)
    return results

or

def main():
    my_fc = make_function()
    results = my_calculations(my_fc, args)
    results2 = my_calculations(my_fc, args2)
    results3 = my_calculations(my_fc, args3)

In these cases the function will be compiled only once, not 3 times.

  • If you need to use the function across python executions, you can save it in a text file, and then import it the usual way.

cheers,
Luk

Hi,

My assessment of this is approximately the same as @luk-f-a’s, that compilation is being included in the execution time because there’s exec'd strings involved and Numba sees these as new functions every time. I imagine the execution to compilation time ratio for your function is heavily skewed towards compilation at present.

Questions:

  1. Could you create a small self contained reproducing example? That would make it easier to help with this.
  2. Is dy scalar? Looks like it might be a float?
  3. What influences the “dynamics” expression? Is it the number of nodes or some property of the nodes too? If it’s both, can they be separated into an expression over the “number of nodes” with variables left unresolved for the property part, e.g. in the above:
    alpha = 0.508152554938578
    [1.0 + alpha*(-y[0, max_delay + i - 1] + y[2, max_delay + i - 1]) + 4.0*y[0, max_delay + i - 1]**2 ...
    
  4. Are the “dynamic” expressions actually constant expressions that will likely never/rarely change?
1 Like

Hey @stuartarchibald,

understood, that makes sense. I thought (wrongly) that if the string itself is not changing, the compilation is done only once… As per our questions:

  1. I’ll try
  2. dy is in my initial implementation a list of length n (system size) which I am casting to an 1D array… then, based on @luk-f-a suggestions, I removed the casting to array, since he mentioned this might be slow, so currently it is a list in which each item is one symbolic expression of dynamics (practically a right-hand side of delayed differential equation)
  3. it’s both actually. the number of nodes changes the length of the list and the expression as well (since I do have coupling, so when I add just one nodes, the rule of thumb is all equations change because we need to add that one node to all other couplings). The property of the nodes are actually changing too (those are the parameters of brain models). Typically you would set up all the parameters for one simulation run, but then you usually want to explore parameters, so you want to run your model many times with different set of parameters (this means in my current implementation compilation every time even if one parameter has changed)

Now I think the way to go would be:

  • cache the complied function so not every time I call run() the string is compiled again and again which causes numba to compile the function every time [this would be pretty easy to achieve]
  • make all model parameters also function parameters of my string function - jointly with cached this would result in compilation only once and then each time even if intrinsic parameters would change, numba won’t compile again… however, this would a problem, because some brain models have approx. 10-15 parameters per node (!), and in theory, each node can have different set of parameters, so in total for a large simulation we would have something like 80 * 15 parameters which is 1200 function parameters… Now, since I am doing everything programatically, this is easily done with loops and list comprehensions. However, I have no idea about other potential drawbacks, like is it ok to have a function in python with ~1000 parameters? Will this work in numba?

Thanks!

short update:
you were right about the compiling!
I adapted the code, for 20 node system (40 DDEs) the first run with compilation is around 30seconds, when then running from the cache, it’s less than second…

Now the only thing is to adapt the code such that I pass all the parameters to the numba-compiled function such that when I change the parameter, the model is compiled again…
So one question remains: is it ok to programmatically create functions with ~1000 parameters? Will it be a problem for numba?

I’ve never tried creating such a large function. I wouldn’t be surprised if the compilation time suffers a bit or even a lot. Could you group some inputs into an array, named tuple or structured array?
Stuart will know better what is more or less time consuming for something so large.

Are these “parameters” just the float constants like alpha in here Performance question / issue on particular problem - #9 by stuartarchibald ?

yes, each parameter is just one float… I can group them into lists, namedtuples, dicts, arrays, almost anything. Since I am doing everything programatically, this shouldn’t be any problem. My go-to would be dicts but I was playing around with dicts in numba a while ago and the performance wasn’t ideal. What would be the ideal way to pass such a high number of floats?
Thanks!

The fastest would probably be a single NumPy array of floats, but this also comes with the cost of it not being structured (like you can’t say data_instance.alpha to get alpha).

1 Like

thanks! by structured you mean something like a namedtuple? wouldn’t then be passing namedtuple faster?

Yes, by structured I mean something like a namedtuple, and the cost I am referring to here is the “human/developer cost”, in that without structure the data is just a bunch of floats that are addressed by numerical index. Passing a namedtuple across a function boundary is in general more computationally expensive than passing an array.

oh yeah, I got it now. So computationally cheapest is numpy array of floats, it’s just hard for me to write it with indices up to 1000. namedtuple would be easy to write for me, it’s just computationally more expensive…
Thanks a lot for all your help!

Perhaps go for making something that is correct and easy to debug, add a load of unit/regression tests so you know you are getting the right answers, and then do a more complicated refactor to using a NumPy array? It might turn out that a namedtuple/dictionary/structured array/whatever you choose, is sufficient for your needs and provides a good balance between performance and maintainability.

in the end it’s not that hard with arrays… so imagine I have a string with all the symbolic equations I will put into my numba integrator template (let’s call it derivatives_str). Then I have a list of all parameters, I can simply do:

for i, param in enumerate(param_list):
    derivatives_str = derivatives_str.replace(param, f"param[{i}]")

and then I simply make param function parameter and this will be np.array of the same length as my param_list
since lists are always iterated in the same order, this should just work :slight_smile: