`guvectorized`: No performance difference between targets `cpu` and `parallel`? `cuda` even slower. `vectorize` faster?

Long story short: Am I using guvectorized wrong?

I have just benchmarked a relatively simple workload for both vectorize and guvectorize across all targets (cpu, parallel and cuda) across different sizes of input arrays.

On the vectorize size, stuff is as expected: For sufficiently large input arrays, I can saturate all my cores with target parallel. Performance scales accordingly. cuda is yet a little faster than all the CPU cores, again as expected.

guvectorized for targets cpu and parallel is basically as fast as vectorize for target cpu. It does not benefit from multiple cores, with and without the use of prange? Even more interestingly, when I switch guvectorized to target cuda, it gets more than a solid order of magnitude slower than a single CPU core?

Test workload looks somewhat like this:

COMPLEXITY = 2 ** 11

@jit(*args, **kwargs)
def helper(scalar: float) -> float:
    res: float = 0.0
    for idx in range(COMPLEXITY):
        if idx % 2 == round(scalar) % 2:
            res += sin(idx)
            res -= cos(idx)
    return res

@vectorize(*args, **kwargs)
def v_main(d: float) -> float:
    return helper(d)

@guvectorize(*args, **kwargs)
def gu_main(d, r):
    for idx in [nb.p]range(d.shape[0]):
        r[idx] = helper(d[idx])

Full self-contained notebook with all implementations, benchmark and plots of the results.

Summary: I did not understand how to use it correctly …

Assuming that I have something along the following lines:

@numba.guvectorize('void(f8[:],f8[:])', '(n)->(n)', target = 'parallel')
def gu_main(d: float, r: float):
    for idx in range(d.shape[0]):
        r[idx] = helper(d[idx])

My mistake was to give it an array of ndim == 1, i.e.

numpy.arange(0, size, dtype = 'f8')

If I, on the other hand, give it the following array with ndim > 1

numpy.arange(0, size, dtype = 'f8').reshape(size, 1)

… the parallelization kicks in just as expected for both targets parallel and cuda:

Now I am beginning to understand.

Notebook with complete test

Absolutely not the most intuitive thing to do, but …

@numba.guvectorize('void(f8[:],f8[:])', '()->()', target = 'parallel')
def gu_main(d: float, r: float):
    r[0] = helper(d[0])

… scales just fine for input arrays of ndim == 1. For reference, see numba#2935.

Whatever loops one places inside a function decorated by guvectorize do not get parallelized. That’s key. So my initial workaround simply restricts the loop in my initial function to one iteration, allowing the parallelization on other dimensions to kick in. Before, there was nothing to parallelize: Everything was handled by a sequential for loop inside the function - which perfectly explains the bad benchmark results I initially observed.

Perhaps this could be made a bit more clear in the documentation … ?

The warning that Numba emitted was also pointing to this:

/home/ernst/Desktop/PROJEKTE/prj.TST2/github.poliastro/env310/lib/python3.10/site-packages/numba/cuda/dispatcher.py:488: NumbaPerformanceWarning: Grid size 1 will likely result in GPU under-utilization due to low occupancy.

Perhaps for gufuncs this warning could also point a bit more specifically to the relationship between grid size and the input shape.

I think so - would you be happy to create a PR to add a short section to the gufunc documentation that would help to make this clear please? (perhaps as a subsection of Creating NumPy universal functions — Numba 0.56.0+0.gf75c45a8d.dirty-py3.7-linux-x86_64.egg documentation)

If I can find the right words, I will. Just made a numba dev environment work and sent a little PR (on another subject) your way.

1 Like