`guvectorized`: No performance difference between targets `cpu` and `parallel`? `cuda` even slower. `vectorize` faster?

s-m-e · August 3, 2022, 1:34pm

Long story short: Am I using guvectorized wrong?

I have just benchmarked a relatively simple workload for both vectorize and guvectorize across all targets (cpu, parallel and cuda) across different sizes of input arrays.

On the vectorize size, stuff is as expected: For sufficiently large input arrays, I can saturate all my cores with target parallel. Performance scales accordingly. cuda is yet a little faster than all the CPU cores, again as expected.

guvectorized for targets cpu and parallel is basically as fast as vectorize for target cpu. It does not benefit from multiple cores, with and without the use of prange? Even more interestingly, when I switch guvectorized to target cuda, it gets more than a solid order of magnitude slower than a single CPU core?

Test workload looks somewhat like this:

COMPLEXITY = 2 ** 11

@jit(*args, **kwargs)
def helper(scalar: float) -> float:
    res: float = 0.0
    for idx in range(COMPLEXITY):
        if idx % 2 == round(scalar) % 2:
            res += sin(idx)
        else:
            res -= cos(idx)
    return res

@vectorize(*args, **kwargs)
def v_main(d: float) -> float:
    return helper(d)

@guvectorize(*args, **kwargs)
def gu_main(d, r):
    for idx in [nb.p]range(d.shape[0]):
        r[idx] = helper(d[idx])

Full self-contained notebook with all implementations, benchmark and plots of the results.

s-m-e · August 3, 2022, 3:00pm

Summary: I did not understand how to use it correctly …

Assuming that I have something along the following lines:

@numba.guvectorize('void(f8[:],f8[:])', '(n)->(n)', target = 'parallel')
def gu_main(d: float, r: float):
    for idx in range(d.shape[0]):
        r[idx] = helper(d[idx])

My mistake was to give it an array of ndim == 1, i.e.

numpy.arange(0, size, dtype = 'f8')

If I, on the other hand, give it the following array with ndim > 1 …

numpy.arange(0, size, dtype = 'f8').reshape(size, 1)

… the parallelization kicks in just as expected for both targets parallel and cuda:

Now I am beginning to understand.

Notebook with complete test

s-m-e · August 3, 2022, 3:20pm

Absolutely not the most intuitive thing to do, but …

@numba.guvectorize('void(f8[:],f8[:])', '()->()', target = 'parallel')
def gu_main(d: float, r: float):
    r[0] = helper(d[0])

… scales just fine for input arrays of ndim == 1. For reference, see numba#2935.

s-m-e · August 3, 2022, 4:47pm

Whatever loops one places inside a function decorated by guvectorize do not get parallelized. That’s key. So my initial workaround simply restricts the loop in my initial function to one iteration, allowing the parallelization on other dimensions to kick in. Before, there was nothing to parallelize: Everything was handled by a sequential for loop inside the function - which perfectly explains the bad benchmark results I initially observed.

Perhaps this could be made a bit more clear in the documentation … ?

gmarkall · August 8, 2022, 9:55am

The warning that Numba emitted was also pointing to this:

/home/ernst/Desktop/PROJEKTE/prj.TST2/github.poliastro/env310/lib/python3.10/site-packages/numba/cuda/dispatcher.py:488: NumbaPerformanceWarning: Grid size 1 will likely result in GPU under-utilization due to low occupancy.
  warn(NumbaPerformanceWarning(msg))

Perhaps for gufuncs this warning could also point a bit more specifically to the relationship between grid size and the input shape.

I think so - would you be happy to create a PR to add a short section to the gufunc documentation that would help to make this clear please? (perhaps as a subsection of Creating NumPy universal functions — Numba 0+untagged.4124.gd4460fe.dirty documentation)

s-m-e · August 9, 2022, 3:56pm

If I can find the right words, I will. Just made a numba dev environment work and sent a little PR (on another subject) your way.

Topic		Replies	Views
Cuda.jit vs guvectorize Support: How do I do ...?	0	779	June 17, 2021
Returning an array of 3D or 6D vectors from `guvectorize` on CUDA (cross-compiling for CPU & CUDA) Support: How do I do ...?	2	742	August 2, 2022
Using guvectorize inside a jitted function Support: How do I do ...?	11	1351	June 17, 2024
Why @njit(parallel=True) seems to be faster than @vectorize(target='parallel')? Support: How do I do ...?	0	746	January 10, 2024
Gufunc with cache=True and parallel mode is still slow Community Support	0	453	November 22, 2021

`guvectorized`: No performance difference between targets `cpu` and `parallel`? `cuda` even slower. `vectorize` faster?

Related topics