Why @njit(parallel=True) seems to be faster than @vectorize(target='parallel')?

Hi!

I’m new to Numba and I would like to use it to write code that may run both on multiple CPUs and GPUs.

In particular, I’m currently learning the basic differences between the decorators @njit and @vectorize, but I have a problem in understanding their different performances.
I wrote the same python function using both the decorator “@njit(parallel=True)” and “@vectorize(…,target=‘parallel’)”.
It turned out that the first implementation is much faster than the latter.
Here below I report the code.

import numba as nb
import numpy as np
from numba import float64

@nb.njit
def logdiffexp(x, y):
    return x + np.log1p(-np.exp(y - x))
    
@nb.njit
def _logTPL(x, alpha, mmin, mmax):
    log_norm_cost = -np.log(alpha - 1) + logdiffexp((1 - alpha) * np.log(mmin), (1 - alpha) * np.log(mmax))
    if (mmin < x) and (x < mmax):
        result = -alpha * np.log(x) - log_norm_cost
    else:
        result = -np.inf
    return result

@nb.njit
def _logSmoothing(m, delta_m, ml):
    if m <= ml:
        result = -np.inf
    elif m >= (ml + delta_m):
        result = 0.0
    else:
        result = -np.logaddexp(0.0, (delta_m / (m - ml) + delta_m / (m - ml - delta_m)))
    return result

@nb.njit
def _logPLm2(m2, beta, ml):
    return beta * np.log(m2) if m2 >= ml else -np.inf

@nb.njit
def _logC_PL(m1, beta, ml):
    return np.log((1 + beta) / (m1**(1 + beta) - ml**(1 + beta)))

@nb.njit(parallel=True)
def log_PL(m1, m2,  alpha, beta, ml, mh):
    result = np.empty_like(m1)
    for i in nb.prange(len(m1)):
        if ml < m2[i] < m1[i] < mh:
            result[i] = _logTPL(m1[i], alpha, ml, mh) + _logPLm2(m2[i], beta, ml) + _logC_PL(m1[i], beta, ml)
        else:
            result[i] = -np.inf
    return result

@nb.vectorize([float64(float64, float64, float64, float64, float64, float64)],target='parallel')
def log_PLvec(m1, m2, alpha, beta, ml, mh):
    if ml < m2 < m1 < mh:
        result = _logTPL(m1, alpha, ml, mh) + _logPLm2(m2, beta, ml) + _logC_PL(m1, beta, ml)
    else:
        result = -np.inf
    return result

The execution times are reported in the picture here below.

Can someone explain me why this happens?
I’m interested in the @vectorize decorator because, as far as I understand, it can take “cuda” as a target and thus run on the GPU (am I wrong?).