Try to understand why numba is 4x faster than c++ code in the same task

Hello,
I’m trying to understand why numba is 4x faster than c++ (compiling by -O3 flag) in the same algorithm.
Is there any way looking into the numba’s compiled code and see its optimization method in c++.

Example:
The task is to find minimum and maximum on each sliding window k in the arrays. Then make a calculation for each index.

Code in python-numba:

@numba.njit(fastmath=True)
def get_value(d1: np.ndarray, d2: np.ndarray, k: int):
    
    n = len(d1)
    result = np.zeros((n))

    for i in range(k - 1, n):
        
        val_min = np.inf
        val_max = -np.inf
        for j in range(i - k, i):
            val_max = max(val_max, d1[j])
            val_min = min(val_min, d2[j])
            
        result[i] = val_min / (val_max - val_min) * 100
    
    return result

Code in c++:

std::vector<double> &get_value(const std::vector<double> &d1, const std::vector<double> &d2, int k) {
    
    int n = d1.size();
    std::vector<double> result(n);

    for (int i = k - 1; i < n; ++i) {
        double val_max = DBL_MIN;
        double val_min = DBL_MAX;

        for (int j = i - k; j < i; ++j) {
            val_max = fmax(val_max, d1[j]);
            val_min = fmin(val_min, d2[j]);
        }
        result[i] = val_min / (val_max - val_min) * 100;
    }
    return result;
}

You can configure Numba to dump the LLVM bytecode, I would suspect the std::vector bounds checking could be slowing it down, while that might be eliminated in Numba? what C++ compiler are you using?

You could test this by getting a pointer to the memory allocation underlying the vector using std::vector::data() and using that pointer directly instead of using operator[] to read and mutate the array elements. You would need to of course check the bounds at the start of the function.

Hello @nt-KeBugCheck,

Thank you for your reply and help.

what C++ compiler are you using?

I’ve been using Clang++ from llvm@14.

It’s really hard for me to look deep down into LLVM bytecode. Is there any way convert or reconstruct them to c++ code.

In addition, I’ve tried using pointer, but the performance is slower than operator[] (6x slower compared to numba). the vectors’ size (n) is about 10,000.

std::vector<double> &get_value(const std::vector<double> &d1, const std::vector<double> &d2, int k) {
    
    int n = d1.size();
    std::vector<double> result(n);
    auto result_pointer = result.data();
    auto d1_pointer = d1.data();
    auto d2_pointer = d2.data();

    for (int i = k - 1; i < n; ++i) {
        double val_max = DBL_MIN;
        double val_min = DBL_MAX;

        for (int j = i - k; j < i; ++j) {
            val_max = fmax(val_max, *(d1_pointer + j));
            val_min = fmin(val_min, *(d2_pointer + j));
        }
        *(result_pointer + i) = val_min / (val_max - val_min) * 100;
    }
    return result;
}

Apart from the suggestions above it may also make sense to try equivalent compilation flags.
(-O3 -ffastmath -march=“native”)

It turned out that there was a mistake in my cmake file, the arg -O3 had never been passed to the complier. The c++ -O3 performance is now as fast as numba’s!

I’m really sorry and thank you for your help @nt-KeBugCheck and @max9111 .