# Try to understand why numba is 4x faster than c++ code in the same task

Hello,
I’m trying to understand why numba is 4x faster than c++ (compiling by -O3 flag) in the same algorithm.
Is there any way looking into the numba’s compiled code and see its optimization method in c++.

Example:
The task is to find minimum and maximum on each sliding window k in the arrays. Then make a calculation for each index.

Code in python-numba:

``````@numba.njit(fastmath=True)
def get_value(d1: np.ndarray, d2: np.ndarray, k: int):

n = len(d1)
result = np.zeros((n))

for i in range(k - 1, n):

val_min = np.inf
val_max = -np.inf
for j in range(i - k, i):
val_max = max(val_max, d1[j])
val_min = min(val_min, d2[j])

result[i] = val_min / (val_max - val_min) * 100

return result
``````

Code in c++:

``````std::vector<double> &get_value(const std::vector<double> &d1, const std::vector<double> &d2, int k) {

int n = d1.size();
std::vector<double> result(n);

for (int i = k - 1; i < n; ++i) {
double val_max = DBL_MIN;
double val_min = DBL_MAX;

for (int j = i - k; j < i; ++j) {
val_max = fmax(val_max, d1[j]);
val_min = fmin(val_min, d2[j]);
}
result[i] = val_min / (val_max - val_min) * 100;
}
return result;
}
``````

You can configure Numba to dump the LLVM bytecode, I would suspect the `std::vector` bounds checking could be slowing it down, while that might be eliminated in Numba? what C++ compiler are you using?

You could test this by getting a pointer to the memory allocation underlying the vector using `std::vector::data()` and using that pointer directly instead of using `operator[]` to read and mutate the array elements. You would need to of course check the bounds at the start of the function.

Hello @nt-KeBugCheck,

what C++ compiler are you using?

I’ve been using Clang++ from llvm@14.

It’s really hard for me to look deep down into LLVM bytecode. Is there any way convert or reconstruct them to c++ code.

In addition, I’ve tried using pointer, but the performance is slower than `operator[]` (6x slower compared to numba). the vectors’ size (n) is about 10,000.

``````std::vector<double> &get_value(const std::vector<double> &d1, const std::vector<double> &d2, int k) {

int n = d1.size();
std::vector<double> result(n);
auto result_pointer = result.data();
auto d1_pointer = d1.data();
auto d2_pointer = d2.data();

for (int i = k - 1; i < n; ++i) {
double val_max = DBL_MIN;
double val_min = DBL_MAX;

for (int j = i - k; j < i; ++j) {
val_max = fmax(val_max, *(d1_pointer + j));
val_min = fmin(val_min, *(d2_pointer + j));
}
*(result_pointer + i) = val_min / (val_max - val_min) * 100;
}
return result;
}
``````

Apart from the suggestions above it may also make sense to try equivalent compilation flags.
(-O3 -ffastmath -march=“native”)

It turned out that there was a mistake in my cmake file, the arg -O3 had never been passed to the complier. The c++ -O3 performance is now as fast as numba’s!

I’m really sorry and thank you for your help @nt-KeBugCheck and @max9111 .