Hello,
I’m trying to understand why numba is 4x faster than c++ (compiling by -O3 flag) in the same algorithm.
Is there any way looking into the numba’s compiled code and see its optimization method in c++.
Example:
The task is to find minimum and maximum on each sliding window k in the arrays. Then make a calculation for each index.
Code in python-numba:
@numba.njit(fastmath=True)
def get_value(d1: np.ndarray, d2: np.ndarray, k: int):
n = len(d1)
result = np.zeros((n))
for i in range(k - 1, n):
val_min = np.inf
val_max = -np.inf
for j in range(i - k, i):
val_max = max(val_max, d1[j])
val_min = min(val_min, d2[j])
result[i] = val_min / (val_max - val_min) * 100
return result
Code in c++:
std::vector<double> &get_value(const std::vector<double> &d1, const std::vector<double> &d2, int k) {
int n = d1.size();
std::vector<double> result(n);
for (int i = k - 1; i < n; ++i) {
double val_max = DBL_MIN;
double val_min = DBL_MAX;
for (int j = i - k; j < i; ++j) {
val_max = fmax(val_max, d1[j]);
val_min = fmin(val_min, d2[j]);
}
result[i] = val_min / (val_max - val_min) * 100;
}
return result;
}
You can configure Numba to dump the LLVM bytecode, I would suspect the std::vector bounds checking could be slowing it down, while that might be eliminated in Numba? what C++ compiler are you using?
You could test this by getting a pointer to the memory allocation underlying the vector using std::vector::data() and using that pointer directly instead of using operator[] to read and mutate the array elements. You would need to of course check the bounds at the start of the function.
It’s really hard for me to look deep down into LLVM bytecode. Is there any way convert or reconstruct them to c++ code.
In addition, I’ve tried using pointer, but the performance is slower than operator[] (6x slower compared to numba). the vectors’ size (n) is about 10,000.
std::vector<double> &get_value(const std::vector<double> &d1, const std::vector<double> &d2, int k) {
int n = d1.size();
std::vector<double> result(n);
auto result_pointer = result.data();
auto d1_pointer = d1.data();
auto d2_pointer = d2.data();
for (int i = k - 1; i < n; ++i) {
double val_max = DBL_MIN;
double val_min = DBL_MAX;
for (int j = i - k; j < i; ++j) {
val_max = fmax(val_max, *(d1_pointer + j));
val_min = fmin(val_min, *(d2_pointer + j));
}
*(result_pointer + i) = val_min / (val_max - val_min) * 100;
}
return result;
}
It turned out that there was a mistake in my cmake file, the arg -O3 had never been passed to the complier. The c++ -O3 performance is now as fast as numba’s!