I run Lattice Boltzmann Simulations when using numba (specifically jit with parallel=True), I could speed up my simulations a lot, for example going from 13 hours to 4 hours in my home computer and thats amazing.
Then I runned my code in a cluster and the simulation took more time than in my home computer.
When I was checking the processors, I actually saw that the simulations was using only one of the processors of the cluster and not as I was expecting the number of threads that I set, using numba.set_num_threads(64).
So I would like to understand better how this works and if it would be possible to solve this.
In my home computer I have a Ryzen 7 3700x with 16 threads, and actually checking the numba.set_num_threads(), I tried to run with 8 threads (half) but when I was looking at my processors, it looks that all of them are in use, but with 50% of capacity. This let me confuse also, because I though that it would literally use 8 threads and it wasnt what happened.
A frind of mine has acess to one, but essentially he runned the same way as I do in my home computer, from the terminal. But I think that my biggest issue right now, a part from the cluster, is why when I set the number of threads to a value, all the threads remain in use but with a lower % of usage.