Issue with Parallel Execution of Numba prange

Context:

I am working with the code from the Facebook Research FFCV-SSL repository (specifically `ffcv/transforms/colorjitter.py`). I am trying to accelerate image transformations using Numba’s `prange`.

Problem Description:
On my workstation machine, the `prange` loop parallelizes well, but when I run the code on GPU cluster machines, the loop is executed sequentially. I am unable to figure out why this is happening. I need the loop to run in parallel on the cluster machines to take full advantage of the available cores.

Code Snippets:

The core function is `apply_cj`, which is decorated with `@njit(parallel=False, fastmath=True, inline="always")`.

``````@njit(parallel=False, fastmath=True, inline="always")
def apply_cj(im, apply_bri, bri_ratio, apply_cont, cont_ratio, apply_sat, sat_ratio, apply_hue, hue_factor):
...
if apply_hue:
...
for row in nb.prange(im.shape[0]):
im[row] = im[row] @ hue_matrix
return np.clip(im, 0, 255).astype(np.uint8)

``````

Later, this `apply_cj` function is called inside the `color_jitter` function:

``````def color_jitter(images, _):
for i in my_range(images.shape[0]):
if np.random.rand() > jitter_prob:
continue

images[i] = apply_cj(
images[i].astype("float64"),
apply_bri,
np.random.uniform(bri[0], bri[1]),
apply_cont,
np.random.uniform(cont[0], cont[1]),
apply_sat,
np.random.uniform(sat[0], sat[1]),
apply_hue,
np.random.uniform(hue[0], hue[1]),
)
return images

color_jitter.is_parallel = True
return color_jitter
``````

Observations and Questions:

• The workstation has Intel CPUs a while the cluster machine has AMD CPUs. Could this be the root of the issue?
• The `numba` configurations seem to be similar on both machines, yet the behavior is different. Why might this be?
• Are there any environment variables or configurations that I might be missing for the cluster machines?
• Any suggestions to force `numba` to parallelize the loop on the cluster machines would be highly appreciated.

working system:

``````System info:
--------------------------------------------------------------------------------
__Time Stamp__
Report started (local time)                   : 2023-08-18 21:23:50.484663
UTC start time                                : 2023-08-18 19:23:50.484668
Running time (s)                              : 5.238415

__Hardware Information__
Machine                                       : x86_64
CPU Count                                     : 36
Number of accessible CPUs                     : 36
List of accessible CPUs cores                 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
CFS Restrictions (CPUs worth of runtime)      : None

CPU Features                                  : 64bit adx aes avx avx2 avx512bw
avx512cd avx512dq avx512f avx512vl
avx512vnni bmi bmi2 clflushopt
clwb cmov crc32 cx16 cx8 f16c fma
fsgsbase fxsr invpcid lzcnt mmx
movbe pclmul popcnt prfchw rdrnd
rdseed sahf sse sse2 sse3 sse4.1
sse4.2 ssse3 xsave xsavec xsaveopt
xsaves

Memory Total (MB)                             : 128499
Memory Available (MB)                         : 55419

__OS Information__
Platform Name                                 : Linux-5.15.0-71-generic-x86_64-with-glibc2.31
Platform Release                              : 5.15.0-71-generic
OS Name                                       : Linux
OS Version                                    : #78~20.04.1-Ubuntu SMP Wed Apr 19 11:26:48 UTC 2023
OS Specific Version                           : ?
Libc Version                                  : glibc 2.31

__Python Information__
Python Compiler                               : GCC 12.3.0
Python Implementation                         : CPython
Python Version                                : 3.9.17
Python Locale                                 : en_US.UTF-8

__Numba Toolchain Versions__
Numba Version                                 : 0.57.1
llvmlite Version                              : 0.40.1

__LLVM Information__
LLVM Version                                  : 14.0.6

__CUDA Information__
CUDA Device Initialized                       : True
CUDA Driver Version                           : 11.6
CUDA Runtime Version                          : 11.8
CUDA NVIDIA Bindings Available                : False
CUDA NVIDIA Bindings In Use                   : False
CUDA Minor Version Compatibility Available    : False
CUDA Minor Version Compatibility Needed       : True
CUDA Minor Version Compatibility In Use       : False
CUDA Detect Output:
Found 2 CUDA devices
id 0      b'Quadro RTX 5000'                              [SUPPORTED]
Compute Capability: 7.5
PCI Device ID: 0
PCI Bus ID: 23
UUID: GPU-26c26bec-5df3-851f-07f2-4e176e266b50
Watchdog: Enabled
FP32/FP64 Performance Ratio: 32
id 1      b'Quadro RTX 5000'                              [SUPPORTED]
Compute Capability: 7.5
PCI Device ID: 0
PCI Bus ID: 101
UUID: GPU-74d37b39-0f91-e9a8-c003-d9ccc49a5b32
Watchdog: Enabled
FP32/FP64 Performance Ratio: 32
Summary:
2/2 devices are supported

CUDA Libraries Test Output:
Finding driver from candidates: libcuda.so, libcuda.so.1, /usr/lib/libcuda.so, /usr/lib/libcuda.so.1, /usr/lib64/libcuda.so, /usr/lib64/libcuda.so.1...
Finding nvvm from Conda environment
named  libnvvm.so.4.0.0
trying to open library... ok
Finding cudart from Conda environment
named  libcudart.so.11.8.89
trying to open library... ok
Finding libdevice from Conda environment
trying to open library... ok

__NumPy Information__
NumPy Version                                 : 1.24.4
NumPy Supported SIMD features                 : ('MMX', 'SSE', 'SSE2', 'SSE3', 'SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2', 'AVX512F', 'AVX512CD', 'AVX512VL', 'AVX512BW', 'AVX512DQ', 'AVX512VNNI', 'AVX512_SKX', 'AVX512_CLX')
NumPy Supported SIMD dispatch                 : ('SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2', 'AVX512F', 'AVX512CD', 'AVX512_SKX', 'AVX512_CLX', 'AVX512_CNL', 'AVX512_ICL')
NumPy Supported SIMD baseline                 : ('SSE', 'SSE2', 'SSE3')
NumPy AVX512_SKX support detected             : True

__SVML Information__
SVML State, config.USING_SVML                 : False
llvmlite Using SVML Patched LLVM              : False
SVML Operational                              : False

TBB Threading Layer Available                 : True
+-->TBB imported successfully.
OpenMP Threading Layer Available              : True
+-->Vendor: GNU
Workqueue Threading Layer Available           : True
+-->Workqueue imported successfully.

__Numba Environment Variable Information__
None found.

__Conda Information__
Conda Build                                   : 3.21.5
Conda Env                                     : 4.10.3
Conda Platform                                : linux-64
Conda Python Version                          : 3.9.7.final.0
Conda Root Writable
``````

Not working system:

``````System info:
--------------------------------------------------------------------------------
__Time Stamp__
Report started (local time)                   : 2023-08-18 21:15:08.766025
UTC start time                                : 2023-08-18 19:15:08.766032
Running time (s)                              : 4.064849

__Hardware Information__
Machine                                       : x86_64
CPU Name                                      : znver2
CPU Count                                     : 256
Number of accessible CPUs                     : 16
List of accessible CPUs cores                 : 56 57 58 59 60 61 62 63 184 185 186 187 188 189 190 191
CFS Restrictions (CPUs worth of runtime)      : None

CPU Features                                  : 64bit adx aes avx avx2 bmi bmi2
clflushopt clwb clzero cmov crc32
cx16 cx8 f16c fma fsgsbase fxsr
lzcnt mmx movbe mwaitx pclmul
popcnt prfchw rdpid rdrnd rdseed
sahf sha sse sse2 sse3 sse4.1
sse4.2 sse4a ssse3 wbnoinvd xsave
xsavec xsaveopt xsaves

Memory Total (MB)                             : 1031954
Memory Available (MB)                         : 807107

__OS Information__
Platform Name                                 : Linux-5.4.0-148-generic-x86_64-with-glibc2.31
Platform Release                              : 5.4.0-148-generic
OS Name                                       : Linux
OS Version                                    : #165-Ubuntu SMP Tue Apr 18 08:53:12 UTC 2023
OS Specific Version                           : ?
Libc Version                                  : glibc 2.31

__Python Information__
Python Compiler                               : GCC 12.3.0
Python Implementation                         : CPython
Python Version                                : 3.9.17
Python Locale                                 : en_US.UTF-8

__Numba Toolchain Versions__
Numba Version                                 : 0.57.1
llvmlite Version                              : 0.40.1

__LLVM Information__
LLVM Version                                  : 14.0.6

__CUDA Information__
CUDA Device Initialized                       : False
CUDA Driver Version                           : ?
CUDA Runtime Version                          : ?
CUDA NVIDIA Bindings Available                : ?
CUDA NVIDIA Bindings In Use                   : ?
CUDA Minor Version Compatibility Available    : ?
CUDA Minor Version Compatibility Needed       : ?
CUDA Minor Version Compatibility In Use       : ?
CUDA Detect Output:
None
CUDA Libraries Test Output:
None

__NumPy Information__
NumPy Version                                 : 1.24.4
NumPy Supported SIMD features                 : ('MMX', 'SSE', 'SSE2', 'SSE3', 'SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2')
NumPy Supported SIMD dispatch                 : ('SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2', 'AVX512F', 'AVX512CD', 'AVX512_SKX', 'AVX512_CLX', 'AVX512_CNL', 'AVX512_ICL')
NumPy Supported SIMD baseline                 : ('SSE', 'SSE2', 'SSE3')
NumPy AVX512_SKX support detected             : False

__SVML Information__
SVML State, config.USING_SVML                 : False
llvmlite Using SVML Patched LLVM              : False
SVML Operational                              : False

TBB Threading Layer Available                 : True
+-->TBB imported successfully.
OpenMP Threading Layer Available              : True
+-->Vendor: GNU
Workqueue Threading Layer Available           : True
+-->Workqueue imported successfully.

__Numba Environment Variable Information__
None found.

__Conda Information__
Conda Build                                   : 3.22.0
Conda Env                                     : 22.9.0
Conda Platform                                : linux-64
Conda Python Version                          : 3.9.13.final.0
Conda Root Writable                           : False
``````

In your sample code, you say “parallel=False”.