Issue with Parallel Execution of Numba prange

Context:

I am working with the code from the Facebook Research FFCV-SSL repository (specifically ffcv/transforms/colorjitter.py). I am trying to accelerate image transformations using Numba’s prange.

Problem Description:
On my workstation machine, the prange loop parallelizes well, but when I run the code on GPU cluster machines, the loop is executed sequentially. I am unable to figure out why this is happening. I need the loop to run in parallel on the cluster machines to take full advantage of the available cores.

Code Snippets:

The core function is apply_cj, which is decorated with @njit(parallel=False, fastmath=True, inline="always").

@njit(parallel=False, fastmath=True, inline="always")
def apply_cj(im, apply_bri, bri_ratio, apply_cont, cont_ratio, apply_sat, sat_ratio, apply_hue, hue_factor):
    ...
    if apply_hue:
        ...
        for row in nb.prange(im.shape[0]):
            im[row] = im[row] @ hue_matrix
    return np.clip(im, 0, 255).astype(np.uint8)

Later, this apply_cj function is called inside the color_jitter function:

def color_jitter(images, _):
                for i in my_range(images.shape[0]):
                    if np.random.rand() > jitter_prob:
                        continue

                    images[i] = apply_cj(
                        images[i].astype("float64"),
                        apply_bri,
                        np.random.uniform(bri[0], bri[1]),
                        apply_cont,
                        np.random.uniform(cont[0], cont[1]),
                        apply_sat,
                        np.random.uniform(sat[0], sat[1]),
                        apply_hue,
                        np.random.uniform(hue[0], hue[1]),
                    )
                return images

            color_jitter.is_parallel = True
            return color_jitter

Observations and Questions:

  • The workstation has Intel CPUs a while the cluster machine has AMD CPUs. Could this be the root of the issue?
  • The numba configurations seem to be similar on both machines, yet the behavior is different. Why might this be?
  • Are there any environment variables or configurations that I might be missing for the cluster machines?
  • Any suggestions to force numba to parallelize the loop on the cluster machines would be highly appreciated.

Thank you for your help!

working system:

System info:
--------------------------------------------------------------------------------
__Time Stamp__
Report started (local time)                   : 2023-08-18 21:23:50.484663
UTC start time                                : 2023-08-18 19:23:50.484668
Running time (s)                              : 5.238415

__Hardware Information__
Machine                                       : x86_64
CPU Name                                      : cascadelake
CPU Count                                     : 36
Number of accessible CPUs                     : 36
List of accessible CPUs cores                 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
CFS Restrictions (CPUs worth of runtime)      : None

CPU Features                                  : 64bit adx aes avx avx2 avx512bw
                                                avx512cd avx512dq avx512f avx512vl
                                                avx512vnni bmi bmi2 clflushopt
                                                clwb cmov crc32 cx16 cx8 f16c fma
                                                fsgsbase fxsr invpcid lzcnt mmx
                                                movbe pclmul popcnt prfchw rdrnd
                                                rdseed sahf sse sse2 sse3 sse4.1
                                                sse4.2 ssse3 xsave xsavec xsaveopt
                                                xsaves

Memory Total (MB)                             : 128499
Memory Available (MB)                         : 55419

__OS Information__
Platform Name                                 : Linux-5.15.0-71-generic-x86_64-with-glibc2.31
Platform Release                              : 5.15.0-71-generic
OS Name                                       : Linux
OS Version                                    : #78~20.04.1-Ubuntu SMP Wed Apr 19 11:26:48 UTC 2023
OS Specific Version                           : ?
Libc Version                                  : glibc 2.31

__Python Information__
Python Compiler                               : GCC 12.3.0
Python Implementation                         : CPython
Python Version                                : 3.9.17
Python Locale                                 : en_US.UTF-8

__Numba Toolchain Versions__
Numba Version                                 : 0.57.1
llvmlite Version                              : 0.40.1

__LLVM Information__
LLVM Version                                  : 14.0.6

__CUDA Information__
CUDA Device Initialized                       : True
CUDA Driver Version                           : 11.6
CUDA Runtime Version                          : 11.8
CUDA NVIDIA Bindings Available                : False
CUDA NVIDIA Bindings In Use                   : False
CUDA Minor Version Compatibility Available    : False
CUDA Minor Version Compatibility Needed       : True
CUDA Minor Version Compatibility In Use       : False
CUDA Detect Output:
Found 2 CUDA devices
id 0      b'Quadro RTX 5000'                              [SUPPORTED]
                      Compute Capability: 7.5
                           PCI Device ID: 0
                              PCI Bus ID: 23
                                    UUID: GPU-26c26bec-5df3-851f-07f2-4e176e266b50
                                Watchdog: Enabled
             FP32/FP64 Performance Ratio: 32
id 1      b'Quadro RTX 5000'                              [SUPPORTED]
                      Compute Capability: 7.5
                           PCI Device ID: 0
                              PCI Bus ID: 101
                                    UUID: GPU-74d37b39-0f91-e9a8-c003-d9ccc49a5b32
                                Watchdog: Enabled
             FP32/FP64 Performance Ratio: 32
Summary:
 2/2 devices are supported

CUDA Libraries Test Output:
Finding driver from candidates: libcuda.so, libcuda.so.1, /usr/lib/libcuda.so, /usr/lib/libcuda.so.1, /usr/lib64/libcuda.so, /usr/lib64/libcuda.so.1...
Using loader <class 'ctypes.CDLL'>
 trying to load driver...  ok, loaded from libcuda.so
Finding nvvm from Conda environment
 named  libnvvm.so.4.0.0
 trying to open library... ok
Finding cudart from Conda environment
 named  libcudart.so.11.8.89
 trying to open library... ok
Finding cudadevrt from Conda environment
 named  libcudadevrt.a
Finding libdevice from Conda environment
 trying to open library... ok


__NumPy Information__
NumPy Version                                 : 1.24.4
NumPy Supported SIMD features                 : ('MMX', 'SSE', 'SSE2', 'SSE3', 'SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2', 'AVX512F', 'AVX512CD', 'AVX512VL', 'AVX512BW', 'AVX512DQ', 'AVX512VNNI', 'AVX512_SKX', 'AVX512_CLX')
NumPy Supported SIMD dispatch                 : ('SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2', 'AVX512F', 'AVX512CD', 'AVX512_SKX', 'AVX512_CLX', 'AVX512_CNL', 'AVX512_ICL')
NumPy Supported SIMD baseline                 : ('SSE', 'SSE2', 'SSE3')
NumPy AVX512_SKX support detected             : True

__SVML Information__
SVML State, config.USING_SVML                 : False
SVML Library Loaded                           : True
llvmlite Using SVML Patched LLVM              : False
SVML Operational                              : False

__Threading Layer Information__
TBB Threading Layer Available                 : True
+-->TBB imported successfully.
OpenMP Threading Layer Available              : True
+-->Vendor: GNU
Workqueue Threading Layer Available           : True
+-->Workqueue imported successfully.

__Numba Environment Variable Information__
None found.

__Conda Information__
Conda Build                                   : 3.21.5
Conda Env                                     : 4.10.3
Conda Platform                                : linux-64
Conda Python Version                          : 3.9.7.final.0
Conda Root Writable           

Not working system:

System info:
--------------------------------------------------------------------------------
__Time Stamp__
Report started (local time)                   : 2023-08-18 21:15:08.766025
UTC start time                                : 2023-08-18 19:15:08.766032
Running time (s)                              : 4.064849

__Hardware Information__
Machine                                       : x86_64
CPU Name                                      : znver2
CPU Count                                     : 256
Number of accessible CPUs                     : 16
List of accessible CPUs cores                 : 56 57 58 59 60 61 62 63 184 185 186 187 188 189 190 191
CFS Restrictions (CPUs worth of runtime)      : None

CPU Features                                  : 64bit adx aes avx avx2 bmi bmi2
                                                clflushopt clwb clzero cmov crc32
                                                cx16 cx8 f16c fma fsgsbase fxsr
                                                lzcnt mmx movbe mwaitx pclmul
                                                popcnt prfchw rdpid rdrnd rdseed
                                                sahf sha sse sse2 sse3 sse4.1
                                                sse4.2 sse4a ssse3 wbnoinvd xsave
                                                xsavec xsaveopt xsaves

Memory Total (MB)                             : 1031954
Memory Available (MB)                         : 807107

__OS Information__
Platform Name                                 : Linux-5.4.0-148-generic-x86_64-with-glibc2.31
Platform Release                              : 5.4.0-148-generic
OS Name                                       : Linux
OS Version                                    : #165-Ubuntu SMP Tue Apr 18 08:53:12 UTC 2023
OS Specific Version                           : ?
Libc Version                                  : glibc 2.31

__Python Information__
Python Compiler                               : GCC 12.3.0
Python Implementation                         : CPython
Python Version                                : 3.9.17
Python Locale                                 : en_US.UTF-8

__Numba Toolchain Versions__
Numba Version                                 : 0.57.1
llvmlite Version                              : 0.40.1

__LLVM Information__
LLVM Version                                  : 14.0.6

__CUDA Information__
CUDA Device Initialized                       : False
CUDA Driver Version                           : ?
CUDA Runtime Version                          : ?
CUDA NVIDIA Bindings Available                : ?
CUDA NVIDIA Bindings In Use                   : ?
CUDA Minor Version Compatibility Available    : ?
CUDA Minor Version Compatibility Needed       : ?
CUDA Minor Version Compatibility In Use       : ?
CUDA Detect Output:
None
CUDA Libraries Test Output:
None

__NumPy Information__
NumPy Version                                 : 1.24.4
NumPy Supported SIMD features                 : ('MMX', 'SSE', 'SSE2', 'SSE3', 'SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2')
NumPy Supported SIMD dispatch                 : ('SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2', 'AVX512F', 'AVX512CD', 'AVX512_SKX', 'AVX512_CLX', 'AVX512_CNL', 'AVX512_ICL')
NumPy Supported SIMD baseline                 : ('SSE', 'SSE2', 'SSE3')
NumPy AVX512_SKX support detected             : False

__SVML Information__
SVML State, config.USING_SVML                 : False
SVML Library Loaded                           : True
llvmlite Using SVML Patched LLVM              : False
SVML Operational                              : False

__Threading Layer Information__
TBB Threading Layer Available                 : True
+-->TBB imported successfully.
OpenMP Threading Layer Available              : True
+-->Vendor: GNU
Workqueue Threading Layer Available           : True
+-->Workqueue imported successfully.

__Numba Environment Variable Information__
None found.

__Conda Information__
Conda Build                                   : 3.22.0
Conda Env                                     : 22.9.0
Conda Platform                                : linux-64
Conda Python Version                          : 3.9.13.final.0
Conda Root Writable                           : False

In your sample code, you say “parallel=False”.