Context:
I am working with the code from the Facebook Research FFCV-SSL repository (specifically ffcv/transforms/colorjitter.py
). I am trying to accelerate image transformations using Numba’s prange
.
Problem Description:
On my workstation machine, the prange
loop parallelizes well, but when I run the code on GPU cluster machines, the loop is executed sequentially. I am unable to figure out why this is happening. I need the loop to run in parallel on the cluster machines to take full advantage of the available cores.
Code Snippets:
The core function is apply_cj
, which is decorated with @njit(parallel=False, fastmath=True, inline="always")
.
@njit(parallel=False, fastmath=True, inline="always")
def apply_cj(im, apply_bri, bri_ratio, apply_cont, cont_ratio, apply_sat, sat_ratio, apply_hue, hue_factor):
...
if apply_hue:
...
for row in nb.prange(im.shape[0]):
im[row] = im[row] @ hue_matrix
return np.clip(im, 0, 255).astype(np.uint8)
Later, this apply_cj
function is called inside the color_jitter
function:
def color_jitter(images, _):
for i in my_range(images.shape[0]):
if np.random.rand() > jitter_prob:
continue
images[i] = apply_cj(
images[i].astype("float64"),
apply_bri,
np.random.uniform(bri[0], bri[1]),
apply_cont,
np.random.uniform(cont[0], cont[1]),
apply_sat,
np.random.uniform(sat[0], sat[1]),
apply_hue,
np.random.uniform(hue[0], hue[1]),
)
return images
color_jitter.is_parallel = True
return color_jitter
Observations and Questions:
- The workstation has Intel CPUs a while the cluster machine has AMD CPUs. Could this be the root of the issue?
- The
numba
configurations seem to be similar on both machines, yet the behavior is different. Why might this be? - Are there any environment variables or configurations that I might be missing for the cluster machines?
- Any suggestions to force
numba
to parallelize the loop on the cluster machines would be highly appreciated.
Thank you for your help!
working system:
System info:
--------------------------------------------------------------------------------
__Time Stamp__
Report started (local time) : 2023-08-18 21:23:50.484663
UTC start time : 2023-08-18 19:23:50.484668
Running time (s) : 5.238415
__Hardware Information__
Machine : x86_64
CPU Name : cascadelake
CPU Count : 36
Number of accessible CPUs : 36
List of accessible CPUs cores : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
CFS Restrictions (CPUs worth of runtime) : None
CPU Features : 64bit adx aes avx avx2 avx512bw
avx512cd avx512dq avx512f avx512vl
avx512vnni bmi bmi2 clflushopt
clwb cmov crc32 cx16 cx8 f16c fma
fsgsbase fxsr invpcid lzcnt mmx
movbe pclmul popcnt prfchw rdrnd
rdseed sahf sse sse2 sse3 sse4.1
sse4.2 ssse3 xsave xsavec xsaveopt
xsaves
Memory Total (MB) : 128499
Memory Available (MB) : 55419
__OS Information__
Platform Name : Linux-5.15.0-71-generic-x86_64-with-glibc2.31
Platform Release : 5.15.0-71-generic
OS Name : Linux
OS Version : #78~20.04.1-Ubuntu SMP Wed Apr 19 11:26:48 UTC 2023
OS Specific Version : ?
Libc Version : glibc 2.31
__Python Information__
Python Compiler : GCC 12.3.0
Python Implementation : CPython
Python Version : 3.9.17
Python Locale : en_US.UTF-8
__Numba Toolchain Versions__
Numba Version : 0.57.1
llvmlite Version : 0.40.1
__LLVM Information__
LLVM Version : 14.0.6
__CUDA Information__
CUDA Device Initialized : True
CUDA Driver Version : 11.6
CUDA Runtime Version : 11.8
CUDA NVIDIA Bindings Available : False
CUDA NVIDIA Bindings In Use : False
CUDA Minor Version Compatibility Available : False
CUDA Minor Version Compatibility Needed : True
CUDA Minor Version Compatibility In Use : False
CUDA Detect Output:
Found 2 CUDA devices
id 0 b'Quadro RTX 5000' [SUPPORTED]
Compute Capability: 7.5
PCI Device ID: 0
PCI Bus ID: 23
UUID: GPU-26c26bec-5df3-851f-07f2-4e176e266b50
Watchdog: Enabled
FP32/FP64 Performance Ratio: 32
id 1 b'Quadro RTX 5000' [SUPPORTED]
Compute Capability: 7.5
PCI Device ID: 0
PCI Bus ID: 101
UUID: GPU-74d37b39-0f91-e9a8-c003-d9ccc49a5b32
Watchdog: Enabled
FP32/FP64 Performance Ratio: 32
Summary:
2/2 devices are supported
CUDA Libraries Test Output:
Finding driver from candidates: libcuda.so, libcuda.so.1, /usr/lib/libcuda.so, /usr/lib/libcuda.so.1, /usr/lib64/libcuda.so, /usr/lib64/libcuda.so.1...
Using loader <class 'ctypes.CDLL'>
trying to load driver... ok, loaded from libcuda.so
Finding nvvm from Conda environment
named libnvvm.so.4.0.0
trying to open library... ok
Finding cudart from Conda environment
named libcudart.so.11.8.89
trying to open library... ok
Finding cudadevrt from Conda environment
named libcudadevrt.a
Finding libdevice from Conda environment
trying to open library... ok
__NumPy Information__
NumPy Version : 1.24.4
NumPy Supported SIMD features : ('MMX', 'SSE', 'SSE2', 'SSE3', 'SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2', 'AVX512F', 'AVX512CD', 'AVX512VL', 'AVX512BW', 'AVX512DQ', 'AVX512VNNI', 'AVX512_SKX', 'AVX512_CLX')
NumPy Supported SIMD dispatch : ('SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2', 'AVX512F', 'AVX512CD', 'AVX512_SKX', 'AVX512_CLX', 'AVX512_CNL', 'AVX512_ICL')
NumPy Supported SIMD baseline : ('SSE', 'SSE2', 'SSE3')
NumPy AVX512_SKX support detected : True
__SVML Information__
SVML State, config.USING_SVML : False
SVML Library Loaded : True
llvmlite Using SVML Patched LLVM : False
SVML Operational : False
__Threading Layer Information__
TBB Threading Layer Available : True
+-->TBB imported successfully.
OpenMP Threading Layer Available : True
+-->Vendor: GNU
Workqueue Threading Layer Available : True
+-->Workqueue imported successfully.
__Numba Environment Variable Information__
None found.
__Conda Information__
Conda Build : 3.21.5
Conda Env : 4.10.3
Conda Platform : linux-64
Conda Python Version : 3.9.7.final.0
Conda Root Writable
Not working system:
System info:
--------------------------------------------------------------------------------
__Time Stamp__
Report started (local time) : 2023-08-18 21:15:08.766025
UTC start time : 2023-08-18 19:15:08.766032
Running time (s) : 4.064849
__Hardware Information__
Machine : x86_64
CPU Name : znver2
CPU Count : 256
Number of accessible CPUs : 16
List of accessible CPUs cores : 56 57 58 59 60 61 62 63 184 185 186 187 188 189 190 191
CFS Restrictions (CPUs worth of runtime) : None
CPU Features : 64bit adx aes avx avx2 bmi bmi2
clflushopt clwb clzero cmov crc32
cx16 cx8 f16c fma fsgsbase fxsr
lzcnt mmx movbe mwaitx pclmul
popcnt prfchw rdpid rdrnd rdseed
sahf sha sse sse2 sse3 sse4.1
sse4.2 sse4a ssse3 wbnoinvd xsave
xsavec xsaveopt xsaves
Memory Total (MB) : 1031954
Memory Available (MB) : 807107
__OS Information__
Platform Name : Linux-5.4.0-148-generic-x86_64-with-glibc2.31
Platform Release : 5.4.0-148-generic
OS Name : Linux
OS Version : #165-Ubuntu SMP Tue Apr 18 08:53:12 UTC 2023
OS Specific Version : ?
Libc Version : glibc 2.31
__Python Information__
Python Compiler : GCC 12.3.0
Python Implementation : CPython
Python Version : 3.9.17
Python Locale : en_US.UTF-8
__Numba Toolchain Versions__
Numba Version : 0.57.1
llvmlite Version : 0.40.1
__LLVM Information__
LLVM Version : 14.0.6
__CUDA Information__
CUDA Device Initialized : False
CUDA Driver Version : ?
CUDA Runtime Version : ?
CUDA NVIDIA Bindings Available : ?
CUDA NVIDIA Bindings In Use : ?
CUDA Minor Version Compatibility Available : ?
CUDA Minor Version Compatibility Needed : ?
CUDA Minor Version Compatibility In Use : ?
CUDA Detect Output:
None
CUDA Libraries Test Output:
None
__NumPy Information__
NumPy Version : 1.24.4
NumPy Supported SIMD features : ('MMX', 'SSE', 'SSE2', 'SSE3', 'SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2')
NumPy Supported SIMD dispatch : ('SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2', 'AVX512F', 'AVX512CD', 'AVX512_SKX', 'AVX512_CLX', 'AVX512_CNL', 'AVX512_ICL')
NumPy Supported SIMD baseline : ('SSE', 'SSE2', 'SSE3')
NumPy AVX512_SKX support detected : False
__SVML Information__
SVML State, config.USING_SVML : False
SVML Library Loaded : True
llvmlite Using SVML Patched LLVM : False
SVML Operational : False
__Threading Layer Information__
TBB Threading Layer Available : True
+-->TBB imported successfully.
OpenMP Threading Layer Available : True
+-->Vendor: GNU
Workqueue Threading Layer Available : True
+-->Workqueue imported successfully.
__Numba Environment Variable Information__
None found.
__Conda Information__
Conda Build : 3.22.0
Conda Env : 22.9.0
Conda Platform : linux-64
Conda Python Version : 3.9.13.final.0
Conda Root Writable : False