Strange behaviour on HPC cluster

I guys!
I’m trying to run my numba code on our cluster composed by different nodes managed by PBS.
I installed anaconda in user-space and I created my environment with all of the libraries I need. The shell is tcsh, thus I configured the base environment importing conda.csh from profile.d in my .tcshrc.
I run my script by typing:

setenv NUMBA_NUM_THREADS 12
conda activate myenv
python myscript.py
conda deactivate

When I run on the cluster front end (for test only) it works fine; 24 cpus are used and the script terminates.

Here problems start!
When I submit the job via qsub, it seems to use all the allocated resources without complete processing, thus I tried to directly run the code on a free node of the cluster; I found out that independently by NUMBA_NUM_THREADS variable, all of the cpus of the cluster node are saturated! Furthermore the script never complete the job!
What’s going on?
Thanks for help.

Ivan

[EDIT]
On the front end:

(py39) frontend /home/user 105 % numba -s
System info:
--------------------------------------------------------------------------------
__Time Stamp__
Report started (local time)                   : 2021-07-03 13:42:37.380803
UTC start time                                : 2021-07-03 11:42:37.381489
Running time (s)                              : 7.433437

__Hardware Information__
Machine                                       : x86_64
CPU Name                                      : broadwell
CPU Count                                     : 20
Number of accessible CPUs                     : 32
List of accessible CPUs cores                 : 0-191
CFS Restrictions (CPUs worth of runtime)      : None

CPU Features                                  : 64bit adx aes avx avx2 bmi bmi2
                                                cmov cx16 cx8 f16c fma fsgsbase
                                                fxsr invpcid lzcnt mmx movbe
                                                pclmul popcnt prfchw rdrnd rdseed
                                                rtm sahf sse sse2 sse3 sse4.1
                                                sse4.2 ssse3 xsave xsaveopt

Memory Total (MB)                             : 257541
Memory Available (MB)                         : 243705

__OS Information__
Platform Name                                 : Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.17
Platform Release                              : 3.10.0-957.el7.x86_64
OS Name                                       : Linux
OS Version                                    : #1 SMP Thu Oct 4 20:48:51 UTC 2018
OS Specific Version                           : ?
Libc Version                                  : glibc 2.17

__Python Information__
Python Compiler                               : GCC 7.5.0
Python Implementation                         : CPython
Python Version                                : 3.9.5
Python Locale                                 : en_US.UTF-8

__LLVM Information__
LLVM Version                                  : 10.0.1

__CUDA Information__
CUDA Device Initialized                       : False
CUDA Driver Version                           : ?
CUDA Detect Output:
None
CUDA Libraries Test Output:
None

__ROC information__
ROC Available                                 : False
ROC Toolchains                                : None
HSA Agents Count                              : 0
HSA Agents:
None
HSA Discrete GPUs Count                       : 0
HSA Discrete GPUs                             : None

__SVML Information__
SVML State, config.USING_SVML                 : True
SVML Library Loaded                           : True
llvmlite Using SVML Patched LLVM              : True
SVML Operational                              : True

__Threading Layer Information__
TBB Threading Layer Available                 : True
+-->TBB imported successfully.
OpenMP Threading Layer Available              : True
+-->Vendor: GNU
Workqueue Threading Layer Available           : True
+-->Workqueue imported successfully.

__Numba Environment Variable Information__
None found.

__Conda Information__
Conda Build                                   : 3.21.4
Conda Env                                     : 4.10.3
Conda Platform                                : linux-64
Conda Python Version                          : 3.8.10.final.0
Conda Root Writable                           : True

__Installed Packages__
_libgcc_mutex             0.1                        main
_openmp_mutex             4.5                       1_gnu
blas                      1.0                         mkl
ca-certificates           2021.5.25            h06a4308_1
certifi                   2021.5.30        py39h06a4308_0
cudatoolkit               11.0.221             h6bb024c_0
cycler                    0.10.0           py39h06a4308_0
dbus                      1.13.18              hb2f20db_0
expat                     2.4.1                h2531618_2
fontconfig                2.13.1               h6c09931_0
freetype                  2.10.4               h5ab3b9f_0
glib                      2.68.2               h36276a3_0
gst-plugins-base          1.14.0               h8213a91_2
gstreamer                 1.14.0               h28cd5cc_2
icu                       58.2                 he6710b0_3
intel-openmp              2021.2.0           h06a4308_610
jpeg                      9b                   h024ee3a_2
kiwisolver                1.3.1            py39h2531618_0
lcms2                     2.12                 h3be6417_0
ld_impl_linux-64          2.35.1               h7274673_9
libffi                    3.3                  he6710b0_2
libgcc-ng                 9.3.0               h5101ec6_17
libgfortran-ng            7.5.0               ha8ba4b0_17
libgfortran4              7.5.0               ha8ba4b0_17
libgomp                   9.3.0               h5101ec6_17
libllvm10                 10.0.1               hbcb73fb_5
libpng                    1.6.37               hbc83047_0
libstdcxx-ng              9.3.0               hd4cf53a_17
libtiff                   4.2.0                h85742a9_0
libuuid                   1.0.3                h1bed415_2
libwebp-base              1.2.0                h27cfd23_0
libxcb                    1.14                 h7b6447c_0
libxml2                   2.9.12               h03d6c58_0
llvmlite                  0.36.0           py39h612dafd_4
lz4-c                     1.9.3                h2531618_0
matplotlib                3.3.4            py39h06a4308_0
matplotlib-base           3.3.4            py39h62a2d02_0
mkl                       2021.2.0           h06a4308_296
mkl-service               2.3.0            py39h27cfd23_1
mkl_fft                   1.3.0            py39h42c9631_2
mkl_random                1.2.1            py39ha9443f7_2
ncurses                   6.2                  he6710b0_1
numba                     0.53.1           py39ha9443f7_0
numpy                     1.20.2           py39h2d18471_0
numpy-base                1.20.2           py39hfae3a4d_0
olefile                   0.46                       py_0
openssl                   1.1.1k               h27cfd23_0
pcre                      8.45                 h295c915_0
pillow                    8.2.0            py39he98fc37_0
pip                       21.1.3           py39h06a4308_0
pyparsing                 2.4.7              pyhd3eb1b0_0
pyqt                      5.9.2            py39h2531618_6
python                    3.9.5                h12debd9_4
python-dateutil           2.8.1              pyhd3eb1b0_0
qt                        5.9.7                h5867ecd_1
readline                  8.1                  h27cfd23_0
scipy                     1.6.2            py39had2a1c9_1
setuptools                52.0.0           py39h06a4308_0
sip                       4.19.13          py39h2531618_0
six                       1.16.0             pyhd3eb1b0_0
sqlite                    3.36.0               hc218d9a_0
tbb                       2020.3               hfd86e86_0
tk                        8.6.10               hbc83047_0
tornado                   6.1              py39h27cfd23_0
tzdata                    2021a                h52ac0ba_0
wheel                     0.36.2             pyhd3eb1b0_0
xz                        5.2.5                h7b6447c_0
zlib                      1.2.11               h7b6447c_3
zstd                      1.4.9                haebb681_0

No errors reported.


__Warning log__
Warning (cuda): CUDA driver library cannot be found or no CUDA enabled devices are present.
Exception class: <class 'numba.cuda.cudadrv.error.CudaSupportError'>
Warning (roc): Error initialising ROC: No ROC toolchains found.
Warning (roc): No HSA Agents found, encountered exception when searching: Error at driver init:
NUMBA_HSA_DRIVER /opt/rocm/lib/libhsa-runtime64.so is not a valid file path.  Note it must be a filepath of the .so/.dll/.dylib or the driver:
Warning (psutil): psutil cannot be imported. For more accuracy, consider installing it.
--------------------------------------------------------------------------------
If requested, please copy and paste the information between
the dashed (----) lines, or from a given specific section as
appropriate.

=============================================================
IMPORTANT: Please ensure that you are happy with sharing the
contents of the information present, any information that you
wish to keep private you should remove before sharing.
=============================================================

On the cluster node:

(py39) [user@cn013 ~]$ numba -s
System info:
--------------------------------------------------------------------------------
__Time Stamp__
Report started (local time)                   : 2021-07-03 13:43:15.883785
UTC start time                                : 2021-07-03 11:43:15.884534
Running time (s)                              : 4.449655

__Hardware Information__
Machine                                       : x86_64
CPU Name                                      : broadwell
CPU Count                                     : 36
Number of accessible CPUs                     : 32
List of accessible CPUs cores                 : 0-191
CFS Restrictions (CPUs worth of runtime)      : None

CPU Features                                  : 64bit adx aes avx avx2 bmi bmi2
                                                cmov cx16 cx8 f16c fma fsgsbase
                                                fxsr invpcid lzcnt mmx movbe
                                                pclmul popcnt prfchw rdrnd rdseed
                                                rtm sahf sse sse2 sse3 sse4.1
                                                sse4.2 ssse3 xsave xsaveopt

Memory Total (MB)                             : 257845
Memory Available (MB)                         : 253311

__OS Information__
Platform Name                                 : Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.17
Platform Release                              : 3.10.0-957.el7.x86_64
OS Name                                       : Linux
OS Version                                    : #1 SMP Thu Oct 4 20:48:51 UTC 2018
OS Specific Version                           : ?
Libc Version                                  : glibc 2.17

__Python Information__
Python Compiler                               : GCC 7.5.0
Python Implementation                         : CPython
Python Version                                : 3.9.5
Python Locale                                 : en_US.UTF-8

__LLVM Information__
LLVM Version                                  : 10.0.1

__CUDA Information__
CUDA Device Initialized                       : False
CUDA Driver Version                           : ?
CUDA Detect Output:
None
CUDA Libraries Test Output:
None

__ROC information__
ROC Available                                 : False
ROC Toolchains                                : None
HSA Agents Count                              : 0
HSA Agents:
None
HSA Discrete GPUs Count                       : 0
HSA Discrete GPUs                             : None

__SVML Information__
SVML State, config.USING_SVML                 : False
SVML Library Loaded                           : False
llvmlite Using SVML Patched LLVM              : True
SVML Operational                              : False

__Threading Layer Information__
TBB Threading Layer Available                 : True
+-->TBB imported successfully.
OpenMP Threading Layer Available              : True
+-->Vendor: GNU
Workqueue Threading Layer Available           : True
+-->Workqueue imported successfully.

__Numba Environment Variable Information__
None found.

__Conda Information__
Conda Build                                   : 3.21.4
Conda Env                                     : 4.10.3
Conda Platform                                : linux-64
Conda Python Version                          : 3.8.10.final.0
Conda Root Writable                           : True

__Installed Packages__
_libgcc_mutex             0.1                        main
_openmp_mutex             4.5                       1_gnu
blas                      1.0                         mkl
ca-certificates           2021.5.25            h06a4308_1
certifi                   2021.5.30        py39h06a4308_0
cudatoolkit               11.0.221             h6bb024c_0
cycler                    0.10.0           py39h06a4308_0
dbus                      1.13.18              hb2f20db_0
expat                     2.4.1                h2531618_2
fontconfig                2.13.1               h6c09931_0
freetype                  2.10.4               h5ab3b9f_0
glib                      2.68.2               h36276a3_0
gst-plugins-base          1.14.0               h8213a91_2
gstreamer                 1.14.0               h28cd5cc_2
icu                       58.2                 he6710b0_3
intel-openmp              2021.2.0           h06a4308_610
jpeg                      9b                   h024ee3a_2
kiwisolver                1.3.1            py39h2531618_0
lcms2                     2.12                 h3be6417_0
ld_impl_linux-64          2.35.1               h7274673_9
libffi                    3.3                  he6710b0_2
libgcc-ng                 9.3.0               h5101ec6_17
libgfortran-ng            7.5.0               ha8ba4b0_17
libgfortran4              7.5.0               ha8ba4b0_17
libgomp                   9.3.0               h5101ec6_17
libllvm10                 10.0.1               hbcb73fb_5
libpng                    1.6.37               hbc83047_0
libstdcxx-ng              9.3.0               hd4cf53a_17
libtiff                   4.2.0                h85742a9_0
libuuid                   1.0.3                h1bed415_2
libwebp-base              1.2.0                h27cfd23_0
libxcb                    1.14                 h7b6447c_0
libxml2                   2.9.12               h03d6c58_0
llvmlite                  0.36.0           py39h612dafd_4
lz4-c                     1.9.3                h2531618_0
matplotlib                3.3.4            py39h06a4308_0
matplotlib-base           3.3.4            py39h62a2d02_0
mkl                       2021.2.0           h06a4308_296
mkl-service               2.3.0            py39h27cfd23_1
mkl_fft                   1.3.0            py39h42c9631_2
mkl_random                1.2.1            py39ha9443f7_2
ncurses                   6.2                  he6710b0_1
numba                     0.53.1           py39ha9443f7_0
numpy                     1.20.2           py39h2d18471_0
numpy-base                1.20.2           py39hfae3a4d_0
olefile                   0.46                       py_0
openssl                   1.1.1k               h27cfd23_0
pcre                      8.45                 h295c915_0
pillow                    8.2.0            py39he98fc37_0
pip                       21.1.3           py39h06a4308_0
pyparsing                 2.4.7              pyhd3eb1b0_0
pyqt                      5.9.2            py39h2531618_6
python                    3.9.5                h12debd9_4
python-dateutil           2.8.1              pyhd3eb1b0_0
qt                        5.9.7                h5867ecd_1
readline                  8.1                  h27cfd23_0
scipy                     1.6.2            py39had2a1c9_1
setuptools                52.0.0           py39h06a4308_0
sip                       4.19.13          py39h2531618_0
six                       1.16.0             pyhd3eb1b0_0
sqlite                    3.36.0               hc218d9a_0
tbb                       2020.3               hfd86e86_0
tk                        8.6.10               hbc83047_0
tornado                   6.1              py39h27cfd23_0
tzdata                    2021a                h52ac0ba_0
wheel                     0.36.2             pyhd3eb1b0_0
xz                        5.2.5                h7b6447c_0
zlib                      1.2.11               h7b6447c_3
zstd                      1.4.9                haebb681_0

No errors reported.


__Warning log__
Warning (cuda): CUDA driver library cannot be found or no CUDA enabled devices are present.
Exception class: <class 'numba.cuda.cudadrv.error.CudaSupportError'>
Warning (roc): Error initialising ROC: No ROC toolchains found.
Warning (roc): No HSA Agents found, encountered exception when searching: Error at driver init:
NUMBA_HSA_DRIVER /opt/rocm/lib/libhsa-runtime64.so is not a valid file path.  Note it must be a filepath of the .so/.dll/.dylib or the driver:
Warning (psutil): psutil cannot be imported. For more accuracy, consider installing it.
--------------------------------------------------------------------------------
If requested, please copy and paste the information between
the dashed (----) lines, or from a given specific section as
appropriate.

=============================================================
IMPORTANT: Please ensure that you are happy with sharing the
contents of the information present, any information that you
wish to keep private you should remove before sharing.
=============================================================

Hi @krono86

What’s being listed in the hardware information above is a little suspicious RE CPU counts and cores etc. However, ignoring this, two questions:

  1. Which threading backend are you using? It looks like it’ll be TBB given numba -s reports that it is available.
  2. Does your code have nested parallelism (i.e. a prange which calls another function that is decorated with @njit(parallel=True).

If the answer to the above is “TBB” and “yes”, then the saturation-without-job-completion issue is likely due to nested parallelism with insufficient work.

Hi @stuartarchibald!
Yes, the used threading backend is TBB, and yes, there’s nested parallelism; however, the higher level prange call jit-compiled functions, but the latters are not compiled with parallel=True.
I found out it works using OMP instead of TBB.
That could be sufficiently for me, but why I’ve the problem?
Thanks a lot!

Ivan

Hi @krono86

Without seeing the code and being able to reproduce this it’s quite hard to guess at what the issue is. If you can create a minimal working example of the problem someone can take a look. It might be worth looking at the active thread counts whilst the program is running under TBB and OpenMP and seeing how that compares to the number of available cores on the machine.

I’m also not sure that you have nested parallelism if:

the higher level prange call jit-compiled functions, but the latters are not compiled with parallel=True.

this implies you have a top level parallel function which is driving a prange loop that calls other non-parallel functions?