Terminating: Nested parallel kernel launch detected, the workqueue threading layer does not supported nested parallelism

I have a suite of unit tests for the STUMPY package that are executed in Github Actions where, for every PR/commit, we test on Linux, Windows, and MacOS and for Python 3.7, 3.8, 3.9, and 3.10. Up until recently, all tests were passing locally as well as on Github Actions. However, a few days ago, we found that the tests were failing on Github Actions for MacOS only (various Python versions):

Terminating: Nested parallel kernel launch detected, the workqueue threading layer does not supported nested parallelism. Try the TBB threading layer.

Oddly, the committed changes to the repo was the simple addition of a Jupyter notebook that is not at all involved in unit testing (i.e., the code base didn’t change and should not have affected the outcome of the tests). Between the passing and failing attempts, some other package dependencies had been updated and I’ve tried pinning those dependencies to versions matching the passing attempts. Unfortunately, this didn’t change anything.

I understand that the error says to Try the TBB threading layer but these tests were passing previously without TBB on MacOS and this error message doesn’t seem to be new in the numba code base (i.e., it’s been around for a while). I’m confused as to what could be triggering this error given that the STUMPY code base has not changed since last passing all tests on Github Actions. In fact, I’ve isolated a single unit test, re-ran it 100 times and over 20 rounds and it fails with the above error only after the 12th round (see here). Additionally, I understand that this error message arises when there is nested parallel=True function calls but I’ve checked and this shouldn’t be the issue. I’ve also wondered if it’s because we are using dask distributed but I couldn’t see why that would be an issue as it was never a problem previously and using an older (passing) version did not resolve the issue.

Any thoughts/suggestions would be greatly appreciated!

@stuartarchibald I believe that you worked on this and so I was wondering if anything comes to mind. Note that the parallel=True function being called is also be executed inside a dask distributed (LocalCluster) for testing. Again, I’m not getting this error when testing locally on my Mac and, on Github Actions, this error comes up after repeating the test multiple times (i.e., it doesn’t get triggered on the first call). So it’s been exceptionally hard to debug.

@seanlaw Thanks for raising this. A couple of thoughts:

  1. I agree, that it apparently always worked before some unrelated commit was made perhaps suggests something changed in the execution environment. Typically this would be packages but could also be something OS level, e.g. provision of an OpenMP runtime library. Maybe something has changed/updated and this lead to Numba changing which threading layers to use based on something now not being present/detected/working. The default selection order is try TBB first, then OpenMP, then workqueue. Perhaps OpenMP used to work and now is does not?
  2. The use of dask.distributed’s LocalCluster is probably resulting in a some sort of thread-pool being active to perform the execution of tasks. Threads concurrently accessing Numba’s workqueue threading layer is a violation of what is supported in that threading layer (the threading layer itself is not threadsafe). Such concurrent access typically comes from nested parallelism (i.e. a Numba parallel=True function calling another parallel=True function from a parallel region), however concurrent access can also come from Python threads. I’ve opened Numba project issue #8563 to demonstrate and also track updating the error message to be more general with respect to the potential causes of the problem.

As to fixing the issue present in STUMPY, in the situation described in the OP, I would guess it is unlikely that Numba’s workqueue has always worked by luck until recently, and suspect, as noted, something somewhere has changed. The numba -s command can be very useful for showing what Numba “detects” should work (it has a specific “threading layers” section) along with the package contents of the current Python environment. The function numba.threading_layer() will also return a string corresponding to the threading layer used in execution once execution completes, maybe see what that was in a working case? Perhaps some combination of these things could help you debug?

1 Like

Thanks for the response @stuartarchibald and for the suggestions!

Perhaps OpenMP used to work and now is does not?

Does numba automatically log which threading layer was used without explicitly checking numba.threading_layer()? Given that it might be something changing in the OS level, it’s not obvious how we might recover the threading layer that was used in the “working case”. See below.

I’m not sure how to go about this. The “working case” happened on Github Actions a week ago and now I only have the option to “re-run all jobs”. In doing so, the related tests are now failing even though the code/commit has not changed. I can certainly do numba -s using that commit but it won’t tell me what the Python environment+package contents was like in the “working case”. It would only tell me what the failing/erroring environment is like. Is numba -s still going to be useful in this regard? Anyhow, here is the result from the currently failing case:

System info:
--------------------------------------------------------------------------------
__Time Stamp__
Report started (local time)                   : 2022-11-03 00:26:42.195959
UTC start time                                : 2022-11-03 00:26:42.195966
Running time (s)                              : 6.873599
__Hardware Information__
Machine                                       : x86_64
CPU Name                                      : ivybridge
CPU Count                                     : 3
Number of accessible CPUs                     : ?
List of accessible CPUs cores                 : ?
CFS Restrictions (CPUs worth of runtime)      : None
CPU Features                                  : 64bit aes avx cmov cx16 cx8 f16c
                                                fsgsbase fxsr mmx pclmul popcnt
                                                rdrnd sahf sse sse2 sse3 sse4.1
                                                sse4.2 ssse3 xsave
Memory Total (MB)                             : 14336
Memory Available (MB)                         : 11451
__OS Information__
Platform Name                                 : macOS-10.16-x86_64-i386-64bit
Platform Release                              : 21.6.0
OS Name                                       : Darwin
OS Version                                    : Darwin Kernel Version 21.6.0: Thu Sep 29 20:12:57 PDT 2022; root:xnu-8020.240.7~1/RELEASE_X86_64
OS Specific Version                           : 10.16   x86_64
Libc Version                                  : ?
__Python Information__
Python Compiler                               : Clang 12.0.0 (clang-1200.0.32.29)
Python Implementation                         : CPython
Python Version                                : 3.9.14
Python Locale                                 : en_US.UTF-8
__Numba Toolchain Versions__
Numba Version                                 : 0.56.3
llvmlite Version                              : 0.39.1
__LLVM Information__
LLVM Version                                  : 11.1.0
__CUDA Information__
CUDA Device Initialized                       : False
CUDA Driver Version                           : ?
CUDA Runtime Version                          : ?
CUDA NVIDIA Bindings Available                : ?
CUDA NVIDIA Bindings In Use                   : ?
CUDA Detect Output:
None
CUDA Libraries Test Output:
None
__NumPy Information__
NumPy Version                                 : 1.23.4
NumPy Supported SIMD features                 : ('MMX', 'SSE', 'SSE2', 'SSE3', 'SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C')
NumPy Supported SIMD dispatch                 : ('SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2', 'AVX512F', 'AVX512CD', 'AVX512_KNL', 'AVX512_SKX', 'AVX512_CLX', 'AVX512_CNL', 'AVX512_ICL')
NumPy Supported SIMD baseline                 : ('SSE', 'SSE2', 'SSE3')
NumPy AVX512_SKX support detected             : False
__SVML Information__
SVML State, config.USING_SVML                 : False
SVML Library Loaded                           : False
llvmlite Using SVML Patched LLVM              : True
SVML Operational                              : False
__Threading Layer Information__
TBB Threading Layer Available                 : False
+--> Disabled due to Unknown import problem.
OpenMP Threading Layer Available              : False
+--> Disabled due to Unknown import problem.
Workqueue Threading Layer Available           : True
+-->Workqueue imported successfully.
__Numba Environment Variable Information__
None found.
__Conda Information__
Conda Build                                   : not installed
Conda Env                                     : 4.12.0
Conda Platform                                : osx-64
Conda Python Version                          : 3.9.12.final.0
Conda Root Writable                           : True
__Installed Packages__
brotlipy                  0.7.0           py39h9ed2024_1003  
ca-certificates           2022.3.29            hecd8cb5_1  
certifi                   2021.10.8        py39hecd8cb5_2  
cffi                      1.15.0           py39hc55c11b_1  
charset-normalizer        2.0.4              pyhd3eb1b0_0  
colorama                  0.4.4              pyhd3eb1b0_0  
conda                     4.12.0           py39hecd8cb5_0  
conda-content-trust       0.1.1              pyhd3eb1b0_0  
conda-package-handling    1.8.1            py39hca72f7f_0  
cryptography              36.0.0           py39hf6deb26_0  
idna                      3.3                pyhd3eb1b0_0  
libcxx                    12.0.0               h2f01273_0  
libffi                    3.3                  hb1e8313_2  
ncurses                   6.3                  hca72f7f_2  
openssl                   1.1.1n               hca72f7f_0  
pip                       21.2.4           py39hecd8cb5_0  
pycosat                   0.6.3            py39h9ed2024_0  
pycparser                 2.21               pyhd3eb1b0_0  
pyopenssl                 22.0.0             pyhd3eb1b0_0  
pysocks                   1.7.1            py39hecd8cb5_0  
python                    3.9.12               hdfd78df_0  
python.app                3                py39hca72f7f_0  
readline                  8.1.2                hca72f7f_1  
requests                  2.27.1             pyhd3eb1b0_0  
ruamel_yaml               0.15.100         py39h9ed2024_0  
setuptools                61.2.0           py39hecd8cb5_0  
six                       1.16.0             pyhd3eb1b0_1  
sqlite                    3.38.2               h707629a_0  
tk                        8.6.11               h7bc2e8c_0  
tqdm                      4.63.0             pyhd3eb1b0_0  
tzdata                    2022a                hda174b7_0  
urllib3                   1.26.8             pyhd3eb1b0_0  
wheel                     0.37.1             pyhd3eb1b0_0  
xz                        5.2.5                h1de35cc_0  
yaml                      0.2.5                haf1e3a3_0  
zlib                      1.2.12               h4dc903c_1  
No errors reported.
__Warning log__
Warning (cuda): CUDA driver library cannot be found or no CUDA enabled devices are present.
Exception class: <class 'numba.cuda.cudadrv.error.CudaSupportError'>
--------------------------------------------------------------------------------
If requested, please copy and paste the information between
the dashed (----) lines, or from a given specific section as
appropriate.
=============================================================
IMPORTANT: Please ensure that you are happy with sharing the
contents of the information present, any information that you
wish to keep private you should remove before sharing.
=============================================================

I noticed that a lot of the packages that I’ve installed (e.g., numpy, scipy, and even numba) are not in the list of __Installed Packages__ since I used pip install when setting my my environment in Github Actions.

Additionally, since the parallel function is being “submitted” to dask and we wait for an asynchronous future result to be returned, which is then processed when complete. So, it isn’t clear how to retrieve the numba.threading_layer() within the dask worker’s thread(s). Right now, if I try to print the numba.threading_layer right after the dask.submit call then I get:

    print(numba.threading_layer())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def threading_layer():
        """
        Get the name of the threading layer in use for parallel CPU targets
        """
        if _threading_layer is None:
>           raise ValueError("Threading layer is not initialized.")
E           ValueError: Threading layer is not initialized.

I’m guessing it’s because the threading layer is being handled within dask and it is opaque to the parent calling function.

Also, I have now isolated the issue to a single unit test. However, that unit test doesn’t fail after it is executed once. Instead, it errors out after calling the same test 501 consecutive times (i.e., it passes the same test 500 times right before failure).

Thanks for the extra information @seanlaw.

It does not log or persist this information, the threading layer is determined at runtime and could reasonably be changed between executions (for example by setting the NUMBA_THREADING_LAYER environment variable).

Thanks for mentioning this, I think numba -s might need an update to better accommodate this situation.

I think this is expected, the threading layer is only initialised once a compilation takes place for the parallel=True target. It is deliberately lazy so as to not start a thread pool in the most common use case of not needing one.

This does sound a little strange, particularly in that it is a specific number of execution attempts before failure. The issue in the OP that the workqueue threading layer is running in to is effectively due to a concurrent access condition so I would have guessed it’d be “random” to some degree.


Moving on to debugging and how resolve this…

The output from numba -s above reports:

which suggests that the only threading layer available on the system is workqueue, which explains why it is running.

In the originally working environment I am going to guess it was the OpenMP threading layer in use due to the following.

  1. There are only three options a) tbb, b) OpenMP c) workqueue.
  2. If it were tbb it’s reasonably likely it would either have been actively installed, and you’d be able to find that spelled out in e.g. CI setup scripts, or it would be a dependency of some other package. If it were installed through any means via conda it would appear in the numba -s output and it is not there. Further, IIRC, there are issues with finding a pip installed tbb library so I would guess it wouldn’t have been working via that either.
  3. The workqueue threading layer has the issues as noted/demonstrated above in relation to concurrent access. This means it either was not in use, has been exceedingly lucky in not hitting problems or something has changed in another package (e.g. dask.distributed) that means concurrent access now occurs.
  4. This leaves OpenMP, which is often found through some other package depending on or linking to it, it can also be present simply as part of a relatively standard OS runtime environment. The OpenMP threading layer is thread-safe, therefore should be fine in the case described above.

A couple of suggestions.

  1. As a preventative measure, in future perhaps run numba -s or some other appropriate tools to print out the environment in use in CI systems? Should something like this happen again, it is reasonably likely it will be traceable through a change in the reported package versions in the environment. Numba’s CI systems do this, and just this week Numba issue #8548 was found and quickly resolved by virtue of being able to check the changes to the execution environment across CI runs.
  2. To try and “fix” the current issue, perhaps install a suitable OpenMP package? The requirements for each threading layer are in the table in this section of the docs: The Threading Layers — Numba 0.56.4+0.g288a38bbd.dirty-py3.7-linux-x86_64.egg documentation. The numba -s command can be run to see if the OpenMP threading layer is working, the information is in the __Threading Layer Information__ section as noted above.
  3. The OpenMP threading layer status in the numba -s command is detected through doing:
    from numba.np.ufunc import omppool
    
    which is wrapped in a try... except ImportError such that any traceback that might hint at a problem (should there be one) is not visible. Therefore attempting this import directly might yield some information about a problem as part of the traceback.

Hope this helps.

Indeed, the number of iterations before a failure is random. The point that I was trying to emphasize was that the failure rarely happens on the very first iteration and that it happens at random. It sounds like this is what you would’ve expected with concurrent accesses.

In the originally working environment I am going to guess it was the OpenMP threading layer in use due to the following.

I believe that your hypothesis is correct. On my laptop (MacOS), the tests all pass and the threading layer appears to be OpenMP:

Threading Layer Information
TBB Threading Layer Available : False
±-> Disabled due to Unknown import problem.
OpenMP Threading Layer Available : True
±->Vendor: Intel
Workqueue Threading Layer Available : True
±->Workqueue imported successfully.

This explains why things work locally as my laptop is able to use OpenMP instead of workqueue. However, on Github Actions, when I attempt to brew install libomp before executing the same tests, I get:

Warning: libomp 15.0.3 is already installed and up-to-date.
To reinstall 15.0.3, run:
  brew reinstall libomp

but numba -s doesn’t seem to be able to find openmp:

__Threading Layer Information__
TBB Threading Layer Available                 : False
+--> Disabled due to Unknown import problem.
OpenMP Threading Layer Available              : False
+--> Disabled due to Unknown import problem.
Workqueue Threading Layer Available           : True
+-->Workqueue imported successfully.

I noticed that the docs for selecting a threading layer specifies installing intel-openmp for MacOS. Thus, I switched over to using:

Collecting intel-openmp
  Downloading intel_openmp-2022.2.0-py2.py3-none-macosx_10_15_x86_64.macosx_11_0_x86_64.whl (751 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 751.4/751.4 kB 9.3 MB/s eta 0:00:00
Installing collected packages: intel-openmp
Successfully installed intel-openmp-2022.2.0

and yet numba -s still doesn’t seem to be finding OpenMP:

__Threading Layer Information__
TBB Threading Layer Available                 : False
+--> Disabled due to Unknown import problem.
OpenMP Threading Layer Available              : False
+--> Disabled due to Unknown import problem.
Workqueue Threading Layer Available           : True
+-->Workqueue imported successfully.

Alas, by adding python -c "from numba.np.ufunc import omppool" to the Github Actions workflow, we see

[6](https://github.com/seanlaw/numba_test/actions/runs/3395341636/jobs/5645099433#step:9:7)Traceback (most recent call last):

[7](https://github.com/seanlaw/numba_test/actions/runs/3395341636/jobs/5645099433#step:9:8) File "<string>", line 1, in <module>

[8](https://github.com/seanlaw/numba_test/actions/runs/3395341636/jobs/5645099433#step:9:9)ImportError: dlopen(/Users/runner/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/numba/np/ufunc/omppool.cpython-38-darwin.so, 0x0002): Library not loaded: '@rpath/libomp.dylib'

[9](https://github.com/seanlaw/numba_test/actions/runs/3395341636/jobs/5645099433#step:9:10) Referenced from: '/Users/runner/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/numba/np/ufunc/omppool.cpython-38-darwin.so'

[10](https://github.com/seanlaw/numba_test/actions/runs/3395341636/jobs/5645099433#step:9:11) Reason: tried: '/Users/ci/miniconda3/envs/numba-ci/envs/testenv_a127991e-060b-42ba-872c-ce53451ba269/lib/libomp.dylib' (no such file), '/Users/ci/miniconda3/envs/numba-ci/envs/testenv_a127991e-060b-42ba-872c-ce53451ba269/lib/libomp.dylib' (no such file), '/Users/ci/miniconda3/envs/numba-ci/envs/testenv_a127991e-060b-42ba-872c-ce53451ba269/lib/libomp.dylib' (no such file), '/Users/ci/miniconda3/envs/numba-ci/envs/testenv_a127991e-060b-42ba-872c-ce53451ba269/lib/libomp.dylib' (no such file), '/Users/ci/miniconda3/envs/numba-ci/envs/testenv_a127991e-060b-42ba-872c-ce53451ba269/lib/libomp.dylib' (no such file), '/Users/ci/miniconda3/envs/numba-ci/envs/testenv_a127991e-060b-42ba-872c-ce53451ba269/lib/libomp.dylib' (no such file), '/usr/local/lib/libomp.dylib' (no such file), '/usr/lib/libomp.dylib' (no such file)

[11](https://github.com/seanlaw/numba_test/actions/runs/3395341636/jobs/5645099433#step:9:12)Error: Process completed with exit code 1.

I think I figured out the problem and am testing my solution. I’ll report back once I’ve ironed things out!

TL;DR; Add the following to your Github Actions:

- name: Link OpenMP
  run: |
       if [ "$RUNNER_OS" == "macOS" ]; then
            brew link --force libomp
       fi
  shell: bash
- name: Show Full Numba Environment
  run: python -m numba -s
  shell: bash

In Github Actions + MacOS, we discovered through brew install libomp that OpenMP was actually already pre-installed. However, numba -s could not locate the libomp.dylib library in the standard locations. It turns out that all that was needed was to instruct Homebrew to link the library over into the relevant directories via brew link --force libomp. After this, libomp.dylib is linked into the /usr/local/lib directory, which numba -s is able to discover and we have:

__Threading Layer Information__
TBB Threading Layer Available                 : False
+--> Disabled due to Unknown import problem.
OpenMP Threading Layer Available              : True
+-->Vendor: Intel
Workqueue Threading Layer Available           : True
+-->Workqueue imported successfully.

See the relevant Github issue and subsequent PR here.

@stuartarchibald Thank you SO much for assisting me with this and directing me towards the light! I appreciate your patience, guidance, and support. Without your help, I would’ve been in over my head and given up sooner! :pray:

@seanlaw no problem, many thanks for your persistence in resolving this, I’m pleased that you have found a way to fix the issue based on the above discussions!

Should you ever find the cause of the original problem (probably OpenMP libraries going missing via a dependency change) I’d be interested in hearing what happened!

1 Like