When we start a long-running computation, we typically start it in a non-main thread, since the main thread is the GUI thread and we keep the GUI thread free to show things like a progress bar with updates. The long-running computations typically use numba.
In the past few months, we started seeing this message when running these:
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
I guess there is some TBB backend that is forking, and since it is running in a non-main thread, it produces this warning? I have a couple of questions:
Why did this message only start appearing recently for us? Is it due to something new?
Is there any way we can continue to call numba functions from non-main threads? This is preferable for us since, as I mentioned, the main thread is the GUI thread.
Thank you,
Patrick
Some system details:
OS: Ubuntu-22.04 (although I believe the warning is occurring on Mac as well)
python==3.10.6
numba==0.56.4
llvmlite==0.39.1
numpy==1.22.4
I don’t think it will be TBB itself forking (or at least it wouldn’t be to my knowledge). That message appears because a fork(2) call has been made from a thread which is not the thread on which the TBB backend/library was initialised.
Not sure. Have you tried some different versions of Numba? I think that PR #6324 added both the detection code for this problem and the error message, it would have first appeared in Numba 0.52 to fix issue #5973.
Providing an answer to this will depend on what is causing the fork(2) call in the first place and whether the child processes need to use TBB. Maybe try and trace the cause of the fork(2) call, once it’s found it might be possible to suggest workarounds. Standard linux tooling such as strace and gdb should help with finding where the call is made.
I went back and tested a few different numba versions, and it looks like the message starts occurring at numba==0.54.0.
I also ran my python program with gdb and created a breakpoint via break fork. When the relevant functions get called, here is the backtrace at break fork. I’m not sure if it is helpful (let me know if I should try something else):
#0 __libc_fork () at ./posix/fork.c:41
#1 0x00005555557e193e in ()
#2 0x00005555556ad64e in ()
#3 0x000055555569ea72 in _PyEval_EvalFrameDefault ()
#4 0x00005555556b03ac in _PyFunction_Vectorcall ()
#5 0x000055555569914a in _PyEval_EvalFrameDefault ()
#6 0x00005555556a5964 in _PyObject_FastCallDictTstate ()
#7 0x00005555556ba594 in ()
#8 0x00005555556a677c in _PyObject_MakeTpCall ()
#9 0x000055555569ee39 in _PyEval_EvalFrameDefault ()
#10 0x00005555556b03ac in _PyFunction_Vectorcall ()
#11 0x000055555569ea72 in _PyEval_EvalFrameDefault ()
#12 0x00005555556b03ac in _PyFunction_Vectorcall ()
#13 0x000055555569914a in _PyEval_EvalFrameDefault ()
#14 0x00005555556b03ac in _PyFunction_Vectorcall ()
#15 0x000055555569914a in _PyEval_EvalFrameDefault ()
#16 0x00005555556b03ac in _PyFunction_Vectorcall ()
#17 0x000055555569914a in _PyEval_EvalFrameDefault ()
#18 0x00005555556b03ac in _PyFunction_Vectorcall ()
#19 0x000055555569914a in _PyEval_EvalFrameDefault ()
#20 0x00005555556be4de in ()
#21 0x000055555569b3b0 in _PyEval_EvalFrameDefault ()
#22 0x00005555556b03ac in _PyFunction_Vectorcall ()
#23 0x0000555555699005 in _PyEval_EvalFrameDefault ()
#24 0x00005555556be391 in ()
#25 0x000055555569a2fc in _PyEval_EvalFrameDefault ()
#26 0x00005555556b03ac in _PyFunction_Vectorcall ()
#27 0x000055555569914a in _PyEval_EvalFrameDefault ()
#28 0x00005555556be391 in ()
#29 0x00005555556bf032 in PyObject_Call ()
#30 0x000055555569b3b0 in _PyEval_EvalFrameDefault ()
#31 0x00005555556be391 in ()
#32 0x000055555569a2fc in _PyEval_EvalFrameDefault ()
#33 0x00005555556be391 in ()
#34 0x00005555556bf032 in PyObject_Call ()
#35 0x000055555569b3b0 in _PyEval_EvalFrameDefault ()
#36 0x00005555556be5f1 in ()
#37 0x000055555569b3b0 in _PyEval_EvalFrameDefault ()
#38 0x00005555556be5f1 in ()
#39 0x00007fff7ca1f117 in QRunnableWrapper::run() ()
at /home/patrick/virtualenvs/hexrd/lib/python3.10/site-packages/PySide2/QtCore.abi3.so
#40 0x00007fff7bcb766a in QThreadPoolThread::run() ()
at /home/patrick/virtualenvs/hexrd/lib/python3.10/site-packages/PySide2/Qt/lib/libQt5Core.so.5
#41 0x00007fff7bcb3b35 in QThreadPrivate::start(void*) ()
at /home/patrick/virtualenvs/hexrd/lib/python3.10/site-packages/PySide2/Qt/lib/libQt5Core.so.5
#42 0x00007ffff7cdbb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#43 0x00007ffff7d6da00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
By the way, we are using a ProcessPoolExecutor for this. That may be the cause of the fork() call. But I believe it is important for us to use the ProcessPoolExecutor on a non-main thread, since the main thread is the GUI thread and we don’t want to block the application.
The numba functions are called within the separate processes of the ProcessPoolExecutor.
RE the Numba version. I think the reporting of this condition was added in #6324 which is part of Numba 0.52.
RE the use of fork(2). This note is made in the above PR:
maybe it might help assess whether the occurrence of this message is “safe”? From what I can recall, the “main” thread in this case is that which launched the TBB pool, this launch typically occurs through calling a @njit(parallel=True) function.
I’ve also taken a look at the CPython source for concurrent.futures.ProcessPoolExecutor. It takes a mp_context kwarg for supplying a multiprocessing context and looking at the use, the implementation calls that to start workers (processes) and also e.g. adjust the process count etc. I would imagine that if the default context is used on Unix (fork) that that could be the source of the fork(2) calls.
To debug this, perhaps try and find the thread that launches the TBB pool? There’s a symbol in the threading layers launch_threads, which is called to launch the threadpool, I would guess that it can be used as a breakpoint in e.g. gdb to help with this.