Thanks alot for another suggestion @sschaer
@nb.njit(cache=True)
def test(addr, v, z):
return kv(addr, v, z)
v, z = 1.5, 10.5
test(addr, 1.5, 10.5)
outputs 6.260706303843087
. Your code clearly works.
I set up a local scheduler with distributed’s LocalCluster(). (This is still on the same machine, but gives the possibility to distribute tasks to local workers, (in this case, to 4 workers)):
client.run(test, addr, v, z)
outputs CommClosedError
.
CommClosedError
from distributed import Client, LocalCluster
cluster=LocalCluster()
client=Client(cluster)
client.run(test, addr, v, z)
outputs:
2023-11-22 12:34:28,781 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:39139 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:47412 remote=tcp://127.0.0.1:39139>: Stream is closed
2023-11-22 12:34:28,788 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:34545 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:35376 remote=tcp://127.0.0.1:34545>: Stream is closed
2023-11-22 12:34:28,792 - distributed.nanny - WARNING - Restarting worker
2023-11-22 12:34:28,794 - distributed.nanny - WARNING - Restarting worker
2023-11-22 12:34:28,809 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:33269 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:52468 remote=tcp://127.0.0.1:33269>: Stream is closed
2023-11-22 12:34:28,814 - distributed.nanny - WARNING - Restarting worker
2023-11-22 12:34:28,853 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:33645 failed: CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:35752 remote=tcp://127.0.0.1:33645>: Stream is closed
2023-11-22 12:34:28,859 - distributed.nanny - WARNING - Restarting worker
---------------------------------------------------------------------------
CommClosedError Traceback (most recent call last)
Cell In[40], line 1
----> 1 client.run(test, addr, v, z)
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/client.py:2901, in Client.run(self, function, workers, wait, nanny, on_error, *args, **kwargs)
2818 def run(
2819 self,
2820 function,
(...)
2826 **kwargs,
2827 ):
2828 """
2829 Run a function on all workers outside of task scheduling system
2830
(...)
2899 >>> c.run(print_state, wait=False) # doctest: +SKIP
2900 """
-> 2901 return self.sync(
2902 self._run,
2903 function,
2904 *args,
2905 workers=workers,
2906 wait=wait,
2907 nanny=nanny,
2908 on_error=on_error,
2909 **kwargs,
2910 )
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/utils.py:339, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
337 return future
338 else:
--> 339 return sync(
340 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
341 )
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/utils.py:406, in sync(loop, func, callback_timeout, *args, **kwargs)
404 if error:
405 typ, exc, tb = error
--> 406 raise exc.with_traceback(tb)
407 else:
408 return result
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/utils.py:379, in sync.<locals>.f()
377 future = asyncio.wait_for(future, callback_timeout)
378 future = asyncio.ensure_future(future)
--> 379 result = yield future
380 except Exception:
381 error = sys.exc_info()
File /srv/conda/envs/notebook/lib/python3.10/site-packages/tornado/gen.py:769, in Runner.run(self)
766 exc_info = None
768 try:
--> 769 value = future.result()
770 except Exception:
771 exc_info = sys.exc_info()
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/client.py:2806, in Client._run(self, function, nanny, workers, wait, on_error, *args, **kwargs)
2803 continue
2805 if on_error == "raise":
-> 2806 raise exc
2807 elif on_error == "return":
2808 results[key] = exc
CommClosedError: in <TCP (closed) Scheduler Broadcast local=tcp://127.0.0.1:52468 remote=tcp://127.0.0.1:33269>: Stream is closed
So when running test(addr, v, z)
in a distributed fashion, the scheduler logs that the workers-processes are shut down by some signal 11, resulting in a CommClosedError:
Fate of the 4 worker-processes
{'tcp://127.0.0.1:33565': (('INFO',
"2023-11-22 12:33:05,043 - distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:43347'"),
('INFO',
"2023-11-22 12:33:05,045 - distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:45013'"),
('INFO',
"2023-11-22 12:33:05,050 - distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:44561'"),
('INFO',
"2023-11-22 12:33:05,054 - distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:38947'"),
('INFO',
'2023-11-22 12:33:36,404 - distributed.nanny - INFO - Worker process 1353 was killed by signal 11'),
('INFO',
'2023-11-22 12:33:36,408 - distributed.nanny - INFO - Worker process 1360 was killed by signal 11'),
('INFO',
'2023-11-22 12:33:36,412 - distributed.nanny - INFO - Worker process 1357 was killed by signal 11'),
('WARNING',
'2023-11-22 12:33:36,419 - distributed.nanny - WARNING - Restarting worker'),
('WARNING',
'2023-11-22 12:33:36,423 - distributed.nanny - WARNING - Restarting worker'),
('WARNING',
'2023-11-22 12:33:36,428 - distributed.nanny - WARNING - Restarting worker'),
('INFO',
'2023-11-22 12:33:36,461 - distributed.nanny - INFO - Worker process 1362 was killed by signal 11'),
('WARNING',
'2023-11-22 12:33:36,465 - distributed.nanny - WARNING - Restarting worker'),
('INFO',}
I would be interested in your thoughts about why this could be happening? Since this is moving away from numba-topics I should consider moving this post to some other forum though.