Track down test isolation issue

I have a series of unit tests of device functions that each run kernels. When I run them separately they are fine. However, when I run all together, at a certain test, I start to get memory errors.

When I run with the compute-sanitizer, again: no errors when run separately; errors when run together.

In fact, I’ve now got it down to two kernels; run (1) then (2) results in error, while running separately or in opposite order results in no error. (Kernel 1 I have to run several different test cases, but then kernel 2 fails immediately on first case) AFAICT, (1) doesn’t seem to do anything nefarious. Perhaps there is an array bound violated somewhere, but compute-sanitizer doesn’t see it in (1) when run alone. (2) is (after commenting out stuff) pretty short.

I have no session fixtures - everything should be recreated per test. But I do have relatively complex data-structures, which I pass in (nested) named tuples of device arrays.

I am wondering if latter tests are being run before memory is free for them? Is there anything I can do to check and intervene?

NB - python error looks like:

Exception ignored in: <finalize object at 0x7f4450566e80; dead>
Traceback (most recent call last):
  File "/usr/local/anaconda3/lib/python3.9/weakref.py", line 591, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 2186, in deref
    mem.free()
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 1975, in free
    self._finalizer()
  File "/usr/local/anaconda3/lib/python3.9/weakref.py", line 591, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 1627, in core
    deallocations.add_item(driver.cuMemFree, ptr, size)
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 1193, in add_item
    self.clear()
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 1204, in clear
    dtor(handle)
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 340, in safe_cuda_api_call
    self._check_ctypes_error(fname, retcode)
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 408, in _check_ctypes_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemFree results in CUDA_ERROR_LAUNCH_FAILED

(then repeated)

Compute sanitizer error starts out like (and then is very long…)

========= COMPUTE-SANITIZER
========= Invalid __global__ read of size 4 bytes
=========     at 0x15b0 in cudapy::tests::context::cuda::adapt::cluster::accumulate::kernels::ssc_accumulate_stats_kernel:
:ssc_accumulate_stats_kernel[abi:v52,cw51cXTLSUwv1sCUt9Ww0FEw09RRQPKzLTg4gaGKFsG2oMQGEYakJSQB1PQBk0Bynm21OiwU1a0UoLGhDpQE8
oxrNQE_3d](HostContext_28TrainingConfig_28UniTuple_28float32_20x_202_29_2c_20float32_2c_20float32_2c_20int64_2c_20int64_2c
_20int64_2c_20int64_2c_20int64_2c_20int64_2c_20UniTuple_28float32_20x_203_29_2c_20ClusterParams_28int16_2c_20int16_2c_20in
t16_2c_20float32_2c_20int16_2c_20int16_2c_20int16_2c_20int16_29_2c_20int64_2c_20UniTuple_28int64_20x_202_29_2c_20UniTuple_
28int64_20x_202_29_29_2c_20ModelData_28int64_2c_20int32_2c_20array_28float64_2c_201d_2c_20C_29_2c_20array_28int64_2c_201d_
2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28bool_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c
_20array_28int32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_
28int32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28int32
_2c_201d_2c_20C_29_2c_20array_28uint8_2c_201d_2c_20C_29_2c_20array_28uint8_2c_201d_2c_20C_29_2c_20array_28float64_2c_201d_
2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20
C_29_2c_20array_28int8_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_
20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20arr
ay_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28in
t32_2c_201d_2c_20C_29_29_2c_20TrainingHostState_28TrainingData_28array_28float32_2c_202d_2c_20C_29_2c_20array_28float32_2c
_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float64_2c_201d_2c_20C_29_29_2c_20ModelState_28array_2
8float32_2c_202d_2c_20C_29_20x_207_29_2c_20array_28Record_28s0_5btype_3duint64_3boffset_3d0_5d_2cs1_5btype_3duint64_3boffs
et_3d8_5d_3b16_3bTrue_29_2c_201d_2c_20C_29_2c_20Overlay_28array_28float32_2c_201d_2c_20C_29_2c_20array_28int8_2c_201d_2c_2
0C_29_2c_20array_28uint8_2c_201d_2c_20C_29_2c_20array_28int16_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_29_29
_29, short, SetSampleClusterPos_28int64_2c_20int64_2c_20int64_2c_20bool_29, long long)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7f42f99a4f88 is out of bounds
=========     and is 5,052,412,024 bytes before the nearest allocation at 0x7f4426c00000 of size 20 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x306526]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:ffi_call_unix64 [0x69dd]
=========                in /usr/local/anaconda3/lib/python3.9/lib-dynload/../../libffi.so.7
=========     Host Frame:ffi_call_int [0x6067]
=========                in /usr/local/anaconda3/lib/python3.9/lib-dynload/../../libffi.so.7
=========     Host Frame:/usr/local/src/conda/python-3.9.12/Modules/_ctypes/callproc.c:1263:_ctypes_callproc [0x140f6]
=========                in /usr/local/anaconda3/lib/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so
=========     Host Frame:/usr/local/src/conda/python-3.9.12/Modules/_ctypes/_ctypes.c:4212:PyCFuncPtr_call [0x1441f]
=========                in /usr/local/anaconda3/lib/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so
=========     Host Frame:/opt/conda/conda-bld/python-split_1649141344976/work/Objects/call.c:283:_PyObject_Call [0x145ac9]
=========                in /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/bin/python

It seems that if I run the β€œright” number of kernels, I can get just the β€œignored” error above as a warning. I also get compute-sanitizer errors. When I try to start the next kernel, it will fail on launch. The following traceback shows that it is failing when creating device arrays in fixture:

    return f(*itertools.chain(preset_pos_args, args), **kwargs)
tests/context/cuda/fixtures.py:821: in case_cluster_2_levels
    case = TrainCase.create(problem, cu_params, nodes)
tests/context/cuda/fixtures.py:476: in create
    cu_mod = context.CUModel.from_model(
src/timbuktu/context/cu_model.py:246: in from_model
    is_iv_feature=cuda.to_device(
../../../../.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/devices.py:232: in _require_cuda_context
    return fn(*args, **kws)
../../../../.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/api.py:128: in to_device
    to, new = devicearray.auto_device(obj, stream=stream, copy=copy,
../../../../.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/devicearray.py:876: in auto_device
    devobj = from_array_like(obj, stream=stream)
../../../../.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/devicearray.py:797: in from_array_like
    return DeviceNDArray(ary.shape, ary.strides, ary.dtype, stream=stream,
../../../../.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/devicearray.py:103: in __init__
    gpu_data = devices.get_context().memalloc(self.alloc_size)
../../../../.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py:1385: in memalloc
    return self.memory_manager.memalloc(bytesize)
../../../../.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py:1077: in memalloc
    self._attempt_allocation(allocator)
../../../../.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py:864: in _attempt_allocation
    return allocator()
../../../../.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py:1075: in allocator
    driver.cuMemAlloc(byref(ptr), size)
../../../../.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py:340: in safe_cuda_api_call
    self._check_ctypes_error(fname, retcode)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <numba.cuda.cudadrv.driver.Driver object at 0x7f626d245670>, fname = 'cuMemAlloc', retcode = 719

    def _check_ctypes_error(self, fname, retcode):
        if retcode != enums.CUDA_SUCCESS:
            errname = ERROR_MAP.get(retcode, "UNKNOWN_CUDA_ERROR")
            msg = "Call to %s results in %s" % (fname, errname)
            _logger.error(msg)
            if retcode == enums.CUDA_ERROR_NOT_INITIALIZED:
                self._detect_fork()
>           raise CudaAPIError(retcode, msg)
E           numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemAlloc results in CUDA_ERROR_LAUNCH_FAILED

CUDA launches run asynchronously, so you will see errors that occurred during earlier kernel launches returned from later calls to API functions. You can catch the initial error with compute-sanitizer, as you have observed, or you can set the environment variable CUDA_LAUNCH_BLOCKING=1 to have kernel launches run synchronously, and that will return the error from the kernel launch in which it originated (but generally you should not run with blocking launches because it will serialize / synchronize a lot of API execution.

The error is identified by compute-sanitizer:

========= Invalid __global__ read of size 4 bytes
=========     at 0x15b0 in cudapy::tests::context::cuda::adapt::cluster::accumulate::kernels::ssc_accumulate_stats_kernel:
:ssc_accumulate_stats_kernel[abi:v52,cw51cXTLSUwv1sCUt9Ww0FEw09RRQPKzLTg4gaGKFsG2oMQGEYakJSQB1PQBk0Bynm21OiwU1a0UoLGhDpQE8
oxrNQE_3d] ...

somehow there’s an invalid memory access in that kernel. If you add debug=True and opt=False to the @cuda.jit decorator and run again, it will hopefully tell you the source file and line number on which the invalid access was produced - this might help pinpoint the next steps / thoughts for debugging.

1 Like

Thanks! … hmm… tried with debug=True, opt=False and CUDA_LAUNCH_BLOCKING=1.

Currently, I’m running 5 tests, which each call a kernel (3 one kernel, 2 another). If I comment any one test, they all pass. Run all together, however, with the above options the session seg-fault during the last test. Also, if I run only the 2nd kernel, I can run all 4 test variants with it (that is, the two shown and another 2).

NB: Tried both using the branch you produced for kernel caching with CG, and with 0.57.0rc1.

numba = {git = "https://github.com/gmarkall/numba.git", branch= "issue-8888-cg"}
# or
numba = "0.57.0rc1"

Console output:

CUDA_LAUNCH_BLOCKING=1 compute-sanitizer \
>   --log-file compute-sanitizer.out pytest -s --cuda \
>     tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_calculate_cluster_entropy[cross_2] \
>     tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_calculate_cluster_entropy[cross_2_x_samples] \
>     tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_calculate_cluster_entropy[cluster_2_levels] \
>     tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_accumulate_stats[2_levels] \
>     tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_accumulate_stats[cross_2] \
>     -vv --durations 5 -x
================================================== test session starts ===================================================
platform linux -- Python 3.9.12, pytest-7.3.1, pluggy-1.0.0 -- /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/bin/python
cachedir: .pytest_cache
rootdir: /home/shauncutts/src/factfiber.ai/learn/timbuktu
configfile: pyproject.toml
plugins: snapshot-0.9.0, cov-4.0.0, cases-3.6.14, timeout-2.1.0
collected 7 items                                                                                                        

tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_calculate_cluster_entropy[cross_2] STREAM 0
PASSED
tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_calculate_cluster_entropy[cross_2_x_samples] STREAM 0
PASSED
tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_calculate_cluster_entropy[cluster_2_levels] STREAM 0
PASSED
tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_accumulate_stats[2_levels] STREAM 0
i_cluster_cuts 0 i_cut_off 0 i_cut 0 0 1
PASSED
tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_accumulate_stats[cross_2] STREAM 0
Fatal Python error: Segmentation fault

Thread 0x00007f90d9270500 (most recent call first):
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 339 in safe_cuda_api_call
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 2569 in launch_kernel
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/dispatcher.py", line 329 in launch
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/dispatcher.py", line 677 in call
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/dispatcher.py", line 541 in __call__
  File "/home/shauncutts/src/factfiber.ai/learn/timbuktu/tests/context/cuda/fixtures.py", line 393 in wrapped
  File "/home/shauncutts/src/factfiber.ai/learn/timbuktu/tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py", line 158 in test_ssc_accumulate_stats
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/_pytest/python.py", line 1799 in runtest
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/_pytest/runner.py", line 169 in pytest_runtest_call
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/_pytest/runner.py", line 262 in <lambda>
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/_pytest/runner.py", line 341 in from_call
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/_pytest/runner.py", line 261 in call_runtest_hook
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/_pytest/runner.py", line 222 in call_and_report
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/_pytest/runner.py", line 133 in runtestprotocol
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/_pytest/runner.py", line 114 in pytest_runtest_protocol
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/_pytest/main.py", line 323 in _main
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/_pytest/main.py", line 269 in wrap_session
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/_pytest/config/__init__.py", line 166 in main
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/_pytest/config/__init__.py", line 189 in console_main
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/bin/pytest", line 8 in <module>

Compute sanitizer output:

========= COMPUTE-SANITIZER
========= Error: process didn't terminate successfully
=========     The application may have hit an error when dereferencing Unified Memory from the host. Please rerun the application under cuda-gdb or a host debugger to catch host side errors.
========= Target application returned an error
========= ERROR SUMMARY: 0 errors

More info, under numba 0.56.4, it still crashes, but doesn’t segfault. Also in numba 0.56.4, it takes 5 minutes to compile the kernels (!) – vs ~10 sec in 0.57.0rc1 or bugfix branch, and ~1-2 sec if kernels are cached.

$ pip install numba==0.56.4
Collecting numba==0.56.4
  Downloading numba-0.56.4-cp39-cp39-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.5/3.5 MB 6.7 MB/s eta 0:00:00
Collecting llvmlite<0.40,>=0.39.0dev0 (from numba==0.56.4)
  Downloading llvmlite-0.39.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.6/34.6 MB 7.5 MB/s eta 0:00:00
Collecting numpy<1.24,>=1.18 (from numba==0.56.4)
  Using cached numpy-1.23.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
Requirement already satisfied: setuptools in /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages (from numba==0.56.4) (67.6.1)
WARNING: Error parsing requirements for timbuktu: Invalid URL: ../../mlops/lamarck
Installing collected packages: numpy, llvmlite, numba
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.2
    Uninstalling numpy-1.24.2:
      Successfully uninstalled numpy-1.24.2
  Attempting uninstall: llvmlite
    Found existing installation: llvmlite 0.40.0rc1
    Uninstalling llvmlite-0.40.0rc1:
      Successfully uninstalled llvmlite-0.40.0rc1
  Attempting uninstall: numba
    Found existing installation: numba 0.57.0rc1
    Uninstalling numba-0.57.0rc1:
      Successfully uninstalled numba-0.57.0rc1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ingolstadt 0.1.0 requires llvmlite==0.40.0rc1, but you have llvmlite 0.39.1 which is incompatible.
lamarck 0.1.0 requires xarray<2023.0.0,>=2022.6.0, but you have xarray 2023.4.1 which is incompatible.
Successfully installed llvmlite-0.39.1 numba-0.56.4 numpy-1.23.5
(timbuktu-py3.9) (base) shauncutts@silk:~/.../learn/timbuktu$ poetry exec clear-pycache:test
Exec: find tests -name '__pycache__' -type d -exec rm -rf {} + 

(timbuktu-py3.9) (base) shauncutts@silk:~/.../learn/timbuktu$ poetry exec clear-pycache
Exec: find . -name '__pycache__' -type d -exec rm -rf {} + 

(timbuktu-py3.9) (base) shauncutts@silk:~/.../learn/timbuktu$ CUDA_LAUNCH_BLOCKING=1 compute-sanitizer   --log-file compute-sanitizer.out pytest -s --cuda     tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_calculate_cluster_entropy[cross_2]     tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_calculate_cluster_entropy[cross_2_x_samples]     tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_calculate_cluster_entropy[cluster_2_levels]     tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_accumulate_stats[2_levels]     tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_accumulate_stats[cross_2]     -vv --durations 5 -x
================================================== test session starts ===================================================
platform linux -- Python 3.9.12, pytest-7.3.1, pluggy-1.0.0 -- /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/bin/python
cachedir: .pytest_cache
rootdir: /home/shauncutts/src/factfiber.ai/learn/timbuktu
configfile: pyproject.toml
plugins: snapshot-0.9.0, cov-4.0.0, cases-3.6.14, timeout-2.1.0
collected 5 items                                                                                                        

tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_calculate_cluster_entropy[cross_2] STREAM 0
PASSED
tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_calculate_cluster_entropy[cross_2_x_samples] STREAM 0
PASSED
tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_calculate_cluster_entropy[cluster_2_levels] STREAM 0
PASSED
tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_accumulate_stats[2_levels] STREAM 0
i_cluster_cuts 0 i_cut_off 0 i_cut 0 0 1
PASSED
tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_accumulate_stats[cross_2] STREAM 0
FAILED

======================================================== FAILURES ========================================================
___________________________________________ test_ssc_accumulate_stats[cross_2] ___________________________________________

case = TrainCase(rand=RandomState(MT19937) at 0x7F8088EDCA40, model=<timbuktu.context.cu_model.CUModel object at 0x7f8088f0e8...3411, grid=(1, 1), block=(2, 20)), _cluster_state=None, _cut_state=None, _assess_state=None, _host_global_overlay=None)

    @pytest.mark.cuda
    # @pytest.mark.skip(reason="Temporarily disabled")
    def test_ssc_accumulate_stats(case: TrainCase) -> None:
        i_level = np.int16(2)
        p_current = accumulate.create_sscp(
            i_set=0, i_off_sample=0, i_cluster=0, valid=True
        )
        i_off = 0
>       case.prep_kernel(
            kernels.ssc_accumulate_stats_kernel, grid=(1, 1), block=(1, 1)
        )(case.host_context, i_level, p_current, i_off)

tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py:158: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/context/cuda/fixtures.py:393: in wrapped
    inst(*args)
../../../../.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/dispatcher.py:491: in __call__
    return self.dispatcher.call(args, self.griddim, self.blockdim,
../../../../.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/dispatcher.py:627: in call
    kernel.launch(args, griddim, blockdim, stream, sharedmem)
../../../../.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/dispatcher.py:279: in launch
    driver.launch_kernel(cufunc.handle,
../../../../.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py:2545: in launch_kernel
    driver.cuLaunchCooperativeKernel(cufunc_handle,
../../../../.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py:320: in safe_cuda_api_call
    self._check_ctypes_error(fname, retcode)

[SNIP]

Start of rest of err w/ numba 0.56.4:

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <numba.cuda.cudadrv.driver.Driver object at 0x7f819d257bb0>, fname = 'cuLaunchCooperativeKernel', retcode = 719

    def _check_ctypes_error(self, fname, retcode):
        if retcode != enums.CUDA_SUCCESS:
            errname = ERROR_MAP.get(retcode, "UNKNOWN_CUDA_ERROR")
            msg = "Call to %s results in %s" % (fname, errname)
            _logger.error(msg)
            if retcode == enums.CUDA_ERROR_NOT_INITIALIZED:
                self._detect_fork()
>           raise CudaAPIError(retcode, msg)
E           numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuLaunchCooperativeKernel results in CUDA_ERROR_LAUNCH_FAILED

../../../../.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py:388: CudaAPIError
================================================== slowest 5 durations ===================================================
349.08s call     tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_accumulate_stats[2_levels]
293.38s call     tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_calculate_cluster_entropy[cross_2]
2.85s call     tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_accumulate_stats[cross_2]
0.20s call     tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_calculate_cluster_entropy[cross_2_x_samples]
0.20s call     tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_calculate_cluster_entropy[cluster_2_levels]
================================================ short test summary info =================================================
FAILED tests/context/cuda/adapt/cluster/accumulate/test_accumulate.py::test_ssc_accumulate_stats[cross_2] - numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuLaunchCooperativeKernel results in CUDA_ERROR_LAUNCH_FAILED
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
======================================== 1 failed, 4 passed in 648.22s (0:10:48) =========================================
Traceback (most recent call last):
  File "/usr/local/anaconda3/lib/python3.9/weakref.py", line 667, in _exitfunc
    f()
  File "/usr/local/anaconda3/lib/python3.9/weakref.py", line 591, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 2162, in deref
    mem.free()
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 1951, in free
    self._finalizer()
  File "/usr/local/anaconda3/lib/python3.9/weakref.py", line 591, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 1603, in core
    deallocations.add_item(driver.cuMemFree, ptr, size)
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 1169, in add_item
    self.clear()
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 1180, in clear
    dtor(handle)
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 320, in safe_cuda_api_call
    self._check_ctypes_error(fname, retcode)
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 388, in _check_ctypes_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemFree results in CUDA_ERROR_LAUNCH_FAILED
Traceback (most recent call last):
  File "/usr/local/anaconda3/lib/python3.9/weakref.py", line 667, in _exitfunc
    f()
  File "/usr/local/anaconda3/lib/python3.9/weakref.py", line 591, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 2162, in deref
    mem.free()
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 1951, in free
    self._finalizer()
  File "/usr/local/anaconda3/lib/python3.9/weakref.py", line 591, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 1603, in core
    deallocations.add_item(driver.cuMemFree, ptr, size)
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 1169, in add_item
    self.clear()
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 1180, in clear
    dtor(handle)
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 320, in safe_cuda_api_call
    self._check_ctypes_error(fname, retcode)
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 388, in _check_ctypes_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemFree results in CUDA_ERROR_LAUNCH_FAILED
Traceback (most recent call last):
  File "/usr/local/anaconda3/lib/python3.9/weakref.py", line 667, in _exitfunc
    f()
  File "/usr/local/anaconda3/lib/python3.9/weakref.py", line 591, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 2162, in deref
    mem.free()
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 1951, in free
    self._finalizer()
  File "/usr/local/anaconda3/lib/python3.9/weakref.py", line 591, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 1603, in core
    deallocations.add_item(driver.cuMemFree, ptr, size)
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 1169, in add_item
    self.clear()
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 1180, in clear
    dtor(handle)
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 320, in safe_cuda_api_call
    self._check_ctypes_error(fname, retcode)
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 388, in _check_ctypes_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemFree results in CUDA_ERROR_LAUNCH_FAILED

start of compute sanitizer output 0.56.4:

========= COMPUTE-SANITIZER
========= Invalid __global__ read of size 4 bytes
=========     at 0x8e0 in timbuktu::context::cuda::adapt::cluster::accumulate::ssc_accumulate_stats[abi:v53,cw51cXTLSUwHBinCqbbgUAAGBlq82ILSCEQYkgSQBFCjFSaBZJtttTo4sahbKRjoKKgDRYEsYyALFiRALtiJtZoA](ClusterLevelContext_28GridGroup_2c_20MiniBatchState_28int64_2c_20int64_2c_20float64_2c_20float64_29_2c_20TrainingConfig_28UniTuple_28float32_20x_202_29_2c_20float32_2c_20float32_2c_20int64_2c_20int64_2c_20int64_2c_20int64_2c_20int64_2c_20int64_2c_20UniTuple_28float32_20x_203_29_2c_20ClusterParams_28int16_2c_20int16_2c_20int16_2c_20float32_2c_20int16_2c_20int16_2c_20int16_2c_20int16_29_2c_20int64_2c_20UniTuple_28int64_20x_202_29_2c_20UniTuple_28int64_20x_202_29_29_2c_20ModelData_28int64_2c_20int32_2c_20array_28float64_2c_201d_2c_20C_29_2c_20array_28int64_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28bool_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28uint8_2c_201d_2c_20C_29_2c_20array_28uint8_2c_201d_2c_20C_29_2c_20array_28float64_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28int8_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_29_2c_20TrainingData_28array_28float32_2c_202d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float64_2c_201d_2c_20C_29_29_2c_20ModelState_28array_28float32_2c_202d_2c_20C_29_20x_207_29_2c_20array_28Record_28s0_5btype_3duint64_3boffset_3d0_5d_2cs1_5btype_3duint64_3boffset_3d8_5d_3b16_3bTrue_29_2c_201d_2c_20C_29_2c_20ClusterState_28ClusterPrepareState_28array_28int16_2c_201d_2c_20C_29_2c_20array_28int8_2c_201d_2c_20C_29_2c_20array_28int16_2c_201d_2c_20C_29_2c_20array_28int16_2c_201d_2c_20C_29_2c_20array_28int16_2c_201d_2c_20C_29_2c_20array_28int8_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_29_2c_20ClusterFindState_28array_28int16_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28int16_2c_201d_2c_20C_29_2c_20array_28int16_2c_201d_2c_20C_29_29_2c_20ClusterOptimizeState_28array_28int16_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28int16_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_29_29_2c_20int16_2c_20int16_2c_20int16_2c_20int16_2c_20int16_2c_20int16_2c_20int16_2c_20int16_2c_20int16_2c_20int16_2c_20int16_29, SetSampleClusterPos_28int64_2c_20int64_2c_20int64_2c_20bool_29, int)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7f81f4c06628 is out of bounds
=========     and is 1,780,507,689 bytes after the nearest allocation at 0x7f818aa00000 of size 512 bytes
=========     Device Frame:/home/shauncutts/src/factfiber.ai/learn/timbuktu/tests/context/cuda/adapt/cluster/accumulate/kernels/ssc_accumulate_stats_kernel.py:34:tests::context::cuda::adapt::cluster::accumulate::kernels::ssc_accumulate_stats_kernel::ssc_accumulate_stats_kernel[abi:v51,cw51cXTLSUwv1oDHpC0oJQBDAi1SkIQh8Y8kgBqnMAkkq2yr1cHmq1spQCTVQd4C2QYUgQaoOixskISA2oE8g1pNAA_3d_3d](HostContext_28TrainingConfig_28UniTuple_28float32_20x_202_29_2c_20float32_2c_20float32_2c_20int64_2c_20int64_2c_20int64_2c_20int64_2c_20int64_2c_20int64_2c_20UniTuple_28float32_20x_203_29_2c_20ClusterParams_28int16_2c_20int16_2c_20int16_2c_20float32_2c_20int16_2c_20int16_2c_20int16_2c_20int16_29_2c_20int64_2c_20UniTuple_28int64_20x_202_29_2c_20UniTuple_28int64_20x_202_29_29_2c_20ModelData_28int64_2c_20int32_2c_20array_28float64_2c_201d_2c_20C_29_2c_20array_28int64_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28bool_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28uint8_2c_201d_2c_20C_29_2c_20array_28uint8_2c_201d_2c_20C_29_2c_20array_28float64_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28int8_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_29_2c_20TrainingHostState_28TrainingData_28array_28float32_2c_202d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float64_2c_201d_2c_20C_29_29_2c_20ModelState_28array_28float32_2c_202d_2c_20C_29_20x_207_29_2c_20array_28Record_28s0_5btype_3duint64_3boffset_3d0_5d_2cs1_5btype_3duint64_3boffset_3d8_5d_3b16_3bTrue_29_2c_201d_2c_20C_29_2c_20Overlay_28array_28float32_2c_201d_2c_20C_29_2c_20array_28int8_2c_201d_2c_20C_29_2c_20array_28uint8_2c_201d_2c_20C_29_2c_20array_28int16_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_29_29_29, short, SetSampleClusterPos_28int64_2c_20int64_2c_20int64_2c_20bool_29, long long) [0xf96f0]
=========     Device Frame:/home/shauncutts/src/factfiber.ai/learn/timbuktu/tests/context/cuda/adapt/cluster/accumulate/kernels/ssc_accumulate_stats_kernel.py:10:cudapy::tests::context::cuda::adapt::cluster::accumulate::kernels::ssc_accumulate_stats_kernel::ssc_accumulate_stats_kernel[abi:v51,cw51cXTLSUwv1oDHpC0oJQBDAi1SkIQh8Y8kgBqnMAkkq2yr1cHmq1spQCTVQd4C2QYUgQaoOixskISA2oE8g1pNAA_3d_3d](HostContext_28TrainingConfig_28UniTuple_28float32_20x_202_29_2c_20float32_2c_20float32_2c_20int64_2c_20int64_2c_20int64_2c_20int64_2c_20int64_2c_20int64_2c_20UniTuple_28float32_20x_203_29_2c_20ClusterParams_28int16_2c_20int16_2c_20int16_2c_20float32_2c_20int16_2c_20int16_2c_20int16_2c_20int16_29_2c_20int64_2c_20UniTuple_28int64_20x_202_29_2c_20UniTuple_28int64_20x_202_29_29_2c_20ModelData_28int64_2c_20int32_2c_20array_28float64_2c_201d_2c_20C_29_2c_20array_28int64_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28bool_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28uint8_2c_201d_2c_20C_29_2c_20array_28uint8_2c_201d_2c_20C_29_2c_20array_28float64_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28int8_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_29_2c_20TrainingHostState_28TrainingData_28array_28float32_2c_202d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float32_2c_201d_2c_20C_29_2c_20array_28float64_2c_201d_2c_20C_29_29_2c_20ModelState_28array_28float32_2c_202d_2c_20C_29_20x_207_29_2c_20array_28Record_28s0_5btype_3duint64_3boffset_3d0_5d_2cs1_5btype_3duint64_3boffset_3d8_5d_3b16_3bTrue_29_2c_201d_2c_20C_29_2c_20Overlay_28array_28float32_2c_201d_2c_20C_29_2c_20array_28int8_2c_201d_2c_20C_29_2c_20array_28uint8_2c_201d_2c_20C_29_2c_20array_28int16_2c_201d_2c_20C_29_2c_20array_28int32_2c_201d_2c_20C_29_29_29_29, short, SetSampleClusterPos_28int64_2c_20int64_2c_20int64_2c_20bool_29, long long) [0x15cf0]
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x306526]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:ffi_call_unix64 [0x69dd]
=========                in /usr/local/anaconda3/lib/python3.9/lib-dynload/../../libffi.so.7
=========     Host Frame:ffi_call_int [0x6067]
=========                in /usr/local/anaconda3/lib/python3.9/lib-dynload/../../libffi.so.7
=========     Host Frame:/usr/local/src/conda/python-3.9.12/Modules/_ctypes/callproc.c:1263:_ctypes_callproc [0x140f6]
=========                in /usr/local/anaconda3/lib/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so
=========     Host Frame:/usr/local/src/conda/python-3.9.12/Modules/_ctypes/_ctypes.c:4212:PyCFuncPtr_call [0x1441f]
=========                in /usr/local/anaconda3/lib/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so
=========     Host Frame:/opt/conda/conda-bld/python-split_1649141344976/work/Objects/call.c:283:_PyObject_Call [0x145ac9]
=========                in /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/bin/python
=========     Host Frame:/opt/conda/conda-bld/python-split_1649141344976/work/Python/ceval.c:3582:_PyEval_EvalFrameDefault [0x1d79f0]
=========                in /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/bin/python
=========     Host Frame:/opt/conda/conda-bld/python-split_1649141344976/work/Python/ceval.c:4338:_PyEval_EvalCode [0x196663]
=========                in /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/bin/python
=========     Host Frame:/opt/conda/conda-bld/python-split_1649141344976/work/Objects/call.c:396:_PyFunction_Vectorcall [0x197354]
=========                in /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/bin/python
=========     Host Frame:/opt/conda/conda-bld/python-split_1649141344976/work/Python/ceval.c:3489:_PyEval_EvalFrameDefault.cold.2984 [0xff755]
=========                in /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/bin/python


[SNIP]


=========     Host Frame:/opt/conda/conda-bld/python-split_1649141344976/work/Python/ceval.c:4338:_PyEval_EvalCode [0x196663]
=========                in /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/bin/python
=========     Host Frame:/opt/conda/conda-bld/python-split_1649141344976/work/Python/ceval.c:4377:PyEval_EvalCodeEx [0x24345c]
=========                in /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/bin/python
=========     Host Frame:/opt/conda/conda-bld/python-split_1649141344976/work/Python/ceval.c:834:PyEval_EvalCode [0x19745b]
=========                in /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/bin/python
=========     Host Frame:/opt/conda/conda-bld/python-split_1649141344976/work/Python/pythonrun.c:1222:run_eval_code_obj [0x24350b]
=========                in /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/bin/python
=========     Host Frame:/opt/conda/conda-bld/python-split_1649141344976/work/Python/pythonrun.c:1243:run_mod [0x273f75]
=========                in /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/bin/python
=========     Host Frame:/opt/conda/conda-bld/python-split_1649141344976/work/Python/pythonrun.c:1140:pyrun_file.cold.3080 [0x114987]
=========                in /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/bin/python
=========     Host Frame:/opt/conda/conda-bld/python-split_1649141344976/work/Python/pythonrun.c:483:PyRun_SimpleFileExFlags [0x279a2f]
=========                in /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/bin/python
=========     Host Frame:/opt/conda/conda-bld/python-split_1649141344976/work/Modules/main.c:683:Py_RunMain [0x27a10b]
=========                in /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/bin/python
=========     Host Frame:/opt/conda/conda-bld/python-split_1649141344976/work/Modules/main.c:1130:Py_BytesMain [0x27a309]
=========                in /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/bin/python
=========     Host Frame:../csu/libc-start.c:342:__libc_start_main [0x24083]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame: [0x2010a0]
=========                in /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.9/bin/python
========= 
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuLaunchCooperativeKernel.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x3063f8]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:ffi_call_unix64 [0x69dd]
=========                in /usr/local/anaconda3/lib/python3.9/lib-dynload/../../libffi.so.7
=========     Host Frame:ffi_call_int [0x6067]
=========                in /usr/local/anaconda3/lib/python3.9/lib-dynload/../../libffi.so.7