I’ve implemented a function that performs kernels 1 and 2 of the CCL algorithm found here: Traversing a matrix on a thread per row basis
The data is a binary 3D matrix (719x701x900) containing many particles as seen in the image below (I’ve left some orange noise for clarity):
Kernel 1 returns the spans and compressed label matrices as seen in the following sample:
Spans Matrix: Rows 698 to 718, Slice 0
Compressed Labels Matrix: Rows 698 to 718, Slice 0

Kernel 2 updates labels by contiguity and requires several iterations. It’s confirmed to work with very simple data. Below is a simple structure made up of 2 (50x50x50) cubes and one (250x25x25) parallelepiped and its finished spans and compressed label matrices. As can be seen, the function correctly identifies the spans and labels the object with a single label.
Spans Matrix: Rows 39 to 59, Slice 0 | Compressed Labels Matrix: Rows 39 to 59, Slice 0
However, the kernel doesn’t seem to work for larger data. There are no errors. The kernel simply blocks. In fact, it isn’t possible to retrieve any information - e.g. I can’t return the spans nor compressed labels matrices even after only 1 iteration.
Does anyone have any idea why this happens? Could it be the kernel is still working and, if so, is there any way to know?