Weird CUDA invalid configuration argument error

I’m getting an error after making some small changes to this package: pytorch-softdtw-cuda/ at master · Maghoumi/pytorch-softdtw-cuda · GitHub

The number of threads per block is 16. It works when the number of blocks is 3675, fails when there’s 5310. These numbers are obviously too low for the problem to be that I’m using too many blocks.

What I don’t understand is that it works on the forward pass, but fails on the backward. If I’m just using the forward pass I can use >100_000 blocks.

edit: Seems to have been because of torch.cdist, although I can’t reproduce it when I just use torch.cdist in a 5 line script

I’ve made a PR here so one can see the diff between the original and what I’ve done: diff lens by RuABraun · Pull Request #1 · RuABraun/pytorch-softdtw-cuda · GitHub

I cannot spot anything that would cause a problem but obviously there must be something ?