CUDA: cache compiled device functions separately

I have CUDA kernels which take over a minute, sometimes two, to compile. I am using caching, but when I am debugging a kernel, any change to a library function requires invalidating the cache. This slows debugging down considerably.

I’m wondering if, under the hood, there might be some way to reused (relink) compiled device functions so that it would be possible to invalidate only the particular device function I used when building a kernel?

At this point I am probably willing to mess around with the internals some, but probably don’t have ability and/or time to rewrite the compiler if that is what it really requires.