Gufunc with cache=True and parallel mode is still slow

I have here a call to guvectorize with cache=True and parallel target. The same function is also compiled with cpu target. I have observed that the cached cpu version is loaded very fast, but the parallel version takes ~4 seconds to load. Is this a bug? Is there any way to achieve faster loading times?