CUDA how to run concurrently kernels using multiprocessing?

Anyone can suggest on how to run concurrently async kernels using multiprocessing ?

Based on CUDA docs it’s says >A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context.

Seems I need to import the same CUDA context to all processes, but I really stuck…

Any help much appreciated.

This sounds like quite out of date documentation - is this in the Numba documentation?

Kernels can run concurrently in different streams in the same context, or in different contexts - this was a limitation only in very early versions of CUDA.

https:// docs . nvidia . com/cuda/cuda-c-programming-guide/index.html#asynchronous-concurrent-execution

My CC 8.6 but no matter what do I do I can’t make them run concurrently, so I thought different contexts are the main bottleneck.

That does surprise me - I’m enquiring as to the interpretation of this sentence and will get back to you about that.

In the meantime can you post an example code illustrating the pattern you’re using to try to execute kernels concurrently, so we can see what might be blocking concurrent execution?

OK, it looks like I had a long standing misunderstanding about the nature of concurrent execution from different processes - the kernels won’t overlap from multiple contexts / processes, and only be interleaved.

Can your application use CUDA streams to overlap kernel execution within one process instead?

Unfortunately no… And we are talking about AMPER architecture which is obviously been made to process really huge amount of data. Single process leads to CPU bottleneck… unfortunately.

Can you suggest any workaround ?

Perhaps it may work if I find the way how to fork the same context to all sub processes.

You may be able to use MPS for this: Multi-Process Service :: GPU Deployment and Management Documentation

I find MPS quite buggy, it’s often freezes threads for no reason and there is no easy way to debug it.
Export/import gpu context directly in numba can really help and avoid MPS use.

For another workaround, can you write your code such that it uses threads instead of processes? (It’s a bit hard to really give good concrete suggestions with such a general description of your application)