CUDA context interactions

I noticed that Numba’s context creation only uses cuDevicePrimaryCtxRetain. Is this just to make it easier to interface with other CUDA libraries? When it comes to mixing cuDevicePrimaryCtxRetain and cuCtxCreate if I call a runtime library does it just grab the current active context (if one exists primary or not) or does the runtime library always/only use the primary context (so if I’ve run cuCtxCreate but not cuDevicePrimaryCtxRetain will the runtime create a second context (that is primary) and use that)?

As a follow, it’s possible my confusion is simply semantics related. Does cuDevicePrimaryCtxRetain simply create a normal context (if one doesn’t exist) and the promote it to be primary (so a primary context is simply a state that can be assigned to any context) or is the primary context a special context that is different from a normal context?
This page is helpful but doesn’t fully clear up my confusion
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DRIVER.html#:~:text=The%20specific%20context%20which%20the,its%20primary%20context%20are%20synonymous.&text=CUDA%20Runtime%20API%20calls%20operate,to%20the%20calling%20host%20thread.

The Runtime API has a 1-1 mapping between devices and primary contexts, but not host threads and primary contexts - this means that generally the runtime will use whatever the current context is for the current thread, regardless of whether it is primary. Sometimes it is required that the current context is the primary context - for example, when cudaDeviceEnablePeerAccess() is called.

The next section after the one you refer to, Context Interoperability expands on this a bit more with details of the exceptions:

If a non-primary CUcontext created by the CUDA Driver API is current to a thread then the CUDA Runtime API calls to that thread will operate on that CUcontext, with some exceptions listed below.

I don’t know of the reason why Numba was originally implemented using cuDevicePrimaryCtxRetain - however, I note that it is the recommended way create contexts, described in the docs for cuCtxCreate, and I would agree that using this mechanism for context creation probably does make it easier to interoperate with other libraries, particularly those using the Runtime API.

Does cuDevicePrimaryCtxRetain simply create a normal context (if one doesn’t exist) and the promote it to be primary (so a primary context is simply a state that can be assigned to any context) or is the primary context a special context that is different from a normal context?

My understanding is primary and other contexts are similar, except for the restrictions about what must be done with the primary context only (like the cudaDeviceEnablePeerAccess() example I mentioned above. The other practical difference is that cuDevicePrimaryCtxRetain doesn’t push the newly-created (if one is created) context onto the context stack, whereas cuCtxCreate does push the newly-created context onto the context stack.

I’m not aware of a way to take an existing context and promote it to be the primary context.

Awesome, thanks so much Graham. One quick clarifying question. If the primary context avoids going on the stack, does this mean it cannot go on the stack at all or does that only mean you need to explicitly call cuCtxPushCurrent after you create a primary context if you want it to go on the stack? If it does go on the stack how do cuDevicePrimaryCtxRetain and the stack interact with one another? If I call cuDevicePrimaryCtxRetain and I need to make it current via cuCtxPushCurrent, can a context be on the stack multiple times? If the primary context isn’t on the stack how can it be current so that functions that don’t take the context as an explicit argument ever use the primary context (like memory allocation)?

Sorry for the rambling question, it’s more or less just my stream of thought as I though through the first part of the question, lol.

After looking through some of Tensorflows code (https://github.com/tensorflow/tensorflow/blob/4806cb0646bd21f713722bd97c0d0262c575f7e0/tensorflow/stream_executor/cuda/cuda_driver.cc#L517) It looks part of my question is answered. Primary contexts can be added to the context stack like any other context. However, I am still unsure of how cuDevicePrimaryCtxRetain and cuCtxPushCurrent/cuCtxSetCurrent interact with one another if the primary context is on the stack but not current (i.e. can the same context be added to the stack multiple times). It seems like it should be possible.

However, I am still unsure of how cuDevicePrimaryCtxRetain and cuCtxPushCurrent/cuCtxSetCurrent interact with one another if the primary context is on the stack but not current (i.e. can the same context be added to the stack multiple times).

I don’t think they do interact with each other as such - cuDevicePrimaryCtxRetain will create a primary context if none exists, or increment its reference count if it does exist.

I can’t picture the scenario you’re thinking of - perhaps you could post an example of a sequence of calls involving cuDevicePrimaryCtxRetain / cuDevicePrimaryCtxRelease and cuCtxPushCurrent / cuCtxPopCurrent / etc… that illustrates your concerns / thoughts about interactions?

I just tried to code up something and it results in the GPU hanging (not sure if its me or the code) so its possible what I am imaging isn’t possible. With a normal context you can only readily access whichever context is current (via cuCtxGetCurrent) from the context stack. This means that calling cuCtxDestroy on a context that is on the stack and not current is difficult (and if non-current stack contexts are impossible to access until they become current it isn’t possible to destroy them at all). A primary context, however, can be accessed at any point, regardless of if it is current or not. If a primary context is on the stack, and not current can it be destroyed and if so, when it becomes primary does this result in a segfault or something (or can the ref count not be reduced below 1 if it is in the stack)? To add to this, can the primary context be added to the stack multiple times (for example if you call an external library, if the library requires the primary context it would make the most sense to acquire the primary context, push it onto the stack, run the code, and then pop it from the stack so you don’t mess with the context stack generated in the main library (which might have the primary context within the context stack somewhere)) ?