Equivalent of Tex2D or Tex3D and Texture Memory in Current Numba?

Hey,

I come from medical imaging, where I do lots of ray tracing through 2D and 3D images, which involve summing and interpolating between pixels/voxels over many many points. Traditionally I’ve been using raw cuda with PyCUDA with texture memory feature, and calling tex2D(image, xcoord, ycoord), where image is my texture object memory. I know this has not been supported in numba, but there’s a lot that’s changed in CUDA over the years

Wondering if anyone is doing something similar but using numba, I’d sure love to use numba instead for my work. I implemented my functions using numba but without the texture memory feature my code will run 3-4x slower. Has anyone been using numba and newer features of CUDA to achieve this?

Now that it’s possible to extend the CUDA target to some degree, I’m wondering if it’s possible to implement support for this as an extension, but I’m not well-versed in the texture memory features.

Have you got a short example of the use of tex2D in a PyCUDA kernel you can point to?

Certainly, yes, this is really widely used in medical imaging so if it’s at all possible this would be amazing.

Here is a short example using image data, which sums up a line integral along the ray, which is unique for each gpu thread. The tex2D memory fetching is really suited for, in this case bilinear, application.

    texture<float, 2> image;

__global__ void forward_project(const float *rays, float *sinogram, int sz, const float minx, const float maxx, const float miny, const float maxy, const int steps)
{
    const int cu_x = threadIdx.x + blockIdx.x * blockDim.x;
    const int cu_y = threadIdx.y + blockIdx.y * blockDim.y;
    
    int offset = cu_x + cu_y * blockDim.x * gridDim.x;
    
    if (offset >= sz)
    {
        return;
    }
    
    float ray[4];
    ray[0] = rays[offset*4 + 0];
    ray[1] = rays[offset*4 + 1];
    ray[2] = rays[offset*4 + 2];
    ray[3] = rays[offset*4 + 3];

    float line_distance = hypotf(ray[2] - ray[0], ray[3] - ray[1]);
    float integral = 0;

    if (line_distance > 0.1)
    {
        for (unsigned int j = 0; j < steps; j++)
        {
            float x = ray[0] + j * (ray[2] - ray[0]) / steps;
            float xnormed = (x - minx) / (maxx - minx);

            float y = ray[1] + j * (ray[3] - ray[1]) / steps;
            float ynormed = (y - miny) / (maxy - miny);

            integral += tex2D(image, xnormed, ynormed);
        }
    }

    sinogram[offset] = integral * line_distance / steps;
}

Thanks for the example - does that all go into one PyCUDA SourceModule? Or does image get declared separately somehow?

Adding a couple of notes: building the example with

nvcc -gencode arch=compute_75,code=sm_75 -ptx tex.cu

Yields the following relevant parts of the PTX:

// This is outside the kernel function
.global .texref image;

// tex2D calls inside the kernel function
// %f44 is xnormed, %f49 is ynormed
tex.2d.v4.f32.f32       {%f50, %f51, %f52, %f53}, [image, {%f44, %f49}];

Generating the texture instructions should be relatively straightforward - it looks like NVVM provides intrinsics for them: NVVM IR :: CUDA Toolkit Documentation

PTX ISA reference: PTX ISA :: CUDA Toolkit Documentation

Yes, it’s one Source module, and image is declared as an array converted to texref and a few flags are set to it

Here is the full function, the items passed into it are numpy arrays

def cuda_forward_project(rays, image, size, steps):
mod = SourceModule("""
texture<float, 2> image;

__global__ void forward_project(const float *rays, float *sinogram, int sz, const float minx, const float maxx, const float miny, const float maxy, const int steps)
{
    const int cu_x = threadIdx.x + blockIdx.x * blockDim.x;
    const int cu_y = threadIdx.y + blockIdx.y * blockDim.y;
    
    int offset = cu_x + cu_y * blockDim.x * gridDim.x;
    
    if (offset >= sz)
    {
        return;
    }
    
    float ray[4];
    ray[0] = rays[offset*4 + 0];
    ray[1] = rays[offset*4 + 1];
    ray[2] = rays[offset*4 + 2];
    ray[3] = rays[offset*4 + 3];

    float line_distance = hypotf(ray[2] - ray[0], ray[3] - ray[1]);
    float integral = 0;

    if (line_distance > 0.1)
    {
        for (unsigned int j = 0; j < steps; j++)
        {
            float x = ray[0] + j * (ray[2] - ray[0]) / steps;
            float xnormed = (x - minx) / (maxx - minx);

            float y = ray[1] + j * (ray[3] - ray[1]) / steps;
            float ynormed = (y - miny) / (maxy - miny);

            integral += tex2D(image, xnormed, ynormed);
        }
    }

    sinogram[offset] = integral * line_distance / steps;
}
""")

# CUDA stuff
func = mod.get_function("forward_project")
tex_image = mod.get_texref("image")

image = image.astype(np.float32)
cuda.matrix_to_texref(image, tex_image, order="C")

tex_image.set_flags(cuda.TRSF_NORMALIZED_COORDINATES)
tex_image.set_filter_mode(cuda.filter_mode.LINEAR)

threads_per_blockdim = 16
nblocks_x = np.size(rays, 0) // threads_per_blockdim**2 + 1
nblocks_y = 1

block=(threads_per_blockdim, threads_per_blockdim, 1)
grid=(nblocks_x, nblocks_y, 1)

# setup vars
rays_shape = rays.shape

rays_ = rays.reshape(-1, 4).astype(np.float32)
sinogram_ = np.zeros(np.prod(rays_shape[:-1]), dtype=np.float32)
sz_ = np.int32(np.prod(rays_shape[:-1]))
minx_ = np.float32(size[0])
maxx_ = np.float32(size[1])
miny_ = np.float32(size[2])
maxy_ = np.float32(size[3])
steps_ = np.int32(steps)

print(block, grid, rays_shape, sz_)

func(cuda.In(rays_), cuda.Out(sinogram_), sz_, minx_, maxx_, miny_, maxy_, steps_, block=block, grid=grid, texrefs=[tex_image])

return sinogram_
1 Like

I think supporting this in Numba is doable, but I’m going to struggle to get round to it in the near-term.

However, I’d be happy to help guide and support someone interested in working towards a PR for Numba that implements it.

I can provide ample amounts of enthusiasm for this, this is the single hang up in an entire class of algorithms and methods in medical imaging that frequently perform this. All fall under the class of methods of “Model Based Iterative Reconstruction” routines, used when you get a CT scan :smile:

I’ve used C/C++ for 10 years, and I have enough enthusiasm to give it a shot if there’s an existing example of extending numba I could study as a starting point. Otherwise definitely encourage anyone to step up.

Appreciate your understanding and support of this!

1 Like

Received some helpful input on the nvidia forum here

So texture memory may no longer be as important feature as before. I’m going to explore some benchmarks on recent hardware, and see how much more efficient I can make my code and compare to numba.

If there’s still a difference I’ll try to put together a PR.

Great - a few references for some background:

If you want to take a look at these to get a feel for what’s generally required and see whether it looks like something that would fit your interest, then I can provide a bit more guidance (having to be brief here now as it’s getting a little late) - do let me know what you think!

Any news on this front? I would love a clean and fast python (numba) implementation of CT forward-/backprojection.

Since this post I’ve switched to just doing the interpolation by implementing the interpolation function in the cuda device function. The 256-bit interpolation that texture interpolation does is a big drawback for doing medical imaging today.

I haven’t experimented too much, but I’d imagine if you are smart about using CUDA intrinsic/atomic functions, you can get very similar speeds as with your own texture interpolation. The drawback is not all of the atomic functions in CUDA are implemented in numba, so if speed is your #1 concern you may want to use nvidia’s CUDA Python API directly, GitHub - NVIDIA/cuda-python: CUDA Python Low-level Bindings

1 Like

What atomic functions do you need? We can add them relatively easily if there are some missing.

The CUDA Python API provides access to the host-side driver and runtime APIs, it doesn’t enable you to write kernels, so if you want to keep your code in Python you still need to use something else (e.g. Numba :slight_smile: )