Equivalent of Tex2D or Tex3D and Texture Memory in Current Numba?

abnowack · March 12, 2021, 4:02pm

Hey,

I come from medical imaging, where I do lots of ray tracing through 2D and 3D images, which involve summing and interpolating between pixels/voxels over many many points. Traditionally I’ve been using raw cuda with PyCUDA with texture memory feature, and calling tex2D(image, xcoord, ycoord), where image is my texture object memory. I know this has not been supported in numba, but there’s a lot that’s changed in CUDA over the years

Wondering if anyone is doing something similar but using numba, I’d sure love to use numba instead for my work. I implemented my functions using numba but without the texture memory feature my code will run 3-4x slower. Has anyone been using numba and newer features of CUDA to achieve this?

gmarkall · March 12, 2021, 5:30pm

Now that it’s possible to extend the CUDA target to some degree, I’m wondering if it’s possible to implement support for this as an extension, but I’m not well-versed in the texture memory features.

Have you got a short example of the use of tex2D in a PyCUDA kernel you can point to?

abnowack · March 12, 2021, 5:52pm

Certainly, yes, this is really widely used in medical imaging so if it’s at all possible this would be amazing.

Here is a short example using image data, which sums up a line integral along the ray, which is unique for each gpu thread. The tex2D memory fetching is really suited for, in this case bilinear, application.

    texture<float, 2> image;

__global__ void forward_project(const float *rays, float *sinogram, int sz, const float minx, const float maxx, const float miny, const float maxy, const int steps)
{
    const int cu_x = threadIdx.x + blockIdx.x * blockDim.x;
    const int cu_y = threadIdx.y + blockIdx.y * blockDim.y;
    
    int offset = cu_x + cu_y * blockDim.x * gridDim.x;
    
    if (offset >= sz)
    {
        return;
    }
    
    float ray[4];
    ray[0] = rays[offset*4 + 0];
    ray[1] = rays[offset*4 + 1];
    ray[2] = rays[offset*4 + 2];
    ray[3] = rays[offset*4 + 3];

    float line_distance = hypotf(ray[2] - ray[0], ray[3] - ray[1]);
    float integral = 0;

    if (line_distance > 0.1)
    {
        for (unsigned int j = 0; j < steps; j++)
        {
            float x = ray[0] + j * (ray[2] - ray[0]) / steps;
            float xnormed = (x - minx) / (maxx - minx);

            float y = ray[1] + j * (ray[3] - ray[1]) / steps;
            float ynormed = (y - miny) / (maxy - miny);

            integral += tex2D(image, xnormed, ynormed);
        }
    }

    sinogram[offset] = integral * line_distance / steps;
}

gmarkall · March 12, 2021, 6:57pm

Thanks for the example - does that all go into one PyCUDA SourceModule? Or does image get declared separately somehow?

gmarkall · March 12, 2021, 7:13pm

Adding a couple of notes: building the example with

nvcc -gencode arch=compute_75,code=sm_75 -ptx tex.cu

Yields the following relevant parts of the PTX:

// This is outside the kernel function
.global .texref image;

// tex2D calls inside the kernel function
// %f44 is xnormed, %f49 is ynormed
tex.2d.v4.f32.f32       {%f50, %f51, %f52, %f53}, [image, {%f44, %f49}];

Generating the texture instructions should be relatively straightforward - it looks like NVVM provides intrinsics for them: NVVM IR :: CUDA Toolkit Documentation

PTX ISA reference: PTX ISA :: CUDA Toolkit Documentation

abnowack · March 12, 2021, 7:14pm

Yes, it’s one Source module, and image is declared as an array converted to texref and a few flags are set to it

Here is the full function, the items passed into it are numpy arrays

def cuda_forward_project(rays, image, size, steps):
mod = SourceModule("""
texture<float, 2> image;

__global__ void forward_project(const float *rays, float *sinogram, int sz, const float minx, const float maxx, const float miny, const float maxy, const int steps)
{
    const int cu_x = threadIdx.x + blockIdx.x * blockDim.x;
    const int cu_y = threadIdx.y + blockIdx.y * blockDim.y;
    
    int offset = cu_x + cu_y * blockDim.x * gridDim.x;
    
    if (offset >= sz)
    {
        return;
    }
    
    float ray[4];
    ray[0] = rays[offset*4 + 0];
    ray[1] = rays[offset*4 + 1];
    ray[2] = rays[offset*4 + 2];
    ray[3] = rays[offset*4 + 3];

    float line_distance = hypotf(ray[2] - ray[0], ray[3] - ray[1]);
    float integral = 0;

    if (line_distance > 0.1)
    {
        for (unsigned int j = 0; j < steps; j++)
        {
            float x = ray[0] + j * (ray[2] - ray[0]) / steps;
            float xnormed = (x - minx) / (maxx - minx);

            float y = ray[1] + j * (ray[3] - ray[1]) / steps;
            float ynormed = (y - miny) / (maxy - miny);

            integral += tex2D(image, xnormed, ynormed);
        }
    }

    sinogram[offset] = integral * line_distance / steps;
}
""")

# CUDA stuff
func = mod.get_function("forward_project")
tex_image = mod.get_texref("image")

image = image.astype(np.float32)
cuda.matrix_to_texref(image, tex_image, order="C")

tex_image.set_flags(cuda.TRSF_NORMALIZED_COORDINATES)
tex_image.set_filter_mode(cuda.filter_mode.LINEAR)

threads_per_blockdim = 16
nblocks_x = np.size(rays, 0) // threads_per_blockdim**2 + 1
nblocks_y = 1

block=(threads_per_blockdim, threads_per_blockdim, 1)
grid=(nblocks_x, nblocks_y, 1)

# setup vars
rays_shape = rays.shape

rays_ = rays.reshape(-1, 4).astype(np.float32)
sinogram_ = np.zeros(np.prod(rays_shape[:-1]), dtype=np.float32)
sz_ = np.int32(np.prod(rays_shape[:-1]))
minx_ = np.float32(size[0])
maxx_ = np.float32(size[1])
miny_ = np.float32(size[2])
maxy_ = np.float32(size[3])
steps_ = np.int32(steps)

print(block, grid, rays_shape, sz_)

func(cuda.In(rays_), cuda.Out(sinogram_), sz_, minx_, maxx_, miny_, maxy_, steps_, block=block, grid=grid, texrefs=[tex_image])

return sinogram_

gmarkall · March 12, 2021, 7:16pm

I think supporting this in Numba is doable, but I’m going to struggle to get round to it in the near-term.

However, I’d be happy to help guide and support someone interested in working towards a PR for Numba that implements it.

abnowack · March 12, 2021, 7:22pm

I can provide ample amounts of enthusiasm for this, this is the single hang up in an entire class of algorithms and methods in medical imaging that frequently perform this. All fall under the class of methods of “Model Based Iterative Reconstruction” routines, used when you get a CT scan

I’ve used C/C++ for 10 years, and I have enough enthusiasm to give it a shot if there’s an existing example of extending numba I could study as a starting point. Otherwise definitely encourage anyone to step up.

Appreciate your understanding and support of this!

abnowack · March 12, 2021, 8:59pm

Received some helpful input on the nvidia forum here

So texture memory may no longer be as important feature as before. I’m going to explore some benchmarks on recent hardware, and see how much more efficient I can make my code and compare to numba.

If there’s still a difference I’ll try to put together a PR.

gmarkall · March 12, 2021, 9:26pm

Great - a few references for some background:

https://github.com/gmarkall/extending-numba-cuda/blob/main/Extending%20Numba’s%20CUDA%20Target.ipynb
The Life of a Numba Kernel: A Compilation Pipeline Taking User Defined Functions in Python to CUDA Kernels | by Graham Markall | RAPIDS AI | Medium
Contributing to Numba with no compiler or LLVM experience

If you want to take a look at these to get a feel for what’s generally required and see whether it looks like something that would fit your interest, then I can provide a bit more guidance (having to be brief here now as it’s getting a little late) - do let me know what you think!

maxrohleder · July 19, 2022, 5:02pm

Any news on this front? I would love a clean and fast python (numba) implementation of CT forward-/backprojection.

abnowack · July 19, 2022, 5:28pm

Since this post I’ve switched to just doing the interpolation by implementing the interpolation function in the cuda device function. The 256-bit interpolation that texture interpolation does is a big drawback for doing medical imaging today.

I haven’t experimented too much, but I’d imagine if you are smart about using CUDA intrinsic/atomic functions, you can get very similar speeds as with your own texture interpolation. The drawback is not all of the atomic functions in CUDA are implemented in numba, so if speed is your #1 concern you may want to use nvidia’s CUDA Python API directly, GitHub - NVIDIA/cuda-python: CUDA Python Low-level Bindings

gmarkall · July 20, 2022, 9:42am

What atomic functions do you need? We can add them relatively easily if there are some missing.

The CUDA Python API provides access to the host-side driver and runtime APIs, it doesn’t enable you to write kernels, so if you want to keep your code in Python you still need to use something else (e.g. Numba )

Topic		Replies	Views
Support for texture memory in numba.cuda? Numba	1	340	August 8, 2023
Why is my Numba CUDA function slower than PyTorch? Community Support	1	788	February 13, 2024
Graphics API interop Community Support	14	1077	March 12, 2021
Numba for CUDA Programmers course released Announcements	0	780	April 23, 2021
CUDA - OpenGL interop Support: How do I do ...?	9	2033	March 30, 2023

Equivalent of Tex2D or Tex3D and Texture Memory in Current Numba?

Related topics