Hi...I am very much eager to convert the following code into cuda.jit, what is the correct way to do that?

import os
import cv2
import time
import numpy as np
import numba as nb
from numba import njit
from google.colab import drive


@njit('f8:,:', parallel=True, cache=True)
def normalize_mat(depth_src):
      depth_min = depth_src.min()
      depth_max = depth_src.max()
      depth = (depth_src - depth_min) / (depth_max - depth_min)

     return depth

def generate_stereo(depth_dir, depth_prefix, filename):
      print("=== Start processing:", filename, "===")
      depth_src = cv2.imread(os.path.join(depth_dir, depth_prefix + filename + ".jpg"))

      if len(depth_src.shape) == 3:
          depth_src = cv2.cvtColor(depth_src, cv2.COLOR_BGR2GRAY)
         depth_src = depth_src

      depth = normalize_mat(depth_src)

      depth = np.round(depth*255).astype(int)

      cv2.imwrite(os.path.join(depth_dir, "normaized_depth_" + filename + ".jpg"), depth)

def file_processing_im(depth_dir, depth_prefix):
      for f in os.listdir(depth_dir):
           filename = f.split(".")[0]
           generate_stereo(depth_dir, depth_prefix, filename)

def main():
      start_time = time.time()

      depth_dir = 'gdrive/MyDrive/depth/'
     depth_prefix = 'Depth_'

file_processing_im(depth_dir, depth_prefix)

      print(time.time() - start_time, "seconds for base generation")

if name == "main":

If you want to convert this code to use CUDA, it’s not clear to me that Numba is the right tool for the job - I’m not familiar with OpenCV, but I understand it has some CUDA functionality already - is that useable for your use case? I wonder if a combination of that and CuPy (to replace the functionality in normalize_mat()) would be more appropriate.

If you really want to replace the functionality with CUDA-jitted kernels with Numba, then I think the main approach you’ll need will be:

  • Replace the njit decorators with cuda.jit.
  • In normalize_mat, you need to implement the max and min calculations and normalization using scalar operations indexed by thread ID. You’ll also need to change it so the output array is passed in, because you won’t be able to create the depth array in the function.
  • In generate_stereo, you’ll need to replace the call to cv2.cvtColor() with a kernel you write yourself that provides the same functionality. You’ll also need to replace the call to np.round() with an implementation that operates on scalars.
  • You’ll need to move the data loading out of generate_stereo(), and allocate space on the device and transfer data to it before calling generate_stereo().

Thank you so much…will try and update