Numba CUDA implementation producing different results and taking longer time

original function:

class ReplayBuffer(object):
  def _encode_sample(self, idxes):
          obses_t, actions, rewards, obses_tp1, dones = [], [], [], [], []
          for i in idxes:
              data = self._storage[i]
              obs_t, action, reward, obs_tp1, done = data
              obses_t.append(np.array(obs_t, copy=False))
              actions.append(np.array(action, copy=False))
              obses_tp1.append(np.array(obs_tp1, copy=False))
          return np.array(obses_t), np.array(actions), np.array(rewards), np.array(obses_tp1), np.array(dones)

In the code, idxes is a list of 1024 random numbers. The for loop iterates over each element in the list of idxes 1024 times and retrieves the data stored in the and stores the information.

What I’m trying to do:

In my new implementation, I’m trying to write a gather kernel with the support of numba to collect the data from self._storage using the random indices list idxes. Finally, the gathered samples should be returned.


  1. The shapes of different sublists in the self._storage is different
  2. The idxes list is always 1024 in length.

Here’s my code:

def gather_kernel(storage, idxes, obses_t, actions, rewards, obses_tp1, dones):
    idx = cuda.grid(1)
    if idx < idxes.shape[0]:
        data = storage[idxes[idx]]
        obs_t, action, reward, obs_tp1, done = data
        cuda.atomic.add(obses_t, idx, obs_t)
        cuda.atomic.add(actions, idx, action)
        cuda.atomic.add(rewards, idx, reward)
        cuda.atomic.add(obses_tp1, idx, obs_tp1)
        cuda.atomic.add(dones, idx, done)

class ReplayBuffer(object):

 def _encode_sample(self, idxes):
        num_elements = len(idxes)
        obses_t = np.zeros(num_elements, dtype=np.float64)
        actions = np.zeros(num_elements, dtype=np.float32)
        rewards = np.zeros(num_elements, dtype=np.int64)
        obses_tp1 = np.zeros(num_elements, dtype=np.float64)
        dones = np.zeros(num_elements, dtype=np.float64)
        block_size = 256
        grid_size = (len(idxes) + block_size - 1) // block_size

        d_storage = cuda.to_device(self._storage)
        d_idxes = cuda.to_device(np.array(idxes))

        gather_kernel[grid_size, block_size](

        # Synchronize to ensure GPU computations are completed

        return obses_t, actions, rewards, obses_tp1, dones

Can anyone let me know the mistake I’m doing here when I’m trying to gather the data on the GPU?

The code has two problems:

  1. Optimized code takes longer time to run
  2. Optimized code and the original code both have different outputs. In terms of rewards, the reward is bad for the optimized code.

Any help on this issue is appreciated.