Numba CUDA implementation producing different results and taking longer time

kailashg26 · May 13, 2023, 10:25pm

original function:

class ReplayBuffer(object):
----------------------
----------------------
  def _encode_sample(self, idxes):
          obses_t, actions, rewards, obses_tp1, dones = [], [], [], [], []
          for i in idxes:
              data = self._storage[i]
              obs_t, action, reward, obs_tp1, done = data
              obses_t.append(np.array(obs_t, copy=False))
              actions.append(np.array(action, copy=False))
              rewards.append(reward)
              obses_tp1.append(np.array(obs_tp1, copy=False))
              dones.append(done)
          return np.array(obses_t), np.array(actions), np.array(rewards), np.array(obses_tp1), np.array(dones)

In the code, idxes is a list of 1024 random numbers. The for loop iterates over each element in the list of idxes 1024 times and retrieves the data stored in the self.storage and stores the information.

What I’m trying to do:

In my new implementation, I’m trying to write a gather kernel with the support of numba to collect the data from self._storage using the random indices list idxes. Finally, the gathered samples should be returned.

Note.

The shapes of different sublists in the self._storage is different
The idxes list is always 1024 in length.

Here’s my code:

@cuda.jit(nopython=True)
def gather_kernel(storage, idxes, obses_t, actions, rewards, obses_tp1, dones):
    idx = cuda.grid(1)
    if idx < idxes.shape[0]:
        data = storage[idxes[idx]]
        obs_t, action, reward, obs_tp1, done = data
        cuda.atomic.add(obses_t, idx, obs_t)
        cuda.atomic.add(actions, idx, action)
        cuda.atomic.add(rewards, idx, reward)
        cuda.atomic.add(obses_tp1, idx, obs_tp1)
        cuda.atomic.add(dones, idx, done)

class ReplayBuffer(object):
------------------------
------------------------

 def _encode_sample(self, idxes):
        num_elements = len(idxes)
        obses_t = np.zeros(num_elements, dtype=np.float64)
        actions = np.zeros(num_elements, dtype=np.float32)
        rewards = np.zeros(num_elements, dtype=np.int64)
        obses_tp1 = np.zeros(num_elements, dtype=np.float64)
        dones = np.zeros(num_elements, dtype=np.float64)

        block_size = 256
        grid_size = (len(idxes) + block_size - 1) // block_size

        d_storage = cuda.to_device(self._storage)
        d_idxes = cuda.to_device(np.array(idxes))

        gather_kernel[grid_size, block_size](
            d_storage,
            d_idxes,
            obses_t,
            actions,
            rewards,
            obses_tp1,
            dones
        )

        # Synchronize to ensure GPU computations are completed
        cuda.synchronize()

        return obses_t, actions, rewards, obses_tp1, dones

Can anyone let me know the mistake I’m doing here when I’m trying to gather the data on the GPU?

The code has two problems:

Optimized code takes longer time to run
Optimized code and the original code both have different outputs. In terms of rewards, the reward is bad for the optimized code.

Any help on this issue is appreciated.

Thanks

Topic		Replies	Views
Random array generation : numba cuda slower than cupy? Support: How do I do ...?	3	1952	July 23, 2021
Performance is comparable between two different numba cuda approaches Community Support	6	152	December 31, 2024
CPU vs GPU version Numba	2	478	July 28, 2020
Sharing CUDA memory by numba Support: How do I do ...?	0	450	November 3, 2021
Why numba.cuda has different results with CPU for loop? Support: How do I do ...?	1	185	March 27, 2024

Numba CUDA implementation producing different results and taking longer time

Related topics