Random Number Generation on the GPU

I’m trying to create a coin flip function on the GPU. From what I undestood, I should use the things mentioned here https://numba.pydata.org/numba-doc/latest/cuda/random.html to do so.

However, create_xoroshiro128p_states(n= someint, seed=1) is returning ‘TypeError: object of type ‘int’ has no len()’.
I don’t know why I’m getting this. Shouldn’t ‘n’ be an int as stated in the documentation?

If I do:

from numba.cuda import random
arr = random.create_xoroshiro128p_states(10, seed=1)

then I don’t see any issue - can you post some executable code that reproduces the issue please?

Also, what version of Numba is this? What is the output of the command numba -s?

It might have been a temporary bug. I’ve tried it since then and got no errors. This is version 0.51.2.
That being said, I do have a follow up question though I’m not sure if I should create a new topic.

I have a 3D array of size 701x900x719 (X, Y, Z) and a certain amount of voxels considered “grey” that I have to change into “black” or “white”.

I’m not sure how to use create_xoroshiro128p_states in this situation. I think I only need at most as many RNG states as grey voxels so, with ending_greys being the amount of grey voxels, I’m doing:

create_xoroshiro128p_states(ending_greys.item(), seed=1)

I’m using grid(3) to get the thread’s position and xoroshiro128p_uniform_float32 to get the next value. From the example in the documentation, xoroshiro128p_uniform_float32(rng_states, thread_id) with thread_id = cuda.grid(1).
How would I go about doing this with grid(3) instead?

NOTE: When I first wrote this I had a typo in the calculation of tid - this is now corrected, and was spotted by JRibeiro in the following posts.

To use the RNG with a 3D grid, you need to linearize the thread indices into a single thread ID. The following example does this to create a 3D array of random numbers:

from numba import cuda
from numba.cuda.random import (create_xoroshiro128p_states,
                               xoroshiro128p_uniform_float32)
import numpy as np


@cuda.jit
def random_3d(arr, rng_states):
    # Per-dimension thread indices and strides
    startx, starty, startz = cuda.grid(3)
    stridex, stridey, stridez = cuda.gridsize(3)

    # Linearized thread index
    tid = (startz * stridey * stridex) + (starty * stridex) + startx

    # Use strided loops over the array to assign a random value to each entry
    for i in range(startz, arr.shape[0], stridez):
        for j in range(starty, arr.shape[1], stridey):
            for k in range(startx, arr.shape[2], stridex):
                arr[i, j, k] = xoroshiro128p_uniform_float32(rng_states, tid)


# Array dimensions
X, Y, Z = 701, 900, 719

# Block and grid dimensions
bx, by, bz = 8, 8, 8
gx, gy, gz = 16, 16, 16

# Total number of threads
nthreads = bx * by * bz * gx * gy * gz

# Initialize a state for each thread
rng_states = create_xoroshiro128p_states(nthreads, seed=1)

# Generate random numbers
arr = cuda.device_array((X, Y, Z), dtype=np.float32)
random_3d[(gx, gy, gz), (bx, by, bz)](arr, rng_states)

A couple of notes on the example:

  • Since the array is so large, it wouldn’t be a good strategy to assign one thread to each element here - creating the initial random states for each thread would take a long time, and the overhead of the kernel launch will be larger than necessary.
  • A fixed number of threads stride across the array, generating random numbers using the linearized index tid to index into the RNG states.

Does the example provide a starting point for something you can adapt to your use case?

Also, a PR to add this example to the documentation is at: https://github.com/numba/numba/pull/6430 - I thought the question was a good one and the answer useful to others.

1 Like

I think it does. Let me try and process the information bit by bit. I’m still new to these environments and there are a few thing I don’t fully understand:

  1. Block and grid dimensions. The first issue is partially about terminology. So far I’ve been launching the execution configuration as (blockspergrid, threadsperblock) so threadsperblock and blockspergrid are analogous to blockdimensions and griddimensions right?
    The second part is: as per the documentation, blockpergrid should be initialized taking into account the dimensions of the given array e.g.
    blockspergrid_x = math.ceil(an_array.shape[0] / threadsperblock[0])
    Therefore, why is blockspergrid being declared independently in this case? Shouldn’t it be
    gx = math.ceil(X / bx) ?

  2. Linearized Thread Index. Despite not being critical to the topic, I would appreciate if you could explain the formula or name some resource that does though I did find this https://cs.calvin.edu/courses/cs/374/CUDA/CUDA-Thread-Indexing-Cheatsheet.pdf.
    I can’t wrap my head around it. From what I understand:
    thread_index =
    position of the current thread in the entire grid (z) * grid size.y * grid size.z +
    position of the current thread in the entire grid (y) * grid size.x +
    position of the current thread in the entire grid (x)
    Shouldn’t it be something akin to, if
    position of the current thread in the entire grid (x) = column
    position of the current thread in the entire grid (y) = row
    position of the current thread in the entire grid (z) = plane
    gridsize.y = height (number of rows) and gridsize.x = width (number of columns):
    thread_index =
    plane * height * width +
    row * width +
    column ?
    in other words,

tid = (startz * stridey * stridex) + (starty * stridex) + startx

I’ll continue on the next post.

My function. I’m implementing an algorithm known as hysteresis that has the following steps:

  1. Given a 3D matrix made up of values between 0 and 255, start with submiting the matrix to a segmentation procedure with 2 thresholds, t1 and t2 with t2 > t1 (arbitrarily chosen).
    I’ve implemented this function as follows:
@cuda.jit
def numba_seg(arr, t1, t2, out):
    x, y, z = cuda.grid(3)

    if x < arr.shape[0] and y < arr.shape[1] and z < arr.shape[2]:
        value = arr[x, y, z]
        if value > t2:
            out[x, y, z] = 255
        elif value < t1:
            out[x, y, z] = 0
        else:
            out[x, y, z] = 128

With your suggestion of using grid-strided loops I believe this becomes:

@cuda.jit
def numba_stride_seg(arr, t1, t2, out):
    x, y, z = cuda.grid(3)
    stride_x, stride_y, stride_z = cuda.gridsize(3)

    for i in range(x, arr.shape[0], stride_x):
        for j in range(y, arr.shape[1], stride_y):
            for k in range(z, arr.shape[2], stride_z):
                value = arr[i, j, k]
                if value > t2:
                    out[x, y, z] = 255
                elif value < t1:
                    out[x, y, z] = 0
                else:
                    out[x, y, z] = 128

threadsperblock and blockspergrid are analogous to blockdimensions and griddimensions right?

That is correct.

Therefore, why is blockspergrid being declared independently in this case? Shouldn’t it be
gx = math.ceil(X / bx) ?

For getting started and testing, creating one thread per element is fine.

Once you have an array with a lot of elements, it’s inefficient to create as many threads as you have elements (it takes more time for the launch because blocks have to be tracked by the device, and the runtime of a block can be short compared to the time taken to instantiate it). Eventually this strategy will require more threads than are supported by the device for large enough data. So for any application with enough data to make good use of a GPU, it’s usually best to instead to choose a launch configuration that provides good occupancy on the device, and then loop inside the kernel within each thread. The striding of the loops inside the kernel helps to ensure that memory accesses of adjacent threads coalesce (see e.g. this Stackoverflow post if you’re unfamiliar with coalescing).

Shouldn’t it be…

tid = (startz * stridey * stridex) + (starty * stridex) + startx

Argh, yes! There was a typo in my code sample above , it should be exactly as you’ve written.

Regarding the function you posted (took me a little while to look into): as well as adding the stride loops, you also need to index the out array with i, j, and k as well. Also, I find that I get better performance if I use x for the innermost loop and z for the outermost loop. So I think your function should look like this:

@cuda.jit
def numba_stride_seg(arr, t1, t2, out):
    x, y, z = cuda.grid(3)
    stride_x, stride_y, stride_z = cuda.gridsize(3)

    for i in range(z, arr.shape[0], stride_z):
        for j in range(y, arr.shape[1], stride_y):
            for k in range(x, arr.shape[2], stride_x):
                value = arr[i, j, k]
                if value > t2:
                    out[i, j, k] = 255
                elif value < t1:
                    out[i, j, k] = 0
                else:
                    out[i, j, k] = 128

A short example that runs both the unstrided and strided versions, comparing their output and performance follows:

from numba import cuda
from numba.cuda.random import (create_xoroshiro128p_states,
                               xoroshiro128p_uniform_float32)
from time import perf_counter
import numpy as np
import math


@cuda.jit
def random_3d(arr, rng_states):
    # Per-dimension thread indices and strides
    startx, starty, startz = cuda.grid(3)
    stridex, stridey, stridez = cuda.gridsize(3)

    # Linearized thread index
    tid = (startz * stridey * stridex) + (starty * stridex) + startx

    # Use strided loops over the array to assign a random value to each entry
    for i in range(startz, arr.shape[0], stridez):
        for j in range(starty, arr.shape[1], stridey):
            for k in range(startx, arr.shape[2], stridex):
                arr[i, j, k] = xoroshiro128p_uniform_float32(rng_states, tid)


@cuda.jit
def numba_seg(arr, t1, t2, out):
    x, y, z = cuda.grid(3)

    if x < arr.shape[0] and y < arr.shape[1] and z < arr.shape[2]:
        value = arr[x, y, z]
        if value > t2:
            out[x, y, z] = 255
        elif value < t1:
            out[x, y, z] = 0
        else:
            out[x, y, z] = 128


@cuda.jit
def numba_stride_seg(arr, t1, t2, out):
    x, y, z = cuda.grid(3)
    stride_x, stride_y, stride_z = cuda.gridsize(3)

    for i in range(z, arr.shape[0], stride_z):
        for j in range(y, arr.shape[1], stride_y):
            for k in range(x, arr.shape[2], stride_x):
                value = arr[i, j, k]
                if value > t2:
                    out[i, j, k] = 255
                elif value < t1:
                    out[i, j, k] = 0
                else:
                    out[i, j, k] = 128


# Array dimensions
X, Y, Z = 701, 900, 719

# Block and grid dimensions
bx, by, bz = 8, 8, 8
# For the one-element-per-thread version
gx, gy, gz = math.ceil(X / bx), math.ceil(Y / by), math.ceil(Z / bz)
# For the stride-loop version
gxs, gys, gzs = 16, 16, 16

# Total number of threads
nthreads = bx * by * bz * gxs * gys * gzs

# Initialize a state for each thread
print("Initializing RNG states")
rng_states = create_xoroshiro128p_states(nthreads, seed=1)

# Generate random numbers
print("Generating random numbers")
arr = cuda.device_array((X, Y, Z), dtype=np.float32)
random_3d[(gxs, gys, gzs), (bx, by, bz)](arr, rng_states)

# Test versions
out = cuda.device_array_like(arr)
out_stride = cuda.device_array_like(arr)
cuda.synchronize()

print("Running one element per thread version")
start = perf_counter()
numba_seg[(gx, gy, gz), (bx, by, bz)](arr, 0.25, 0.75, out)
cuda.synchronize()
end = perf_counter()
one_element_time = end - start

print("Running strided version")
start = perf_counter()
numba_stride_seg[(gxs, gys, gzs), (bx, by, bz)](arr, 0.25, 0.75, out_stride)
cuda.synchronize()
end = perf_counter()
strided_time = end - start

print("Copying results to host")
out_host = out.copy_to_host()
out_stride_host = out_stride.copy_to_host()

print("Sanity checking output")
np.testing.assert_equal(out_host, out_stride_host)

print("Sanity check OK!")
print(f"One element per thread time: {one_element_time}")
print(f"Strided loop time:           {strided_time}")

Note that the cuda.synchronize() calls are not necessary for correctness, I just have them there to ensure that I’m timing the whole kernel execution (and not just the launch). With this benchmark I find that actually the strided version has similar performance (or maybe a few % slower) than the one-thread-per-element version. The output for me is as follows:

Initializing RNG states
Generating random numbers
Running one element per thread version
Running strided version
Copying results to host
Sanity checking output
Sanity check OK!
One element per thread time: 0.14170214999467134
Strided loop time:           0.14913468800659757

No worries. Thanks for all the help so far :smile:
Regarding the link on coalesced memory, I think I got the gist of it though I’m not so sure about how it relates to the chosen grid dimension values.
Let me break down the data first. With 701x900x719 we have 453,617,100 elements.
I think we have 458,219,520 threads for the one-element-per-thread version. Makes sense.
We have 2,097,152 threads for the stride-loop version. I don’t really get this one. From what was said before, I think the idea is to look at something other than threads. I don’t know what.

Ah! Of course. My bad. A few things still concerning this function:

  1. Is the section below correct? Don’t we need to change where to stop too? E.g.
    for i in range(z, arr.shape[2], stride_z)
  1. I’m still getting considerably slower times for the strided version even after making the changes though it’s now obvious it’s tied to the launch configuration issue.
    By the way, it takes roughly 0.224 seconds on average for the non-strided version and 0.302 seconds for your strided version.
    If I make the changes mentioned above, 0.295 seconds and 0.309 if left in the original order.
    If I change the grid dimensions to (16, 16, 16) I get 0.227 seconds - pretty much the same as the non-strided version as expected and, if I add the changes above, I even got 0.218 seconds so slightly faster than the non-strided version.
    Below is the configuration I’m using i.e. the one for the non-strided version. I suppose this will be solved when I understand the very first question.
threadsperblock = (2, 8, 32)  # 512 threads - ordering made no significant change
blockspergrid_x = ceil(arr.shape[0] / threadsperblock[0])
blockspergrid_y = ceil(arr.shape[1] / threadsperblock[1])
blockspergrid_z = ceil(arr.shape[2] / threadsperblock[2])
  1. Assuming the previous issue is because of the imense amount of threads launched, do you think there is merit in choosing the strided version over the non-strided one given these execution times? Do you think the difference in execution time will considerably change with the size of the data?