Solving ODE on CUDA

Hello,
I’m taking my first steps in parallel computing with CUDA GPU and I really appreciate how Numba makes it easy. After a few days of reading the docs and examples and with a budget nvidia card I was able to compute some of my scientific stuff an order of magnitude faster than I could do it with c++ on my CPU.
While browsing web I run into this forum and I’m wandering if I could get some general feedback on the piece of code I wrote. Please send me away if this isn’t the right place :slight_smile:

My goal is to solve the same set of equations, with random initial conditions, as fast as possible. I thought I would make a big array of initial conditions and then make each thread work on its own piece of array. Below I put the fastest version I was able to write for my GT 710 GPU. Did I make any obvious mistakes? How could I optimize further? Thank you for your time!

from numba import cuda # for parallel gpu computations
import numpy as np # for random samples and array management
import time # for timing
import math # for sinus
import matplotlib.pyplot as plt # for scatter plot


@cuda.jit
def solve_ode(x, time):
    """
    Solve 2DoF ode on gpu, given
    the initial conditions. The
    result will be stored in the
    input array.
    

    Parameters
    ----------
    x : np.array
        Contains initial conditions for the simulations.
        The elements are arranged in pairs:
        [x1_sim1, x2_sim1, x1_sim2, x2_sim2, ...]
    time : np.array
        Three-element list with time details.

    Returns
    -------
    None.

    """
    # time variables
    t = time[0]
    t_end = time[1]
    dt = time[2]
    
    # index of thread on GPU
    pos = cuda.grid(1)
    # mappping index to access every
    # second element of the array
    pos = pos * 2
    
    # condidion to avoid threads
    # accessing indices out of array
    if pos < x.size:
        # execute until the time reaches t_end
        while t < t_end:
            # compute derivatives
            dxdt0 = x[pos+1]
            dxdt1 = np.float32(10.0)*math.sin(t) - np.float32(0.1)*x[pos+1] - x[pos]**3
            
            # update state vecotr
            x[pos] += dxdt0 * dt
            x[pos+1] += dxdt1 * dt
            
            # update time
            t += dt
            
        
# number of independent oscillators
# to simulate
trials = 10_000

# time variables
t0 = 0
t_end = 100
dt = 0.01
t = np.array([t0, t_end, dt], dtype='float32')

# generate random initial condiotions
init_states = np.random.random_sample(2 * trials).astype('float32')

# manage nr of threads (threads)
threads_per_block = 32
blocks_per_grid = \
    (init_states.size + (threads_per_block - 1)) // threads_per_block

# start timer
start = time.perf_counter()

# start parallel simulations
solve_ode[blocks_per_grid, threads_per_block](init_states, t)

# measure time elapsed
end = time.perf_counter()
print(f'The result was computed in {end-start} s')

# reshape the array into 2D
x = init_states.reshape((trials, 2))

# plot the phase space
plt.scatter(x[:, 0], x[:, 1], s=1)

Hi Konrad,

Apologies for the slow reply on this one. I think in general you did a pretty good job of writing code optimized for the GPU - the only thing I could spot that would improve the performance was replacing the computation of x[pos]**3 with xp3 where xp3 is computed as xp = x[pos]; xp3 = xp * xp * xp - this compiles down to a couple of multiply instructions, instead of using a more complex routine for computing arbitrary powers.

There are a couple of things to note about the timing:

  • Because you only run once, and have lazy compilation enabled (because there is no signature provided to the JIT decorator - see Compiling Python code with @jit — Numba 0.52.0-py3.7-linux-x86_64.egg documentation) you are also including the time for JIT compiling the function. I added a signature to the decorator so it becomes @cuda.jit(void(float32[::1], float32[::1])), which forces compilation immediately. Note that float32[::1] describes a contiguous array of float32s - using just float32[:] describes an array that may not be contiguous, and leads to less efficient code generation.
  • Because you’re passing in NumPy arrays, Numba has to transfer them to the device before execution, and transfer them back afterwards - this is an overhead that wouldn’t usually exist in the context of a program with multiple kernels that execute on data residing on the GPU, so you wouldn’t normally want to time them. I added code to transfer the data to/from the device outside of the timed region. When launching kernels on data that resides on the device, kernel launches are asynchronous, so it’s important to call cuda.synchronize() before recording the end time - otherwise you only record the CPU time taken to launch the kernel, rather than its execution time on the GPU.

Code comparing the two versions follows:

from numba import cuda, void, float32  # for parallel gpu computations
import numpy as np  # for random samples and array management
import time  # for timing
import math  # for sinus


@cuda.jit(void(float32[::1], float32[::1]))
def solve_ode(x, time):
    """
    Solve 2DoF ode on gpu, given the initial conditions. The result will be
    stored in the input array.


    Parameters
    ----------
    x : np.array
        Contains initial conditions for the simulations.
        The elements are arranged in pairs:
        [x1_sim1, x2_sim1, x1_sim2, x2_sim2, ...]
    time : np.array
        Three-element list with time details.

    Returns
    -------
    None.

    """
    # time variables
    t = time[0]
    t_end = time[1]
    dt = time[2]

    # index of thread on GPU
    pos = cuda.grid(1)
    # mappping index to access every
    # second element of the array
    pos = pos * 2

    # condidion to avoid threads
    # accessing indices out of array
    if pos < x.size:
        # execute until the time reaches t_end
        while t < t_end:
            # compute derivatives
            dxdt0 = x[pos + 1]
            dxdt1 = (
                np.float32(10.0) * math.sin(t)
                - np.float32(0.1) * x[pos + 1]
                - x[pos]**3
            )

            # update state vecotr
            x[pos] += dxdt0 * dt
            x[pos + 1] += dxdt1 * dt

            # update time
            t += dt


@cuda.jit(void(float32[::1], float32[::1]))
def solve_ode_opt(x, time):
    """
    Solve 2DoF ode on gpu, given the initial conditions. The result will be
    stored in the input array.


    Parameters
    ----------
    x : np.array
        Contains initial conditions for the simulations.
        The elements are arranged in pairs:
        [x1_sim1, x2_sim1, x1_sim2, x2_sim2, ...]
    time : np.array
        Three-element list with time details.

    Returns
    -------
    None.

    """
    # time variables
    t = time[0]
    t_end = time[1]
    dt = time[2]

    # index of thread on GPU
    pos = cuda.grid(1)
    # mappping index to access every
    # second element of the array
    pos = pos * 2

    # condidion to avoid threads
    # accessing indices out of array
    if pos < x.size:
        # execute until the time reaches t_end
        while t < t_end:
            # compute derivatives
            dxdt0 = x[pos + 1]
            xp = x[pos]
            xp3 = xp * xp * xp
            dxdt1 = (
                np.float32(10.0) * math.sin(t)
                - np.float32(0.1) * x[pos + 1]
                - xp3
            )

            # update state vecotr
            x[pos] += dxdt0 * dt
            x[pos + 1] += dxdt1 * dt

            # update time
            t += dt


# number of independent oscillators
# to simulate
trials = 1_000_000

# time variables
t0 = 0
t_end = 200
dt = 0.001
t = np.array([t0, t_end, dt], dtype="float32")

# generate random initial condiotions
init_states = np.random.random_sample(2 * trials).astype("float32")

# manage nr of threads (threads)
threads_per_block = 32
blocks_per_grid = (init_states.size + (threads_per_block - 1)) \
        // threads_per_block

d_init_states = cuda.to_device(init_states)

d_init_states_opt = cuda.to_device(init_states)

d_t = cuda.to_device(t)

# start timer
start = time.perf_counter()

# start parallel simulations
solve_ode[blocks_per_grid, threads_per_block](d_init_states, d_t)
cuda.synchronize()

# measure time elapsed
end = time.perf_counter()


# start timer
start_opt = time.perf_counter()

# start parallel simulations
solve_ode_opt[blocks_per_grid, threads_per_block](d_init_states_opt, d_t)
cuda.synchronize()

# measure time elapsed
end_opt = time.perf_counter()

# Copy results back to host
original_result = d_init_states.copy_to_host()
optimized_result = d_init_states_opt.copy_to_host()

# Sanity check

np.testing.assert_equal(original_result, optimized_result)

print(f"The result was computed in {end-start} s")
print(f"The optimized result was computed in {end_opt-start_opt} s")

(I also added some code to ensure that the two versions produce identical output, to avoid ending up with something that computes the wrong answer incredibly quickly :slight_smile: ). For me the output is (with a Quadro RTX 8000):

$ python repro.py 
The result was computed in 0.9238016649906058 s
The optimized result was computed in 0.8137022639857605 s

so the change suggested above shaves about 12% off the kernel runtime.

Hi Graham,
Thanks again for your time and the feedback! I really enjoyed this whole CUDA exercise and still am impressed by how fast I could do my computations with Numba’s interface and a 40 € GPU :slightly_smiling_face: