Hello,
I’m taking my first steps in parallel computing with CUDA GPU and I really appreciate how Numba makes it easy. After a few days of reading the docs and examples and with a budget nvidia card I was able to compute some of my scientific stuff an order of magnitude faster than I could do it with c++ on my CPU.
While browsing web I run into this forum and I’m wandering if I could get some general feedback on the piece of code I wrote. Please send me away if this isn’t the right place
My goal is to solve the same set of equations, with random initial conditions, as fast as possible. I thought I would make a big array of initial conditions and then make each thread work on its own piece of array. Below I put the fastest version I was able to write for my GT 710 GPU. Did I make any obvious mistakes? How could I optimize further? Thank you for your time!
from numba import cuda # for parallel gpu computations
import numpy as np # for random samples and array management
import time # for timing
import math # for sinus
import matplotlib.pyplot as plt # for scatter plot
@cuda.jit
def solve_ode(x, time):
"""
Solve 2DoF ode on gpu, given
the initial conditions. The
result will be stored in the
input array.
Parameters
----------
x : np.array
Contains initial conditions for the simulations.
The elements are arranged in pairs:
[x1_sim1, x2_sim1, x1_sim2, x2_sim2, ...]
time : np.array
Three-element list with time details.
Returns
-------
None.
"""
# time variables
t = time[0]
t_end = time[1]
dt = time[2]
# index of thread on GPU
pos = cuda.grid(1)
# mappping index to access every
# second element of the array
pos = pos * 2
# condidion to avoid threads
# accessing indices out of array
if pos < x.size:
# execute until the time reaches t_end
while t < t_end:
# compute derivatives
dxdt0 = x[pos+1]
dxdt1 = np.float32(10.0)*math.sin(t) - np.float32(0.1)*x[pos+1] - x[pos]**3
# update state vecotr
x[pos] += dxdt0 * dt
x[pos+1] += dxdt1 * dt
# update time
t += dt
# number of independent oscillators
# to simulate
trials = 10_000
# time variables
t0 = 0
t_end = 100
dt = 0.01
t = np.array([t0, t_end, dt], dtype='float32')
# generate random initial condiotions
init_states = np.random.random_sample(2 * trials).astype('float32')
# manage nr of threads (threads)
threads_per_block = 32
blocks_per_grid = \
(init_states.size + (threads_per_block - 1)) // threads_per_block
# start timer
start = time.perf_counter()
# start parallel simulations
solve_ode[blocks_per_grid, threads_per_block](init_states, t)
# measure time elapsed
end = time.perf_counter()
print(f'The result was computed in {end-start} s')
# reshape the array into 2D
x = init_states.reshape((trials, 2))
# plot the phase space
plt.scatter(x[:, 0], x[:, 1], s=1)