A program of mine has a large amount of temporary data , resulting in out of memory of GPU. Therefore, I have to reduce the temporary data according to the number of parallel threads,but now the memory data is confused.
The simulation code is as follows,The intention of the code is to copy from A data set to B data set, and the temporary storage is smaller than that of A and B.
If you use the function “testThreadsPerBlockMemoryNormal”,it will be normal,the difference between it and the function “testThreadsPerBlockMemory” is the assignment operation is in different loops.(In the algorithm, the two loops must be separated)
I need some advice to achieve the results like function “testThreadsPerBlockMemoryNormal”.
----code----
import numpy as np
from numba import cuda
@cuda.jit()
def testThreadsPerBlockMemoryNormal(dataA, dataB, dataBuff, randomBuff):
tx = cuda.threadIdx.x
bx = cuda.blockIdx.x
bw = cuda.blockDim.x
pos = tx + bx * bw
if pos < dataA.shape[0]:
for i in range(dataBuff.shape[0]):
dataBuff[tx] = dataA[pos * dataBuff.shape[0] + i]
# delay
b = 0
for j in range(randomBuff[pos * dataBuff.shape[0] + i]):
b += 1
# delay end
dataB[pos * dataBuff.shape[0] + i] = dataBuff[tx]
@cuda.jit()
def testThreadsPerBlockMemory(dataA, dataB, dataBuff, randomBuff):
tx = cuda.threadIdx.x
bx = cuda.blockIdx.x
bw = cuda.blockDim.x
pos = tx + bx * bw
if pos < dataA.shape[0]:
for i in range(dataBuff.shape[0]):
dataBuff[tx] = dataA[pos * dataBuff.shape[0] + i]
# delay begin
b = 0
for j in range(randomBuff[pos * dataBuff.shape[0] + i]):
b += 1
# delay end
for i in range(dataBuff.shape[0]):
dataB[pos * dataBuff.shape[0] + i] = dataBuff[tx]
def doTest():
data_A_host = np.zeros(100, dtype=np.int32)
data_B_host = np.array(data_A_host, dtype=np.int32)
dataBuff_host = np.zeros(10, dtype=np.int32)
randomBuff_host = np.array(data_A_host, dtype=np.int32)
for i in range(data_A_host.shape[0]):
data_A_host[i] = i
for i in range(randomBuff_host.shape[0]):
randomBuff_host[i] = np.random.randint(1, 100000000)
dataA = cuda.to_device(data_A_host)
dataB = cuda.to_device(data_B_host)
dataBuff = cuda.to_device(dataBuff_host)
randomBuff = cuda.to_device(randomBuff_host)
testThreadsPerBlockMemory[10, 10](dataA, dataB, dataBuff, randomBuff)
dataB.copy_to_host(data_B_host)
cuda.synchronize()
dataBuff.copy_to_host(dataBuff_host)
cuda.synchronize()
print('-----data_A_host------')
print(data_A_host)
print('-----data_B_host------')
print(data_B_host)
print('-----dataBuff_host------')
print(dataBuff_host)
for i in range(data_A_host.shape[0]):
if data_A_host[i] != data_B_host[i]:
print(data_A_host[i], data_B_host[i])
if __name__ == "__main__":
doTest()
sorry ,I reviewed my simulation code. The two functions are not equivalent,Now I upload the new code to make it equivalent in single thread,but now in multithreading, memory reading and writing are still out of sync
import numpy as np
from numba import cuda
@cuda.jit()
def testThreadsPerBlockMemoryNormal(dataA, dataB, dataBuff, randomBuff):
tx = cuda.threadIdx.x
bx = cuda.blockIdx.x
bw = cuda.blockDim.x
pos = tx + bx * bw
if bx <= dataA.shape[0] // dataBuff.shape[1]:
for i in range(dataBuff.shape[1]):
dataBuff[tx][i] = dataA[bx * dataBuff.shape[1] + i]
# delay
b = 0
for j in range(randomBuff[bx * dataBuff.shape[1] + i]):
b += 1
# delay end
dataB[bx * dataBuff.shape[1] + i] = dataBuff[tx][i]
@cuda.jit()
def testThreadsPerBlockMemory(dataA, dataB, dataBuff, randomBuff):
tx = cuda.threadIdx.x
bx = cuda.blockIdx.x
bw = cuda.blockDim.x
pos = tx + bx * bw
if bx <= dataA.shape[0]//dataBuff.shape[1]:
for i in range(dataBuff.shape[1]):
dataBuff[tx][i] = dataA[bx * dataBuff.shape[1] + i]
# delay begin
b = 0
for j in range(randomBuff[bx * dataBuff.shape[1] + i]):
b += 1
# delay end
for i in range(dataBuff.shape[1]):
dataB[bx * dataBuff.shape[1] + i] = dataBuff[tx][i]
def doTest():
data_A_host = np.zeros(100, dtype=np.int32)
data_B_host = np.array(data_A_host, dtype=np.int32)
dataBuff_host = np.zeros((10, 10), dtype=np.int32)
randomBuff_host = np.array(data_A_host, dtype=np.int32)
for i in range(data_A_host.shape[0]):
data_A_host[i] = i
for i in range(randomBuff_host.shape[0]):
randomBuff_host[i] = np.random.randint(1, 100000000)
dataA = cuda.to_device(data_A_host)
dataB = cuda.to_device(data_B_host)
dataBuff = cuda.to_device(dataBuff_host)
randomBuff = cuda.to_device(randomBuff_host)
testThreadsPerBlockMemory[10, 10](dataA, dataB, dataBuff, randomBuff)
dataB.copy_to_host(data_B_host)
cuda.synchronize()
dataBuff.copy_to_host(dataBuff_host)
cuda.synchronize()
print('-----data_A_host------')
print(data_A_host)
print('-----data_B_host------')
print(data_B_host)
print('-----dataBuff_host------')
print(dataBuff_host)
for i in range(data_A_host.shape[0]):
if data_A_host[i] != data_B_host[i]:
print(data_A_host[i], data_B_host[i])
if __name__ == "__main__":
doTest()
I’m now finding it hard to see how the output differs from what you expect - with the new version of the code, what is wrong with the output and what should it be?
It should copy data_A_host to data_B_host,the dataBuff is a cache for each thread and I want to know why the expression of testThreadsPerBlockMemory() makes the threads of the same block out of sync.
thank you gmarkall.The “out of sync” means that all threads of each block will be synchronized once the execution is completed,and the threads of the next block will launch ,but as the result, it is not synchronized in block.
As introduced in the video said “a block has a fixed number of threads which are guaranteed to be running simultaneously on the same SM”, Maybe I have some misunderstandings,the threads of each block has once synchronization,because it makes me confused to produce different results.
now I have a solution that doesn’t look perfect ,code is as follow it is work. launch some threads in one block and synchronize them, and separate the steps as the blocks into the loop according to the data size
import numpy as np
from numba import cuda
@cuda.jit()
def testThreadsPerBlockMemory(dataA, dataB, dataBuff, randomBuff, cuda_block):
tx = cuda.threadIdx.x
bx = cuda.blockIdx.x
bw = cuda.blockDim.x
bx = bx + cuda_block
if bx <= dataA.shape[0] // dataBuff.shape[1]:
for i in range(dataBuff.shape[1]):
dataBuff[tx][i] = dataA[bx * dataBuff.shape[1] + i]
# delay begin
b = 0
for j in range(randomBuff[bx * dataBuff.shape[1] + i]):
b += 1
# delay end
for i in range(dataBuff.shape[1]):
dataB[bx * dataBuff.shape[1] + i] = dataBuff[tx][i]
def doTest():
data_A_host = np.zeros(100, dtype=np.int32)
data_B_host = np.array(data_A_host, dtype=np.int32)
dataBuff_host = np.zeros((10, 10), dtype=np.int32)
randomBuff_host = np.array(data_A_host, dtype=np.int32)
for i in range(data_A_host.shape[0]):
data_A_host[i] = i
for i in range(randomBuff_host.shape[0]):
randomBuff_host[i] = np.random.randint(1, 100000000)
dataA = cuda.to_device(data_A_host)
dataB = cuda.to_device(data_B_host)
dataBuff = cuda.to_device(dataBuff_host)
randomBuff = cuda.to_device(randomBuff_host)
# launch all threads in one block synchronize them,
# and separate the steps as the blocks into the loop according to the data size
for i in range(data_A_host.shape[0] // dataBuff_host.shape[1]):
testThreadsPerBlockMemory[1, dataBuff_host.shape[0]](dataA, dataB, dataBuff, randomBuff, i)
cuda.synchronize()
dataB.copy_to_host(data_B_host)
cuda.synchronize()
dataBuff.copy_to_host(dataBuff_host)
cuda.synchronize()
print('-----data_A_host------')
print(data_A_host)
print('-----data_B_host------')
print(data_B_host)
print('-----dataBuff_host------')
print(dataBuff_host)
for i in range(data_A_host.shape[0]):
if data_A_host[i] != data_B_host[i]:
print(data_A_host[i], data_B_host[i])