Why numba.cuda has different results with CPU for loop?

I’ve been researching on CT reconstruction these days, which is solving a linear equations Ax=b. And weighted matrix A need compute first. Because the matrix A is very large, it needs to be computed by the GPU.

I take two different methods to compute A, and their results are identical in CPU for loop. But when i change for loop to parallel in numba.cuda, the two GPU results are different, and none of them identical with CPU results.

So I’m here for your help, how can I rewrite my code to correct the GPU result.

I have used some operators and functions in cuda kernel function, including:

* / + -
math.cos(), math.sin()
max(), min()

In addition, i have two questions:

  1. why result of math.ceil() is int in CPU, but float in numba.cuda?
  2. cpu computation has precision problem(as below), but why cpu can get correct result finally?
-1.3 - 0.1 = -1.4000000000000001
1.3 + 0.1 = 1.4000000000000001
math.floor(-2.7755575615628914e-14) = -1

shoud i use round function in GPU to truncate result?

its my fault. i forget to use atomic.add rather than + in back projection.