I’ve been researching on CT reconstruction these days, which is solving a linear equations Ax=b. And weighted matrix A need compute first. Because the matrix A is very large, it needs to be computed by the GPU.

I take two different methods to compute A, and their results are identical in CPU for loop. But when i change for loop to parallel in numba.cuda, the two GPU results are different, and none of them identical with CPU results.

So I’m here for your help, how can I rewrite my code to correct the GPU result.

I have used some operators and functions in cuda kernel function, including:

```
math.ceil()
int()
<
!=
* / + -
**
math.cos(), math.sin()
max(), min()
```

In addition, i have two questions:

- why result of math.ceil() is int in CPU, but float in numba.cuda?
- cpu computation has precision problem(as below), but why cpu can get correct result finally?

```
-1.3 - 0.1 = -1.4000000000000001
1.3 + 0.1 = 1.4000000000000001
math.floor(-2.7755575615628914e-14) = -1
```

shoud i use round function in GPU to truncate result?