I am doing some benchmarking on accelerators for python. In below example, tensorflow runs at about 56% of the numba runtime:

```
import numpy as np
import tensorflow as tf
from numba import njit, prange
@tf.function
def compute_tf(m, n):
x1 = tf.range(0, m-1, 1) ** 2
x2 = tf.range(0, n-1, 1) ** 2
return x1[:, None] + x2[None, :]
compute_tf(tf.constant(1), tf.constant(1))
m = 50000
n = 10000
%timeit compute_tf(m, n)
```

557 ms ± 30.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

```
@njit(parallel=True)
def compute_numba(m, n):
x = np.empty((m, n))
for i in prange(m):
for j in prange(n):
x[i, j] = i**2 + j**2
return x
compute_numba(1, 1)
%timeit compute_numba(m, n)
```

995 ms ± 38.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is a very simple computation, so I don’t really see why the TF-version would run any faster. Do you have any idea on how I can make the numba-version run on par with TF?