Your code performs poorly for very small arrays because of the parallelism which causes overhead.
I assume the sawtooth-like pattern you observe comes from the fact that your implementation is broken for cases where “log-two of length” is even. Run for example this:
import numpy as np
import matplotlib.pyplot as plt
def fft(T):
...
for i in range(5, 18):
x = np.linspace(0, 10*np.pi, num=2**i)
y = np.sin(x)
plt.figure()
plt.title(str(i))
plt.plot(fft(y))
plt.plot(np.fft.rfft(y), ls="--")
plt.xlim(0, 12)
plt.show()
PS: There is support for scipy.fft.rfft
in Numba compiled code: Rocket-FFT — a Numba extension supporting numpy.fft and scipy.fft
However, it may be a bit slower for 1D transforms since multi-threading is only supported over multiple dimensions.