Dear `numba`

community members,

For one of my OSS projects - `PyFixest`

- I have implemented an alternating projection algorithm via `numba`

. Unfortunately, I cannot seem to beat the (fully vectorized) numpy reference implementation in the `PyHDFE`

package. I’d be very grateful for any hints on how to further optimize my code.

First some context: `PyFixest`

implements linear regression estimators for models with high-dimensional fixed effects. To circumvent the problem of having to invert a very large design matrix of dimension N x k, it instead demeans both design matrix and dependent variable prior to fitting the model on demeaned data via ordinarly least squares (based on the Frisch-Waughn-Lovell Theorem). The demeaning occurs via an “alternating projection algorithm”, which I have attached below:

```
import numpy as np
from numba import njit, prange
@njit(parallel = True, cache = False, fastmath = False)
def demean(cx, flist, weights, tol = 1e-10, maxiter = 2000):
'''
Demean a Matrix cx by fixed effects in flist.
The fixed effects are weighted by weights. Convervence tolerance
is set to 1e-08 for the sum of absolute differences.
Args:
cx: Matrix to be demeaned
flist: Matrix of fixed effects
weights: Weights for fixed effects
tol: Convergence tolerance. 1e-08 by default.
Returns
res: Demeaned matrix of dimension cx.shape
'''
N = cx.shape[0]
fixef_vars = flist.shape[1]
K = cx.shape[1]
res = np.zeros((N,K))
# loop over all variables to demean, in parallel
for k in prange(K):
cxk = cx[:,k]#.copy()
oldxk = cxk - 1
converged = False
for _ in range(maxiter):
for i in range(fixef_vars):
fmat = flist[:,i]
weighted_ave = _ave3(cxk, fmat, weights)
cxk -= weighted_ave
if np.sum(np.abs(cxk - oldxk)) < tol:
converged = True
break
# update
oldxk = cxk.copy()
res[:,k] = cxk
return res
@njit
def _ave3(x, f, w):
N = len(x)
wx_dict = {}
w_dict = {}
# Compute weighted sums using a dictionary
for i in prange(N):
j = f[i]
if j in wx_dict:
wx_dict[j] += w[i] * x[i]
else:
wx_dict[j] = w[i] * x[i]
if j in w_dict:
w_dict[j] += w[i]
else:
w_dict[j] = w[i]
# Convert the dictionaries to arrays
wx = np.zeros_like(f, dtype=x.dtype)
w = np.zeros_like(f, dtype=w.dtype)
for i in range(N):
j = f[i]
wx[i] = wx_dict[j]
w[i] = w_dict[j]
# Compute the average
wxw_long = wx / w
return wxw_long
```

To conclude, here is a benchmark against `PyHDFE`

:

```
%load_ext autoreload
%autoreload 2
import numpy as np
import time
np.random.seed(1238)
N = 1_000_000
x = np.random.normal(0, 1, 4*N).reshape((N,4))
f1 = np.random.choice(list(range(1000)), N).reshape((N,1))
f2 = np.random.choice(list(range(1000)), N).reshape((N,1))
flist = np.concatenate((f1, f2), axis = 1)
weights = np.ones(N)
import pyhdfe
start_time = time.time()
algorithm = pyhdfe.create(flist)
res_pyhdfe = algorithm.residualize(x)
end_time = time.time()
print(end_time - start_time)
# 2.168398380279541
start_time = time.time()
algorithm = pyhdfe.create(flist)
res_pyfixest = demean(x, flist, weights, tol = 1e-08)
# Calculate the execution time
end_time = time.time()
print(end_time - start_time)
# 3.359426736831665
np.allclose(res_pyhdfe , res_pyfixest)
# True
```

Any tips on how I could further optimize my implementation?

Best, Alex