You use a sub-optimal algorithm. For each group, you go through the entire data. So your algorithm scales linearly with respect to the number of groups, of which you have many.
Is your example representative of your actual use case? If so, I think it is possible to write a function that is much faster than the very generic version of Pandas.
By representative, I meant the size of your data set and the number of groups. You have 100 groups within your 5 million samples. Your algorithm scales unfavorably with a large number of groups and therefore does not perform well compared to pandas. With, say, 5 groups, it’s much less dramatic. Below is a simple yet relatively fast way to process your data, assuming you have enough memory to copy the full at once data.
For the example you gave, it is about 30 times faster than Pandas on my computer. Most of the runtime is spent copying data, so there’s little to gain unless you relax some of the constraints.
You can improve further by, for example, not copying your data, but just passing the appropriate row indexes to my_func for each group. You may also want the groups to be in a Fortran ordered array to facilitate later column-wise processing in my_func. I suspect this may be the case because you are splitting your data by columns.
I hope this can be a good starting point for you.
@numba.njit
def _groupby(data, ids):
n_groups = ids.max() + 1
# Get the size of each group
sizes = np.zeros(n_groups, dtype=np.uint32)
for id in ids:
sizes[id] += 1
# Allocate space for all groups
groups = [np.empty((n, 2), dtype=data.dtype) for n in sizes]
# Copy data to the appropriate groups
ptrs = np.zeros(n_groups, dtype=np.uint32)
for id, (e1, e2) in zip(ids, data):
groups[id][ptrs[id]] = (e1, e2)
ptrs[id] += 1
return groups
@numba.njit
def groupby_apply(data, group_ids, function):
# If possible, make `group_ids` of dtype uint32 because
# they are later used for indexing
groups = _groupby(data, group_ids)
res = 0
for group in groups:
res += function(group[:, 0], group[:, 1])
return res / len(groups)
Thank you for the suggestion! A 30x speed up is massive! What is the key difference? Is it because you have another step that returns the group to be processed?
I’ve just done a quick calculation and the data:group ratio is around 4, which means there will be ~1.2 million groups for 5 million samples.
I iterate over group_ids to get the number of groups
I iterate over group_ids again to get the size of each group
I iterate over data to copy them into the groups
Your function:
iterates over group_ids to get the number of groups
Iterates ngroups times over data to build the groups
That’s a gigantic number of operations in your particular use case. In this case, I would even suggest slightly rewriting my function as it assigns a list of n_groups arrays. You can also store the data in a single array and keep track of where each group is located within that array. For a small number of groups this does not matter, but for 1.2 million groups it is again very worthwhile. For 1e5 samples and 2.5e4 groups (1/4 ratio), it’s more than 1000 times faster than Pandas.
And again you may want to use and F-array.
@numba.njit
def groupby_apply(data, group_ids, function):
n_groups = group_ids.max() + 1
# Get the size of each group
sizes = np.zeros(n_groups, dtype=np.uint32)
for id in group_ids:
sizes[id] += 1
# Compute offset of each group within new array
ptrs = np.zeros(n_groups, dtype=np.uint32)
for i in range(1, n_groups):
ptrs[i] = ptrs[i-1] + sizes[i-1]
# Copy data to the appropriate location
grouped_data = np.empty(data.shape, dtype=data.dtype)
for id, (e1, e2) in zip(group_ids, data):
grouped_data[ptrs[id]] = (e1, e2)
ptrs[id] += 1
# Get the slice of each group and apply the function
res = 0
for end, size in zip(ptrs, sizes):
group_data = grouped_data[end-size:end]
res += function(group_data[:, 0], group_data[:, 1])
return res / n_groups
I meant that you store the row indices for each group and pass them to function along with data. Then inside function you read the relevant rows from data. But I wouldn’t do this. It adds complexity with likely very little benefit because your data has only two columns. And it assumes function does not modify the data.
What you can probably do, however, is make sure that data from the same groups are close together within the data array. Most of the runtime comes from reading data in a non-contiguous fashion. If you can increase the probability that data from the same group is already close to each other, this can provide significant performance benefits. Unfortunately, this depends on your particular case (how you obtain the data) and not on the function itself. Here is an example of what I mean:
import numba as nb
import numpy as np
func = nb.njit(lambda a, b: 1)
n_samples = 5_000_000
n_groups = n_samples // 4
data = np.empty((n_samples, 2))
group_ids_random = np.random.randint(0, n_groups, size=n_samples)
group_ids_sorted = np.sort(group_ids_random)
group_ids_shuffled = group_ids_sorted.copy()
n_splits = 4
n = n_samples // n_splits
for i in range(n_splits):
np.random.shuffle(group_ids_shuffled[i*n:(i+1)*n])
# Warum up
groupby_apply(data, group_ids_random, func)
groupby_apply(data, group_ids_sorted, func)
groupby_apply(data, group_ids_shuffled, func)
# Random distributed ids
# 468 ms ± 49.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit groupby_apply(data, group_ids_random, func)
# Data is already grouped
# 54.8 ms ± 6.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit groupby_apply(data, group_ids_sorted, func)
# Data is not grouped but rows from the same group are likely to be close together
# 137 ms ± 8.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit groupby_apply(data, group_ids_shuffled, func)