How can I correctly use (or improve) numba for pandas apply?

Hello,

I’m new to numba and am trying to use numba to optimise a pandas groupby apply, and it seems to be significantly slower.

Here’s my code:

from time import time

import numba
import numpy as np
import pandas as pd

n_examples = 5_000_000

dummy_dataset = pd.DataFrame({
    'dummy_grouping_attribute_1': np.random.randint(0, 100, n_examples),
    'dummy_grouping_attribute_2': np.random.randint(0, 100, n_examples).astype(str),
    'col_a': np.random.rand(n_examples),
    'col_b': np.random.randint(0, 2, n_examples),
})

group_keys = ['dummy_grouping_attribute_1', 'dummy_grouping_attribute_2']


def my_func(col_a, col_b):
    return 1


### Pandas apply: 1.4s
tic = time()
dummy_dataset.groupby(group_keys).apply(lambda x: my_func(x['col_b'], x['col_a'])).mean()
print('apply', time() - tic) # 1.4s

### Numba - 7s
group_ids = (
    dummy_dataset
    .groupby(group_keys)
    .ngroup()
    .to_numpy(dtype='int64')
)

data = dummy_dataset[group_keys].to_numpy(dtype='float64')


@numba.njit(parallel=True)
def groupby_apply_njit(data, group_ids, function):
    ngroups = group_ids.max() + 1   # Number of groups
    res = 0

    for group_id in numba.prange(ngroups):
        group_data = data[group_ids == group_id]

        res += function(group_data[:, 0], group_data[:, 1])

    return res / groups


numba_func = numba.njit(my_func)

groupby_apply_njit(data, group_ids, numba_func)

tic = time()
groupby_apply_njit(data, group_ids, numba_func)
print('numba', time() - tic) # 7s

Any directions would be much appreciated! Thanks.

Hi @q999

You use a sub-optimal algorithm. For each group, you go through the entire data. So your algorithm scales linearly with respect to the number of groups, of which you have many.

Is your example representative of your actual use case? If so, I think it is possible to write a function that is much faster than the very generic version of Pandas.

Thank you. This is representative of my use case, in the sense that I want to apply a custom function to each group in the data.

What would be a more optimal solution for this use case?

By representative, I meant the size of your data set and the number of groups. You have 100 groups within your 5 million samples. Your algorithm scales unfavorably with a large number of groups and therefore does not perform well compared to pandas. With, say, 5 groups, it’s much less dramatic. Below is a simple yet relatively fast way to process your data, assuming you have enough memory to copy the full at once data.
For the example you gave, it is about 30 times faster than Pandas on my computer. Most of the runtime is spent copying data, so there’s little to gain unless you relax some of the constraints.
You can improve further by, for example, not copying your data, but just passing the appropriate row indexes to my_func for each group. You may also want the groups to be in a Fortran ordered array to facilitate later column-wise processing in my_func. I suspect this may be the case because you are splitting your data by columns.

I hope this can be a good starting point for you.

@numba.njit
def _groupby(data, ids):
    n_groups = ids.max() + 1  
    
    # Get the size of each group
    sizes = np.zeros(n_groups, dtype=np.uint32)
    for id in ids:
        sizes[id] += 1
                
    # Allocate space for all groups 
    groups = [np.empty((n, 2), dtype=data.dtype) for n in sizes]
    
    # Copy data to the appropriate groups
    ptrs = np.zeros(n_groups, dtype=np.uint32)
    for id, (e1, e2) in zip(ids, data):
        groups[id][ptrs[id]] = (e1, e2)
        ptrs[id] += 1

    return groups

@numba.njit
def groupby_apply(data, group_ids, function):
    # If possible, make `group_ids` of dtype uint32 because 
    # they are later used for indexing
    groups = _groupby(data, group_ids)
    
    res = 0
    for group in groups:
        res += function(group[:, 0], group[:, 1])
    
    return res / len(groups)

Thank you for the suggestion! A 30x speed up is massive! What is the key difference? Is it because you have another step that returns the group to be processed?

I’ve just done a quick calculation and the data:group ratio is around 4, which means there will be ~1.2 million groups for 5 million samples.

It is so much faster because:

  • I iterate over group_ids to get the number of groups
  • I iterate over group_ids again to get the size of each group
  • I iterate over data to copy them into the groups

Your function:

  • iterates over group_ids to get the number of groups
  • Iterates ngroups times over data to build the groups

That’s a gigantic number of operations in your particular use case. In this case, I would even suggest slightly rewriting my function as it assigns a list of n_groups arrays. You can also store the data in a single array and keep track of where each group is located within that array. For a small number of groups this does not matter, but for 1.2 million groups it is again very worthwhile. For 1e5 samples and 2.5e4 groups (1/4 ratio), it’s more than 1000 times faster than Pandas.
And again you may want to use and F-array.

@numba.njit
def groupby_apply(data, group_ids, function):
    n_groups = group_ids.max() + 1  
    
    # Get the size of each group
    sizes = np.zeros(n_groups, dtype=np.uint32)
    for id in group_ids:
        sizes[id] += 1
    
    # Compute offset of each group within new array
    ptrs = np.zeros(n_groups, dtype=np.uint32)
    for i in range(1, n_groups):
        ptrs[i] =  ptrs[i-1] + sizes[i-1]
                
    # Copy data to the appropriate location
    grouped_data = np.empty(data.shape, dtype=data.dtype)
    for id, (e1, e2) in zip(group_ids, data):
        grouped_data[ptrs[id]] = (e1, e2)
        ptrs[id] += 1
    
    # Get the slice of each group and apply the function
    res = 0
    for end, size in zip(ptrs, sizes):
        group_data = grouped_data[end-size:end]
        res += function(group_data[:, 0], group_data[:, 1])
    
    return res / n_groups

This is amazing, thank you so much!

@sschaer I have been thinking about your suggestion over the past few days:

You can improve further by, for example, not copying your data, but just passing the appropriate row indexes to my_func for each group.

and

You can also store the data in a single array and keep track of where each group is located within that array

I thought this is what this piece was doing: Finding out the relevant rows to pass to the calculation.

    for group_id in numba.prange(ngroups):
        group_data = data[group_ids == group_id]
        res += function(group_data[:, 0], group_data[:, 1])

Is there a more efficient way to do this?

Thank you

I meant that you store the row indices for each group and pass them to function along with data. Then inside function you read the relevant rows from data. But I wouldn’t do this. It adds complexity with likely very little benefit because your data has only two columns. And it assumes function does not modify the data.

What you can probably do, however, is make sure that data from the same groups are close together within the data array. Most of the runtime comes from reading data in a non-contiguous fashion. If you can increase the probability that data from the same group is already close to each other, this can provide significant performance benefits. Unfortunately, this depends on your particular case (how you obtain the data) and not on the function itself. Here is an example of what I mean:

import numba as nb 
import numpy as np 

func = nb.njit(lambda a, b: 1)

n_samples = 5_000_000
n_groups = n_samples // 4 

data = np.empty((n_samples, 2))
group_ids_random = np.random.randint(0, n_groups, size=n_samples)
group_ids_sorted = np.sort(group_ids_random)
group_ids_shuffled = group_ids_sorted.copy()
n_splits = 4
n = n_samples // n_splits
for i in range(n_splits):  
    np.random.shuffle(group_ids_shuffled[i*n:(i+1)*n])

# Warum up
groupby_apply(data, group_ids_random, func)
groupby_apply(data, group_ids_sorted, func)
groupby_apply(data, group_ids_shuffled, func)

# Random distributed ids
# 468 ms ± 49.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit groupby_apply(data, group_ids_random, func)

# Data is already grouped
# 54.8 ms ± 6.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit groupby_apply(data, group_ids_sorted, func)

# Data is not grouped but rows from the same group are likely to be close together 
# 137 ms ± 8.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit groupby_apply(data, group_ids_shuffled, func)

Really clear explanation, thank you!