Speeding up random array generation compared to numpy

tvas · February 2, 2024, 9:13pm

Hi all, I’m trying to use numba to speed up random array generation. I’m generating a random dataset for labels and I need a array with mostly nan values, and a specific percentage to be randomly set to 0 or 1. Here are my implementations in numpy and numba:

from numba import jit
import numpy as np

def numba_random_label_gen(seed: int, size: int, num_labels_per_batch: int) -> np.ndarray:
    np.random.seed(seed)
    # Create a 1D Numpy array with all nan
    label_data = np.full(size, np.nan, dtype=np.float64)
    # Choose random indexes
    label_indexes = np.random.choice(size, num_labels_per_batch, replace=False)
    # Create labels for those indexes, e.g. assign 0/1 randomly
    label_data[label_indexes] = np.random.randint(0, 1, num_labels_per_batch)

    return label_data, label_indexes

def np_random_label_gen(
    rng: np.random.Generator, size: int, num_labels_per_batch: int
) -> np.ndarray:
    # Create a 1D Numpy array with all nan
    label_data = np.full(size, np.nan, dtype=np.float64)
    # Choose random indexes
    label_indexes = rng.choice(size, num_labels_per_batch, replace=False)
    # Create labels for those indexes, e.g. assign 0/1 randomly
    label_data[label_indexes] = rng.integers(2, size=num_labels_per_batch)

    return label_data, label_indexes

When running the code for large inputs, but also using timeit (after running the numba function once first to compile it) my numba implementation is 5x slower than the pure numpy one.

import timeit

numba_labels = numba_random_label_gen(42, size=10000, num_labels_per_batch=100)

timeit.timeit("numba_random_label_gen(42, size=10000, num_labels_per_batch=100)", number=100000)

#11.433971279999241 sec

np_setup = """
import numpy as np
rng = np.random.default_rng(seed=42)
"""

timeit.timeit("np_random_label_gen(rng, size=10000, num_labels_per_batch=100)", number=100000, setup=np_setup

# 2.6677552140317857 sec

Any ideas on how I could speed up the numba implementation? Is it possible the use of np.random vs. np.random.Generator? My versions are:

>>> np.__version__
'1.26.3'
>>> numba.__version__
'0.59.0'

sschaer · February 4, 2024, 6:31pm

Hi @tvas

Numba is slower because it uses a different algorithm. Numba’s implementation is based on permutation of the candidate values. This is very slow if only a small subset (e.g. 1% as in your case) is selected. I assume Numpy does the same if you use np.random.choice(size, num_labels_per_batch, replace=False). Strangely, it seems to change the algorithm if you use rng.choice(size, num_labels_per_batch, replace=False).

The best implementation strongly depends on the requirements. Two solutions that work well for the example you gave are:

@nb.njit
def choice_without_replacement(n, k):
    # https://github.com/ciphergoth/sansreplace
    t = n - k + 1
    d = np.empty(k, dtype=np.int64)
    for i in range(k):
        r = np.random.randint(t + i)
        if r < t:
            d[i] = r
        else:
            d[i] = d[r - t]
    d.sort()
    for i in range(k):
        d[i] += i
    return d


@nb.njit
def choice_without_replacement(n, k):
    slots = np.zeros(n, dtype=np.byte)
    choices = np.empty(k, dtype=np.int64)
    count = 0
    while count < k:
        r = np.random.randint(n)
        if not slots[r]:
            slots[r] = 1
            choices[count] = r
            count += 1
    return choices

PS: Check that np.random.randint(0, 1, num_labels_per_batch) is working correctly. For me, the docstring says that the upper bound is inclusive, but it is not.

tvas · February 26, 2024, 5:33pm

Thanks @sschaer this did minimize the difference between the two implementations! I’ll keep situations like this in mind when trying out numba usecases.

Topic		Replies	Views
Functions involving random numbers Community Support	2	606	December 21, 2020
Faster for loop - help pls Community Support	7	419	October 17, 2020
Dictionary performance Community Support	2	661	November 28, 2022
Tips for performance improvement of my code Support: How do I do ...?	5	567	January 5, 2023
Numba slow for simple numpy operations Numba	1	223	October 13, 2023

Speeding up random array generation compared to numpy

Related Topics