Performance issue with typed dicts and lists

ehsantn · March 14, 2021, 12:30pm

@kartiksubbarao looks like you are trying to accelerate finding unique values for a Pandas Series? Bodo uses Numba to accelerate Pandas and may support your use case automatically. I added Bodo to you code to demonstrate. Also, having timers outside the call measures the compilation time so put the timers inside functions:

import random
import string
import pandas as pd
import numba
from time import time
import bodo

def seen_name(names):
    start = time()
    seen = {}
    for i in range(len(names)):
        if names[i] not in seen: seen[names[i]] = True
    print(f'\nseen_name => {time() - start:.4} seconds\n')

ntypes = numba.core.types
ndict = numba.typed.Dict
@numba.njit
def seen_name_numba(names):
    start = time()
    seen = ndict.empty(key_type=ntypes.unicode_type,
                    value_type=ntypes.boolean)
    for i in range(len(names)):
        if names[i] not in seen: seen[names[i]] = True
    print(f'\nseen_name_numba => {time() - start} seconds')

@bodo.jit
def seen_name_bodo(names):
    start = time()
    res = names.nunique()
    print(f'\nseen_name_bodo Series => {time() - start} seconds')
    return res

allnames = []
# Generate a long list of random 10-letter strings
for i in range(1000000):
    allnames.append(''.join(
    random.choices(string.ascii_letters, k=10)))
# The real-world scenario stores the data in a dataframe
df = pd.DataFrame(allnames, columns=['name'])

seen_name(df.name.tolist())
seen_name_numba(df.name.tolist())
seen_name_numba(numba.typed.List(df.name))
seen_name_bodo(df.name)

Here are results on my MacBook Pro (2019, 2.3 GHz Intel Core i9) laptop:

seen_name => 0.2311 seconds
seen_name_numba => 0.270217 seconds
seen_name_numba => 0.344879 seconds
seen_name_bodo Series => 0.561972 seconds

Using regular Series in Bodo seems to have overhead (which we need to investigate), but you can parallelize your code and scale on more cores and data linearly.

Topic		Replies	Views
Trying to use a set with string values causes AssertionError() Support: What is this error message?	4	1349	October 14, 2020
Dictionary return type, brackets vs get function Support: What is this error message?	0	185	August 17, 2023
Dicts with tuple keys not working in 0.52.0 Community Support	3	423	December 11, 2020
Numba Dict function not compiling Support: How do I do ...?	1	532	May 3, 2021
Assigning NumPy Arrays to a typed.List Support: How do I do ...?	7	973	February 9, 2022

Performance issue with typed dicts and lists

Related Topics