@kartiksubbarao looks like you are trying to accelerate finding unique values for a Pandas Series? Bodo uses Numba to accelerate Pandas and may support your use case automatically. I added Bodo to you code to demonstrate. Also, having timers outside the call measures the compilation time so put the timers inside functions:
import random
import string
import pandas as pd
import numba
from time import time
import bodo
def seen_name(names):
start = time()
seen = {}
for i in range(len(names)):
if names[i] not in seen: seen[names[i]] = True
print(f'\nseen_name => {time() - start:.4} seconds\n')
ntypes = numba.core.types
ndict = numba.typed.Dict
@numba.njit
def seen_name_numba(names):
start = time()
seen = ndict.empty(key_type=ntypes.unicode_type,
value_type=ntypes.boolean)
for i in range(len(names)):
if names[i] not in seen: seen[names[i]] = True
print(f'\nseen_name_numba => {time() - start} seconds')
@bodo.jit
def seen_name_bodo(names):
start = time()
res = names.nunique()
print(f'\nseen_name_bodo Series => {time() - start} seconds')
return res
allnames = []
# Generate a long list of random 10-letter strings
for i in range(1000000):
allnames.append(''.join(
random.choices(string.ascii_letters, k=10)))
# The real-world scenario stores the data in a dataframe
df = pd.DataFrame(allnames, columns=['name'])
seen_name(df.name.tolist())
seen_name_numba(df.name.tolist())
seen_name_numba(numba.typed.List(df.name))
seen_name_bodo(df.name)
Here are results on my MacBook Pro (2019, 2.3 GHz Intel Core i9) laptop:
seen_name => 0.2311 seconds
seen_name_numba => 0.270217 seconds
seen_name_numba => 0.344879 seconds
seen_name_bodo Series => 0.561972 seconds
Using regular Series in Bodo seems to have overhead (which we need to investigate), but you can parallelize your code and scale on more cores and data linearly.