Numba Dict function not compiling


I am new to Numba,

I am trying to create a dictionary with uint as keys and a list of uint as values and the size (number of keys) of the dictionary will be a few hundred million or a few 10s of billions. So to gain speed and efficiency over the python dictionary I came across Numba and would like to use it.

Python = 3.7.7
Numba = 0.53.1

from numba import types, typed, njit

def numba_dict(list_of_lists):
key_dtype = types.uint64
value_dtype = types.uint8
bf_sparse_dict = typed.Dict.empty(
    key_type=key_dtype, value_type=types.ListType(value_dtype)

for idx, inner_list in enumerate(list_of_lists):
    for key in inner_list:
        if key not in bf_sparse_dict:
            bf_sparse_dict[key] = typed.List.empty_list(value_dtype)

return bf_sparse_dict

When I try to run this function from a python file I get the following error

numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<class 'numba.core.types.containers.ListType'>) found for 

 >>> ListType(class(uint8))

There are 2 candidate implementations:
  - Of which 2 did not match due to:
  Overload in function 'ListType': File: numba/core/ Line 39.
    With argument(s): '(class(uint8))':
   Rejected as the implementation raised a specific error:
     TypingError: List() argument must be iterable
  raised from /home/ssk/anaconda3/envs/zarrtest/lib/python3.7/site-packages/numba/typed/

During: resolving callee type: Function(<class 'numba.core.types.containers.ListType'>)
During: typing of call at (270)

File "", line 270:
def numba_dict(bloomfilters_list):
    <source elided>
    bf_sparse_dict = typed.Dict.empty(
        key_type=types.uint64, value_type=types.ListType(types.uint8)

But the same piece of code works completely fine on an interpreter from command line (i.e without compilation with @njit decorator)

Any pointer to what I am doing wrong would be really helpful.

Thank you!

Hi @sanjaysrikakulam and welcome to the board! :slight_smile:

1st of all, regarding your bug: I think this is caused by your use of types.ListType inside of an njitted function. It may be a bit counter intuitive, but not all of the typing stuff that numba ships, can itself be used in compiled functions, as far as I know. This however only concerns the type definition itself. The resulting type is of course intended to be used with numba. A good way around this is to simply push the type definitions outside of your function. Since your types are truly constant and not inferred from the incoming data, that should not be an issue.

2nd: It seems to me, that the slight modification I proposed then raises a different bug, related to the fact that you are putting a python list of lists into the function (I hope I am correct on this). These are so called reflected lists to numba, since it has to report all mutations that happened inside of compiled code back to the Python interpreter. (You are not mutating the incoming list here as far as I can tell, but due to the nature of and promises around reflected lists numba probably still enforces the same criteria - maybe someone can fact-check me on this). It seems that numba cannot deal with reflected lists of lists at this point (at least even the most simple scenario possible - immediately returning list_of_lists - failed for me).
If at all possible you might be better of doing the conversion from Python into numba datatypes outside the njitted function. I also received a deprecation warning, announcing that reflected list support inside compiled functions may be dropped in the long run, so implementing a new solution on reflected lists may not be the way to go.

3rd: Here is where my comments become a bit hazy. But you mentioned that your dict may have 10s of billions of entries. Even if we just account 2 bytes per item (very optimistic) in the dict, that would equate 10s of GBs of data. I wonder if dicts are the way to go for such a massive dataset. That is of course closely linked to what you actually want to do with this data afterwards, and what the exact format of your input data is. But to me it looks a bit as if something like a numpy array may be a more efficient container.