Support LiteralList type for homogeneous data

kozlov-alexey · November 26, 2020, 4:25pm

Hi,

I would like to discuss topics related to uses of LiteralList type in certain use-cases here.
In SDC we wanted to use homogeneous literal lists in scenarios like the following, arising when trying to implement pandas df.drop function:

    def test_impl(df):
        drop_cols = ['A', 'C']
        return df.drop(columns=drop_cols)

The columns argument has to be a list of columns to drop from the dataframe, but it requires for column names to be known at compile time, since the type of returned dataframe depends on it (for simplicity the DataFrame type can be thought of as a Tuple of types.Array objects with different dtypes).

Currently we can implement DataFrameType drop overload this way that supports List of columns as argument:

@overload(DataFrameType, 'drop')
def sdc_pandas_dataframe_drop(df, labels=None, axis=0, index=None, columns=None, level=None, inplace=False,
                          errors='raise'):
   ...
   if not (isinstance(columns, types.List) and isinstance(columns.dtype, types.UnicodeType)):
       return None

   if columns.initial_value is None:
       raise TypingError('{} Unsupported use of parameter columns:'
                      ' expected list of constant strings. Given: {}'.format('Method drop()', columns))
   else:
       # this works because global tuple of strings is captured as Tuple of StringLiterals
       columns_as_tuple = tuple(columns.initial_value)
       def _sdc_pandas_dataframe_drop_wrapper_impl(df, labels=None, axis=0, index=None,
                                                columns=None, level=None, inplace=False, errors="raise"):
           # below function actually does the main work and can handle columns as tuple of StringLiterals
           return df_drop_internal(labels=labels,
                       axis=axis,
                       index=index,
                       columns=columns_as_tuple,
                       level=level,
                       inplace=inplace,
                       errors=errors)

       return _sdc_pandas_dataframe_drop_wrapper_impl

However, since this uses reflected list (i.e. non-literal one) there’s a use-case when this list can be mutated after it’s first defined, but before it’s passed to df.drop() call:

def test_df_drop_columns_list_mutation_unsupported(self):
    def test_impl(df):
        drop_cols = ['A', 'C']                  
        drop_cols.pop(-1)                       # this would ban LiteralList modification
        return df.drop(columns=drop_cols)       # this could ban all other types except LiteralList
    sdc_func = self.jit(test_impl)

    df = pd.DataFrame({
        'A': [1.0, 2.0, np.nan, 1.0],
        'B': [4, 5, 6, 7],
        'C': [1.0, 2.0, np.nan, 1.0],
    })

    res = sdc_func(df)
    assert 'C' in res.columns, "Column 'C' should not be dropped"

Such a test compiles successfully but fails as captured initial_value becomes stale as the list gets mutated. We can obviously incorporate runtime check in the df.drop implementation and raise an exception if it’s detected that list was mutated, but in the ideal case we wanted to ban this scenario from being compiled at all.

It seemed natural that such a list [‘A’, ‘C’] should be typed as LiteralList[StringLiteral['A'], StringLiteral['C']], but it’s not, since the rule in BuildListConstraint here is to unify elements type if that is possible and infer an non-literal list. In this case StringLiteral['A'] and StringLiteral['C'] types are unified to be types.unicode_type, so the type of [‘A’, ‘C’] happens to be:
# drop_cols = $drop_cols.16 :: list(unicode_type)<iv=['A', 'C']>.

I was wondering is it possible to infer LiteralList type in such use-cases first and if that does not compile try with non-literal type?
Or is there any other way around this that would allow homogeneous LiteralLists to be supported?

I’m going to add some simple reproducer in a couple of hours.

Thanks in advance!

kozlov-alexey · November 30, 2020, 6:14pm

OK, here’s the script that can be used as a reproducer:

import numba
from numba import types
from numba.extending import overload, intrinsic
from numba.core.typing.templates import signature

@intrinsic
def inner_df_drop(typingctx, data, drop_idxs):

    print("DEBUG: inner_df_drop: data", data)
    print("DEBUG: inner_df_drop: drop_idxs", drop_idxs)
    if (not isinstance(data, types.Tuple)
            or not isinstance(drop_idxs, types.Tuple)):
        return None, None

    old_ncols = len(data)
    unlit_drop_idxs = [a.literal_value for a in drop_idxs]
    keep_col_idxs, new_typs = zip(*((i, t) for i, t in enumerate(data) if i not in unlit_drop_idxs))
    def codegen(context, builder, signature, args):
        old_data = args[0]
        new_data = [
            builder.extract_value(
                old_data,
                types.intp(i)
            ) for i in range(old_ncols) if i in keep_col_idxs
        ]

        res = context.make_tuple(builder, ret_typ, new_data)
        return res

    ret_typ = types.Tuple(new_typs)
    sig = signature(ret_typ, data, drop_idxs)
    return sig, codegen

def df_drop(data, drop_idxs):
    pass

@overload(df_drop, prefer_literal=True)
def df_drop_ovld(data, drop_idxs):

    print(f"DEBUG: df_drop_ovld: data={data}")
    print(f"DEBUG: df_drop_ovld: drop_idxs={drop_idxs}")
    if not (isinstance(drop_idxs, types.List)
                and drop_idxs.initial_value is not None
            ):
        return None

    columns_as_tuple = tuple(drop_idxs.initial_value)
    def df_drop_impl(data, drop_idxs):
        return inner_df_drop(data, columns_as_tuple)

    return df_drop_impl


@numba.njit
def test_func_1():
    data = ('a', [2, 3], 4.2, 600)
    drop_list = [1, 3]
    drop_list.pop()                     # comment for test to pass
    return df_drop(data, drop_list)


expected_res = ('a', 4.2, 600)
# expected_res = ('a', 4.2)             # uncomment for test to pass
res = test_func_1()
print(f"DEBUG: res={res}")
print(f"DEBUG: expected_res={expected_res}")

assert res == expected_res

sklam · December 2, 2020, 4:41pm

The challenges to inferring LiteralList will be in interprocedural situation because Numba do not keep track of mutations. So a situation like:

def callee(lst):
    lst.append(1)


def caller():
    lst = [1, 2]
    callee(lst)

will not be possible. If we force the lst to be literal in the first place and rely on error in callee() to stop using LiteralList, we will end up with a lot of recompilation. It is more common for list to be mutable.

I think the best approach to to opt-in from the overload implementation using literally(). Any mutation to the LiteralList should then result in a compilation error.

stuartarchibald · December 2, 2020, 5:58pm

When working on this I tried a bunch of tricks to work around the interprocedural mutation issue @sklam describes, none of which work reliably. I think type inference would need a massive overhaul to make this work.

There’s also a technical oddity born from making LiteralList the default type from a build_list opcode. Essentially type inference would first try “all types are non-literal, with the exception of lists” and if that fails it would fall back to “all types are literal where possible, apart from lists, which are now… something else? perhaps non-literal?!”

I also have a feeling that directly adding literal support to the List type for the purposes of getting literally working could well result in strange things happening too and it might be necessary to do something custom.

Further, as @sklam notes, there’s also the performance issue that arises from there being far more uses of mutable lists than immutable ones. This could potentially lead to things being repeatedly compiled due to a new default literal path being tried first and then fallback being required, this may be ok for smaller call graphs, but not so for larger ones. This also hints at literal list being an unlikely candidate for the default?

Hope this helps?

kozlov-alexey · December 4, 2020, 7:07pm

Thanks for the answers @sklam, @stuartarchibald, I understand why inferring Literal types for lists by default is a bad idea now. I think the ability to dispatch based on list values in overloads will always be important for us though, so having literally support Lists and Tuples seems like a good idea. Can you give more insights on what those strange things may be?

stuartarchibald · December 10, 2020, 1:25pm

I think that due to the way types.Literal and literally are implemented, copying the path that other literal types take is likely to not work in the case presented here. This case is a special “opt in” for this behaviour opposed to something that should just happen. Were the standard path to achieving literal type support implemented for list I think that when type inference fails with the non-literal types it will try unliteraling a list and a literal list would appear, this list having different semantics to a non-literal list. This is why I think something special is potentially needed, as a third state in the system is required.

Topic		Replies	Views
Unexpected behaviour with literally Development	2	643	August 12, 2020
Problem with Typed List function argument Community Support	1	689	October 12, 2023
List mistaken as list when creating list of function references Community Support	4	629	April 21, 2021
Performance of typed.List outside of jit functions Community Support	20	719	June 5, 2024
Best practices for using read-only Python lists Community Support	4	2038	January 12, 2022

Support LiteralList type for homogeneous data

Related topics