Hi,
I would like to discuss topics related to uses of LiteralList type in certain use-cases here.
In SDC we wanted to use homogeneous literal lists in scenarios like the following, arising when trying to implement pandas df.drop function:
def test_impl(df):
drop_cols = ['A', 'C']
return df.drop(columns=drop_cols)
The columns argument has to be a list of columns to drop from the dataframe, but it requires for column names to be known at compile time, since the type of returned dataframe depends on it (for simplicity the DataFrame type can be thought of as a Tuple of types.Array objects with different dtypes).
Currently we can implement DataFrameType drop overload this way that supports List of columns as argument:
@overload(DataFrameType, 'drop')
def sdc_pandas_dataframe_drop(df, labels=None, axis=0, index=None, columns=None, level=None, inplace=False,
errors='raise'):
...
if not (isinstance(columns, types.List) and isinstance(columns.dtype, types.UnicodeType)):
return None
if columns.initial_value is None:
raise TypingError('{} Unsupported use of parameter columns:'
' expected list of constant strings. Given: {}'.format('Method drop()', columns))
else:
# this works because global tuple of strings is captured as Tuple of StringLiterals
columns_as_tuple = tuple(columns.initial_value)
def _sdc_pandas_dataframe_drop_wrapper_impl(df, labels=None, axis=0, index=None,
columns=None, level=None, inplace=False, errors="raise"):
# below function actually does the main work and can handle columns as tuple of StringLiterals
return df_drop_internal(labels=labels,
axis=axis,
index=index,
columns=columns_as_tuple,
level=level,
inplace=inplace,
errors=errors)
return _sdc_pandas_dataframe_drop_wrapper_impl
However, since this uses reflected list (i.e. non-literal one) there’s a use-case when this list can be mutated after it’s first defined, but before it’s passed to df.drop() call:
def test_df_drop_columns_list_mutation_unsupported(self):
def test_impl(df):
drop_cols = ['A', 'C']
drop_cols.pop(-1) # this would ban LiteralList modification
return df.drop(columns=drop_cols) # this could ban all other types except LiteralList
sdc_func = self.jit(test_impl)
df = pd.DataFrame({
'A': [1.0, 2.0, np.nan, 1.0],
'B': [4, 5, 6, 7],
'C': [1.0, 2.0, np.nan, 1.0],
})
res = sdc_func(df)
assert 'C' in res.columns, "Column 'C' should not be dropped"
Such a test compiles successfully but fails as captured initial_value becomes stale as the list gets mutated. We can obviously incorporate runtime check in the df.drop implementation and raise an exception if it’s detected that list was mutated, but in the ideal case we wanted to ban this scenario from being compiled at all.
It seemed natural that such a list [‘A’, ‘C’] should be typed as LiteralList[StringLiteral['A'], StringLiteral['C']]
, but it’s not, since the rule in BuildListConstraint
here is to unify elements type if that is possible and infer an non-literal list. In this case StringLiteral['A']
and StringLiteral['C']
types are unified to be types.unicode_type, so the type of [‘A’, ‘C’] happens to be:
# drop_cols = $drop_cols.16 :: list(unicode_type)<iv=['A', 'C']>.
I was wondering is it possible to infer LiteralList type in such use-cases first and if that does not compile try with non-literal type?
Or is there any other way around this that would allow homogeneous LiteralLists to be supported?
I’m going to add some simple reproducer in a couple of hours.
Thanks in advance!