Code generation vs literal unroll

hi, I have a comparison of unrolling a loop via code generation vs via literal_unroll. The second option takes longer to compile (understandable to a certain extent) but it’s also much slower to execute (like 1500x slower). I’d prefer to avoid code generation but it’s proving to be the best option so far.

I was wondering if there’s anything I could do here to improve the compilation speed in either option, and/or the execution speed in the literal_unroll option. The latter seems to suffer from reference counting cost (based on a quick look to the llvm ir).

import time
import warnings
from typing import Tuple, Callable, NamedTuple, Union

import numpy as np
from numba import njit, literal_unroll, NumbaExperimentalFeatureWarning


@njit
def id_fc(x):
    return x


class COOFunctions(NamedTuple):
    data: Tuple[Callable[[int], Union[int, Tuple[int, ...]]], ...]
    rows: Tuple[int, ...]
    cols: Tuple[int, ...]


class COOVecs(NamedTuple):
    data: Tuple[np.ndarray, ...]
    rows: Tuple[int, ...]
    cols: Tuple[int, ...]


def make_looper(coo_fs):
    fs_enum = tuple(enumerate(coo_fs.data))

    @njit
    def get_values(out, t, vecs) -> np.ndarray:
        for x in literal_unroll(fs_enum):
            idx, fc = x
            out[coo_fs.rows[idx], coo_fs.cols[idx]] = vecs.data[idx][fc(t)]
        return out

    return get_values


@njit
def foo(t):
    return 0, t *2


def make_looper_txt(timing_fs, sparse_pos):
    namespace = {'np': np}
    fc_idx = 0

    def add_fc(fc, fc_idx, namespace):
        exec(f"fc_{fc_idx} = fc", locals(), namespace)
        return f"fc_{fc_idx}"

    fc_text = ""
    fc_text += "def get_values(out, t, vecs):" + "\n"
    enum_gen = enumerate(zip(timing_fs, sparse_pos))
    for state_idx, (row_time_fc, row_pos) in enum_gen:
        if len(row_time_fc) > 0:
            for dec_idx, (time_fc, pos) in enumerate(zip(row_time_fc, row_pos)):
                time_fc_name = add_fc(time_fc, fc_idx, namespace)
                in_, out_ = pos
                fc_text += f"    out[{in_}, {out_}] = vecs[{in_}][{dec_idx}][{time_fc_name}(t)]" + "\n"
    fc_text += "    return out"
    exec(fc_text, namespace, namespace)
    return njit(namespace['get_values'])



class Timer:
    def __init__(self, text= "Elapsed time: {:0.10f} seconds"):
        self.text = text
        self._start_time: float = 0

    def __enter__(self):
        self._start_time = time.perf_counter()
        return self

    def __exit__(self, *exc_info):
        end = time.perf_counter()
        print(self.text.format(end - self._start_time))


if __name__ == '__main__':
    timing_funs = tuple([foo for _ in range(367)])
    coo_fs = COOFunctions(timing_funs, tuple(range(367)), tuple(np.repeat(0, 367)))
    vecs_data = tuple([np.zeros((1, 1441)) for _ in range(367)])
    vecs = COOVecs(vecs_data, tuple(range(367)), tuple(np.repeat(0, 367)))
    fs = tuple([(foo, foo, foo) for _ in range(125)])
    sparse_pos = tuple([((1, 0), (1, 123), (1, 124)) for _ in range(125)])
    vecs_2 = np.zeros((125, 125, 1, 10000))
    out = np.zeros((1000, 1000), np.float64)

    warnings.filterwarnings("ignore", category=NumbaExperimentalFeatureWarning)

    with Timer("Text: {:0.10f} seconds"):
        looper_txt = make_looper_txt(fs, sparse_pos)
        res1 = looper_txt(out, 0, vecs_2)

    with Timer("Text without compilation: {:0.10f} seconds"):
        res2 = looper_txt(out, 0, vecs_2)

    with Timer("Literal unroll: {:0.10f} seconds"):
        looper = make_looper(coo_fs)
        res3 = looper(out, 0, vecs)

    with Timer("Literal unroll without compilation: {:0.10f} seconds"):
        res4 = looper(out, 0, vecs)

I can track it down to a problem inside the new refprune pass that it is too slow to walk the entire subgraph for an incref. I can workaround the problem by banning revisiting nodes but that disable some valid cases for the pruner. Tracking at PR https://github.com/numba/llvmlite/pull/644.

I can workaround the problem by banning revisiting nodes but that disable some valid cases for the pruner.

I think it would be fair to challenge whether such a large literal_unroll is a targeted use case at all. I wouldn’t like to disable other valid cases to make room for my example. I just thought it’d be interesting to look at the comparison.

Again and again I find code gen the best option in many cases. Since you are more experienced, what’s your take on this? Is it better to improve things like literal_unroll until it can beat code gen, or is it better to speed up compilation of generated code? other people have also reported slow compilation times on large functions: Tips or tricks for speeding up compilation time on first call of large Numba-jitted NumPy-containing functions?

btw @sklam, I just realized that I ran the examples above on 0.51. I now tried it with 0.52.0rc1 and the second example looper never ends compiling (I killed it after 5 minutes). There’s a possible regression.

I opened an issue about this last part https://github.com/numba/numba/issues/6411

I thought the 1500x slowdown was against 0.52.0rc1. So my workaround would allow the compilation to complete. However, it does not address the 1500x slowdown for the literal_unroll . But a related observation is that literal_unroll is producing a lot of reference-counting operation for the refprune to work on. However, because the graph is so wide, it takes a long time for the pruner to find prune-able reference-counting operations. The reference-counting ops is likely the reason for the 1500x slowdown, and the trying to prune them is the reason for the extremely long compilation.

Think some of this is tuple unpack too, wrapping the tuple in a StructRef:

import time
import warnings
from typing import Tuple, Callable, NamedTuple, Union

import numpy as np
from numba import njit, literal_unroll, NumbaExperimentalFeatureWarning


@njit
def id_fc(x):
    return x


class COOFunctions(NamedTuple):
    data: Tuple[Callable[[int], Union[int, Tuple[int, ...]]], ...]
    rows: Tuple[int, ...]
    cols: Tuple[int, ...]


class COOVecs(NamedTuple):
    data: Tuple[np.ndarray, ...]
    rows: Tuple[int, ...]
    cols: Tuple[int, ...]


def make_looper(coo_fs):
    fs_enum = tuple(enumerate(coo_fs.data))

    @njit
    def get_values(out, t, vecs) -> np.ndarray:
        for x in literal_unroll(fs_enum):
            idx, fc = x
            out[coo_fs.rows[idx], coo_fs.cols[idx]] = vecs.data[idx][fc(t)]
        return out

    return get_values


@njit
def foo(t):
    return 0, t *2

from numba import types
from numba.experimental import structref

@structref.register
class WrappedTupleType(types.StructRef):
    def preprocess_fields(self, fields):
        return tuple((name, typ) for name, typ in fields)

class WrappedTuple(structref.StructRefProxy):
    def __new__(cls, tuple_holder):
        return structref.StructRefProxy.__new__(cls, tuple_holder)

structref.define_proxy(WrappedTuple, WrappedTupleType, ["tuple_holder"])


def make_looper_txt(timing_fs, sparse_pos):
    namespace = {'np': np}
    fc_idx = 0

    def add_fc(fc, fc_idx, namespace):
        exec(f"fc_{fc_idx} = fc", locals(), namespace)
        return f"fc_{fc_idx}"

    fc_text = ""
    fc_text += "def get_values(out, t, vecs):" + "\n"
    enum_gen = enumerate(zip(timing_fs, sparse_pos))
    for state_idx, (row_time_fc, row_pos) in enum_gen:
        if len(row_time_fc) > 0:
            for dec_idx, (time_fc, pos) in enumerate(zip(row_time_fc, row_pos)):
                time_fc_name = add_fc(time_fc, fc_idx, namespace)
                in_, out_ = pos
                fc_text += f"    out[{in_}, {out_}] = vecs[{in_}][{dec_idx}][{time_fc_name}(t)]" + "\n"
    fc_text += "    return out"
    exec(fc_text, namespace, namespace)
    return njit(namespace['get_values'])


def make_looper_deref(coo_fs):
    fs_enum = tuple(enumerate(coo_fs.data))

    @njit
    def get_values(out, t, vecs) -> np.ndarray:
        for x in literal_unroll(fs_enum):
            idx, fc = x
            out[coo_fs.rows[idx], coo_fs.cols[idx]] = vecs.data.tuple_holder[idx][fc(t)]
        return out

    return get_values



class Timer:
    def __init__(self, text= "Elapsed time: {:0.10f} seconds"):
        self.text = text
        self._start_time: float = 0

    def __enter__(self):
        self._start_time = time.perf_counter()
        return self

    def __exit__(self, *exc_info):
        end = time.perf_counter()
        print(self.text.format(end - self._start_time))


if __name__ == '__main__':
    sz = 5
    timing_funs = tuple([foo for _ in range(sz)])
    coo_fs = COOFunctions(timing_funs, tuple(range(sz)), tuple(np.repeat(0, sz)))
    vecs_data = tuple([np.zeros((1, 1441)) for _ in range(sz)])
    vecs = COOVecs(vecs_data, tuple(range(sz)), tuple(np.repeat(0, sz)))
    fs = tuple([(foo, foo, foo) for _ in range(125)])
    sparse_pos = tuple([((1, 0), (1, 123), (1, 124)) for _ in range(125)])
    vecs_2 = np.zeros((125, 125, 1, 10000))
    out = np.zeros((1000, 1000), np.float64)

    warnings.filterwarnings("ignore", category=NumbaExperimentalFeatureWarning)

    with Timer("Text: {:0.10f} seconds"):
        looper_txt = make_looper_txt(fs, sparse_pos)
        res1 = looper_txt(out, 0, vecs_2)

    with Timer("Text without compilation: {:0.10f} seconds"):
        res2 = looper_txt(out, 0, vecs_2)

    with Timer("Literal unroll: {:0.10f} seconds"):
        looper = make_looper(coo_fs)
        res3 = looper(out, 0, vecs)

    with Timer("Literal unroll without compilation: {:0.10f} seconds"):
        res4 = looper(out, 0, vecs)

    x = WrappedTuple(tuple(range(sz)))
    y = WrappedTuple(tuple(np.repeat(0, sz)))
    tmp = WrappedTuple(vecs_data)
    hide_vecs = COOVecs(tmp, x, y)
    with Timer("StructRef Literal unroll compilation: {:0.10f} seconds"):
        looper = make_looper_deref(coo_fs)
        res5 = looper(out, 0, hide_vecs)

    with Timer("StructRef Literal unroll without compilation: {:0.10f} seconds"):
        res5 = looper(out, 0, hide_vecs)

gives me:

$ NUMBA_LLVM_REFPRUNE_PASS=0 python di22.py 
Text: 25.6103645280 seconds
Text without compilation: 0.0000258130 seconds
Literal unroll: 0.6655537720 seconds
Literal unroll without compilation: 0.0005317160 seconds
StructRef Literal unroll compilation: 0.3853233850 seconds
StructRef Literal unroll without compilation: 0.0001672180 seconds

thanks @stuartarchibald, interesting use of StructRef. I cannot claim to really understand what’s going on. why is this helping with the tuple unpacking?

I found something interesting. I replaced the outer named tuple with a structref and the inner tuples with a typed list, and I am getting really good results. Now the literal unroll is only a factor of 2 away from the text-based loop unrolling.

Some time ago I had discarded typed lists as too slow for iteration, but there’s been improvements since then so I’m exploring what else I could use typed lists for, especially replacing tuples that are expensive to unpack.

import os

os.environ["NUMBA_LLVM_REFPRUNE_PASS"] = "0"

import time
import warnings
from typing import Tuple, Callable, NamedTuple, Union

import numpy as np
from numba import njit, literal_unroll, NumbaExperimentalFeatureWarning
from numba.typed import List

@njit
def id_fc(x):
    return x


class COOFunctions(NamedTuple):
    data: Tuple[Callable[[int], Union[int, Tuple[int, ...]]], ...]
    rows: Tuple[int, ...]
    cols: Tuple[int, ...]


class COOVecs(NamedTuple):
    data: Tuple[np.ndarray, ...]
    rows: Tuple[int, ...]
    cols: Tuple[int, ...]


def make_looper(coo_fs):
    fs_enum = tuple(enumerate(coo_fs.data))

    @njit
    def get_values(out, t, vecs) -> np.ndarray:
        for x in literal_unroll(fs_enum):
            idx, fc = x
            out[coo_fs.rows[idx], coo_fs.cols[idx]] = vecs.data[idx][fc(t)]
        return out

    return get_values


@njit
def foo(t):
    return 0, t *2

from numba import types
from numba.experimental import structref

@structref.register
class WrappedTupleType(types.StructRef):
    def preprocess_fields(self, fields):
        return tuple((name, typ) for name, typ in fields)

class WrappedTuple(structref.StructRefProxy):
    def __new__(cls, tuple_holder):
        return structref.StructRefProxy.__new__(cls, tuple_holder)

structref.define_proxy(WrappedTuple, WrappedTupleType, ["tuple_holder"])


def make_looper_txt(timing_fs, sparse_pos):
    namespace = {'np': np}
    fc_idx = 0

    def add_fc(fc, fc_idx, namespace):
        exec(f"fc_{fc_idx} = fc", locals(), namespace)
        return f"fc_{fc_idx}"

    fc_text = ""
    fc_text += "def get_values(out, t, vecs):" + "\n"
    enum_gen = enumerate(zip(timing_fs, sparse_pos))
    for state_idx, (row_time_fc, row_pos) in enum_gen:
        if len(row_time_fc) > 0:
            for dec_idx, (time_fc, pos) in enumerate(zip(row_time_fc, row_pos)):
                time_fc_name = add_fc(time_fc, fc_idx, namespace)
                in_, out_ = pos
                fc_text += f"    out[{in_}, {out_}] = vecs[{in_}][{dec_idx}][{time_fc_name}(t)]" + "\n"
    fc_text += "    return out"
    exec(fc_text, namespace, namespace)
    return njit(namespace['get_values'])


def make_looper_deref(coo_fs):
    fs_enum = tuple(enumerate(coo_fs.data))

    @njit
    def get_values(out, t, vecs) -> np.ndarray:
        for x in literal_unroll(fs_enum):
            idx, fc = x
            out[coo_fs.rows[idx], coo_fs.cols[idx]] = vecs.data.tuple_holder[idx][fc(t)]
        return out

    return get_values



class Timer:
    def __init__(self, text= "Elapsed time: {:0.10f} seconds"):
        self.text = text
        self._start_time: float = 0

    def __enter__(self):
        self._start_time = time.perf_counter()
        return self

    def __exit__(self, *exc_info):
        end = time.perf_counter()
        print(self.text.format(end - self._start_time))


if __name__ == '__main__':
    sz = 750
    assert sz % 3 == 0
    timing_funs = tuple([foo for _ in range(sz)])
    coo_fs = COOFunctions(timing_funs, tuple(range(sz)), tuple(np.repeat(0, sz)))
    vecs_data = tuple([np.zeros((1, 1441)) for _ in range(sz)])
    vecs = COOVecs(vecs_data, tuple(range(sz)), tuple(np.repeat(0, sz)))
    fs = tuple([(foo, foo, foo) for _ in range(sz//3)])
    sparse_pos = tuple([((1, 0), (1, sz//3-2), (1, sz//3-1)) for _ in range(sz//3)])
    vecs_2 = List([List([np.zeros((1, 10000)) for _ in range(3)]) for _ in range(sz//3)])
    out = np.zeros((1000, 1000), np.float64)

    warnings.filterwarnings("ignore", category=NumbaExperimentalFeatureWarning)

    with Timer("Text: {:0.10f} seconds"):
        looper_txt = make_looper_txt(fs, sparse_pos)
        res1 = looper_txt(out, 0, vecs_2)

    with Timer("Text without compilation: {:0.10f} seconds"):
        res2 = looper_txt(out, 0, vecs_2)

    with Timer("Literal unroll: {:0.10f} seconds"):
        looper = make_looper(coo_fs)
        res3 = looper(out, 0, vecs)

    with Timer("Literal unroll without compilation: {:0.10f} seconds"):
        res4 = looper(out, 0, vecs)

    x = WrappedTuple(tuple(range(sz)))
    y = WrappedTuple(tuple(np.repeat(0, sz)))
    tmp = WrappedTuple(vecs_data)
    hide_vecs = COOVecs(tmp, x, y)
    with Timer("StructRef Literal unroll compilation: {:0.10f} seconds"):
        looper = make_looper_deref(coo_fs)
        res5 = looper(out, 0, hide_vecs)

    with Timer("StructRef Literal unroll without compilation: {:0.10f} seconds"):
        res5 = looper(out, 0, hide_vecs)

    vecs_data = List([np.zeros((1, 1441)) for _ in range(sz)])
    vecs = COOVecs(vecs_data, List(range(sz)), List(np.repeat(0, sz)))

    with Timer("Literal unroll lists: {:0.10f} seconds"):
        looper = make_looper(coo_fs)
        res6 = looper(out, 0, vecs)

    with Timer("Literal unrol lists without compilation: {:0.10f} seconds"):
        res6 = looper(out, 0, vecs)


    ######################
    @structref.register
    class COOStructType(types.StructRef):
        def preprocess_fields(self, fields):
            # This method is called by the type constructor for additional
            # preprocessing on the fields.
            # Here, we don't want the struct to take Literal types.
            return tuple((name, types.unliteral(typ)) for name, typ in fields)


    class COOStruct(structref.StructRefProxy):
        def __new__(cls, data, rows, cols):
            # Overriding the __new__ method is optional, doing so
            # allows Python code to use keyword arguments,
            # or add other customized behavior.
            # The default __new__ takes `*args`.
            # IMPORTANT: Users should not override __init__.
            return structref.StructRefProxy.__new__(cls, data, rows, cols)


    structref.define_proxy(COOStruct, COOStructType, ["data", "rows", "cols"])


    vecs_data = List([np.zeros((1, 1441)) for _ in range(sz)])
    vecs = COOStruct(vecs_data, List(range(sz)), List(np.repeat(0, sz)))

    with Timer("Literal unroll lists+StructRef: {:0.10f} seconds"):
        looper = make_looper(coo_fs)
        res7 = looper(out, 0, vecs)

    with Timer("Literal unrol lists+StructRef without compilation: {:0.10f} seconds"):
        res7 = looper(out, 0, vecs)

and for a large example (sz = 750) got

Text: 44.5729698250 seconds
Text without compilation: 0.0001041110 seconds

Literal unroll: 219.5337678840 seconds
Literal unroll without compilation: 0.0224145050 seconds

StructRef Literal unroll compilation: 74.2100051170 seconds
StructRef Literal unroll without compilation: 0.0118298770 seconds

Literal unroll lists: 51.4629602010 seconds
Literal unrol lists without compilation: 0.0003306740 seconds

Literal unroll lists+StructRef: 56.6343746760 seconds
Literal unrol lists+StructRef without compilation: 0.0002141190 seconds