Code generation vs literal unroll

luk-f-a · October 21, 2020, 12:43pm

hi, I have a comparison of unrolling a loop via code generation vs via literal_unroll. The second option takes longer to compile (understandable to a certain extent) but it’s also much slower to execute (like 1500x slower). I’d prefer to avoid code generation but it’s proving to be the best option so far.

I was wondering if there’s anything I could do here to improve the compilation speed in either option, and/or the execution speed in the literal_unroll option. The latter seems to suffer from reference counting cost (based on a quick look to the llvm ir).

import time
import warnings
from typing import Tuple, Callable, NamedTuple, Union

import numpy as np
from numba import njit, literal_unroll, NumbaExperimentalFeatureWarning


@njit
def id_fc(x):
    return x


class COOFunctions(NamedTuple):
    data: Tuple[Callable[[int], Union[int, Tuple[int, ...]]], ...]
    rows: Tuple[int, ...]
    cols: Tuple[int, ...]


class COOVecs(NamedTuple):
    data: Tuple[np.ndarray, ...]
    rows: Tuple[int, ...]
    cols: Tuple[int, ...]


def make_looper(coo_fs):
    fs_enum = tuple(enumerate(coo_fs.data))

    @njit
    def get_values(out, t, vecs) -> np.ndarray:
        for x in literal_unroll(fs_enum):
            idx, fc = x
            out[coo_fs.rows[idx], coo_fs.cols[idx]] = vecs.data[idx][fc(t)]
        return out

    return get_values


@njit
def foo(t):
    return 0, t *2


def make_looper_txt(timing_fs, sparse_pos):
    namespace = {'np': np}
    fc_idx = 0

    def add_fc(fc, fc_idx, namespace):
        exec(f"fc_{fc_idx} = fc", locals(), namespace)
        return f"fc_{fc_idx}"

    fc_text = ""
    fc_text += "def get_values(out, t, vecs):" + "\n"
    enum_gen = enumerate(zip(timing_fs, sparse_pos))
    for state_idx, (row_time_fc, row_pos) in enum_gen:
        if len(row_time_fc) > 0:
            for dec_idx, (time_fc, pos) in enumerate(zip(row_time_fc, row_pos)):
                time_fc_name = add_fc(time_fc, fc_idx, namespace)
                in_, out_ = pos
                fc_text += f"    out[{in_}, {out_}] = vecs[{in_}][{dec_idx}][{time_fc_name}(t)]" + "\n"
    fc_text += "    return out"
    exec(fc_text, namespace, namespace)
    return njit(namespace['get_values'])



class Timer:
    def __init__(self, text= "Elapsed time: {:0.10f} seconds"):
        self.text = text
        self._start_time: float = 0

    def __enter__(self):
        self._start_time = time.perf_counter()
        return self

    def __exit__(self, *exc_info):
        end = time.perf_counter()
        print(self.text.format(end - self._start_time))


if __name__ == '__main__':
    timing_funs = tuple([foo for _ in range(367)])
    coo_fs = COOFunctions(timing_funs, tuple(range(367)), tuple(np.repeat(0, 367)))
    vecs_data = tuple([np.zeros((1, 1441)) for _ in range(367)])
    vecs = COOVecs(vecs_data, tuple(range(367)), tuple(np.repeat(0, 367)))
    fs = tuple([(foo, foo, foo) for _ in range(125)])
    sparse_pos = tuple([((1, 0), (1, 123), (1, 124)) for _ in range(125)])
    vecs_2 = np.zeros((125, 125, 1, 10000))
    out = np.zeros((1000, 1000), np.float64)

    warnings.filterwarnings("ignore", category=NumbaExperimentalFeatureWarning)

    with Timer("Text: {:0.10f} seconds"):
        looper_txt = make_looper_txt(fs, sparse_pos)
        res1 = looper_txt(out, 0, vecs_2)

    with Timer("Text without compilation: {:0.10f} seconds"):
        res2 = looper_txt(out, 0, vecs_2)

    with Timer("Literal unroll: {:0.10f} seconds"):
        looper = make_looper(coo_fs)
        res3 = looper(out, 0, vecs)

    with Timer("Literal unroll without compilation: {:0.10f} seconds"):
        res4 = looper(out, 0, vecs)

sklam · October 22, 2020, 9:42pm

I can track it down to a problem inside the new refprune pass that it is too slow to walk the entire subgraph for an incref. I can workaround the problem by banning revisiting nodes but that disable some valid cases for the pruner. Tracking at PR https://github.com/numba/llvmlite/pull/644.

luk-f-a · October 22, 2020, 9:57pm

I can workaround the problem by banning revisiting nodes but that disable some valid cases for the pruner.

I think it would be fair to challenge whether such a large literal_unroll is a targeted use case at all. I wouldn’t like to disable other valid cases to make room for my example. I just thought it’d be interesting to look at the comparison.

Again and again I find code gen the best option in many cases. Since you are more experienced, what’s your take on this? Is it better to improve things like literal_unroll until it can beat code gen, or is it better to speed up compilation of generated code? other people have also reported slow compilation times on large functions: Tips or tricks for speeding up compilation time on first call of large Numba-jitted NumPy-containing functions?

luk-f-a · October 22, 2020, 10:19pm

btw @sklam, I just realized that I ran the examples above on 0.51. I now tried it with 0.52.0rc1 and the second example looper never ends compiling (I killed it after 5 minutes). There’s a possible regression.

luk-f-a · October 23, 2020, 1:49pm

I opened an issue about this last part https://github.com/numba/numba/issues/6411

sklam · October 23, 2020, 2:13pm

I thought the 1500x slowdown was against 0.52.0rc1. So my workaround would allow the compilation to complete. However, it does not address the 1500x slowdown for the literal_unroll . But a related observation is that literal_unroll is producing a lot of reference-counting operation for the refprune to work on. However, because the graph is so wide, it takes a long time for the pruner to find prune-able reference-counting operations. The reference-counting ops is likely the reason for the 1500x slowdown, and the trying to prune them is the reason for the extremely long compilation.

stuartarchibald · October 27, 2020, 11:01am

Think some of this is tuple unpack too, wrapping the tuple in a StructRef:

import time
import warnings
from typing import Tuple, Callable, NamedTuple, Union

import numpy as np
from numba import njit, literal_unroll, NumbaExperimentalFeatureWarning


@njit
def id_fc(x):
    return x


class COOFunctions(NamedTuple):
    data: Tuple[Callable[[int], Union[int, Tuple[int, ...]]], ...]
    rows: Tuple[int, ...]
    cols: Tuple[int, ...]


class COOVecs(NamedTuple):
    data: Tuple[np.ndarray, ...]
    rows: Tuple[int, ...]
    cols: Tuple[int, ...]


def make_looper(coo_fs):
    fs_enum = tuple(enumerate(coo_fs.data))

    @njit
    def get_values(out, t, vecs) -> np.ndarray:
        for x in literal_unroll(fs_enum):
            idx, fc = x
            out[coo_fs.rows[idx], coo_fs.cols[idx]] = vecs.data[idx][fc(t)]
        return out

    return get_values


@njit
def foo(t):
    return 0, t *2

from numba import types
from numba.experimental import structref

@structref.register
class WrappedTupleType(types.StructRef):
    def preprocess_fields(self, fields):
        return tuple((name, typ) for name, typ in fields)

class WrappedTuple(structref.StructRefProxy):
    def __new__(cls, tuple_holder):
        return structref.StructRefProxy.__new__(cls, tuple_holder)

structref.define_proxy(WrappedTuple, WrappedTupleType, ["tuple_holder"])


def make_looper_txt(timing_fs, sparse_pos):
    namespace = {'np': np}
    fc_idx = 0

    def add_fc(fc, fc_idx, namespace):
        exec(f"fc_{fc_idx} = fc", locals(), namespace)
        return f"fc_{fc_idx}"

    fc_text = ""
    fc_text += "def get_values(out, t, vecs):" + "\n"
    enum_gen = enumerate(zip(timing_fs, sparse_pos))
    for state_idx, (row_time_fc, row_pos) in enum_gen:
        if len(row_time_fc) > 0:
            for dec_idx, (time_fc, pos) in enumerate(zip(row_time_fc, row_pos)):
                time_fc_name = add_fc(time_fc, fc_idx, namespace)
                in_, out_ = pos
                fc_text += f"    out[{in_}, {out_}] = vecs[{in_}][{dec_idx}][{time_fc_name}(t)]" + "\n"
    fc_text += "    return out"
    exec(fc_text, namespace, namespace)
    return njit(namespace['get_values'])


def make_looper_deref(coo_fs):
    fs_enum = tuple(enumerate(coo_fs.data))

    @njit
    def get_values(out, t, vecs) -> np.ndarray:
        for x in literal_unroll(fs_enum):
            idx, fc = x
            out[coo_fs.rows[idx], coo_fs.cols[idx]] = vecs.data.tuple_holder[idx][fc(t)]
        return out

    return get_values



class Timer:
    def __init__(self, text= "Elapsed time: {:0.10f} seconds"):
        self.text = text
        self._start_time: float = 0

    def __enter__(self):
        self._start_time = time.perf_counter()
        return self

    def __exit__(self, *exc_info):
        end = time.perf_counter()
        print(self.text.format(end - self._start_time))


if __name__ == '__main__':
    sz = 5
    timing_funs = tuple([foo for _ in range(sz)])
    coo_fs = COOFunctions(timing_funs, tuple(range(sz)), tuple(np.repeat(0, sz)))
    vecs_data = tuple([np.zeros((1, 1441)) for _ in range(sz)])
    vecs = COOVecs(vecs_data, tuple(range(sz)), tuple(np.repeat(0, sz)))
    fs = tuple([(foo, foo, foo) for _ in range(125)])
    sparse_pos = tuple([((1, 0), (1, 123), (1, 124)) for _ in range(125)])
    vecs_2 = np.zeros((125, 125, 1, 10000))
    out = np.zeros((1000, 1000), np.float64)

    warnings.filterwarnings("ignore", category=NumbaExperimentalFeatureWarning)

    with Timer("Text: {:0.10f} seconds"):
        looper_txt = make_looper_txt(fs, sparse_pos)
        res1 = looper_txt(out, 0, vecs_2)

    with Timer("Text without compilation: {:0.10f} seconds"):
        res2 = looper_txt(out, 0, vecs_2)

    with Timer("Literal unroll: {:0.10f} seconds"):
        looper = make_looper(coo_fs)
        res3 = looper(out, 0, vecs)

    with Timer("Literal unroll without compilation: {:0.10f} seconds"):
        res4 = looper(out, 0, vecs)

    x = WrappedTuple(tuple(range(sz)))
    y = WrappedTuple(tuple(np.repeat(0, sz)))
    tmp = WrappedTuple(vecs_data)
    hide_vecs = COOVecs(tmp, x, y)
    with Timer("StructRef Literal unroll compilation: {:0.10f} seconds"):
        looper = make_looper_deref(coo_fs)
        res5 = looper(out, 0, hide_vecs)

    with Timer("StructRef Literal unroll without compilation: {:0.10f} seconds"):
        res5 = looper(out, 0, hide_vecs)

gives me:

$ NUMBA_LLVM_REFPRUNE_PASS=0 python di22.py 
Text: 25.6103645280 seconds
Text without compilation: 0.0000258130 seconds
Literal unroll: 0.6655537720 seconds
Literal unroll without compilation: 0.0005317160 seconds
StructRef Literal unroll compilation: 0.3853233850 seconds
StructRef Literal unroll without compilation: 0.0001672180 seconds

luk-f-a · October 29, 2020, 8:03am

thanks @stuartarchibald, interesting use of StructRef. I cannot claim to really understand what’s going on. why is this helping with the tuple unpacking?

I found something interesting. I replaced the outer named tuple with a structref and the inner tuples with a typed list, and I am getting really good results. Now the literal unroll is only a factor of 2 away from the text-based loop unrolling.

Some time ago I had discarded typed lists as too slow for iteration, but there’s been improvements since then so I’m exploring what else I could use typed lists for, especially replacing tuples that are expensive to unpack.

import os

os.environ["NUMBA_LLVM_REFPRUNE_PASS"] = "0"

import time
import warnings
from typing import Tuple, Callable, NamedTuple, Union

import numpy as np
from numba import njit, literal_unroll, NumbaExperimentalFeatureWarning
from numba.typed import List

@njit
def id_fc(x):
    return x


class COOFunctions(NamedTuple):
    data: Tuple[Callable[[int], Union[int, Tuple[int, ...]]], ...]
    rows: Tuple[int, ...]
    cols: Tuple[int, ...]


class COOVecs(NamedTuple):
    data: Tuple[np.ndarray, ...]
    rows: Tuple[int, ...]
    cols: Tuple[int, ...]


def make_looper(coo_fs):
    fs_enum = tuple(enumerate(coo_fs.data))

    @njit
    def get_values(out, t, vecs) -> np.ndarray:
        for x in literal_unroll(fs_enum):
            idx, fc = x
            out[coo_fs.rows[idx], coo_fs.cols[idx]] = vecs.data[idx][fc(t)]
        return out

    return get_values


@njit
def foo(t):
    return 0, t *2

from numba import types
from numba.experimental import structref

@structref.register
class WrappedTupleType(types.StructRef):
    def preprocess_fields(self, fields):
        return tuple((name, typ) for name, typ in fields)

class WrappedTuple(structref.StructRefProxy):
    def __new__(cls, tuple_holder):
        return structref.StructRefProxy.__new__(cls, tuple_holder)

structref.define_proxy(WrappedTuple, WrappedTupleType, ["tuple_holder"])


def make_looper_txt(timing_fs, sparse_pos):
    namespace = {'np': np}
    fc_idx = 0

    def add_fc(fc, fc_idx, namespace):
        exec(f"fc_{fc_idx} = fc", locals(), namespace)
        return f"fc_{fc_idx}"

    fc_text = ""
    fc_text += "def get_values(out, t, vecs):" + "\n"
    enum_gen = enumerate(zip(timing_fs, sparse_pos))
    for state_idx, (row_time_fc, row_pos) in enum_gen:
        if len(row_time_fc) > 0:
            for dec_idx, (time_fc, pos) in enumerate(zip(row_time_fc, row_pos)):
                time_fc_name = add_fc(time_fc, fc_idx, namespace)
                in_, out_ = pos
                fc_text += f"    out[{in_}, {out_}] = vecs[{in_}][{dec_idx}][{time_fc_name}(t)]" + "\n"
    fc_text += "    return out"
    exec(fc_text, namespace, namespace)
    return njit(namespace['get_values'])


def make_looper_deref(coo_fs):
    fs_enum = tuple(enumerate(coo_fs.data))

    @njit
    def get_values(out, t, vecs) -> np.ndarray:
        for x in literal_unroll(fs_enum):
            idx, fc = x
            out[coo_fs.rows[idx], coo_fs.cols[idx]] = vecs.data.tuple_holder[idx][fc(t)]
        return out

    return get_values



class Timer:
    def __init__(self, text= "Elapsed time: {:0.10f} seconds"):
        self.text = text
        self._start_time: float = 0

    def __enter__(self):
        self._start_time = time.perf_counter()
        return self

    def __exit__(self, *exc_info):
        end = time.perf_counter()
        print(self.text.format(end - self._start_time))


if __name__ == '__main__':
    sz = 750
    assert sz % 3 == 0
    timing_funs = tuple([foo for _ in range(sz)])
    coo_fs = COOFunctions(timing_funs, tuple(range(sz)), tuple(np.repeat(0, sz)))
    vecs_data = tuple([np.zeros((1, 1441)) for _ in range(sz)])
    vecs = COOVecs(vecs_data, tuple(range(sz)), tuple(np.repeat(0, sz)))
    fs = tuple([(foo, foo, foo) for _ in range(sz//3)])
    sparse_pos = tuple([((1, 0), (1, sz//3-2), (1, sz//3-1)) for _ in range(sz//3)])
    vecs_2 = List([List([np.zeros((1, 10000)) for _ in range(3)]) for _ in range(sz//3)])
    out = np.zeros((1000, 1000), np.float64)

    warnings.filterwarnings("ignore", category=NumbaExperimentalFeatureWarning)

    with Timer("Text: {:0.10f} seconds"):
        looper_txt = make_looper_txt(fs, sparse_pos)
        res1 = looper_txt(out, 0, vecs_2)

    with Timer("Text without compilation: {:0.10f} seconds"):
        res2 = looper_txt(out, 0, vecs_2)

    with Timer("Literal unroll: {:0.10f} seconds"):
        looper = make_looper(coo_fs)
        res3 = looper(out, 0, vecs)

    with Timer("Literal unroll without compilation: {:0.10f} seconds"):
        res4 = looper(out, 0, vecs)

    x = WrappedTuple(tuple(range(sz)))
    y = WrappedTuple(tuple(np.repeat(0, sz)))
    tmp = WrappedTuple(vecs_data)
    hide_vecs = COOVecs(tmp, x, y)
    with Timer("StructRef Literal unroll compilation: {:0.10f} seconds"):
        looper = make_looper_deref(coo_fs)
        res5 = looper(out, 0, hide_vecs)

    with Timer("StructRef Literal unroll without compilation: {:0.10f} seconds"):
        res5 = looper(out, 0, hide_vecs)

    vecs_data = List([np.zeros((1, 1441)) for _ in range(sz)])
    vecs = COOVecs(vecs_data, List(range(sz)), List(np.repeat(0, sz)))

    with Timer("Literal unroll lists: {:0.10f} seconds"):
        looper = make_looper(coo_fs)
        res6 = looper(out, 0, vecs)

    with Timer("Literal unrol lists without compilation: {:0.10f} seconds"):
        res6 = looper(out, 0, vecs)


    ######################
    @structref.register
    class COOStructType(types.StructRef):
        def preprocess_fields(self, fields):
            # This method is called by the type constructor for additional
            # preprocessing on the fields.
            # Here, we don't want the struct to take Literal types.
            return tuple((name, types.unliteral(typ)) for name, typ in fields)


    class COOStruct(structref.StructRefProxy):
        def __new__(cls, data, rows, cols):
            # Overriding the __new__ method is optional, doing so
            # allows Python code to use keyword arguments,
            # or add other customized behavior.
            # The default __new__ takes `*args`.
            # IMPORTANT: Users should not override __init__.
            return structref.StructRefProxy.__new__(cls, data, rows, cols)


    structref.define_proxy(COOStruct, COOStructType, ["data", "rows", "cols"])


    vecs_data = List([np.zeros((1, 1441)) for _ in range(sz)])
    vecs = COOStruct(vecs_data, List(range(sz)), List(np.repeat(0, sz)))

    with Timer("Literal unroll lists+StructRef: {:0.10f} seconds"):
        looper = make_looper(coo_fs)
        res7 = looper(out, 0, vecs)

    with Timer("Literal unrol lists+StructRef without compilation: {:0.10f} seconds"):
        res7 = looper(out, 0, vecs)

and for a large example (sz = 750) got

Text: 44.5729698250 seconds
Text without compilation: 0.0001041110 seconds

Literal unroll: 219.5337678840 seconds
Literal unroll without compilation: 0.0224145050 seconds

StructRef Literal unroll compilation: 74.2100051170 seconds
StructRef Literal unroll without compilation: 0.0118298770 seconds

Literal unroll lists: 51.4629602010 seconds
Literal unrol lists without compilation: 0.0003306740 seconds

Literal unroll lists+StructRef: 56.6343746760 seconds
Literal unrol lists+StructRef without compilation: 0.0002141190 seconds

Topic		Replies	Views
What is the best way to generate an implementation of a function at compile time? Development	2	769	January 5, 2021
Performance discrepancy between jit/generated_jit/overload Community Support	7	663	April 21, 2021
Long compilation times for simple but long functions Community Support	4	635	October 25, 2022
Passing namedTuple to a jitted function is slow Community Support	1	590	April 25, 2022
How do I find out what the compiler is spending its time doing? Support: How do I do ...?	0	577	December 3, 2020

Code generation vs literal unroll

Related Topics