Speed tips for mapping all values of an array

:wave:

I’m trying to remap all values in an array according to some 1-1 correspondence. This can be accomplished by the skimage.util.map_array function, inner-loop implementation here. (There is a pure Python wrapper that takes care of the array shapes and array allocation here.)

Here’s how it looks in practice:

In [10]: values = np.random.randint(0, 5, size=10)
In [11]: inval = np.arange(5)
In [12]: outval = np.random.random(5)
In [13]: values
Out[13]: array([0, 0, 4, 0, 3, 2, 0, 2, 0, 2])
In [14]: inval
Out[14]: array([0, 1, 2, 3, 4])
In [15]: outval
Out[15]: array([0.595442  , 0.22325946, 0.16452037, 0.70457358, 0.37474462])
In [16]: map_array(values, inval, outval)
Out[16]: 
array([0.595442  , 0.595442  , 0.37474462, 0.595442  , 0.70457358,
       0.16452037, 0.595442  , 0.16452037, 0.595442  , 0.16452037])

This works well but it’s about 4x slower than using array indexing, as in outval[values]:

In [39]: image = np.random.randint(0, 5, size=(2048, 2048))

In [40]: %timeit map_array(image, inval, outval)
35.6 ms ± 249 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [41]: %timeit outval[image]
9.48 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And the problem with the NumPy indexing approach is that you end up with a really huge outval array if the values in image are large — even if you don’t actually have many of them. (e.g. to map 2**32 and 2**32+1 to 0.5 and 1, you need to allocate a 4GB array!)

I thought I’d give Numba a go since dictionaries were implemented “recently”. (Thank you! :pray:) That turns out to be ~2x slower still than the C++ unordered_map approach.

import numba


@numba.jit
def _map_array(inarr, outarr, inval, outval):
    lut = {}
    for i in range(len(inval)):
        lut[inval[i]] = outval[i]
    for i in range(len(inarr)):
        outarr[i] = lut[inarr[i]]

Measurement:

In [30]: nd._map_array(image.ravel(), outarr.ravel(), inval, outval)                   

In [31]: %timeit nd._map_array(image.ravel(), outarr.ravel(), inval, outval)           
69.6 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Any ideas on how to speed this up?

Thank you!

The 2x slower than C++ is surprising to me. I compared the LLVM output of the following C++ code:

#include <unordered_map>


void foo(const size_t N, const size_t inarr[], size_t outarr[], const size_t inval[], const size_t outval[]) {
    std::unordered_map<size_t, size_t> lut;

    for (size_t i=0; i<N; ++i){
        lut[inval[i]] = outval[i];
    }


    for (size_t i=0; i<N; ++i){
        outarr[i] = lut[inarr[i]];
    }

}

to the Numba generated one and they are quite similar. Both have loops consisting of mainly calling the hashtable methods. The Numba one do not have reference-count operations in the loops. So, we will need to do a more detailed profiling on both Numba and C++ to find out the difference.

I will say that I was compiling with gcc, not clang. Not sure whether that makes a difference. (iirc clang goes via LLVM but gcc doesn’t?)

btw note that N should be different for inarr/outarr and inval/outval, though I realise that’s not the issue here. :blush:

I did a benchmark and Numba is always ~2x faster than C++. See this notebook: unordered_map C++ vs Numba.ipynb · GitHub. I made the C++ version as close to the Python one as possible. I am curious to see your C++ approach.

Hi again @sklam! As linked above, my “C++” is actually Cython with the C++ bindings, link:

One issue with your benchmarks is that they don’t really follow my use case (see code above), which is that the number of keys is typically much smaller than the size of the image. So I suspect your benchmark is dominated by dictionary building, whereas my benchmark is completely about dictionary lookup.

Above, inval/outval have size 5 while inarr/outarr have size 4M. That’s the kind of regime I’m usually operating in.