Speed tips for mapping all values of an array

jni · May 12, 2021, 12:42pm

I’m trying to remap all values in an array according to some 1-1 correspondence. This can be accomplished by the skimage.util.map_array function, inner-loop implementation here. (There is a pure Python wrapper that takes care of the array shapes and array allocation here.)

Here’s how it looks in practice:

In [10]: values = np.random.randint(0, 5, size=10)
In [11]: inval = np.arange(5)
In [12]: outval = np.random.random(5)
In [13]: values
Out[13]: array([0, 0, 4, 0, 3, 2, 0, 2, 0, 2])
In [14]: inval
Out[14]: array([0, 1, 2, 3, 4])
In [15]: outval
Out[15]: array([0.595442  , 0.22325946, 0.16452037, 0.70457358, 0.37474462])
In [16]: map_array(values, inval, outval)
Out[16]: 
array([0.595442  , 0.595442  , 0.37474462, 0.595442  , 0.70457358,
       0.16452037, 0.595442  , 0.16452037, 0.595442  , 0.16452037])

This works well but it’s about 4x slower than using array indexing, as in outval[values]:

In [39]: image = np.random.randint(0, 5, size=(2048, 2048))

In [40]: %timeit map_array(image, inval, outval)
35.6 ms ± 249 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [41]: %timeit outval[image]
9.48 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And the problem with the NumPy indexing approach is that you end up with a really huge outval array if the values in image are large — even if you don’t actually have many of them. (e.g. to map 2**32 and 2**32+1 to 0.5 and 1, you need to allocate a 4GB array!)

I thought I’d give Numba a go since dictionaries were implemented “recently”. (Thank you! ) That turns out to be ~2x slower still than the C++ unordered_map approach.

import numba


@numba.jit
def _map_array(inarr, outarr, inval, outval):
    lut = {}
    for i in range(len(inval)):
        lut[inval[i]] = outval[i]
    for i in range(len(inarr)):
        outarr[i] = lut[inarr[i]]

Measurement:

In [30]: nd._map_array(image.ravel(), outarr.ravel(), inval, outval)                   

In [31]: %timeit nd._map_array(image.ravel(), outarr.ravel(), inval, outval)           
69.6 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Any ideas on how to speed this up?

Thank you!

sklam · May 13, 2021, 3:01pm

The 2x slower than C++ is surprising to me. I compared the LLVM output of the following C++ code:

#include <unordered_map>


void foo(const size_t N, const size_t inarr[], size_t outarr[], const size_t inval[], const size_t outval[]) {
    std::unordered_map<size_t, size_t> lut;

    for (size_t i=0; i<N; ++i){
        lut[inval[i]] = outval[i];
    }


    for (size_t i=0; i<N; ++i){
        outarr[i] = lut[inarr[i]];
    }

}

to the Numba generated one and they are quite similar. Both have loops consisting of mainly calling the hashtable methods. The Numba one do not have reference-count operations in the loops. So, we will need to do a more detailed profiling on both Numba and C++ to find out the difference.

jni · May 14, 2021, 1:56am

I will say that I was compiling with gcc, not clang. Not sure whether that makes a difference. (iirc clang goes via LLVM but gcc doesn’t?)

jni · May 14, 2021, 1:58am

btw note that N should be different for inarr/outarr and inval/outval, though I realise that’s not the issue here.

sklam · May 14, 2021, 4:42pm

I did a benchmark and Numba is always ~2x faster than C++. See this notebook: unordered_map C++ vs Numba.ipynb · GitHub. I made the C++ version as close to the Python one as possible. I am curious to see your C++ approach.

jni · May 15, 2021, 12:52pm

Hi again @sklam! As linked above, my “C++” is actually Cython with the C++ bindings, link:

github.com

scikit-image/scikit-image/blob/4397765d5c842af3d1e590f2738475b47d6e4e95/skimage/util/_remap.pyx#L3-L24

    
      
          from libcpp.unordered_map cimport unordered_map
          cimport cython
          from .._shared.fused_numerics cimport np_numeric, np_anyint
          
          
@cython.boundscheck(False)  # Deactivate bounds checking
          @cython.wraparound(False)   # Deactivate negative indexing
          def _map_array(np_anyint[:] inarr, np_numeric[:] outarr,
                         np_anyint[:] inval, np_numeric[:] outval):
              # build the map from the input and output vectors
              cdef size_t i, n_map, n_array
              cdef unordered_map[np_anyint, np_numeric] lut
              n_map = inval.shape[0]
              for i in range(n_map):
                  lut[inval[i]] = outval[i]
              # apply the map to the array
              n_array = inarr.shape[0]
              # The prange option gave some compilation warnings
              #  "Unsigned index type not allowed before OpenMP 3.0"
              # and didn't seem to be any faster
              # for i in prange(n_array, nogil=True): #

This file has been truncated. show original

One issue with your benchmarks is that they don’t really follow my use case (see code above), which is that the number of keys is typically much smaller than the size of the image. So I suspect your benchmark is dominated by dictionary building, whereas my benchmark is completely about dictionary lookup.

Above, inval/outval have size 5 while inarr/outarr have size 4M. That’s the kind of regime I’m usually operating in.

Topic		Replies	Views
Convert array with numbers to array of strings with format Support: How do I do ...?	1	338	April 9, 2023
Help needed to re-implement np.matmul for 4D and 5D matrix Support: How do I do ...?	2	268	August 19, 2023
Feedback on tips for first-timers Community Support	14	494	August 15, 2023
Best practices for using read-only Python lists Community Support	4	1576	January 12, 2022
Complex Structured Inputs Community Support	10	426	July 13, 2022

Speed tips for mapping all values of an array

Related Topics