Fastest numba code for computing euclideandistance between a vector and every row of a matrix

evalparse · August 21, 2020, 7:24am

I have this numpy code to try and compute the euclidean distance between b and EVERY row of a. I can’t get the speed to faster than 3.5s. But the best Julia solution I found can do this in about 75ms. Which is like almost 50x faster!

I have to admit I have never used numba but have heard lots of good things about it. I will start looking into how to write this in numba, but given it’s a simple problem, I wonder if anyone can help give me some example code to work from? Do I just write a loop and Numba will @jit it?

my numpy code

import numpy as np

# generate
a = np.random.rand(4000000, 128)

b = np.random.rand(128)

print(a.shape)
print(b.shape)

def lin_norm_ever(a, b):
    return np.linalg.norm(a - b, axis=1)


import time
t = time.time()
res = lin_norm_ever(a, b)
print(res.shape)
elapsed = time.time() - t
print(elapsed)

evalparse · August 21, 2020, 11:01am

Here’s my Numba attempt. It’s not bad. But it’s 10x slower than the Julia implememtation (at end)

###################### numba
import numba

@numba.jit(nopython=True, parallel=True)
def numba_dist(a, b):
    dist = np.zeros(a.shape[0])
    for r in range(a.shape[0]):
        for c in range(128):
            dist[r] += (b[c] - a[r, c])**2

    return dist


import time
t = time.time()
res = numba_dist(a, b)
print(res.shape)
elapsed = time.time() - t
print(elapsed)

Julia Implementation:


a = rand(Float32, 128, 4153344)
c = rand(Float32, 128)

using Tullio
function comp_tullio(a, c)
    dist = zeros(Float32, size(a, 2))
    @tullio dist[i] = (c[j] - a[j,i])^2
    dist
end

@time d = comp_tullio(a, c)
@benchmark comp_tullio($a, $c)

using CUDA
CUDA.allowscalar(false)

function comp_gpu_kernel!(buffer, a, b)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x

    for j = i:stride:size(a, 2)
        for i in 1:128
            buffer[j] += (a[i,j] - b[i])^2
        end
    end
    return
end

function comp_gpu(a, b)
    dist = CUDA.zeros(Float32, size(a, 2))
    CUDA.@sync @cuda threads = 256 blocks = 1024 comp_gpu_kernel!(dist, a,  b)
    dist
end

gpu_a = CuArray(a)
gpu_c = CuArray(c)

@time cgpu = comp_gpu(gpu_a, gpu_c)

@benchmark comp_gpu($gpu_a, $gpu_c)

rpopovici · August 21, 2020, 11:10am

@evalparse @njit is not running on CUDA cores. your julia code uses CUDA acceleration. It’s normal to be much faster than CPU acceleration. you need to use @cuda.jit decorator https://numba.pydata.org/numba-doc/dev/cuda/kernels.html

rpopovici · August 21, 2020, 11:20am

Also your benchmark result is skewed. In numba you have to execute the function at least twice. the first time it will be a lot slower than the second time because of the jit compilation. to fix this issue you have to cache the compiled version

@numba.jit(cache=True, nopython=True, parallel=True, fastmath=True, boundscheck=False, nogil=True)
def numba_dist(a, b):
    dist = np.zeros(a.shape[0])
    for r in range(a.shape[0]):
        for c in range(128):
            dist[r] += (b[c] - a[r, c])**2

    return dist

evalparse · August 21, 2020, 11:22am

I included both CUDA and non-CUDA julia implementations.

rpopovici · August 21, 2020, 11:22am

sorry, I didn’t noticed that…

evalparse · August 21, 2020, 11:26am

Thanks for the tips. I include cache and ran it twice just to be same. The 2nd time it ran at 430ms, which is still 6x slowers than the Julia Tullio.jl implementation.

For me the Julia GPU implementation on my RTX2080 Ti is about 50% faster at 50ms, vs 75ms on CPU.

import numba

@numba.jit(cache=True, nopython=True, parallel=True, fastmath=True, boundscheck=False, nogil=True)
def numba_dist(a, b):
    dist = np.zeros(a.shape[0])
    for r in range(a.shape[0]):
        for c in range(128):
            dist[r] += (b[c] - a[r, c])**2

    return dist


import time
t = time.time()
res = numba_dist(a, b)
print(res.shape)
elapsed = time.time() - t
print(elapsed)

t = time.time()
res = numba_dist(a, b)
print(res.shape)
elapsed = time.time() - t
print(elapsed)

rpopovici · August 21, 2020, 11:29am

@evalparse you can squize even more perf with SVML. not sure if you have installed the lib or not, but here is the link https://numba.pydata.org/numba-doc/latest/user/performance-tips.html

evalparse · August 21, 2020, 11:39am

I did a pip install icc_rt but it made no difference to the speed.

rpopovici · August 21, 2020, 11:48am

I don’t know, I never used it myself… you can try to reinstall numba after icc_rt.
Another thing, numpy.random generate float64 by default. you probaly want to convert it to float32

a = np.random.rand(4000000, 128).astype('float32')
print(type(a[0][0]))

evalparse · August 21, 2020, 12:05pm

yep, realised that and have been testing on that basis. The same performance.

luk-f-a · August 21, 2020, 12:16pm

@evalparse it seems that the performance is coming from Tullio, not Julia itself.

function comp_tullio(a, b)
    dist = zeros(Float64, size(a, 2))
    for r = 1:size(a,2)
        for c = 1:128
            dist[r] += (b[c] - a[c, r])^2
        end
    end
    dist
end

@time d = comp_tullio(a, c)
@benchmark comp_tullio($a, $c)

  0.620265 seconds (41.01 k allocations: 33.955 MiB, 0.88% gc time)
BenchmarkTools.Trial: 
  memory estimate:  31.69 MiB
  allocs estimate:  2
  --------------
  minimum time:     593.268 ms (0.50% GC)
  median time:      604.132 ms (0.06% GC)
  mean time:        601.816 ms (0.16% GC)
  maximum time:     610.214 ms (0.07% GC)
  --------------
  samples:          9
  evals/sample:     1

which compares to

@numba.jit(nopython=True, parallel=True)
def numba_dist(a, b):
    dist = np.zeros(a.shape[0])
    for r in range(a.shape[0]):
        for c in range(128):
            dist[r] += (b[c] - a[r, c])**2

    return dist

%timeit numba_dist(a,b)
439 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Tullio uses threads and simd, so I changed your code to

@numba.njit(fastmath=True, parallel=True)
def numba_dist4(a, b):
    dist = np.zeros(a.shape[0])
    for r in prange(a.shape[0]):
        d = 0
        for c in range(128):
            d += (b[c] - a[r, c])**2
        dist[r] = d
    return dist

which, when passed Float32 arrays will yield

%timeit numba_dist4(a,b)
102 ms ± 1.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I know Julia can use SIMD and threads, but I don’t know how to do it myself, so I cannot give you a direct comparison to this number.

I haven’t been able to install Tullio, so I cannot check how fast it would run on my computer. How many cores do you have? If more than 4, it’s worth running my code on your computer to check what’s the speed there.

cheers,
Luk

luk-f-a · August 21, 2020, 12:22pm

UPDATE: I had to update to Julia 1.5 to use Tullio.

These are the results:

a = rand(Float32, 128, 4153344)
c = rand(Float32, 128)

using Tullio
function comp_tullio(a, c)
    dist = zeros(Float32, size(a, 2))
    @tullio dist[i] = (c[j] - a[j,i])^2
    dist
end

@time d = comp_tullio(a, c)
@benchmark comp_tullio($a, $c)

0.481017 seconds (1.07 M allocations: 70.214 MiB)
BenchmarkTools.Trial: 
  memory estimate:  15.84 MiB
  allocs estimate:  2
  --------------
  minimum time:     143.370 ms (0.00% GC)
  median time:      144.133 ms (0.00% GC)
  mean time:        144.305 ms (0.00% GC)
  maximum time:     148.624 ms (0.00% GC)
  --------------
  samples:          35
  evals/sample:     1

So, unless I’m doing something wrong, Numba seems to be 30% faster than Julia+Tullio on my machine (4 cores).

I might have something wrong since I did all this very fast and I’m not a Julia expert. But it was fun anyway.

Cheers,
Luk

evalparse · August 21, 2020, 1:03pm

Tullio.jl is a 100%-Julia library that just uses Julia code. See GitHub - mcabbott/Tullio.jl: ⅀

evalparse · August 21, 2020, 1:05pm

In Julia just do

using Pkg
Pkg.add("Tullio")
using Tullio

evalparse · August 21, 2020, 1:12pm

Can’t replicate it

import numba
from numba import prange
import numpy as np

# generate
a = np.random.rand(4153344, 128).astype("float32")
b = np.random.rand(128).astype("float32")

print(type(a))

print(a.shape)
print(b.shape)


@numba.njit(fastmath=True, parallel=True)
def numba_dist4(a, b):
    dist = np.zeros(a.shape[0])
    for r in prange(a.shape[0]):
        d = 0
        for c in range(128):
            d += (b[c] - a[r, c])**2
        dist[r] = d
    return dist


import time
t = time.time()
res =  numba_dist4(a, b)
elapsed = time.time() - t
print(res.shape)
print(elapsed)

t = time.time()
res = numba_dist4(a, b)
elapsed = time.time() - t
print(res.shape)
print(elapsed)

ON my computer it’s on par with Tullio.jl and maybe a bit slower. For such a simple problem, I’d expect optimized Numba and Julia to be the same speed. Which is kinda confirmed. I get about 90ms so the same as 75ms for Julia

evalparse · August 21, 2020, 1:13pm

This is really nice to know about Numba. I assume it parallize the range loop.

luk-f-a · August 21, 2020, 1:15pm

I agree. In most simple cases, they will have the same speed. As I said, I’m not a Julia expert, so there might some tricks that I missed.

luk-f-a · August 21, 2020, 1:21pm

@evalparse, beyond confirming our a priori expectations that Numba and Julia yield similar performance, has this been useful to you?

evalparse · August 21, 2020, 1:21pm

Yes. I learnt some Numba tricks like prange and some decorator options. Also setting aside a local variable like d is another nice tricked I learned.

Topic		Replies	Views
A huge performance penalty for this simple function Numba	22	1722	January 28, 2023
Try to understand why numba is 4x faster than c++ code in the same task Community Support	4	1404	November 9, 2022
Dictionary performance Community Support	2	658	November 28, 2022
Accelerating calculations involving 3D array Community Support	6	903	December 1, 2020
Performance question / issue on particular problem Support: How do I do ...?	19	956	January 20, 2021

Fastest numba code for computing euclideandistance between a vector and every row of a matrix

Related Topics