Fastest numba code for computing euclideandistance between a vector and every row of a matrix

I have this numpy code to try and compute the euclidean distance between b and EVERY row of a. I can’t get the speed to faster than 3.5s. But the best Julia solution I found can do this in about 75ms. Which is like almost 50x faster!

I have to admit I have never used numba but have heard lots of good things about it. I will start looking into how to write this in numba, but given it’s a simple problem, I wonder if anyone can help give me some example code to work from? Do I just write a loop and Numba will @jit it?

my numpy code
import numpy as np

# generate
a = np.random.rand(4000000, 128)

b = np.random.rand(128)

print(a.shape)
print(b.shape)

def lin_norm_ever(a, b):
    return np.linalg.norm(a - b, axis=1)


import time
t = time.time()
res = lin_norm_ever(a, b)
print(res.shape)
elapsed = time.time() - t
print(elapsed)

Here’s my Numba attempt. It’s not bad. But it’s 10x slower than the Julia implememtation (at end)

###################### numba
import numba

@numba.jit(nopython=True, parallel=True)
def numba_dist(a, b):
    dist = np.zeros(a.shape[0])
    for r in range(a.shape[0]):
        for c in range(128):
            dist[r] += (b[c] - a[r, c])**2

    return dist


import time
t = time.time()
res = numba_dist(a, b)
print(res.shape)
elapsed = time.time() - t
print(elapsed)

Julia Implementation:


a = rand(Float32, 128, 4153344)
c = rand(Float32, 128)

using Tullio
function comp_tullio(a, c)
    dist = zeros(Float32, size(a, 2))
    @tullio dist[i] = (c[j] - a[j,i])^2
    dist
end

@time d = comp_tullio(a, c)
@benchmark comp_tullio($a, $c)

using CUDA
CUDA.allowscalar(false)

function comp_gpu_kernel!(buffer, a, b)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x

    for j = i:stride:size(a, 2)
        for i in 1:128
            buffer[j] += (a[i,j] - b[i])^2
        end
    end
    return
end

function comp_gpu(a, b)
    dist = CUDA.zeros(Float32, size(a, 2))
    CUDA.@sync @cuda threads = 256 blocks = 1024 comp_gpu_kernel!(dist, a,  b)
    dist
end

gpu_a = CuArray(a)
gpu_c = CuArray(c)

@time cgpu = comp_gpu(gpu_a, gpu_c)

@benchmark comp_gpu($gpu_a, $gpu_c)

@evalparse @njit is not running on CUDA cores. your julia code uses CUDA acceleration. It’s normal to be much faster than CPU acceleration. you need to use @cuda.jit decorator https://numba.pydata.org/numba-doc/dev/cuda/kernels.html

Also your benchmark result is skewed. In numba you have to execute the function at least twice. the first time it will be a lot slower than the second time because of the jit compilation. to fix this issue you have to cache the compiled version

@numba.jit(cache=True, nopython=True, parallel=True, fastmath=True, boundscheck=False, nogil=True)
def numba_dist(a, b):
    dist = np.zeros(a.shape[0])
    for r in range(a.shape[0]):
        for c in range(128):
            dist[r] += (b[c] - a[r, c])**2

    return dist

I included both CUDA and non-CUDA julia implementations.

sorry, I didn’t noticed that…

Thanks for the tips. I include cache and ran it twice just to be same. The 2nd time it ran at 430ms, which is still 6x slowers than the Julia Tullio.jl implementation.

For me the Julia GPU implementation on my RTX2080 Ti is about 50% faster at 50ms, vs 75ms on CPU.

import numba

@numba.jit(cache=True, nopython=True, parallel=True, fastmath=True, boundscheck=False, nogil=True)
def numba_dist(a, b):
    dist = np.zeros(a.shape[0])
    for r in range(a.shape[0]):
        for c in range(128):
            dist[r] += (b[c] - a[r, c])**2

    return dist


import time
t = time.time()
res = numba_dist(a, b)
print(res.shape)
elapsed = time.time() - t
print(elapsed)

t = time.time()
res = numba_dist(a, b)
print(res.shape)
elapsed = time.time() - t
print(elapsed)

@evalparse you can squize even more perf with SVML. not sure if you have installed the lib or not, but here is the link https://numba.pydata.org/numba-doc/latest/user/performance-tips.html

I did a pip install icc_rt but it made no difference to the speed.

I don’t know, I never used it myself… you can try to reinstall numba after icc_rt.
Another thing, numpy.random generate float64 by default. you probaly want to convert it to float32

a = np.random.rand(4000000, 128).astype('float32')
print(type(a[0][0]))

yep, realised that and have been testing on that basis. The same performance.

@evalparse it seems that the performance is coming from Tullio, not Julia itself.

function comp_tullio(a, b)
    dist = zeros(Float64, size(a, 2))
    for r = 1:size(a,2)
        for c = 1:128
            dist[r] += (b[c] - a[c, r])^2
        end
    end
    dist
end

@time d = comp_tullio(a, c)
@benchmark comp_tullio($a, $c)
  0.620265 seconds (41.01 k allocations: 33.955 MiB, 0.88% gc time)
BenchmarkTools.Trial: 
  memory estimate:  31.69 MiB
  allocs estimate:  2
  --------------
  minimum time:     593.268 ms (0.50% GC)
  median time:      604.132 ms (0.06% GC)
  mean time:        601.816 ms (0.16% GC)
  maximum time:     610.214 ms (0.07% GC)
  --------------
  samples:          9
  evals/sample:     1

which compares to

@numba.jit(nopython=True, parallel=True)
def numba_dist(a, b):
    dist = np.zeros(a.shape[0])
    for r in range(a.shape[0]):
        for c in range(128):
            dist[r] += (b[c] - a[r, c])**2

    return dist

%timeit numba_dist(a,b)
439 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Tullio uses threads and simd, so I changed your code to

@numba.njit(fastmath=True, parallel=True)
def numba_dist4(a, b):
    dist = np.zeros(a.shape[0])
    for r in prange(a.shape[0]):
        d = 0
        for c in range(128):
            d += (b[c] - a[r, c])**2
        dist[r] = d
    return dist

which, when passed Float32 arrays will yield

%timeit numba_dist4(a,b)
102 ms ± 1.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I know Julia can use SIMD and threads, but I don’t know how to do it myself, so I cannot give you a direct comparison to this number.

I haven’t been able to install Tullio, so I cannot check how fast it would run on my computer. How many cores do you have? If more than 4, it’s worth running my code on your computer to check what’s the speed there.

cheers,
Luk

UPDATE: I had to update to Julia 1.5 to use Tullio.

These are the results:

a = rand(Float32, 128, 4153344)
c = rand(Float32, 128)

using Tullio
function comp_tullio(a, c)
    dist = zeros(Float32, size(a, 2))
    @tullio dist[i] = (c[j] - a[j,i])^2
    dist
end

@time d = comp_tullio(a, c)
@benchmark comp_tullio($a, $c)
0.481017 seconds (1.07 M allocations: 70.214 MiB)
BenchmarkTools.Trial: 
  memory estimate:  15.84 MiB
  allocs estimate:  2
  --------------
  minimum time:     143.370 ms (0.00% GC)
  median time:      144.133 ms (0.00% GC)
  mean time:        144.305 ms (0.00% GC)
  maximum time:     148.624 ms (0.00% GC)
  --------------
  samples:          35
  evals/sample:     1

So, unless I’m doing something wrong, Numba seems to be 30% faster than Julia+Tullio on my machine (4 cores).

I might have something wrong since I did all this very fast and I’m not a Julia expert. But it was fun anyway.

Cheers,
Luk

Tullio.jl is a 100%-Julia library that just uses Julia code. See GitHub - mcabbott/Tullio.jl: ⅀

In Julia just do

using Pkg
Pkg.add("Tullio")
using Tullio

Can’t replicate it

import numba
from numba import prange
import numpy as np

# generate
a = np.random.rand(4153344, 128).astype("float32")
b = np.random.rand(128).astype("float32")

print(type(a))

print(a.shape)
print(b.shape)


@numba.njit(fastmath=True, parallel=True)
def numba_dist4(a, b):
    dist = np.zeros(a.shape[0])
    for r in prange(a.shape[0]):
        d = 0
        for c in range(128):
            d += (b[c] - a[r, c])**2
        dist[r] = d
    return dist


import time
t = time.time()
res =  numba_dist4(a, b)
elapsed = time.time() - t
print(res.shape)
print(elapsed)

t = time.time()
res = numba_dist4(a, b)
elapsed = time.time() - t
print(res.shape)
print(elapsed)

ON my computer it’s on par with Tullio.jl and maybe a bit slower. For such a simple problem, I’d expect optimized Numba and Julia to be the same speed. Which is kinda confirmed. I get about 90ms so the same as 75ms for Julia

This is really nice to know about Numba. I assume it parallize the range loop.

I agree. In most simple cases, they will have the same speed. As I said, I’m not a Julia expert, so there might some tricks that I missed.

@evalparse, beyond confirming our a priori expectations that Numba and Julia yield similar performance, has this been useful to you?

Yes. I learnt some Numba tricks like prange and some decorator options. Also setting aside a local variable like d is another nice tricked I learned.

1 Like