Array of jitclass with low memory usage

I have some data that is awkwardly structured. In Fortran, I can easily contain this data with a 2-D array of derived types (i.e. structs in C).

program main
  use iso_fortran_env, only: dp => real64
  implicit none

  type :: ContainerType
    real(dp), allocatable :: arr1(:)
    real(dp), allocatable :: arr2(:)
  end type

  type(ContainerType), allocatable :: dat(:,:)

  ! `dat` can easily contain all my data

end program

My goal is to create a similar data structure in Numba. My current solution is using nested lists of jitclasses

spec = [
    ('arr1', types.double[:]),
    ('arr2', types.double[:]),
]
@nb.experimental.jitclass(spec)
class Container_0():
    def __init__(self, arr1, arr2):
        self.arr1 = arr1
        self.arr2 = arr2

spec = [
    ("c0", nb.types.ListType(Container_0.class_type.instance_type)),
]
@nb.experimental.jitclass(spec)
class Container_1():
    def __init__(self, c0):
        self.c0 = c0
        
spec = [
    ("c1", nb.types.ListType(Container_1.class_type.instance_type)),
]
@nb.experimental.jitclass(spec)
class Container_2():
    def __init__(self, c1):
        self.c1 = c1

While this approach seems to work in numba, it takes up massive amounts of memory. For example, If I store 60,000 double precision numbers in this data structure, then it takes ~ 1 Gb of memory. In Fortran or C, it should only take at most 1 Mb to store the data.

MY QUESTION: Is there an alternative way to store my data in Numba which doesnā€™t use so much memory?

Hi!

have you considered structured arrays? Structured arrays ā€” NumPy v1.24 Manual

luk

I have not considered Structured arrays. But I just looked at the documentation, and it seems like they can not hold data in the way that I describe. I need to be able to hold an array of objects, which seems to not be supported by structured arrays.

a C example would help make the problem more concrete for me.

Below is an example in C. Container ** is effectively a 2-D array of Container structs, which in the example is 100 by 100 in size. Each element of this 2D array is a struct which itself has two separate arrays (arr1 and arr2). In the example, I allocate arrays of length 3 always, but in principle arr1 and arr2 could be any length. In my real application, arr1 and arr2 would be different lengths for each element of the 2d array.

Of course, I donā€™t delete the memory in the example. But its just an example.

The script stores 60,000 doubles with 800 Kb of memory.

#include <stdlib.h>
#include <stdio.h>

typedef struct Container
{
  double *arr1;
  int arr1_n;
  double *arr2;
  int arr2_n;
} Container;

int main()
{ 
  int tmp;
  int n = 100;
  int m = 100;
  int total_memory = 0;

  Container **c = (Container **) malloc(sizeof(Container*)*n);
  total_memory += sizeof(void*)*n;
  for (int i = 0; i < n; i++){
    c[i] = (Container *) malloc(sizeof(Container)*m);
    total_memory += sizeof(Container)*m;
    for (int j = 0; j < m; j++){
      c[i][j].arr1 = (double *) malloc(sizeof(double)*3);
      total_memory += sizeof(double)*3;
      c[i][j].arr1_n = 3;

      c[i][j].arr2 = (double *) malloc(sizeof(double)*3);
      total_memory += sizeof(double)*3;
      c[i][j].arr2_n = 3;
    }
  }

  printf("double precision numbers stored: %i\n", n*m*6);
  printf("total memory: %i bytes\n",total_memory);
  return 0;
}

Thanks, thatā€™s quite helpful.

How are you measuring the memory usage in python?

You used length 3, is that an expected ā€˜averageā€™ value? Having some idea about that could inform solutions.

A runnable minimal python reproducer could be useful starting point. Iā€™d think a python solution with memory usage reasonably comparable to the C example may well be possible.

Iā€™m using the memory_profiler package to measure memory usage in Python. Iā€™ve also just used the ā€œActivity Monitorā€ application that comes with Macs to see how much more memory a python process takes, when I make an object.

You used length 3, is that an expected ā€˜averageā€™ value? Having some idea about that could inform solutions.

The length is random between 0 and maybe up to 10

Below is a minimum example. Copy-paste into a .py file and run with python -m memory_profiler profile_script.py. The object c2 is ~215 Mb on my computer.

Iā€™ve created the equivalent script in C, which takes 32 Mb to store the same amount of information. In the C program (or Python program), about 2,500,000 doubles are stored, which should take about 20 Mb. In C, it takes an additional 12 Mb of memory to define the structure of the data (i.e. the pointers and sizes of arrays). In Numba, it is taking 215 - 20 = 195 Mb to define the structure of the data, a factor of 195/12 = ~16 more data than in C.

import numpy as np
from numba import types
import numba as nb
from memory_profiler import profile

spec = [
    ('arr1', types.double[:]),
    ('arr2', types.double[:]),
]
@nb.experimental.jitclass(spec)
class Container_0():
    def __init__(self, arr1, arr2):
        self.arr1 = arr1
        self.arr2 = arr2

spec = [
    ("c0", nb.types.ListType(Container_0.class_type.instance_type)),
]
@nb.experimental.jitclass(spec)
class Container_1():
    def __init__(self, c0):
        self.c0 = c0
        
spec = [
    ("c1", nb.types.ListType(Container_1.class_type.instance_type)),
]
@nb.experimental.jitclass(spec)
class Container_2():
    def __init__(self, c1):
        self.c1 = c1
        
@nb.njit()
def make_Container_2():
    np.random.seed(0)
    n = 1000
    m = 500
    N = np.random.randint(0,6,size=(n,m))
    
    list_2 = nb.typed.List()
    for i in range(n):
        list_1 = nb.typed.List()
        for j in range(m):
            arr1 = np.random.uniform(0.0, 1.0, size=N[i,j])
            arr2 = np.random.uniform(0.0, 1.0, size=N[i,j])
            
            c0 = Container_0(arr1, arr2)
            list_1.append(c0)
        c1 = Container_1(list_1)
        list_2.append(c1)
    c2 = Container_2(list_2)
    
    return c2

@profile
def main():
    c2 = make_Container_2()

if __name__ == "__main__":
    make_Container_2() # compile stuff
    main()

Thanks for a great starting point!

1 Like

if the inner arrays can have different lenghts then struct array does not work but maybe an ā€œawkwardā€ array can help you. Awkward Array documentation ā€” Awkward Array 2.0.6 documentation

itā€™s an array-like container of arrays, so it might work for your case.

Luk

1 Like

would you consider a solution where all the arr1, arr2 values are stored in a single ndarray and the 2-D control structure was simply indexes into that storage array? For example, the control structure could be something like below, all pointing into the single storage array.

index_type = np.dtype([
    ('arr1_begin_index', np.uint64),
    ('arr1_end_index', np.uint64),
    ('arr2_begin_index', np.uint64),
    ('arr2_begin_index', np.uint64)
])

That would work in practice. But it would be a little clumsy I guess.

It gives me an idea. It might take much less memory to store the 2-D array of objects in a single list, instead of a list of lists. Iā€™m going to try this tomorrow at some point.

An awkward/ragged array-like solution as luk and nelson suggest would definitely be more clumsy. Although if your aim is low-memory and speed I doubt you would find a better solutionā€”I also wouldnā€™t be surprised if this had a better runtime than the C or Fortran options. Itā€™s definitely going to be faster and less memory intensive to allocate a handful of big array than 10,000 individual arrays. Keep in mind that there is a little more memory overhead in numba than C/Fortran per array. Numpy arrays have extra members like a shape, and they also have a refcount so they can get cleaned up in the Numba Runtime. With that in mind, these extra things donā€™t seem big enough to account for the 1 Mb ā†’ 1 Gb jump youā€™re seeingā€¦ so Iā€™m a little bit curious about the results of your idea to use a single list.

I tried a single list of jitclass instead of a list of lists. Roughly the same memory is used either way. Iā€™ve done some experimenting with C, Fortran and Numba.

The following struct in C is 24 bytes (sizeof(Container) is 24 bytes).

typedef struct Container_t
{
  int n;
  double *arr1;
  double *arr2;
} Container;

The following somewhat equivalent derived type in Fortran is 128 bytes

type :: Container
  real(dp), allocatable :: arr1(:)
  real(dp), allocatable :: arr2(:)
end type

The following Python somewhat equivalent jitclass is about 400 bytes:

@nb.experimental.jitclass
class Container_0():
    arr1 : types.double[:]
    arr2 : types.double[:]
    def __init__(self, arr1, arr2):
        self.arr1 = arr1
        self.arr2 = arr2

So, it looks like C will do a factor of ~17 lower memory than numba, and Fortran will do a factor of 3 lower memory than numba, using the array of structs technique.

The solution proposed by @nelson2005 will be the most memory efficient in my case by far. It should only take four, 4 byte integers (= 16 bytes) to store indexes for one ā€œstructā€ shown above. Therefore this solution should be even more efficient than an array of Structs in C.

Edit:

It is possible to write a lower storage struct in Fortran that is equivalent to the C struct in memory and usage.

type :: Container_1
  integer :: n
  type(c_ptr) :: arr1
  type(c_ptr) :: arr2
end type

Iā€™d think a single 4-byte int for base offset plus two one-byte integers for arr1, arr2 length should be sufficient, for a total of 6 bytes to store indexes for one struct.

I think the interface neednā€™t be worse than the C interfaceā€¦ something like below should be pretty straightforward to use.

cls.set(i, j, arr1, arr2)
and arr1, arr2 = cls.get(i, j)

I may have missed something in the above, but thought Iā€™d checkā€¦ my question is: is the memory use being attributed correctly?

Depending on which version of Numba is in use, at some point between importing Numba and there being an instance of a jitclass (or a typed.List), the library bindings to LLVM have to be loaded and initialised, these are not small, then a lot of code has to be compiled to make the jitclass actually work (think accessors, constructors, destructors, boxing/unboxing code etc) along with things like the Numba runtime, all of which adds to memory use. Once all this is done, a jitclass/typed.List would finally be at a point where it can be instantiated. Perhaps it would be worth seeing what the memory cost of creating a one element list with a single jitclass instance is to make sure that the memory cost isnā€™t also including Numba/LLVM/compiled objects etc?

As @DannyWeitekamp noted above, there is some additional information associated with Numba equivalents of many data structures, but the difference is usually something like a couple of void* members or similar so I wouldnā€™t expect it to cause the extreme differences noted above.

Hope this helps?

I think Iā€™m measuring the memory correctly.

  • A numba list of int64 seems to have 8 bytes per element.
  • A numba list of jitclass which contains only a single int64 seems to take about 64 bytes per element
  • A numba list of jitclass which contains a 1-D array of doubles allocated to zero length seems to take about 200 bytes per element

So there is a little extra memory baggage with jitclass. But itā€™s not as terrible as I was originally thinking.