Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Efficient Jagged Arrays (zeux.io)
59 points by ingve on July 3, 2023 | hide | past | favorite | 16 comments


This representation of jagged array is literally the "Compressed Sparse Row" (CSR) format for sparse matrices [1]:

- a single array of floats with all the nonzero matrix entries, ordered row-by-row:

  double values[nnz];
- a single array of ints with the corresponding column index:

  int columns[nnz];
- a single array of ints, of length (m+1), with the start of each row in the two other arrays (the (m+1)th element stores nnz, which will be useful later):

  int start[m + 1];
So to print a whole matrix, you do this:

  for (int i = 0; i < m; i++) {
    index = start[i]; // where
    count = start[i + 1] - start[i]; // number of nonzeros in row i
    
    for (int t = 0; t < count; t++) {
       j = columns[index + t];
       v = values[index + t];
       printf("element (%d, %d) has value %g\n", i, j, v);
    }
To a modern programmer, this may look very old-fashioned (Fortran-ish), but I have tried many alternatives (like grouping columns and values together in a struct, or using pointers), and in my numerical codes, nothing gets even close in terms of performance. It must be said that CSR and CSC (its transpose) are very compact in memory, which is often a bottleneck. Also, since so many sparse numerical codes in benchmarks have used this representation for decades, it could be that CPU designers make sure that it works well.

[1] https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_spars...


Thanks for this comment, was super interesting! Apparently Matlab uses a similar format (Compressed Sparse Column) to represent sparse matrices.


Have you tried using precomputed byte offsets rather than indices for all the integers? Those would definitely speed up the non-power-of-2-sized case, but I'd be curious if they might speed this up a little bit too, since they should remove some bit shifting instructions.


You often need the indices for other operations (indexing into other arrays). CSC it's transpose CSR, and the doubly compressed forms DCSC/DCSR, are optimized for linear algebra for the most part.


A very nice (and fast) C implementation in CSparse (now part of SuiteSparse) https://people.engr.tamu.edu/davis/suitesparse.html


I’m fairly sure anecdotes like this inform the conversation about structs of arrays versus arrays of structs.

There are a couple of languages that let you treat them the same way. Having list comprehensions (plural) over multidimensional data would be handy.


How do you insert elements at the end of a row?


there's a whole ecosystem in Python originally developed for high energy physics data processing: https://github.com/scikit-hep/awkward all because Numpy demands square N-dimensional array

Same technique used everywhere, here's a simple Julia pkg for the same thing: https://github.com/JuliaArrays/ArraysOfArrays.jl/blob/3a6f5b...

But Julia at least has the decency to just support ragged Vector{Vector} out of the box, and it's not that slow


Nice post.

Personally I find the way rows are added in this post a bit confusing. It's done for the sake of illustrating a source of data where rows elements of the jagged array are not accessed sequentially. That's reasonable for the purpose of the post, but for someone not familiar with jagged arrays, it can be a bit confusing. A more traditional way of building a jagged array is as follow:

    struct JaggedArray<T> {
        offsets: Vec<u32>,
        data: Vec<T>,
    }
    // impl JaggedArray<T>
    fn add_row(&mut self, row: Vec<T>) {
        self.data.extend(row);
        self.offsets.push(self.len());
    }
    fn get_row(&self, row: usize) -> &[T] {
        let start = self.offsets[row] as usize;
        let end =  self.offsets[row + 1] as usize;
        &self.data[start..end]
    }
Now this could be improved to reduce allocations, but it's fine as an illustration.

I personally prefer to store the end element in the 'offsets' array, since you know the first offset will always be zero. With this, the 'offsets' array has as many elements as the jagged array has rows, which I find elegant:

    fn get_row(&self, row: usize) -> &[T] {
        let start = if row == 0 { 0 } else { self.offsets[row - 1] as usize };
        let end =  self.offsets[row] as usize;
        &self.data[start..end]
    }
Jagged arrays are super useful as lists of list, some better perf characteristics than the naive Illife Vector. I've used them extensively lately. It's also extra fun when you combine them with a bitset.

With a vector as backing storage, jagged arrays are also not limited to a fixed size! You can trivially add new rows or extend the last row. This comes in handy in a parsing context.


A perfect song to pair with this article is "Jagged Roots", by British electronic trio "Ivy Lab"[1]

1. https://www.youtube.com/watch?v=xJp4MB9PK8A


The author listed some reasons for why using jagged arrays can be bad, all having something to do with std::vector.

Why not use *, [][], [], or even [] like in good old C, since here the concern is about efficiency, not memory safety?


in C++

This is the first time I've heard the term "jagged arrays." Is this common parlance and I'm just a hermit?


It is, although at both schools I went to "ragged arrays" was the more common term.

https://en.wikipedia.org/wiki/Jagged_array

I think to some degree it's fallen out of favor as this is the only kind of array you will ever see if you stick to, say, bootcamp-level Python and JavaScript.


It's becoming common. You can also hear "jagged tensors". Alternatively "ragged".


It's common parlance in C# where the language has different syntax for dense and jagged arrays (i.e., A[i,j,k] vs A[i][j][k]).


I've heard the term before, but I have never needed to use jagged arrays.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: