Programming languages come with sorting functions by default. We can often do much better. For example, Downs has showed that radix sort can greatly surpass default sort functions in C++. Radix sort is you friend if you want to sort large arrays of integers.

What about Java? Richard Startin and Gareth Andrew Lloyd have been working hard to improve the sorting function used inside the RoaringBitmap library. Though we use a custom radix sort function, it is not difficult to make it more generic, so that it can sort any array of integers. I came up with the following code:

public static void radixSort(int[] data) { int[] copy = new int[data.length]; int[] level0 = new int[257]; int[] level1 = new int[257]; int[] level2 = new int[257]; int[] level3 = new int[257]; for (int value : data) { value -= Integer.MIN_VALUE; level0[(value & 0xFF) + 1] ++; level1[((value >>> 8) & 0xFF) + 1] ++; level2[((value >>> 16) & 0xFF) + 1] ++; level3[((value >>> 24) & 0xFF) + 1] ++; } for (int i = 1; i < level0.length; ++i) { level0[i] += level0[i - 1]; level1[i] += level1[i - 1]; level2[i] += level2[i - 1]; level3[i] += level3[i - 1]; } for (int value : data) { copy[level0[(value - Integer.MIN_VALUE) & 0xFF]++] = value; } for (int value : copy) { data[level1[((value - Integer.MIN_VALUE)>>>8) & 0xFF]++] = value; } for (int value : data) { copy[level2[((value - Integer.MIN_VALUE)>>>16) & 0xFF]++] = value; } for (int value : copy) { data[level3[((value - Integer.MIN_VALUE)>>>24) & 0xFF]++] = value; } }

It is about as unsophisticated as it looks. We compute four histograms, one per byte in an integer: Java stores integers using 4-byte words. Then we do 4 passes through the data. We could make it more sophisticated by examining the histogram: if the higher-level histograms are trivial, we can skip some passes. We could extend it to Java longs though we would then need 4 extra passes. It is also possible to generalize to floating-point numbers.

The strange subtraction with MIN_VALUE are to accommodate the fact that Java has signed integers (positive and negative) under a two complement’s format.

Let us compare it against the default `Arrays.sort` function in Java. We want to sort 1 million integers, generated uniformly at random. Using Java 8 on an Apple M1 processor, we get that RadixSort is ten times faster than `Arrays.sort`.

Arrays.sort | 60 ms |

RadixSort | 5 ms |

There are some caveats. The radix sort function is likely to use more memory. Furthermore, the results are sensitive to the input data (both its size and its distribution). Nevertheless, for some systems, radix sort can be a net win.

FWIW for large data arrays I’d recommend a 3-pass radix sort (using 11, 11 and 10 bits of the input values). This requires a bit more space, with the histogram/offset array adding up to 20 KB but that still fits within L1 reasonably comfortably and as such usually wins significantly on moderately large inputs.

At least with a C++ LSD radix sort, I found 8 bits generally better than 11 bits: performance started dropping off heavily right after 8 bits as there are too many buckets to be easily held in the L1 cache (typically has 512 cache lines on x86) and the L1 TLB.

Of course, this depends a lot on the hardware, the size of the data being sorted, and other details of the implementation (e.g., if there is an intermediate buffering step where smaller buckets are accumulated before being copied out to the final buckets, larger radices may perform better).

Especially, it depends on the data distribution: if many fewer than available the

`2^radix`

buckets are actually used, larger radices are better since the caching penalty is reduced or eliminated. In principle, one could look at the histogram and try to pick the radix dynamically based on what’s likely to be good for that distribution.I want to revoke my use of “generally better” in the first sentence. I should really say “I found 8 bits better in my specific scenario of sorting uniformly random values”.

This is my experience also with 11-bit radix. It was surprisingly bad (on random input). I’m surprised to see it recommended, honestly.

I think you’re better off using 8-bits and relying on column-skipping to shave off (likely high-order bits) for when the data isn’t random.

You’re right; for arrays of random integers indeed 4-pass sort seems to often win by a bit. I need to update my priors; my data was mostly based on experience with in-order Cell systems, and I haven’t rebenchmarked this specific problem on modern Intel chips until now…

fwiw source https://gist.github.com/zeux/2e4edeef8a09ee0bfab3ec76858aaec1

FWIW I’ve ran the benchmark from https://gist.github.com/zeux/2e4edeef8a09ee0bfab3ec76858aaec1 on an AMD Ryzen 5900X and I am getting different results from Intel – on Intel 4-pass is winning by a fair margin, but on Ryzen 3-pass is noticeably faster on sorting ints (and still a touch slower on floats but the difference isn’t very significant).

It’s possible to save some memory and only keep two levels’ histograms in memory at one time by creating the next level’s histogram as you’re distributing the current one.

But this probably isn’t worth it. Each histogram requires only 1KiB (it’s not hard to adjust the prefix sum loop to get rid of the 257th element), but you need 256 cache lines (16 KiB) for the active head of each bucket.

That’s probably why going to 11 bits is a disaster; a typical 64K L1D only has 1024 lines. So randomly writing to 2048 buffers is going to thrash the hell out of it.