Best Compression algorithm for a sequence of integers

Algorithm Problem Overview

I have a large array with a range of integers that are mostly continuous, eg 1-100, 110-160, etc. All integers are positive. What would be the best algorithm to compress this?

I tried the deflate algorithm but that gives me only 50% compression. Note that the algorithm cannot be lossy.

All numbers are unique and progressively increasing.

Also if you can point me to the java implementation of such algorithm that would be great.

Algorithm Solutions

Solution 1 - Algorithm

We have written recent research papers that survey the best schemes for this problem. Please see:

Daniel Lemire and Leonid Boytsov, Decoding billions of integers per second through vectorization,Software: Practice & Experience 45 (1), 2015. http://arxiv.org/abs/1209.2137

Daniel Lemire, Nathan Kurz, Leonid Boytsov, SIMD Compression and the Intersection of Sorted Integers, Software: Practice and Experience (to appear) http://arxiv.org/abs/1401.6399

They include an extensive experimental evaluation.

You can find a complete implementation of all techniques in C++11 online: https://github.com/lemire/FastPFor and https://github.com/lemire/SIMDCompressionAndIntersection

There are also C libraries: https://github.com/lemire/simdcomp and https://github.com/lemire/MaskedVByte

If you prefer Java, please see https://github.com/lemire/JavaFastPFOR

Solution 2 - Algorithm

First, preprocess your list of values by taking the difference between each value and the previous one (for the first value, assume the previous one was zero). This should in your case give mostly a sequence of ones, which can be compressed much more easily by most compression algorithms.

This is how the PNG format does to improve its compression (it does one of several difference methods followed by the same compression algorithm used by gzip).

Solution 3 - Algorithm

Well, i'm voting for smarter way. All you have to store is [int:startnumber][int/byte/whatever:number of iterations] in this case, you'll turn your example array into 4xInt value. After it you can compress as you want :)

Solution 4 - Algorithm

While you could design a custom algorithm specific to your stream of data, it's probably easier to use an off the shelf encoding algorithm. I ran a few tests of compression algorithms available in Java and found the following compression rates for a sequence of one million consecutive integers:

None        1.0
Deflate     0.50
Filtered    0.34
BZip2       0.11
Lzma        0.06

Solution 5 - Algorithm

What size are the numbers? In addition to the other answers, you could consider base-128 variant-length encoding, which lets you store smaller numbers in single bytes while still allowing larger numbers. The MSB means "there is another byte" - this is described here.

Combine this with the other techniques so you are storing "skip size", "take size", "skip size", "take size" - but noting that neither "skip" nor "take" will ever be zero, so we'll subtract one from each (which lets you save an extra byte for a handful of values)

So:

1-100, 110-160

is "skip 1" (assume start at zero as it makes things easier), "take 100", "skip 9", "take 51"; subtract 1 from each, giving (as decimals)

0,99,8,50

which encodes as (hex):

00 63 08 32

If we wanted to skip/take a larger number - 300, for example; we subtract 1 giving 299 - but that goes over 7 bits; starting with the little end, we encode blocks of 7 bits and an MSB to indicate continuation:

299 = 100101100 = (in blocks of 7): 0000010 0101100

so starting with the little end:

1 0101100 (leading one since continuation)
0 0000010 (leading zero as no more)

giving:

AC 02

So we can encode large numbers easily, but small numbers (which sound typical for skip/take) take less space.

You could try running this through "deflate", but it might not help much more...

If you don't want to deal with all that messy encoding cruff yourself... if you can create the integer-array of the values (0,99,8,60) - you could use protocol buffers with a packed repeated uint32/uint64 - and it'll do all the work for you ;-p

I don't "do" Java, but here's a full C# implementation (borrowing some of the encoding bits from my protobuf-net project):

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
static class Program
{
    static void Main()
    {
        var data = new List<int>();
        data.AddRange(Enumerable.Range(1, 100));
        data.AddRange(Enumerable.Range(110, 51));
        int[] arr = data.ToArray(), arr2;

        using (MemoryStream ms = new MemoryStream())
        {
            Encode(ms, arr);
            ShowRaw(ms.GetBuffer(), (int)ms.Length);
            ms.Position = 0; // rewind to read it...
            arr2 = Decode(ms);
        }
    }
    static void ShowRaw(byte[] buffer, int len)
    {
        for (int i = 0; i < len; i++)
        {
            Console.Write(buffer[i].ToString("X2"));
        }
        Console.WriteLine();
    }
    static int[] Decode(Stream stream)
    {
        var list = new List<int>();
        uint skip, take;
        int last = 0;
        while (TryDecodeUInt32(stream, out skip)
            && TryDecodeUInt32(stream, out take))
        {
            last += (int)skip+1;
            for(uint i = 0 ; i <= take ; i++) {
                list.Add(last++);
            }
        }
        return list.ToArray();
    }
    static int Encode(Stream stream, int[] data)
    {
        if (data.Length == 0) return 0;
        byte[] buffer = new byte[10];
        int last = -1, len = 0;
        for (int i = 0; i < data.Length; )
        {
            int gap = data[i] - 2 - last, size = 0;
            while (++i < data.Length && data[i] == data[i - 1] + 1) size++;
            last = data[i - 1];
            len += EncodeUInt32((uint)gap, buffer, stream)
                + EncodeUInt32((uint)size, buffer, stream);
        }
        return len;
    }
    public static int EncodeUInt32(uint value, byte[] buffer, Stream stream)
    {
        int count = 0, index = 0;
        do
        {
            buffer[index++] = (byte)((value & 0x7F) | 0x80);
            value >>= 7;
            count++;
        } while (value != 0);
        buffer[index - 1] &= 0x7F;
        stream.Write(buffer, 0, count);
        return count;
    }
    public static bool TryDecodeUInt32(Stream source, out uint value)
    {
        int b = source.ReadByte();
        if (b < 0)
        {
            value = 0;
            return false;
        }

        if ((b & 0x80) == 0)
        {
            // single-byte
            value = (uint)b;
            return true;
        }

        int shift = 7;

        value = (uint)(b & 0x7F);
        bool keepGoing;
        int i = 0;
        do
        {
            b = source.ReadByte();
            if (b < 0) throw new EndOfStreamException();
            i++;
            keepGoing = (b & 0x80) != 0;
            value |= ((uint)(b & 0x7F)) << shift;
            shift += 7;
        } while (keepGoing && i < 4);
        if (keepGoing && i == 4)
        {
            throw new OverflowException();
        }
        return true;
    }
}

Solution 6 - Algorithm

TurboPFor: Fastest Integer Compression

for C/C++ including Java Critical Natives/JNI Interface
SIMD accelerated integer compression
Scalar + Integrated (SIMD) differential/Zigzag encoding/decoding for sorted/unsorted integer lists
Full range 8/16/32/64 bits interger lists
Direct access
Benchmark app

Solution 7 - Algorithm

compress the string "1-100, 110-160" or store the string in some binary representation and parse it to restore the array

Solution 8 - Algorithm

I'd combine the answers given by CesarB and Fernando Miguélez.

First, store the differences between each value and the previous one. As CesarB pointed out, this will give you a sequence of mostly ones.

Then, use a Run Length Encoding compression algorithm on this sequence. It will compress very nicely due to the large number of repeated values.

Solution 9 - Algorithm

In addition to the other solutions:

You could find "dense" areas and use a bitmap to store them.

So for example:

If you have 1000 numbers in 400 ranges between 1000-3000, you could use a single bit to denote the existence of a number and two ints to denote the range. Total storage for this range is 2000 bits + 2 ints, so you can store that info in 254bytes, which is pretty awesome since even short integers will take up two bytes each, so for this example you get 7X savings.

The denser the areas the better this algorithm will do, but at some point just storing start and finish will be cheaper.

Solution 10 - Algorithm

I'd suggest taking a look at Huffman Coding, a special case of Arithmetic Coding. In both cases you analyse your starting sequence to determine the relative frequencies of different values. More-frequently-occurring values are encoded with fewer bits than the less-frequently-occurring ones.

Solution 11 - Algorithm

I know this is an old message thread, but I am including my personal PHP test of the SKIP/TAKE idea I found here. I'm calling mine STEP(+)/SPAN(-). Perhaps someone might find it helpful.

NOTE: I implemented the ability to allow duplicate integers as well as negative integers even though the original question involved positive, non-duplicated integers. Feel free to tweak it if you want to try and shave a byte or two.

CODE:

  // $integers_array can contain any integers; no floating point, please. Duplicates okay.
  $integers_array = [118, 68, -9, 82, 67, -36, 15, 27, 26, 138, 45, 121, 72, 63, 73, -35,
  					68, 46, 37, -28, -12, 42, 101, 21, 35, 100, 44, 13, 125, 142, 36, 88,
  					113, -40, 40, -25, 116, -21, 123, -10, 43, 130, 7, 39, 69, 102, 24,
  					75, 64, 127, 109, 38, 41, -23, 21, -21, 101, 138, 51, 4, 93, -29, -13];

  // Order from least to greatest... This routine does NOT save original order of integers.
  sort($integers_array, SORT_NUMERIC); 

  // Start with the least value... NOTE: This removes the first value from the array.
  $start = $current = array_shift($integers_array);    

  // This caps the end of the array, so we can easily get the last step or span value.
  array_push($integers_array, $start - 1);
  
  // Create the compressed array...
  $compressed_array = [$start];
  foreach ($integers_array as $next_value) {
  	// Range of $current to $next_value is our "skip" range. I call it a "step" instead.
  	$step = $next_value - $current;
  	if ($step == 1) {
		// Took a single step, wait to find the end of a series of seqential numbers.
  		$current = $next_value;
	} else {
		// Range of $start to $current is our "take" range. I call it a "span" instead.
		$span = $current - $start;
		// If $span is positive, use "negative" to identify these as sequential numbers. 
		if ($span > 0) array_push($compressed_array, -$span);
		// If $step is positive, move forward. If $step is zero, the number is duplicate.
		if ($step >= 0) array_push($compressed_array, $step);
		// In any case, we are resetting our start of potentialy sequential numbers.
  		$start = $current = $next_value;
  	}
  }
  
  // OPTIONAL: The following code attempts to compress things further in a variety of ways.
  
  // A quick check to see what pack size we can use.
  $largest_integer = max(max($compressed_array),-min($compressed_array));
  if ($largest_integer < pow(2,7)) $pack_size = 'c';
  elseif ($largest_integer < pow(2,15)) $pack_size = 's';
  elseif ($largest_integer < pow(2,31)) $pack_size = 'l';
  elseif ($largest_integer < pow(2,63)) $pack_size = 'q';
  else die('Too freaking large, try something else!');

  // NOTE: I did not implement the MSB feature mentioned by Marc Gravell.
  // I'm just pre-pending the $pack_size as the first byte, so I know how to unpack it.
  $packed_string = $pack_size;

  // Save compressed array to compressed string and binary packed string.
  $compressed_string = '';
  foreach ($compressed_array as $value) {
	  $compressed_string .= ($value < 0) ? $value : '+'.$value;
	  $packed_string .= pack($pack_size, $value);
  }

  // We can possibly compress it more with gzip if there are lots of similar values.	  
  $gz_string = gzcompress($packed_string);

  // These were all just size tests I left in for you.
  $base64_string = base64_encode($packed_string);
  $gz64_string = base64_encode($gz_string);
  $compressed_string = trim($compressed_string,'+');  // Don't need leading '+'.
  echo "<hr>\nOriginal Array has "
  	.count($integers_array)
  	.' elements: {not showing, since I modified the original array directly}';
  echo "<br>\nCompressed Array has "
  	.count($compressed_array).' elements: '
  	.implode(', ',$compressed_array);
  echo "<br>\nCompressed String has "
  	.strlen($compressed_string).' characters: '
  	.$compressed_string;
  echo "<br>\nPacked String has "
  	.strlen($packed_string).' (some probably not printable) characters: '
  	.$packed_string;
  echo "<br>\nBase64 String has "
  	.strlen($base64_string).' (all printable) characters: '
  	.$base64_string;
  echo "<br>\nGZipped String has "
  	.strlen($gz_string).' (some probably not printable) characters: '
  	.$gz_string;
  echo "<br>\nBase64 of GZipped String has "
  	.strlen($gz64_string).' (all printable) characters: '
  	.$gz64_string;

  // NOTICE: The following code reverses the process, starting form the $compressed_array.
  	
  // The first value is always the starting value.
  $current_value = array_shift($compressed_array);
  $uncompressed_array = [$current_value];
  foreach ($compressed_array as $val) {
  	if ($val < -1) {
  	  // For ranges that span more than two values, we have to fill in the values.
	  $range = range($current_value + 1, $current_value - $val - 1);
	  $uncompressed_array = array_merge($uncompressed_array, $range);
  	}
  	// Add the step value to the $current_value
	$current_value += abs($val); 
  	// Add the newly-determined $current_value to our list. If $val==0, it is a repeat!
  	array_push($uncompressed_array, $current_value);	  
  }

  // Display the uncompressed array.
  echo "<hr>Reconstituted Array has "
  	.count($uncompressed_array).' elements: '
  	.implode(', ',$uncompressed_array).
  	'<hr>';

OUTPUT:

--------------------------------------------------------------------------------
Original Array has 63 elements: {not showing, since I modified the original array directly}
Compressed Array has 53 elements: -40, 4, -1, 6, -1, 3, 2, 2, 0, 8, -1, 2, -1, 13, 3, 6, 2, 6, 0, 3, 2, -1, 8, -11, 5, 12, -1, 3, -1, 0, -1, 3, -1, 2, 7, 6, 5, 7, -1, 0, -1, 7, 4, 3, 2, 3, 2, 2, 2, 3, 8, 0, 4
Compressed String has 110 characters: -40+4-1+6-1+3+2+2+0+8-1+2-1+13+3+6+2+6+0+3+2-1+8-11+5+12-1+3-1+0-1+3-1+2+7+6+5+7-1+0-1+7+4+3+2+3+2+2+2+3+8+0+4
Packed String has 54 (some probably not printable) characters: cØÿÿÿÿ ÿõ ÿÿÿÿÿÿ
Base64 String has 72 (all printable) characters: Y9gE/wb/AwICAAj/Av8NAwYCBgADAv8I9QUM/wP/AP8D/wIHBgUH/wD/BwQDAgMCAgIDCAAE
GZipped String has 53 (some probably not printable) characters: xœ Ê» ÑÈÎ¯€)YšE¨MŠ“^qçºR¬m&Òõ‹%Ê&TFÊ‰ùÀ6ÿÁÁ Æ
Base64 of GZipped String has 72 (all printable) characters: eJwNyrsNACAMA9HIzq+AKVmaRahNipNecee6UgSsBW0m0gj1iyXKJlRGjcqJ+cA2/8HBDcY=
--------------------------------------------------------------------------------
Reconstituted Array has 63 elements: -40, -36, -35, -29, -28, -25, -23, -21, -21, -13, -12, -10, -9, 4, 7, 13, 15, 21, 21, 24, 26, 27, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 51, 63, 64, 67, 68, 68, 69, 72, 73, 75, 82, 88, 93, 100, 101, 101, 102, 109, 113, 116, 118, 121, 123, 125, 127, 130, 138, 138, 142
--------------------------------------------------------------------------------

Solution 12 - Algorithm

The basic idea you should probably use is, for each range of consecutive integers (I will call these ranges), to store the starting number and the size of the range. For example, if you have a list of 1000 integers, but there are only 10 separate ranges, you can store a mere 20 integers (1 start number and 1 size for each range) to represent this data which would be a compression rate of 98%. Fortunately, there are some more optimizations you can make which will help with cases where the number of ranges is larger.

Store the offset of the starting number relative to the previous starting number, rather than the starting number itself. The advantage here is that the numbers you store will generally require less bits (this may come in handy in later optimization suggestions). Additionally, if you only stored the starting numbers, these numbers would all be unique, while storing the offset gives a chance that the numbers are close or even repeat which may allow for even further compression with another method being applied after.
Use the minimum number of bits possible for both types of integers. You can iterate over the numbers to obtain the largest offset of a starting integer as well as the size of the largest range. You can then use a datatype that most efficiently stores these integers and simply specify the datatype or number of bits at the start of the compressed data. For example, if the largest offset of a starting integer is only 12,000, and the largest range is 9,000 long, then you can use a 2 byte unsigned integer for all of these. You could then cram the pair 2,2 at the start of the compressed data to show that 2 bytes is used for both integers. Of course you can fit this information into a single byte using a little bit of bit manipulation. If you are comfortable with doing a lot of heavy bit manipulation you could store each number as the minimum possible amount of bits rather than conforming to 1, 2, 4, or 8 byte representations.

With those two optimizations lets look at a couple of examples (each is 4,000 bytes):

1,000 integers, biggest offset is 500, 10 ranges
1,000 integers, biggest offset is 100, 50 ranges
1,000 integers, biggest offset is 50, 100 ranges

WITHOUT OPTIMIZATIONS

20 integers, 4 bytes each = 80 bytes. COMPRESSION = 98%
100 integers, 4 bytes each = 400 bytes. COMPRESSION = 90%
200 integers, 4 bytes each = 800 bytes. COMPRESSION = 80%

WITH OPTIMIZATIONS

1 byte header + 20 numbers, 1 byte each = 21 bytes. COMPRESSION = 99.475%
1 byte header + 100 numbers, 1 byte each = 101 bytes. COMPRESSION = 97.475%
1 byte header + 200 numbers, 1 byte each = 201 bytes. COMPRESSION = 94.975%

Solution 13 - Algorithm

Your case is very similar to compression of indices in search engines. The popular compression algorithm used is the PForDelta algorithm and Simple16 algorithm. You can use the kamikaze library for your compression needs.

Solution 14 - Algorithm

I couldn't get my compression to be much better than about .11 for this. I generated my test data via python interpreter and it's a newline delimited list of integers from 1-100, and 110-160. I use the actual program as a compressed representation of the data. My compressed file is as follows:

main=mapM_ print [x|x<-[1..160],x`notElem`[101..109]]

It's just a Haskell script that generates the the file you can run via:

$ runhaskell generator.hs >> data

The size of the g.hs file is 54 bytes, and the python generated data is 496 bytes. This gives 0.10887096774193548 as the compression ratio. I think with more time one could shrink the program, or you could compress the compressed file (i.e. the haskell file).

One other approach might be to save 4 bytes of data. The min and max of each sequence, then give those to a generating function. Albeit, the loading of files will add more characters to the decompresser adding more complexity and more bytes to the decompresser. Again, I represented this very specific sequence via a program and it doesn't generalize, it's a compression that's specific to the data. Furthermore, adding generality makes the decompresser larger.

Another concern is that one must have the Haskell interpreter to run this. When I compiled the program it made it much larger. I don't really know why. There is the same problem with python, so maybe the best approach is to give the ranges, so that a some program could decompress the file.

Solution 15 - Algorithm

If you have series of repeated values RLE is the easiest to implement and could give you a good result. Nontheless other more advanced algorithms that take into account the entrophy such as LZW, which is now patent-free, can usually achive a much better compression.

You can take a look at these and other lossless algorithms here.

Solution 16 - Algorithm

Compress data with Blockchain

End of 2020, the local NAS is annoying again because the storage space is running out. Shit! The raid has been deactivated for almost a year and compression with zlib, zip or xz is no alternative. A new network storage is needed, the last network storage!

Eleven months later, a good dozen of PoC, the beN algorithm is near. A modified blockchain algorithm, Micro-Blockchain, not designed to mine Bitcoin or Cryptocurrencies but to compress data. No algorithm for single-core or multi-core computers, many-core capable, GPU says hello!

Technical details: The beN algorithm enables recursively an almost unlimited compression of any data stream. Based on established technologies such as modified blockchain, sorting algorithms and hash collisions calculation, lossy becomes lossless compression. The GPGPU and the parallel execution of hundreds and thousands of calculations allow an algorithm that would have taken an unacceptable amount of time just a few years ago. As a recursive algorithm, beN is also capable of streaming and allows you to compress and decompress a data stream from the beginning, even with highly compressed data, without knowing the entire data. Also tables as with Huffman based compressions are not needed.

Content Type	Original Author	Original Content on Stackoverflow
Question	pdeva	View Question on Stackoverflow
Solution 1 - Algorithm	Daniel Lemire	View Answer on Stackoverflow
Solution 2 - Algorithm	CesarB	View Answer on Stackoverflow
Solution 3 - Algorithm	Tamir	View Answer on Stackoverflow
Solution 4 - Algorithm	brianegge	View Answer on Stackoverflow
Solution 5 - Algorithm	Marc Gravell	View Answer on Stackoverflow
Solution 6 - Algorithm	powturbo	View Answer on Stackoverflow
Solution 7 - Algorithm	Ray Tayek	View Answer on Stackoverflow
Solution 8 - Algorithm	Michael Dorfman	View Answer on Stackoverflow
Solution 9 - Algorithm	Sam Saffron	View Answer on Stackoverflow
Solution 10 - Algorithm	Martin	View Answer on Stackoverflow
Solution 11 - Algorithm	Deen Foxx	View Answer on Stackoverflow
Solution 12 - Algorithm	MahlerFive	View Answer on Stackoverflow
Solution 13 - Algorithm	Arun Iyer	View Answer on Stackoverflow
Solution 14 - Algorithm	Antithesis	View Answer on Stackoverflow
Solution 15 - Algorithm	Fernando Miguélez	View Answer on Stackoverflow
Solution 16 - Algorithm	Sébastian	View Answer on Stackoverflow