Comparing various compression algorithms for particular project

We build a hobby search engine for one blog platform and there is a challenging problem of fitting the whole search index and archival dataset into one server node.

There are 1.3 billion of posts and comments and the whole uncompressed data takes about 2Tb.
We generate *many* indexes for very tricky and fast searches, we can not afford waiting for 20+ seconds to find all comments Alice made in Bob’s blog, which is a number of seconds Sphinx takes to solve this task by enumerating all comments and filtering them according to author/blog fields, instead we have to answer in a fraction of second.

Our algorithms produce about 3x of uncompressed data compared to input, and generally this should be compressed. 1x among those 3x is original text content (slightly reduced by supported language, dropped html markup and so on), the rest is a set of binary indexes where index content is actually timestamp-like ids. Although these timestamp-like ids are monotonically increasing, they are not fixed-interval timestamps and we can not use facebook’s Gorilla without serious modifications.

As a test we decided to check how common compression algorithms will work this out. We tried zlib, snappy, lz4, lz4hc and zstd. Test dataset was about 220MB uncompressed, we measured time our tool needed to index this text with exactly the same settings, it produced about 660MB of data which was compressed during the test at the storage level, size of the final database was reported. We did not test bz2, since its compression/decompression time is much higher and that’s unacceptable for realtime.

Here are the results
index_compaction_time_database_size

It is a bit controversial that zstd is comparable in speed with zlib but final database takes 10% more space.

Arithmetic coding vs common compression

Arithmetic coding is a type of lossless compression, where more frequently used bytes are replaced with fewer bits than less frequent.

Arithmetic coding is commonly associated with Huffman coding, but it is not correct – Huffman coding splits input string into smaller and smaller chunks until chunk can be replaced with smaller entity. Usually Huffman encoding can not reach Shannon limit (theoretical compression limit for given data) for reasonable number of iterations, while arithmetic coding – which transfers the whole string into packed binary stream at once – approaches that limit much faster.

We work with Greylock search engine where there is a major task to implement full-text search index with as small disk footprint as possible. In our tests 200k of uncompressed input events occupy about 220Mb of disk space and whole index takes close to 450Mb. Index includes original content, common index and per-author index (this basically doubles the original index). Elastic search with morphology takes about 539Mb, without morphology – 317Mb, Sphinx (+ mysql to store original content) is about 250Mb, but this index is heavily truncated, there is still a notion of stop-words and morphology is broken.

We compress our binary index entries as well as original content using lz4hc or gzip, but I wondered whether arithmetic coding can do better than that. I skipped Huffman encoder because of its theoretical problems approaching Shannon limit (not talking about practical issues), and decided to check rANS entropy coder – this is an Asymmetric Numeral entropy Systems coder which combines speed of Huffman coding with compression rate of arithmetic coding. Well, at least that’s what its author says in the paper.
Here is an image from the paper showing Huffman coder approaching Shannon limit huffman

It happend to be worse than any common compression both for binary and text data.

Text data from original library: 768771 bytes
bz2: 232598
gzip: 312281
lz4: 363236
rans: 435113

Binary data from our index: 9678
bz2: 3652
gzip: 3665
lz4: 5262
rans: 7795

We do not drop idea of using arithmetic coding to pack binary data since Dropbox showed with its Lepton project that jpeg images can be losslessly compressed by 20% using tricky DCT indexes sorting and prediction (about 3-4% among 20%) and VP8 arithmetic coding for these indexes.

Naive comparison of Lepton and rANS yields dramatic difference:
Original jpeg file sile: 71517
Lepton encoded size: 55700
gzip: 71171
bz2: 71586
lz4: 71414
rANS encoded size: 71336

This is of course unfair comparison, since Lepton encodes not binary steam of the original image, but specially crafted DCT indexes.

Since our indexes are basically vectors of document IDs which are in fact partially increasing timestamp and partially sequence number, it might be a good idea to use timestamp compression algorithm described in Facebook’s Gorilla timeseries database – I wrote about it here, but that’s another story.