Compression in ClickHouse
Nov 21, 2017
It might not be obvious from the start, but ClickHouse supports different kinds of compressions, namely two LZ4 and ZSTD.
There are evaluations for both of these methods: https://www.percona.com/blog/2016/04/13/evaluating-database-compression-methods-update/
But in short, LZ4 is fast but provides smaller compression ratio comparing to ZSTD. While ZSTD is slower than LZ4, it is often faster and compresses better than a traditional Zlib, so it might be considered as a replacement for Zlib compression.
To get some real numbers using ClickHouse, let’s review a table compressed with both methods.
For this, we will take the table
lineorder, from the benchmark described in https://www.altinity.com/blog/2017/6/16/clickhouse-in-a-general-analytical-workload-based-on-star-schema-benchmark
The uncompressed datasize for
lineorder table with 1000 scale is 680G.
And now let’s load this table into ClickHouse.
With the default compression (LZ4), we have
And with ZSTD
There we need to mention how to make ClickHouse using ZSTD. For this, we add the following lines into config:
<compression incl="clickhouse_compression"> <case> <method>zstd</method> </case> </compression>
So the compression ratio for this table
What about performance? For this let’s run the following query
SELECT toYear(LO_ORDERDATE) AS yod, sum(LO_REVENUE) FROM lineorder GROUP BY yod;
And we will execute this query in “cold” run (no data is cached), and following “hot” run when some data is already cached in OS memory after the first run.
So query results, for LZ4 compression:
7 rows in set. Elapsed: 19.131 sec. Processed 6.00 billion rows, 36.00 GB (313.63 million rows/s., 1.88 GB/s.)
7 rows in set. Elapsed: 4.531 sec. Processed 6.00 billion rows, 36.00 GB (1.32 billion rows/s., 7.95 GB/s.)
For ZSTD compression:
7 rows in set. Elapsed: 20.990 sec. Processed 6.00 billion rows, 36.00 GB (285.85 million rows/s., 1.72 GB/s.)
7 rows in set. Elapsed: 7.965 sec. Processed 6.00 billion rows, 36.00 GB (753.26 million rows/s., 4.52 GB/s.)
While there is practically no difference in cold run times (as the IO time prevail decompression time), in hot runs LZ4 is much faster (as there is much less IO operations, and performance of decompression becomes a major factor).
ClickHouse proposes two methods of compression: LZ4 and ZSTD, so you can choose what is suitable for your case, hardware setup and workload.
- zstd is preferrable where I/O is the bottleneck in the queries with huge range scans.
- LZ4 is preferrable when I/O is fast enough so decompression speed becomes a bottleneck.
- For ultra fast disk subsystems, e.g. SSD NVMe arrays, even LZ4 may be slow, so ClickHouse has an option to specify 'none' compression.
It is possible to have different compression configuration depending on part size. I.e. use faster LZ4 for smaller parts that usually keep hot data and allow for better zstd compression for historical data that is usually merged in bigger parts.
ClickHouse team considers column specific compression in the roadmap, that would provide much more flexible way to deal with compresion and encoding settings in the future.