New Encodings to Improve ClickHouse Efficiency
July 10, 2019
Modern analytical databases would not exist without efficient data compression. Storage gets cheaper and more performant, but data sizes typically grow even faster. Moore’s Law for big data outperforms its analogy in hardware. In our blog we already wrote about ClickHouse compression (https://www.altinity.com/blog/2017/11/21/compression-in-clickhouse) and Low Cardinality data type wrapper (https://www.altinity.com/blog/2019/3/27/low-cardinality). In this article we will describe and test the most advanced ClickHouse encodings, which especially shine for time series data. We are proud that some of those encodings have been contributed to ClickHouse by Altinity.
This article presents an early preview of new encoding functionality for ClickHouse release 19.11. As of the time of writing, release 19.11 is not yet available. In order to test new encodings ClickHouse can be built from source, or a testing build can be installed. We expect that ClickHouse release 19.11 should be available in public releases in a few weeks.
Encoding and Compression Overview
Encoding and compression are used for the same purpose -- we use them in order to store data in a more compact way. They transform data differently, however. Encoding maps source data to a different (encoded) form, element by element. You can think of it as contiguous mathematical mapping. For example, dictionary encoding maps strings to dictionary indices. Encoding is aware of the data type being encoded. Compression on the other hand assumes some generic algorithm is used that compresses bytes, not data types.
For many years ClickHouse supported two compression algorithms: LZ4 (used by default) and ZSTD. Those algorithms sometimes have logic to apply encodings inside, but they are not aware of ClickHouse data types. Knowing the data type and the nature of the data allow us to encode data more efficiently. During the last several months ClickHouse syntax has been extended in order to accommodate column level encodings, and new encodings have been added to ClickHouse:
- Delta. Delta encoding stores the difference between consecutive values. The difference typically has smaller byte-size and cardinality, especially for sequences. It can be efficiently compressed later with LZ4 or ZSTD.
- DoubleDelta. With this encoding ClickHouse stores difference between consecutive deltas. It gives even better results for slowly changing sequences. To use an analogy from physics, Delta encodes speed, and DoubleDelta encodes acceleration.
- Gorilla. This one is inspired by a Facebook article some time ago [http://www.vldb.org/pvldb/vol8/p1816-teller.pdf], and nobody remembers the academic name of the algorithm anymore. Gorilla encoding is very efficient for values that does not change often. It is applicable both to float and integer data types.
- T64. This encoding is unique to ClickHouse. It calculates max and min values for the encoded range, and then strips the higher bits by transposing a 64-bit matrix (which is where the T64 name comes from). At the end we have a more compact bit representation of the same data. The encoding is universal for integer data types, and does not require any special properties from the data other than locality of values.
Encodings can be specified as a part of the column definition using the ‘Codec’ keyword. The example below defines DoubleDelta encoding for a column. Note, that it turns off default compression.
ts DateTime Codec(DoubleDelta) -- encoded but NOT compressed
In order to have both encoding and compression, codecs can be chained, as shown below:
ts DateTime Codec(DoubleDelta, LZ4) -- encoded AND compressed
Delta, DoubleDelta and Gorilla encodings can be found in many time series databases. Delta and DoubleDelta are traditionally used for timestamps, while Gorilla encoding is used for values. InfluxDB and Prometheus storage are good examples. As we demonstrated a few months ago [https://www.altinity.com/blog/clickhouse-for-time-series], ClickHouse competes well against InfluxDB in terms of performance, but was behind when efficiency of compression is concerned. Using new encodings will fix the gap. Let’s test them out!
We are going to test encodings using several criteria:
- Efficiency of encoding without compression for different data types
- Efficiency of encoding with LZ4 and ZSTD compressions applied
- Efficiency of LZ4 and ZSTD compression applied to encoded data
We used 4 different generated datasets with 1,000,000 rows each that we store in different tables:
- Monotonic sequence of numbers with constant increments (codec_test1_seq).
Sample data: 0,1000,2000,3000,4000,5000,6000,7000,8000,9000, ...
- Monotonic sequence with random increment (codec_test2_mon).
Sample data: 22,1072,2034,3076,4007,5094,6061,7074,8061,9058, ...
- Constant sequence with random spikes (codec_test3_var).
Sample data: 1825,1000,1000,1000,1000,1000,1000,1703,1000,1000, ...
- Randomized set of values below 1B (codec_test4_rand).
Sample data: 608745218, 167444263, 620949842, 956686395, 66062863, ...
The first three datasets are typical for time series applications. The randomized set is the worst case scenario. Every dataset table has multiple columns that represent different combinations of data type, encoding and compression. We also store the unencoded column as a reference. That allows us to compare different combinations inside data sets and across data sets. To summarize, we test the following combinations:
- Int32, Int64 data types*
- None, Delta, DoubleDelta, Gorilla, T64 encodings
- None, LZ4, ZSTD compression**
** It is possible to adjust compression level, but default compression levels were used.
Compression and encoding efficiency can be seen from ClickHouse system.columns table, which displays compressed and uncompressed size for each column as well as encodings and compressions that were applied to the column.
For those who are interested in detailed SQL examples, you can find scripts in the Appendix at the end of the article.
Encoding and Compression Efficiency Results: Integer types
When looking to results below, keep in mind the source data size: unencoded and uncompressed. It is 4,000,000 bytes for Int32 and 8,000,000 for Int64. We also include unencoded data size as a reference.
First let’s look at uncompressed data.
As you can see, Delta does not compress data at all, since default delta size is the same as data type. We will see that it will work better with LZ4. DoubleDelta is the best for uniform sequences and monotonic ones but not that good for random data.
It is also very interesting to see how Gorilla and T64 perform. Random data is the best example. Since we generated random values below 1B, data fits 30 bits. T64 is able to detect it and strip extra bits perfectly. You can clearly see it both for Int32 and Int64 -- the data size is about the same, and is close to 3,750,000 bytes (1,000,000*30/8).
Gorilla is slightly worse on random dataset (32 bits), but it is fantastic for the ‘var’ dataset. ‘Gauge’ kind of data is the primary use case for such an encoding, though we were surprised by Gorilla performance on integer data types in general.
Next, let's look at LZ4 compression.
With LZ4 compression Delta encoding is already good for everything except random dataset. DoubleDelta evidently is the best for time series cases.
And now let’s look at ZSTD.
It’s interesting that ZSTD can compress Delta for sequences more efficiently than DoubleDelta. Compression ratio for sequences is impressive 1:800+ with both delta encodings and ZSTD. T64 is the best for random data again.
Let’s look at this from a different angle. For report below we combined all time series datasets in order to compare compression and encoding effects together.
To summarize, Delta and DoubleDelta work well for time series specific data. DoubleDelta is very efficient with LZ4, so adding ZSTD does not improve it. T64 and Gorilla are great general purpose codecs that can be recommended for all cases where the data pattern is not known. Gorilla is better for time series, but T64 is more ‘compression-friendly’. Even if Gorilla can be more efficient initially, compression is more efficient with T64. But the difference is not that much. Note, however, that Gorilla can encode floats with the same efficiency, so for float values it is a no brainer.
ClickHouse codecs help a lot to improve general compression, reduce storage and increase performance due to less I/O. It is important to understand the nature of the data and choose the right codec. Delta encoding is the best to store time columns, DoubleDelta should compress very well for increasing counters, and Gorilla is the best for gauges. T64 can be used for integer data, provided you do not store random hashes. Using codes also allows you to stick with fast LZ4 and reduce CPU load when decompressing the data.
There is room for improvement left, however. In particular, it looks like DoubleDelta can be improved for Int32, a new version of the T64 codec has been added last week, and there is an article by VictoriaMetrics that claims significant improvement using modified Gorilla algorithm [https://medium.com/faun/victoriametrics-achieving-better-compression-for-time-series-data-than-gorilla-317bc1f95932]. We expect that even more efficient codecs will be available in ClickHouse over the next few months.
One might complain that this article lacks performance tests. Performance tests on synthetic data could be misleading. We are going to test the real effect using TSBS benchmarks. Stay tuned.
Appendix: Test Data Preparation
Below are scripts to repeat our benchmarks.
DROP TABLE IF EXISTS codec_test1_seq; CREATE TABLE codec_test1_seq ( n Int32, /* No compression */ n32 Int32 default n Codec(NONE), n32_delta Int32 default n Codec(Delta), n32_doubledelta Int32 default n Codec(DoubleDelta), n32_t64 Int32 default n Codec(T64), n32_gorilla Int32 default n Codec(Gorilla), n64 Int64 default n Codec(NONE), n64_delta Int64 default n Codec(Delta), n64_doubledelta Int64 default n Codec(DoubleDelta), n64_t64 Int64 default n Codec(T64), n64_gorilla Int64 default n Codec(Gorilla), f32 Float32 default n Codec(NONE), f64 Float64 default n Codec(NONE), f32_gorilla Float32 default n Codec(Gorilla), f64_gorilla Float64 default n Codec(Gorilla), /* LZ4 compression */ l_n32 Int32 default n Codec(LZ4), l_n32_delta Int32 default n Codec(Delta, LZ4), l_n32_doubledelta Int32 default n Codec(DoubleDelta, LZ4), l_n32_t64 Int32 default n Codec(T64, LZ4), l_n32_gorilla Int32 default n Codec(Gorilla, LZ4), l_n64 Int64 default n Codec(LZ4), l_n64_delta Int64 default n Codec(Delta, LZ4), l_n64_doubledelta Int64 default n Codec(DoubleDelta, LZ4), l_n64_t64 Int64 default n Codec(T64, LZ4), l_n64_gorilla Int64 default n Codec(Gorilla, LZ4), l_f32 Float32 default n Codec(LZ4), l_f64 Float64 default n Codec(LZ4), l_f32_gorilla Float32 default n Codec(Gorilla, LZ4), l_f64_gorilla Float64 default n Codec(Gorilla, LZ4), /* ZSTD compression */ z_n32 Int32 default n Codec(ZSTD), z_n32_delta Int32 default n Codec(Delta, ZSTD), z_n32_doubledelta Int32 default n Codec(DoubleDelta, ZSTD), z_n32_t64 Int32 default n Codec(T64, ZSTD), z_n32_gorilla Int32 default n Codec(Gorilla, ZSTD), z_n64 Int64 default n Codec(ZSTD), z_n64_delta Int64 default n Codec(Delta, ZSTD), z_n64_doubledelta Int64 default n Codec(DoubleDelta, ZSTD), z_n64_t64 Int64 default n Codec(T64, ZSTD), z_n64_gorilla Int64 default n Codec(Gorilla, ZSTD), z_f32 Float32 default n Codec(ZSTD), z_f64 Float64 default n Codec(ZSTD), z_f32_gorilla Float32 default n Codec(Gorilla, ZSTD), z_f64_gorilla Float64 default n Codec(Gorilla, ZSTD) ) Engine = MergeTree PARTITION BY tuple() ORDER BY tuple(); DROP TABLE IF EXISTS codec_test2_mon; CREATE TABLE codec_test2_mon AS codec_test1_seq; DROP TABLE IF EXISTS codec_test3_var; CREATE TABLE codec_test3_var AS codec_test1_seq; DROP TABLE IF EXISTS codec_test4_rand; CREATE TABLE codec_test4_rand AS codec_test1_seq; insert into codec_test1_seq (n) select number*1000 from numbers(1000000) settings max_block_size=1000000; insert into codec_test2_mon (n) select number*1000+(rand()%100) from numbers(1000000) settings max_block_size=1000000; insert into codec_test3_var (n) select 1000 + (rand(1)%1000) * (rand(2)%10 = 0) from numbers(1000000) settings max_block_size=1000000; insert into codec_test4_rand (n) select rand()%(1000000*1000) from numbers(1000000) settings max_block_size=1000000; /* sample rows */ select table, groupArray(n) samples from ( select 'codec_test1_seq' table, n from codec_test1_seq limit 10 union all select 'codec_test2_mon' table, n from codec_test2_mon limit 10 union all select 'codec_test3_var' table, n from codec_test3_var limit 10 union all select 'codec_test4_rand' table, n from codec_test4_rand limit 10 ) group by table order by table; /* validation. It is complicated because of mix of Int and Float types */ select arrayDistinct(avgForEach(CAST(replaceOne(replaceOne(toString(tuple(*)), '(', '['), ')', ']'), 'Array(Int64)'))) avg from codec_test1_seq union all select arrayDistinct(avgForEach(CAST(replaceOne(replaceOne(toString(tuple(*)), '(', '['), ')', ']'), 'Array(Int64)'))) avg from codec_test2_mon union all select arrayDistinct(avgForEach(CAST(replaceOne(replaceOne(toString(tuple(*)), '(', '['), ')', ']'), 'Array(Int64)'))) avg from codec_test3_var union all select arrayDistinct(avgForEach(CAST(replaceOne(replaceOne(toString(tuple(*)), '(', '['), ')', ']'), 'Array(Int64)'))) avg from codec_test4_rand; /* Encoding benchmarks results */ select multiIf(table like '%rand', 'Random', 'Time-Series') dataset_type, table, type as data_type, multiIf(compression_codec like '%ZSTD%', 'ZSTD', compression_codec like '%LZ4%', 'LZ4', ' None') compression, multiIf(compression_codec like '%DoubleDelta%', 'DoubleDelta', compression_codec like '%Delta%', 'Delta', compression_codec like '%T64%', 'T64', compression_codec like '%Gorilla%', 'Gorilla', ' None') encoding, compression_codec codec, sum(data_uncompressed_bytes) uncompressed, sum(data_compressed_bytes) compressed, round(uncompressed/compressed,1) ratio from system.columns where table like 'codec_test%' and name != 'n' group by dataset_type, table, data_type, compression, encoding, codec order by dataset_type, table, data_type, compression, encoding, codec FORMAT CSVWithNames;
Receive news and updates about ClickHouse technology