ClickHouse Primary Keys

Nov 27, 2017

Our guest author is ClickHouse evangelist Yegor Andreenko. Follow him on Twitter @f1yegor.
This is a cross-post from: (https://medium.com/@f1yegor/clickhouse-primary-keys-2cf2a45d7324)

Special thanks to Alexey Milovidov, ClickHouse developer, for providing material for this article.


Recently I dived deep into
ClickHouse
. ClickHouse is column-store database by Yandex with great performance for analytical queries. For example check
benchmark
and post of
Mark Litwintschik
.

This is the translation of answer given by Alexey Milovidov (creator of ClickHouse) about
composite primary key
.

Questions:

  1. How many columns primary key could have? And what is layout of data on storage drive? Are there any theoretical/practical limits?
  2. Could columns with missing data at some rows be part of the primary key?

Data in table of
MergeTree
type stored in set of multiple parts. On average you could expect little number of parts (units-tens per month).

In every part data stored sorted lexicographically by primary key. For example, if your primary key — (CounterID, Date), than rows would be located sorted by CounterID, and for rows with the same CounterID — sorted by Date.

Data structure of primary key looks like an array of marks — it’s values of primary key every
index_granularity
rows.

index_granularity
— settings of MergeTree engine, default to 8192.

We say that primary key is sparse index of sorted data. Let’s visualise it with only one part. (I should have equal length between marks, but it’s a bit imperfect to draw asci-art here):

It’s convenient to represent marks as marks of ruler. Primary key allows effectively read range of data. For
select
ClickHouse chooses set of mark ranges that could contain target data.

This way,
if you select CounterID IN (‘a’, ‘h’)
server reads data with mark ranges [0, 3) and [6, 8).

if you select CounterID IN (‘a’, ‘h’) AND Date = 3
server reads data with mark ranges [1, 3) and [7, 8).

Sometimes primary key works even if only the second column condition presents in select:

if you select Date = 3
server reads data with mark ranges [1, 10).

In our example it’s all marks except 0 — this is 90% of data. In this case index isn’t really effective, but still allows to skip part of data.
On the other hand, if we have more data for one
CounterID, index allows to skip wider ranges of Date in data.

In any case, usage of index never could be less efficient than full scan.

Sparse index could read unnecessary rows: during read of one range of primary key index_granularity * 2 unnecessary rows in every part. It’s normal and you shouldn’t try to reduce index_granularity. ClickHouse designed to work effective with data by large batches of rows, that’s why a bit of additional column during read isn’t hurt the performance. index_granularity = 8192 — good value for most cases.

Sparse index allows to work with tables that have enormous number of rows. And it always fits in RAM.

Primary key isn’t unique. You can insert many rows with the same value of primary key.

Primary key can also contain functional expressions.
Example: (CounterID, EventDate, intHash32(UserID))

Above it’s used to mix up the data of particular UserID for every tuple CounterID, EventDate. By-turn it’s used in sampling (
https://clickhouse.yandex/reference_en.html#SAMPLE
clause).

Let’s sum up what choice of primary key affects:

1) The most important and obvious: primary key allows to read less data during SELECT queries. As shown in examples above it’s usually doesn’t make sense to include many columns into primary key for this purpose.

Let’s say you have primary key (a, b). By adding one more column c: (a, b, c) makes sense only if it conforms with both conditions:

  • if you have queries with filter for this column;
  • in your data could be quite long (several time bigger than index_granularity) ranges of data with the same values of (a, b).

In other words when adding one more column will allow to skip big enough ranges of data.

2) Data is sorted by primary key. That way data is more compressable. Sometimes it happens that by adding one more column into primary key data could be compressed better.

3) When you use different kinds of MergeTree with additional logic in merge: CollapsingMergeTree, SummingMergeTree and etc., primary key affects merge of data. For this reason it might be necessary to use more columns in primary key even when it’s not necessary for point 1.

Number of columns into primary key isn’t limited explicitly. Long primary key is usually useless. In real use case the maximum that I saw was ~20 columns (for SummingMergeTree), but I don’t recommend this variant.
Long primary key will negatively affect insert performance and memory usage.

Long primary key will not negatively affect the performance of SELECT queries.

During insert, missing values of all columns will be replaced with default values and written to table.

Share

3 Comments

  1. "Sometimes it happens that by adding one more column into primary key data could be compressed better."
    Could you give an example of how this would play?

    1. For example, you log prices of products daily and have PRIMARY KEY product_id. If you change the definition to PRIMARY KEY (product_id, date), the rows will be sorted by date within a product. Often the prices of products don’t change from day to day. So, sequent duplicate values can be compressed efficiently.

Comments are closed.