Custom partitioning in ClickHouse 1.1.54310

 

Nov 8, 2017

ClickHouse introduced a new major feature in version 1.1.54310 – it is a custom partitioning for tables with MergeTree engine. Before it enforced to use Date field for partitioning and the only choice to partition by month.

The new version allows much more flexibility and we can choose the partition schema not only based on Date fields.

The feature still marked as experimental and we need to start the server with experimental_allow_extended_storage_definition_syntax flag.

Let’s review how to use this custom partitioning. We will take the table from ontime benchmark mentioned on the page
https://clickhouse.yandex/docs/en/getting_started/example_datasets/ontime.html

The table is created as:

) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192)

With new syntax it can be rewritten as:

) ENGINE = MergeTree PARTITION BY toYYYYMM(FlightDate) Order By (Year, FlightDate);

Now it is clear how to change the definition for daily partitioning:

) ENGINE = MergeTree PARTITION BY FlightDate Order By (Year, FlightDate);

Or to use partitioning by different non-Date column:

) ENGINE = MergeTree PARTITION BY AirlineID Order By FlightDate;

Now, we can drop data for one specific date (e.g. for 1987-10-01)

ALTER TABLE ontime DROP PARTITION '1987-10-01';

Why is this feature useful?
It provides better granularity to manage data. We can perform data maintenance for the period which is required for business tasks, and not just by prescribed monthly periods.

One thing to be aware, that more fine partitioning may affect the compression.
For example with the default monthly partitioning and LZ4 compression, the full ontime table takes 7.7GB on disk, and when we switch to daily partition the size growth to 15GB. This is because the data for compression becomes much shorter, so the compression is less efficient.

 
Share

7 Comments

  1. I wonder if without the "Order By …",) ENGINE = MergeTree PARTITION BY FlightDate;We can still drop data for one specific date (e.g. for 1987-10-01) ?Thanks

    1. Hi Tim. ‘order by’ does not affect partitioning, it is sorting order of merge tree. You can not omit it, it is required. But it is not related to partitioning functionality, so partitions do work with any sort order.

  2. Hi Zaytsev,I’m just wondering what is the different between Partition by week and by day?I mean how is the performance of two kinds of Partition? Is Partition by Day make the ClickHouse database larger or slower than Partition by Week?
    Thank you.

    1. It depends on your queries. If you queries usually query one or two days of data, daily partitions may be more efficient for query performance. If you queries typically query weeks or months, then small partitions are bad.

  3. ??? ????? ??????????, ??? ? ???????, ?????? ??????, ???? ????????? ????? ? ??????????. ???? ? ?????????? ????? ????????? ? ???????: ) ENGINE = MergeTree PARTITION BY toYYYYMM(FlightDate) Order By (Year, FlightDate);

  4. При таком синтаксисе, как в примере, выдает ошибку, если несколько полей в сортировке. Поля в сортировке нужно указывать в скобках: ) ENGINE = MergeTree PARTITION BY toYYYYMM(FlightDate) Order By (Year, FlightDate);

Comments are closed.