ClickHouse Materialized Views Illuminated, Part 2
Sep 9, 2019
In the previous blog post on materialized views, we introduced a way to construct ClickHouse materialized views that compute sums and counts using the SummingMergeTree engine. The SummingMergeTree can use normal SQL syntax for both types of aggregates. We also let the materialized view definition create the underlying table for data automatically. Both of these techniques are quick but have limitations for production systems.
In the current post we will show how to create a materialized view with a range of aggregate types on an existing table. This appproach is suitable when you need to compute more than simple sums. It’s also handy for cases where your table has large amounts of arriving data or has to deal with schema changes.
Using State Functions and TO Tables to Create More Flexible Views
In the following example we are going to measure readings from devices. Let’s start with a table definition.
CREATE TABLE counter ( when DateTime DEFAULT now(), device UInt32, value Float32 ) ENGINE=MergeTree PARTITION BY toYYYYMM(when) ORDER BY (device, when)
Next we add sufficient data to make query times slow enough to be interesting: 1 billion rows of synthetic data for 10 devices. Note: If you are trying these out you can just put in a million rows to get started. The examples work regardless of the amount of data.
INSERT INTO counter SELECT toDateTime('2015-01-01 00:00:00') + toInt64(number/10) AS when, (number % 10) + 1 AS device, (device * 3) + (number/10000) + (rand() % 53) * 0.1 AS value FROM system.numbers LIMIT 1000000
Now let’s look at a sample query we would like to run regularly. It summarizes all data for all devices over the entire duration of sampling. In this case that means 3.25 years worth of data from the table, all of it prior to 2019.
SELECT device, count(*) AS count, max(value) AS max, min(value) AS min, avg(value) AS avg FROM counter GROUP BY device ORDER BY device ASC . . . 10 rows in set. Elapsed: 2.709 sec. Processed 1.00 billion rows, 8.00 GB (369.09 million rows/s., 2.95 GB/s.)
The preceding query is slow because it must read all of the data in the table to get answers. We want to design a materialized view that reads a lot less data. It turns out that if we define a view that summarizes data on a daily basis, ClickHouse will correctly aggregate the daily totals across the entire interval.
Unlike our previous simple example we will define the target table ourselves. This has the advantage that the table is now visible, which makes it easier to load data as well as do schema migrations. Here’s the target table definition.
CREATE TABLE counter_daily ( day DateTime, device UInt32, count UInt64, max_value_state AggregateFunction(max, Float32), min_value_state AggregateFunction(min, Float32), avg_value_state AggregateFunction(avg, Float32) ) ENGINE = SummingMergeTree() PARTITION BY tuple() ORDER BY (device, day)
The table definition introduces a new datatype, called an aggregate function, which holds partially aggregated data. The type is required for aggregates other than sums or counts. Next we create the corresponding materialized view. It selects from counter (the source table) and sends data to counter_daily (the target table) using special TO syntax in the CREATE statement. Where the table has aggregate functions, the SELECT statement has matching functions like ‘maxState’. We’ll get into how these are related when we discuss aggregate functions in detail.
CREATE MATERIALIZED VIEW counter_daily_mv TO counter_daily AS SELECT toStartOfDay(when) as day, device, count(*) as count, maxState(value) AS max_value_state, minState(value) AS min_value_state, avgState(value) AS avg_value_state FROM counter WHERE when >= toDate('2019-01-01 00:00:00') GROUP BY device, day ORDER BY device, day
The TO keyword lets us point to our target table but has a disadvantage. ClickHouse does not allow use of the POPULATE keyword with TO. To begin with the materialized view therefore has no data. We’re going to load data manually. But we’ll also use a nice trick that enables us to avoid problems in case there is active data loading going on at the same time.
Notice that the view definition has a WHERE clause. This says that any data prior to 2019 should be ignored. We now have a way to handle data loading in a way that does not lose data. The view will take care of new data arriving in 2019. Meanwhile we can load old data from 2018 and before with an INSERT.
Let’s demonstrate how this works by loading new data into the counter table. The new data will start in 2019 and should load into the view automatically.
INSERT INTO counter SELECT toDateTime('2019-01-01 00:00:00') + toInt64(number/10) AS when, (number % 10) + 1 AS device, (device * 3) + (number / 10000) + (rand() % 53) * 0.1 AS value FROM system.numbers LIMIT 10000000
Now let’s manually load the older data using the following INSERT. It loads all data from 2018 and before.
INSERT INTO counter_daily SELECT toStartOfDay(when) as day, device, count(*) AS count, maxState(value) AS max_value_state, minState(value) AS min_value_state, avgState(value) AS avg_value_state FROM counter WHERE when < toDateTime('2019-01-01 00:00:00') GROUP BY device, day ORDER BY device, day
We are finally ready to select data out of the view. As with the target table and materialized view, ClickHouse uses specialized syntax to select from the view.
SELECT device, sum(count) AS count, maxMerge(max_value_state) AS max, minMerge(min_value_state) AS min, avgMerge(avg_value_state) AS avg FROM counter_daily GROUP BY device ORDER BY device ASC ┌─device─┬─────count─┬────────max─┬─────min─┬────────────────avg─┐ │ 1 │ 101000000 │ 100008.17 │ 3.008 │ 49515.50042561026 │ │ 2 │ 101000000 │ 100011.164 │ 6.0031 │ 49518.500627177054 │ │ 3 │ 101000000 │ 100014.17 │ 9.0062 │ 49521.50087863756 │ │ 4 │ 101000000 │ 100017.04 │ 12.0333 │ 49524.5006612177 │ │ 5 │ 101000000 │ 100020.19 │ 15.0284 │ 49527.50092650661 │ │ 6 │ 101000000 │ 100023.15 │ 18.0025 │ 49530.50098047898 │ │ 7 │ 101000000 │ 100026.195 │ 21.0326 │ 49533.50099656529 │ │ 8 │ 101000000 │ 100029.18 │ 24.0297 │ 49536.50119239665 │ │ 9 │ 101000000 │ 100031.984 │ 27.0258 │ 49539.50119958179 │ │ 10 │ 101000000 │ 100035.17 │ 30.0229 │ 49542.501308345716 │ └────────┴───────────┴────────────┴─────────┴────────────────────┘ 10 rows in set. Elapsed: 0.003 sec. Processed 11.70 thousand rows, 945.49 KB (3.76 million rows/s., 304.25 MB/s.)
This query properly summarizes all data including the new rows. You can check the math by rerunning the original SELECT on the counter table. The difference is that the materialized view returns data around 900 times faster. It’s worth learning a bit of new syntax to get this!!
At this point we can circle back and explain what’s going on under the covers.
Aggregate functions are like collectors that allow ClickHouse to build aggregates from data spread across many parts. The following diagram shows how this works to compute averages. We start with a selectable value in the source table. The materialized view converts the data into a partial aggregate using the avgState function, which is an internal structure. Finally, when selecting data out, apply avgMerge to total up the partial aggregates into the resulting number.
Partial aggregates enable materialized views to work with data spread across many parts on multiple nodes. The merge function properly assembles the aggregates even if you change the group by variables. It would not work just to combine simple average values, because they would be lacking the weights necessary to scale each partial average as it added to the total. This behavior has an important consequence.
Remember above when we mentioned that ClickHouse could answer our sample query using a materialized view with summarized daily data? That’s a consequence of how aggregate functions work. It means that our daily view can also answer questions about the week, month, year, or entire interval.
ClickHouse is somewhat unusual that it directly exposes partial aggregates in the SQL syntax, but the way they work to solve problems is extremely powerful. When you design materialized views try to use tricks like daily summarization to solve multiple problems with a single view. A single view can answer a lot of questions.
Table Engines for Materialized Views
ClickHouse has multiple engines that are useful for materialized views. The AggregatingMergeTree engine works with aggregate functions only. If you want to do counts or sums you’ll need to define them using AggregateFunction datatypes in the target table. You’ll also need to use state and merge functions in the view and select statements. For example, to process counts you would need to use countState(count) and countMerge(count) in our worked examples above.
We recommend the SummingMergeTree engine to do aggregates in materialized views. It can handle aggregate functions perfectly well. However it hides them for sums and counts, which is handy for simple cases. It does not prevent you from using the state and merge functions in this case; it’s just you don’t have to. Meanwhile it does everything that AggregatingMergeTree does.
Database schema tends to change in production systems, especially those that are under active development. You can manage such changes relatively easily when using materialized views with an explicit target table.
Let’s take a simple example. Suppose the name of the counter table changes to counter_replicated. The materialized view won’t work once this change is applied. Even worse, the failures will block INSERTs to the counter table. You can deal with the change as follows.
-- Delete view prior to schema change. DROP TABLE counter_daily_mv -- Rename source table. RENAME TABLE counter TO counter_replicated -- Recreate view with correct source table name. CREATE MATERIALIZED VIEW counter_daily_mv TO counter_daily AS SELECT toStartOfDay(when) as day, device, count(*) as count, maxState(value) AS max_value_state, minState(value) AS min_value_state, avgState(value) AS avg_value_state FROM counter_replicated GROUP BY device, day ORDER BY device, day
Depending on the actual steps in schema migration you may have to work around missed data that arrives while the materialized view definition is being changed. You can handle that using filter conditions and manual loading as we showed in the main example.
Materialized View Plumbing and Data Sizes
Finally, let’s look again at the relationship between the data tables and the materialized view. The target table is a normal table. You can select data from either the target table or the materialized view. There is no difference. Moreover, if you drop the materialized view, the table remains. As we just showed, you can make schema changes to the view by simply dropping and recreating it. If you need to change the target table itself, run ALTER TABLE commands as you would for any other table.
The diagram also shows the data size of the source and target tables. Materialized views are often vastly smaller than the tables whose data they aggregate. That’s certainly the case here. The following query shows the difference in sizes for this example.
SELECT table, formatReadableSize(sum(data_compressed_bytes)) AS tc, formatReadableSize(sum(data_uncompressed_bytes)) AS tu, sum(data_compressed_bytes) / sum(data_uncompressed_bytes) AS ratio FROM system.columns WHERE database = currentDatabase() GROUP BY table ORDER BY table ASC ┌─table────────────┬─tc─────────┬─tu─────────┬──────────────ratio─┐ │ counter │ 6.52 GiB │ 11.29 GiB │ 0.5778520850660066 │ │ counter_daily │ 210.35 KiB │ 422.75 KiB │ 0.4975675675675676 │ │ counter_daily_mv │ 0.00 B │ 0.00 B │ nan │ └──────────────────┴────────────┴────────────┴────────────────────┘
As the calculations show, the materialized view target table is approximately 30,000 times smaller than the source data from which the materialized view derives. This difference speeds up queries enormously. As we showed earlier our test query runs about 900x faster when using data from the materialized view.
ClickHouse materialized views are extremely flexible, thanks to powerful aggregate functions as well as the simple relationship between source table, materialized view, and target table. The fact that materialized views allow an explicit target table is a useful feature that makes schema migration simpler. You can also mitigate potential lost view updates by adding filter conditions to the view SELECT definition and manually loading missed data.
There are many other ways that materialized views can help transform data. We have already described some of them, such as last point queries, and plan to write about others in future on this blog. For more information, check out our recent webinar entitled ClickHouse and the Magic of Materialized Views. We cover several use case examples there.
Finally, if you are using materialized views in a way you think would be interesting to other users, write an article or present at a local ClickHouse meetup. We gladly host content from community users on the Altinity Blog and are always looking for speakers at future meetups. Please let us know if you have something you would like to share with the community.
Receive news and updates about ClickHouse technology