Introducing ClickHouse IPv4 and IPv6 Domains for IP Address Handling

One of our customers recently had a problem using CickHouse: the simple workflow of load-analyze-present wasn’t as efficient as they were expecting. The body of the problem was with loading and presenting IPv4 and IPv6 addresses, which are traditionally stored in ClickHouse as UInt32 and FixedString(16) columns. These types have many advantages, like compact footprint and ease of comparing values. But they also have shortcomings that prompted us to seek a better solution.

For example, imagine that you want to load your dataset with IPv4-values into a table. It generally looks like the following. We just show IPv4, but IPv6 values are treated analogously.

INSERT INTO hits_old (url, from) VALUES ('https://wikipedia.org', 
IPv4StringToNum('116.253.40.133'))('https://clickhouse.yandex', 
IPv4StringToNum('183.247.232.58'))('https://clickhouse.yandex/docs/en/', 
IPv4StringToNum('116.106.34.242'));

Not too bad, but you can not insert too many values this way. You may prefer to load data from the file instead:

$ head -n3 data.csv
"https://clickhouse.yandex","183.247.232.58"
"https://clickhouse.yandex/docs/en/","116.106.34.242"
"https://wikipedia.org","116.253.40.133"

Unfortunately loading this file is not that easy. Before our new feature arrived there were two options:

  • Use a staging table and perform type conversion when ‘INSERT SELECT’ from the staging table
  • Use materialized columns or materialized views for type conversions

Both approaches require an additional data structure, hence add overhead to processing.

In order to overcome these problems, we introduced two new data types to ClickHouse: IPv4 and IPv6. With these new data types conversions are automatic. User code is now more straightforward and easier to read:

INSERT INTO hits_new (url, from) VALUES ('https://wikipedia.org', '116.253.40.133')('https://clickhouse.yandex', 
'183.247.232.58')('https://clickhouse.yandex/docs/en/', '116.106.34.242');

And the best thing: you can now load your dataset with INSERT INTO ... FORMAT directly:

$ clickhouse client -q 'INSERT INTO hits_new FORMAT CSV' < data.csv

IPv4 values have nice human-friendly textual representation, which is what you see SELECTing from a client console andor exporting values into a file.

Now let’s check out detailed differences between the traditional and new approach of handling IP address values.


Creating tables

Imagine that you need to analyze user activity. Previously you would have designed the following table:

CREATE TABLE hits_old (
  url String, 
  from UInt32
) ENGINE = MergeTree() ORDER BY url;

The new way of doing exactly the same is:

CREATE TABLE hits_new (
  url String, 
  from IPv4
) ENGINE = MergeTree() ORDER BY url;

Now let’s check out how third-party software sees the old table:

DESCRIBE TABLE hits_old;
url     String
from    UInt32

The new way looks as follows:

DESCRIBE TABLE hits_new;
url     String
from    IPv4

The new data type has a distinct type name. Now any third-party software has a chance of recognizing the column as IPv4 right away, without even looking at stored values.


Inserting values

Let’s fill our table with data. As you have already seen, inserting data into a table with new IPv4 columns is easy:

INSERT INTO hits_new (url, from) VALUES 
('https://ya.ru', '45.3.47.158');

You can even load data files directly that contain CSV or any other format supported by ClickHouse:

$ clickhouse client -q 'INSERT INTO hits_new FORMAT CSV' < data.csv

This is made possible with automatic conversion of the textual representation at load time, as if you were wrapping each and every value with the IPv4StringToNum() function call.

Please note that loading old-style datafiles with int-like IP addresses into table with IPv4 column does not work. The process will fail with:

Code: 441. DB::Exception: Invalid IPv4 value.

Instead you can use a UInt32-backed table and convert it later.


Working with values

You can use any function that takes a UInt32 argument and call it with an IPv4 value. In fact, you can even use basic arithmetic on IPv4 values, even though this doesn’t make much sense. It would be more meaningful to use specialized functions, like IPv4NumToStringClassC, IPv4ToIPv6, etc.

SELECT IPv4NumToStringClassC(from) FROM hits_new LIMIT 1;
┌─IPv4NumToStringClassC(from)─┐
│ 183.247.232.xxx             │
└─────────────────────────────┘

Please note that no other type conversions (other than to UInt32) are performed implicitly, so if you want to turn your IPv4 value into a string you have to do this explicitly with IPv4NumToString:

SELECT concat(url, ' was accessed from ', IPv4NumToString(from)) 
FROM hits_new LIMIT 1;
┌─concat(url, ' was accessed from ', IPv4NumToString(from))──┐
│ https://clickhouse.yandex was accessed from 183.247.232.58 │
└────────────────────────────────────────────────────────────┘

Exporting values

After you have loaded all the data and performed your magic on it, I assume you want to see the results:

SELECT * FROM hits_old LIMIT 3;
┌─url────────────────────────────────┬───────from─┐
│ https://clickhouse.yandex          │ 3086477370 │
│ https://clickhouse.yandex/docs/en/ │ 1953112818 │
│ https://wikipedia.org              │ 1962748037 │
└────────────────────────────────────┴────────────┘

The old table is quite a nuisance, as you have to wrap data in a function call to get a human-friendly textual IP address. By contrast values in the new table look like normal IP addresses straight away:

SELECT * FROM hits_new LIMIT 3;
┌─url────────────────────────────────┬───────────from─┐
│ https://clickhouse.yandex          │ 183.247.232.58 │
│ https://clickhouse.yandex/docs/en/ │ 116.106.34.242 │
│ https://wikipedia.org              │ 116.253.40.133 │
└────────────────────────────────────┴────────────────┘

Quite a bit more convenient, right? The same happens when you are exporting values from table in bulk. Here’s the old approach.

SELECT * FROM hits_old LIMIT 3 FORMAT CSV;
"https://clickhouse.yandex",3086477370
"https://clickhouse.yandex/docs/en/",1953112818
"https://wikipedia.org",1962748037

And now the new approach:

SELECT * FROM hits_new LIMIT 3 FORMAT CSV;
"https://clickhouse.yandex","183.247.232.58"
"https://clickhouse.yandex/docs/en/","116.106.34.242"
"https://wikipedia.org","116.253.40.133"

Under the hood

Even though it appears that new-style IPv4 values are stored as strings, this is actually not the case. Let’s dump a value of the hits_old as hex to see the binary representation:

SELECT toTypeName(from), hex(from) FROM hits_old LIMIT 1;
┌─toTypeName(from)─┬─hex(from)─┐
│ UInt32           │ B7F7E83A  │
└──────────────────┴───────────┘

Doing the same thing for the hits_new table produces exactly the same result:

SELECT toTypeName(from), hex(from) FROM hits_new LIMIT 1;
┌─toTypeName(from)─┬─hex(from)─┐
│ IPv4             │ B7F7E83A  │
└──────────────────┴───────────┘

Not only IPv4 values are stored as efficient as UInt32-counterparts, but both cases have the same binary layout.

To be more precise, IPv6 and IPv6 are not real types but domains. Adding a new type to ClickHouse is an elaborate and difficult task that is not normally recommended. Domains are a sort of “tag” attached to an existing base type that redefines the type name and overrides the way values are parsed from and rendered to textual form. This design resembles user-defined domains from PostgreSQL, except there are no value constraints. That said, any IPv4 value representable as UInt32 is valid and any IPv6 as FixedString(16) is valid too. Consequently there is no need to for constraints, at least for now.


Converting old columns to new ones

If you are convinced and want to convert your existing IPv4-disguised-as-UInt32 columns into true IPv4-columns, you can easily do that with ALTER TABLE:

ALTER TABLE hits_old MODIFY COLUMN from IPv4;
SELECT * FROM hits_old LIMIT 1;
┌─url───────────────────────┬───────────from─┐
│ https://clickhouse.yandex │ 183.247.232.58 │
└───────────────────────────┴────────────────┘

How expensive would that conversion be? Let’s find out! I prepared a datafile based on the dataset from The CAIDA UCSD IPv4 Routed /24 DNS Names Dataset – 20171130. It has four million rows, so we can now have some fun.

$ wc -l ./IPv4_test_data_UInt32.tsv
4281966 ./IPv4_test_data_UInt32.tsv
$ clickhouse-client -q 'DESCRIBE  hits_old'
url     String
from    UInt32
$ time clickhouse client -q 'INSERT INTO hits_old FORMAT TSV' < ./IPv4_test_data_UInt32.tsv
real    0m23.242s
user    0m20.459s
sys     0m0.268s
$ clickhouse client -q "SELECT * FROM hits_old WHERE url != '' LIMIT 1"
 host-213-144-115-157.reverse.teknotel.com      3583013789
$ time clickhouse client -q "ALTER TABLE hits_old MODIFY COLUMN from IPv4;"
real    0m0.141s
user    0m0.099s
sys     0m0.021s
$ clickhouse client -q "SELECT * FROM hits_old WHERE url != '' LIMIT 1"
 host-213-144-115-157.reverse.teknotel.com      213.144.115.157
$ clickhouse client -q "SELECT count(*) FROM hits_old"
4281966

The conversion was nearly instant, even on my relatively old laptop! The trick is that the data itself wasn’t touched, only the column headers describing the type of the data. Thus very little work was done and execution time doesn’t depend on the size of data.


Conclusion

As this article showed, there are two issues with the traditional ClickHouse approach to IP address handling:

  • To import data from a human-friendly IPv4/6 format you have to use additional temporary storage.
  • There is no easy way of providing information about column type other than having contrived name, so integration with third-party software might be complicated.

These are fixed by the new IPv4 and IPv6 domains, which have the following traits:

  • Distinct type name
  • Light footprint
  • Compatibility with legacy IPv4/v6 storage format
  • Human-friendly text format for imported and exported values

Altinity contributed the new domain type implementation to ClickHouse. It is available in ClickHouse starting from version 19.3. It was great to see the pull request go in. We are now moving to new contributions. Stay tuned for more blog articles describing our work!

Share