MySQL to ClickHouse data migration and replication
Dec 5, 2017
Altinity has developed an open source clickhouse-mysql tool that helps to migrate existing MySQL tables to ClickHouse and setup realtime replication of changes from MySQL to ClickHouse (INSERTS only). Here we demonstrate how it can be done by a simple example.
We will be using a subset of airline ontime test dataset:
mysql> select count(*) from airline.ontime;
+----------+
| count(*) |
+----------+
| 1289551 |
+----------+
Migration process consists of three steps:
- Create CH table
- Copy existing data from MySQL to CH
- Setup MySQL-to-ClickHouse replication
Prerequisites
clickhouse-mysql is a Python script, so Python >=3.5 needs to be installed. ‘pypy’ is better from performance prospective.
Following modules are needed for MySQL and ClickHouse integrations:
pip install mysqlclient
pip install mysql-replication
pip install clickhouse-driver
MySQL should have binlog enabled for replication.
1. Create ClickHouse table.
clickhouse-mysql has several modes. We start from ClickHouse table creation helper.
pypy clickhouse-mysql.py --log-file=/dev/null --src-host=127.0.0.1 --src-user=root --src-only-tables=airline.ontime --table-templates
That will generate ClickHouse table DDL template that needs to be manually edited.
CREATE TABLE `airline`.`ontime` (
`Year` Nullable(UInt16),
... CUT ...
`Div5TailNum` Nullable(String)
) ENGINE = MergeTree(<PRIMARY_DATE_FIELD>,
(<COMMA_SEPARATED_INDEX_FIELDS_LIST>), 8192)
Following changes need to be done manually:
- Specify partitioning column (
<PRIMARY_DATE_FIELD>
). - Specify primary key columns (
<COMMA_SEPARATED_INDEX_FIELDS_LIST>
) - Make sure partitioning and primary key columns are not nullable.
For example, it may look like
CREATE TABLE `airline`.`ontime` (
`Year` UInt16,
...CUT...
`Div5TailNum` Nullable(String)
) ENGINE = MergeTree(FlightDate, (FlightDate, Year, Month), 8192)
Once the table definition is ready, run clickhouse-client to create it into interactive mode or create from file.
2. Copy existing data from MySQL to ClickHouse
Data copy can be performed using standard MySQL and ClickHouse commands.
sudo mysqldump
-u root
--tz-utc
--quick
--fields-terminated-by=,
--fields-optionally-enclosed-by="
--fields-escaped-by=\
--tab="$CSV_FILES_DIR"/
airline ontime
sudo cat "$CSV_FILES_DIR"/ontime.txt | clickhouse-client --query="INSERT INTO airline.ontime FORMAT CSV"
If there are no errors you can check the table in clickhouse-client:
:) select count() from airline.ontime;
SELECT count()
FROM airline.ontime
┌─count()─┐
│ 1289551 │
└─────────┘
3. Setup MySQL-to-ClickHouse replication
Replication log reader for ontime database can be started by an example shell script:
./examples/run_airline_ontime_data_mysql_to_ch_reader.sh
If you are curious what it does, it runs ‘clickhouse-mysql’ with following parameters:
pypy clickhouse-mysql
--src-resume
--src-wait
--nice-pause=1
--log-level=info
--log-file=ontime.log
--src-host=127.0.0.1
--src-user=root
--dst-host=127.0.0.1
--csvpool
--csvpool-file-path-prefix=qwe_
--mempool-max-flush-interval=60
--mempool-max-events-num=10000
Now let’s test how inserts are propagated from MySQL to ClickHouse. We will duplicate the data in MySQL.
mysql> insert into airline.ontime select * from airline.ontime;
mysql> select count(*) from airline.ontime;
+----------+
| count(*) |
+----------+
| 2579102 |
+----------+
And check on ClickHouse side.
:) select count() from airline.ontime;
SELECT count()
FROM airline.ontime
┌─count()─┐
│ 2579102 │
└─────────┘
Inserted records were populated to ClickHouse.
Conclusion
Several months ago ProxySQL introduced ClickHouse support allowing to access ClickHouse via MySQL protocol. We took this further and demonstrated that it is possible to have MySQL data in ClickHouse in realtime using MySQL binlog reading technique. There are multiple ways to do that. Custom implementation like Altinity ‘clickhouse-mysql’ prototype is one approach, or more universal technologies can be used, for example Uber’s StorageTapper that generates Kafka stream from MySQL binlogs that can be later inserted to ClickHouse using Kafka engine or dedicated consumer.