Migration to ClickHouse
Oct 23, 2017
ClickHouse is an excellent analytics database choice not just for startups but also for companies that have already invested significant amount of resources into their analytics solutions, but are not completely satisfied with the results. In this article we will discuss how and when companies consider the ClickHouse migration project, and what challenges they may expect. We do not disclose any names, but every example has a real world prototype.
When to migrate
Migration is always a pain. It does not matter if you move to another house or to another database, it is always a lot of efforts, bothers, frustrations and so on. But the final result worths the challenge, and this is what motivates you to keep going. So what are the motivators to consider migration to ClickHouse?
One of the main motivators is performance degradations with the data grouth. It is very easy to implement a prototype and even a production system using popular open source MySQL or PostreSQL databases. It works fine until the data size is realatively small, indexes fit in memory, data cache hit ratio is high enough. Unfortunately, such an idilly eventually comes to the end, and quries become sluggish. Sometimes it is possible to add more memory, order faster disks and so on, but it only delays the main question: how can the system be scaled out?
ClickHouse may help here a lot. Even when running at the same physical server it typically shows x100-500 query performance for analytic workloads comparing to OLTP databases. If it is not enough, ClickHouse cluster can be easily rolled out to address problems of any scale.
Many companies that struggled with performance problems, tried different open source alternatives, -- they have found ClickHouse to be their silver bullet.
Cost of Scaling-Out
Before ClickHouse hit the market many companies have already deployed analytic solutions using commercial DBMSs. HP Vertica is probably the best out those. Vertica is pricey but it does not hurt too much for low number of TBs, the price is often affordable even for small companies. The picture changes when business demands to scale the data 10 or 100 times (it worths noting that increase in data size does not always mean increase in revenues). With commerical DBMSs scaling out the data means to increase costs at the close rate, though volume discounts are usually provided.
At this point some companies may consider to change the platform. ClickHouse allows to scale with only hardware costs applied with a splash of expertise.
Cost of Ownership
Cloud databases provide a quickest solution to anaylitic databases deployment. Google Big Query, Amazon RedShift or Snowflake DBs are very good products. They are very quick to setup and make the system up and running. Over the time, however, some companies wonder if the price of quick startup is justified in the long term. We have made a number of benchmarks of ClickHouse running on Amazon and dedicated servers and compared it to RedShift. For example, at this article we concluded that the cost of running ClickHouse in Amazon is several times lower comparing to similarly performing RedShift instance. Original article that inspired our tests allows to estimate cost reduction comparing to Google Big Query and Athena. ClickHouse clearly wins and may reduce costs significantly.
Another consideration here is that ClickHouse deployment is not vendor locked. Companies can deploy ClickHouse solutions to Amazon, dedicated servers, private clouds or dedicated clouds like KodiakData thus optimizing their deployment and cost structure in a most flexible way.
ClickHouse is an unusual database with a lot cool features and design decisions made for the best performance. It takes a while, however, to learn how to use ClickHouse most efficiently. The other side of the coin shows up in ClickHouse limitations. ClickHouse lacks some things that more mature databases have for years. That complicates the migration project.
Main Challenges of ClickHouse migration
When migrating to ClickHouse you may expect following challenges.
Efficient schema. Schema is the key to performance. Therefore good understanding of ClickHouse best practices is required, how table engines work, what dictionaries are and so on. Proper schema allows to use ClickHouse in the most efficient way.
Reliable data ingestion. ClickHouse allows to balance between consistency and speed, and those trade-offs are important to understand and manage, especially in the distributed environment.
Data distribution and replication. Scalable and reliable system requires those to be properly designed. ClickHouse is very flexible in how data is distributed and replicated, there is no single way.
Client interfaces. One of main ClickHouse issues to date is non-standard SQL dialect. It is very powerful with a lot of extensions, but it is not standard. That makes it sometimes hard to integrade with client tools that use SQL. This issue is well understood by ClickHouse developers, and we can expect ClickHouse to support SQL-92 or later standards at the future.
Any migration project should start from the plan. We propose the following approach to mitigate risks and ensure the migration project is successful.
Confirm you use case. It is crucially important to use ClickHouse for the use cases it has been designed for. Primarily it is any time series data that comes in a stream or batches. As Michael Stonebraker wrote in his game changing article 10 years ago -- "One size does not fit all" -- ClickHouse should not be used as a general purpose database.
Check benchmarks. There are many benchmarks available and there is a good chance that you can find the benchmark against the database you are already using. Links to some benchmarks can be found at our web site.
Run your own benchmarks. "Trust by Check" -- it is always makes a lot of sense to run the benchmark using the real data and compare to the existing production system. Sometimes it becomes the first ClickHouse hands-on experience for the team responsible for infrastructure and future migration.
Carefully analyze ClickHouse limitations, not features. Managers often pay attention only to the features but overlook the limitations. Developers tend to be more skeptical. Features do work, limitations may become show stopper or significant challenge at some point. Such ClickHouse limitations as restricted partitioning, absense of updates and deletes, transactions and so on needs to be reviewed. Don't be afraid of those, ClickHouse limitations in most cases can be worked around or even turned to the benefits.
Make a POC. Proof of concept allows to run a toy system end-to-end, learn ClickHouse, aquire hands-on expertise and so on. Such an experience will help with the real system design and also should uncover most of issues that system implementation may face.
Enter the community. ClickHouse is an open source database. So the best support is community. Yandex-backed Telegram channels and Google Group are the best places for newcomers to learn more, ask questions and so on. Even if you plan for professional support from Altinity at some point, it makes sense to learn as much as you can by yourself.
Once those pre-requisites steps are completed you will get some ClickHouse expertise applicable to your use case and will have a confidence that ClickHouse works for you (or does not). How to proceed further depends on the complexity of your system. POC may evolve to the production deployment, or something new may be required to start from scratch. Here hardware requirements for production data loads needs to be analyzed, integrations needs to be designed and so on. Proper QA and testing needs to be planned as well.
For some companies it takes just a few weeks to completely swtich-over to ClickHouse, while huge project may take a year.
We at Altinity bring our experience and expertise to help companies with their migrations to ClickHouse. We believe that ClickHouse is the best open source analytical database available today, and it should be used everywhere where big data needs to be analyzed.