Podcast: Scale Your Analytics On The Clickhouse Data Warehouse

July 8, 2019
Robert Hodges and Alexander Zaitsev on Data Engineering Podcast with Tobias Macey
Scale Your Analytics On The Clickhouse Data Warehouse – Episode 88

The market for data warehouse platforms is large and varied, with options for every use case. ClickHouse is an open source, column-oriented database engine built for interactive analytics with linear scalability. In this episode Robert Hodges and Alexander Zaitsev explain how it is architected to provide these features, the various unique capabilities that it provides, and how to run it in production. It was interesting to learn about some of the custom data types and performance optimizations that are included.

Interview

 

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Clickhouse is and how you each got involved with it?
    • What are the primary use cases that Clickhouse is targeting?
    • Where does it fit in the database market and how does it compare to other column stores, both open source and commercial?
  • Can you describe how Clickhouse is architected?
  • Can you talk through the lifecycle of a given record or set of records from when they first get inserted into Clickhouse, through the engine and storage layer, and then the lookup process at query time?
    • I noticed that Clickhouse has a feature for implementing data safeguards (deletion protection, etc.). Can you talk through how that factors into different use cases for Clickhouse?
  • Aside from directly inserting a record via the client APIs can you talk through the options for loading data into Clickhouse?
    • For the MySQL/Postgres replication functionality how do you maintain schema evolution from the source DB to Clickhouse?
  • What are some of the advanced capabilities, such as SQL extensions, supported data types, etc. that are unique to Clickhouse?
  • For someone getting started with Clickhouse can you describe how they should be thinking about data modeling?
  • Recent entrants to the data warehouse market are encouraging users to insert raw, unprocessed records and then do their transformations with the database engine, as opposed to using a data lake as the staging ground for transformations prior to loading into the warehouse. Where does Clickhouse fall along that spectrum?
  • How is scaling in Clickhouse implemented and what are the edge cases that users should be aware of?
    • How is data replication and consistency managed?
  • What is involved in deploying and maintaining an installation of Clickhouse?
    • I noticed that Altinity is providing a Kubernetes operator for Clickhouse. What are the opportunities and tradeoffs presented by that platform for Clickhouse?
  • What are some of the most interesting/unexpected/innovative ways that you have seen Clickhouse used?
  • What are some of the most challenging aspects of working on Clickhouse itself, and or implementing systems on top of it?
  • What are the shortcomings of Clickhouse and how do you address them at Altinity?
  • When is Clickhouse the wrong choice?

Share