Altinity
ClickHouse Leading Service Provider

Blog

ClickHouse on the Cloud benchmarks

Jan 4, 2018

It's been a while since Altinity announced a partnership with Kodiak Data, a cloud infrastructure company. Despite that, we have never written about Kodiak Data and how they help with ClickHouse deployments. Now there are several companies already using ClickHouse at Kodiak Data MemCloud(TM), so it's time to explain why.

KodiakData is a cloud infrastructure company. As such they can utilize any hardware and deploy managed cloud in any datacenter. However, Kodiak Data decided to primarily target the needs of users who require highest possible performance. They offer really impressive "hardware" with NVMe SSDs with entropy encoding to survive failures, expandable RAM, 40GBe internal network and so on. I put "hardware" in quotes because what you get is a virtual box or boxes, but characteristics of those "boxes" match the high-end hardware.

Certainly, we were interested in how it performs for ClickHouse and conducted several tests using very basic Kodiak Data virtual boxes. These tests are based on NYC Taxi trips dataset and are actually extensions of our 'ClickHouse vs. RedShift' benchmark. In this benchmark, we compared ClickHouse at high-performance Amazon EC2 i3 instances to basic Kodiak Data instances, and also listed RedShift at various configurations for comparison as well.

To recap, the test dataset contains approximately 1.3 billion taxi rides records. We used 8 test queries. Sometimes queries were slightly different for ClickHouse and RedShift due to use of ClickHouse dictionaries instead of SQL JOIN. Then we summed execution time for all queries to get the final number.

Q1: 
SELECT passenger_count, avg(total_amount) 
  FROM yellow_tripdata_staging 
 GROUP BY passenger_count;

Q2: 
SELECT passenger_count, toYear(tpep_pickup_datetime) AS year, count(*)   FROM yellow_tripdata_staging 
 GROUP BY passenger_count, year

Q3:
SELECT passenger_count, toYear(tpep_pickup_datetime) AS year, round(trip_distance) AS distance, count(*) 
  FROM yellow_tripdata_staging 
 GROUP BY passenger_count, year, distance 
 ORDER BY year, count(*) DESC;

Q4:
-- ClickHouse
SELECT dictGetString('taxi_zones', 'zone', toUInt64(pickup_location_id)) AS zone, count(*)
  FROM yellow_tripdata
 GROUP BY pickup_location_id
 ORDER BY count(*) DESC
 LIMIT 10;

-- RedShift
SELECT min(zone), count(*) 
  FROM yellow_tripdata_staging td 
  left join taxi_zones tz on tz.locationid = td.pickup_location_id
 GROUP BY pickup_location_id 
 ORDER BY count(*) DESC 
 LIMIT 10;

Q5:
-- ClickHouse
SELECT count(*)
  FROM yellow_tripdata 
 WHERE dictGetString('taxi_zones', 'zone', toUInt64(pickup_location_id)) = 'Midtown East';

-- RedShift
SELECT count(*) 
  FROM yellow_tripdata_staging td 
  left join taxi_zones tz on tz.locationid = td.pickup_location_id
 WHERE zone = 'Midtown East';

Q6:

-- ClickHouse
SELECT dictGetFloat32('weather', 'min_temperature', ('GHCND:USW00094728', pickup_date)) AS t,
       count() AS c
  FROM yellow_tripdata
  GROUP BY t
  ORDER BY c DESC
  LIMIT 10;

-- RedShift:
SELECT min_temperature t, count(*) c 
  FROM yellow_tripdata_staging td 
  left join central_park_weather_observations w on w.date = trunc(td.tpep_pickup_datetime) 
 GROUP BY t 
 ORDER BY c DESC 
 LIMIT 10;

Q7:
-- ClickHouse
SELECT dictGetFloat32('weather', 'snowfall', ('GHCND:USW00094728', pickup_date)) AS t,
       avg((tpep_dropoff_datetime - tpep_pickup_datetime) / 60) AS d
  FROM yellow_tripdata_staging
  GROUP BY t
  ORDER BY t DESC
  LIMIT 10;

-- RedShift
SELECT snowfall t,
       avg(datediff(minute,tpep_pickup_datetime,tpep_dropoff_datetime)) AS d 
  FROM yellow_tripdata_staging td 
  left join central_park_weather_observations w on w.date = trunc(td.tpep_pickup_datetime) 
 GROUP BY t 
 ORDER BY t DESC 
 LIMIT 10;

Q7:
-- ClickHouse
SELECT dictGetUInt32('weather', 'snowfall', ('GHCND:USW00094728', pickup_date)) AS t,
       sum(passenger_count) AS c
  FROM yellow_tripdata
  GROUP BY t
  ORDER BY t DESC
  LIMIT 10;

-- RedShift
SELECT snowfall t, 
       sum(passenger_count) AS d 
  FROM yellow_tripdata_staging td 
  left join central_park_weather_observations w on w.date = trunc(td.tpep_pickup_datetime)
 GROUP BY t 
 ORDER BY t DESC
 LIMIT 10;

The same test queries have been executed at various hardware configurations in Amazon and Kodiak Data.

ClickHouse:

  • AWS r4.xlarge x1: 4 vCPU, 13.5 ECU, 30.5GB RAM, EBS
  • AWS i3.2xlarge x1: 8 vCPU, 27 ECU, 61GB RAM, 1 x 1990 NVMe SSD
  • AWS i3.4xlarge x1: 16 vCPU, 53 ECU, 122GB RAM, 2 x 1990 NVMe SSD
  • Kodiak A.1: 8 vCPU, 16GB vRAM, 2TB NVMe SSD vDisk
  • Kodiak A.3: 3 * (8 vCPU, 16GB vRAM, 2TB NVMe SSD vDisk) -- this is 3 node cluster that stands apart from other configs, but we have it in our test environment, so could not resist to include it in the picture

RedShift:

  • RedShift dc2.8xlarge: 36 vCPU, 244GB vRAM, 2.5TB NVMe SSD -- this is the best RedShift instance available to the moment of writing

Here is the summary of results:

Total query time per different ClickHouse and RedShift setups. Less is better

Results demonstrate that dedicated high-performance cloud even at the low-end setup outperforms most advanced Amazon instances. Also, it positions ClickHouse very well against RedShift, that has been already demonstrated by our previous benchmarks.

However, performance figures are not complete without cost. And here you will be surprised again:

ClickHouse:

  • AWS r4.xlarge x1: 4 vCPU, 13.5 ECU, 30.5GB RAM, EBS -- $45/week
  • AWS i3.2xlarge x1: 8 vCPU, 27 ECU, 61GB RAM, 1 x 1990 NVMe SSD -- $104/week
  • AWS i3.4xlarge x1: 16 vCPU, 53 ECU, 122GB RAM, 2 x 1990 NVMe SSD -- $208/week
  • Kodiak A.1: 8 vCPU, 16GB vRAM, 2TB NVMe SSD vDisk -- $40/week
  • Kodiak A.3: 3 * (8 vCPU, 16GB vRAM, 2TB NVMe SSD vDisk) -- $100/week

RedShift:

  • RedShift dc2.8xlarge: 36 vCPU, 244GB vRAM, 2.5TB NVMe SSD -- $950/week

Notes, prices are approximate, taken from :

  1. For Amazon, prices refer to https://aws.amazon.com/ec2/pricing/on-demand/
  2. RedShift prices can be found at https://aws.amazon.com/redshift/pricing/
  3. http://calculator.s3.amazonaws.com/index.html can be used to estimate discounted long term commitments with Amazon
  4. Kodiak A.1 and A.3 prices are discounted quotes for Altinity clients

ClickHouse on Kodiak Data is not only faster, but it is also several times less expensive than ClickHouse at AWS and an order of magnitude less expensive than RedShift.

Please don't take it as a marketing push, but when we see such numbers we cannot resist recommending Kodiak Data for ClickHouse if you need a cloud solution. If private hosting or on-permise is considered, Kodiak Data can also be a preferable alternative, because NVMe SSD setups with big disks are not widely available at hosting providers and are expensive. On the other hand, huge RAID arrays of spinning disks can compete with NVMe SSDs on speed and can give more capacity at less cost.

It worth mentioning that ClickHouse on the Cloud is not yet 'cloudy' enough, i.e. it cannot be started with the few mouse clicks like RedShift. It requires manual installation from AWS image or RPM and manual configuration for the cluster. This is something that Altinity is working on to make ClickHouse more user-friendly in cloud environments. We will keep everybody posted.

Meanwhile, ClickHouse on Kodiak Data MemCloud (TM) is available for everybody to try at ClickHouse-on-MemCloud demo site right now. You can get an access to our demo system at Kodiak Data MemCloud (TM) that contains three test datasets already preloaded. So test queries from this article and our other benchmarks can be tried.
Have fun!


Few links to the articles and resources mentioned above: