ClickHouse vs Amazon RedShift Benchmark #2: STAR2002 dataset
Jul 3, 2017
Preface
We continue to benchmark ClickHouse against other column-based storages. Here we will make the another test against Amazon Redshift using different dataset and queries.
For this particular benchmark we will be loading 1TB CSV dataset. The dataset is based on STAR2002 experiment data repeated 500 times. We are going to compare ClickHouse results with the benchmark described in GCE BigQuery vs AWS Redshift vs AWS Athena article, where RedShift has been tested in two different configurations.
The source dataset can be downloaded from here and needs to be replicated 500 times. At the end we have the following dataset:
- CSV size: 997GB (~1TB)
- 7 928 812 500 lines (~8 billion)
- 16 columns
- All columns are either integers, double precision or floats
The tested configuration is Amazon d2.xlarge EC2 instances with ClickHouse installed.
Data Load
CSV file has been pre-copied to the ClickHouse server. The data loading process took 1 hour and 40 minutes to complete. It is 6 times faster than it took for Redshift to load data from S3.
Performance Benchmark
We use the same queries as in the mentioned article:
- SELECT count(*) FROM t
- SELECT count(*) FROM t WHERE eventnumber > 1
- SELECT count(*) FROM t WHERE eventnumber > 20000
- SELECT count(*) FROM t WHERE eventnumber > 500000
- SELECT eventFile, count(*) FROM t GROUP BY eventFile
- SELECT eventFile, count(*) FROM t WHERE eventnumber > 525000 GROUP BY eventFile
- SELECT eventFile, eventTime, count(*) FROM t WHERE eventnumber > 525000 GROUP BY eventFile, eventTime ORDER BY eventFile DESC, eventTime ASC
- SELECT MAX(runNumber) FROM t
- SELECT AVG(eventTime) FROM t WHERE eventnumber > 20000
- SELECT eventFile, AVG(eventTime), AVG(multiplicity), MAX(runNumber), count(*) FROM t WHERE eventnumber > 20000 GROUP BY eventFile
The results demonstrate that ClickHouse performs much faster than Redshift on the same hardware and is comparable to much more expensive one.
Conclusion
ClickHouse showed the great results with the different dataset and different use cases. It is also interesting to compare prices for Redshift and ClickHouse instances:
RedShift:
- $0.9 per working hour for ds2.xlarge instance, ~$650 per month
- $5.7 per working hour for dc1.8xlarge instance, ~$4100 per month
ClickHouse:
- $0.266 per working hour for d2.xlarge instance, ~$190 per month.
Do you have any comparison on features?
I think this article is missing actual query performance comparison 🙂