clickhouse-local: The power of ClickHouse SQL in a single command

The most interesting innovations in databases come from asking simple questions. For example: what if you could run ClickHouse queries without a server or attached storage? It would just be SQL queries and the rich ClickHouse function library. What would that look like? What problems could we solve with it?

We can answer the first question easily. It would look like ‘clickhouse-local’! You may not know about this handy tool, as not a lot has been written about it. A simple explanation is that ‘clickhouse-local’ turns the ClickHouse SQL query processor into a command line utility.

So what problems can we solve with ‘clickhouse-local’? That requires a longer answer. We now have a simple way to query non-ClickHouse data at ClickHouse speeds as well as perform data stream preprocessing. You can substitute powerful SQL expressions for traditional commands like awk or sed. We will explore examples in the rest of this article to demonstrate the power of this technique.

Note: ‘clickhouse-local’ is just one of several useful utilities in the ClickHouse distribution besides ‘clickhouse-client’ and ‘clickhouse-server’. The ‘clickhouse-copier’ tool copies data between environments. We described it in an article a while ago, so have a look there to find out more.

Introduction

The description of ‘clickhouse-local’ in the documentation is quite short: the ‘clickhouse-local’ program enables you to perform fast processing on local files, without having to deploy and configure the ClickHouse server.

Documentation also gives a fancy example of using such a tool:

#printing memory used by each Unix user:
$ ps aux | tail -n +2 | awk '{ printf("%s\t%s\n", $1, $4) }' | clickhouse-local -S "user String, mem Float64" -q "SELECT user, round(sum(mem), 2) as memTotal FROM table GROUP BY user ORDER BY memTotal DESC FORMAT Pretty"
┌──────────┬──────────┐
│ user     │ memTotal │
├──────────┼──────────┤
│ clickho+ │      8.1 │
├──────────┼──────────┤
│ root     │      1.5 │
├──────────┼──────────┤
│ telegraf │      0.1 │
├──────────┼──────────┤
│ postfix  │        0 │
└──────────┴──────────┘

Well, it looks useful. Maybe linux shell hackers can use that in their scripts, embed ‘clickhouse-local’ here and there, and create some alias for every stat they need. Later those fancy commands will make an indelible impression on colleagues. 🙂

But does the tool give something to data engineers?

The answer is definitely yes. Let’s look through two major use cases.

Use Case 1. Query CSV and other file formats with SQL

Imagine you have a CSV file, and you want to extract something from that using the full power of SQL queries. Of course, you can insert that CSV file into ClickHouse (or some other DBMS), but for a one-time query it sounds like overkill.

So how can ‘clickhouse-local’ help in that case? Let’s check.

First of all, you need to have ‘clickhouse-local’ installed. It is distributed as a part of ‘clickhouse-client’ package, so if you have ‘clickhouse-client’ in your system it means you already have ‘clickhouse-local’. Otherwise, you need to install the ‘clickhouse-client’ package. Please follow the official installation instructions to do that. It is a single binary with minimum dependencies.

Now let’s check out an example. We took a CSV formatted dataset “120 years of Olympic history: athletes and results” containing basic bio data on athletes and medal results from Athens 1896 to Rio 2016. The dataset is available for download from popular machine learning resources kaggle. The dataset is not a big one; there are only 271K records, 15 columns each. What we can do with ‘clickhouse-local’ is query this dataset directly from the file using SQL the same way as a database table! For example, let’s count how many records per sport discipline does the CSV have, and find the tallest athlete height:

$ clickhouse-local \
    --file=athlete_events.csv \
    --input-format=CSVWithNames \
    --table=athlete_events \
    --structure='ID String,Name String,Sex String,Age String,Height String,Weight String,Team String,NOC String,Games String,Year String,Season String,City String,Sport String,Event String,Medal String' \
    --query='SELECT Sport, count() count, max(toFloat64OrZero(Height)) h from athlete_events GROUP BY Sport ORDER BY h FORMAT PrettyCompact'
┌─Sport─────────────────────┬─count─┬───h─┐
│ Basque Pelota             │     2 │   0 │
│ Aeronautics               │     1 │   0 │
│ Military Ski Patrol       │    24 │   0 │
...
│ Rugby                     │   162 │ 191 │
│ Short Track Speed Skating │  1534 │ 191 │
│ Figure Skating            │  2298 │ 193 │
...
│ Volleyball                │  3404 │ 219 │
│ Basketball                │  4536 │ 226 │
└───────────────────────────┴───────┴─────┘

# time: 
# real    0m0.421s
# user    0m0.233s
# sys    0m0.068s

Apparently, the tallest Basketball player who got Olympic medal was 226cm high, while tallest figure skater was only 193cm high. Cool!

All the command line arguments are described in the official documentation and also shown when you call ‘clickhouse-local –help’.

The idea of running SQL queries on CSV files isn’t new, and there are several projects that implement the concept:

Let’s compare clickhouse with the two most popular ones: ‘q’ and ‘textql’.

‘textql’ is written in golang on top of SQLite. There is no binary distribution, it needs to be built from the source. Somewhat non-typically for the Unix world it uses long parameter names with a single dash.

$ textql -header -sql='SELECT Sport, count() count, max(CAST(Height AS decimal)) h from athlete_events GROUP BY Sport ORDER BY h' /athlete_events.csv 
Aeronautics,1,0
...
Basketball,4536,226

# time 
# real    0m4.764s
# user    0m4.877s
# sys    0m0.121s

Let’s try the same with ‘q’. It is also built on top of SQLite, this time using Python.

$ cat /athlete_events.csv | q -H 'SELECT Sport, count() count, max(CAST(Height AS decimal)) h from - GROUP BY Sport ORDER BY h' -d,
Aeronautics,1,0
...
Volleyball,3404,219
Basketball,4536,226

# time
# real    0m7.003s
# user    0m6.839s
# sys    0m0.288s

If we compare query times:

  • clickhouse-local: 0m0.421s
  • textsql: 0m4.764s
  • q: 0m7.003s

‘clickhouse-local’ is certainly much faster than tools built on top of SQLite. ClickHouse’s outstanding performance, rich SQL dialect, plenty of input/output formats shine here as well. The main disadvantage of ‘clickhouse-local’ comparing to those tools is lack of schema auto-detection.

Usage of ‘clickhouse-local’ is not limited to reading files from the local file system. You can access alternative data sources including ‘hdfs’, ‘url’, and many other ClickHouse functions. You can also generate data with the ‘numbers’ function, etc.

Some examples:

$ clickhouse-local --query='select *, rand() from numbers(10)'
0    1587455344
1    2935372231
2    3928708432
3    3640071348
4    1777407868
5    2705362273
6    2100998939
7    3720239753
8    4128495323
9    3397399832

$ clickhouse-local --query="select * from url('http://www.geoplugin.net/csv.gp?ip=3.3.3.3','CSV','property String,value String') limit 10" --format=PrettyCompact
┌─property──────────────┬─value──────┐
│ geoplugin_request     │ 3.3.3.3    │
│ geoplugin_status      │ 200        │
│ geoplugin_delay       │ 2ms        │
│ geoplugin_city        │ Seattle    │
│ geoplugin_region      │ Washington │
│ geoplugin_regionCode  │ WA         │
│ geoplugin_regionName  │ Washington │
│ geoplugin_areaCode    │            │
│ geoplugin_dmaCode     │ 819        │
│ geoplugin_countryCode │ US         │
└───────────────────────┴────────────┘

Use Case 2. ‘clickhouse-local’ as input stream preprocessor

Sometimes data needs to be preprocessed before it is being inserted into ClickHouse. Of course you can do scripting magic with tools like sed / awk / python, etc. With this approach before putting data to the ClickHouse area, where everything works with amazing speed, you need to do something with old and good but much slower tools. ‘clickhouse-local’ enables another approach to preprocessing.

Suppose we have the following CSV file:

"x","y","z"
"x","y","z"
"1","35676000","VG9reW8="
"2","19354922.0","TmV3IFlvcms="
"3","19028000","TWV4aWNvIENpdHk="
"4","18978000","TXVtYmFp"
"5","18845000"," U8OjbyBQYXVsbw=="
"6","15926000","RGVsaGk="
"7","14987000","U2hhbmdoYWk="
"8a","14787000","S29sa2F0YQ=="
"9","12815475.0","TG9zIEFuZ2VsZXM="
"10","12797394","RGhha2E="
"11","12795000","QnVlbm9zIEFpcmVz"
"12","12130000","S2FyYWNoaQ=="
"13","11893000","Q2Fpcm8="
"14","NA","UmlvIGRlIEphbmVpcm8="
"15","11294000","xYxzYWth"

The CSV has several issues: duplicated headers, base64-encoded string in third column (would be better to decode that before putting to ClickHouse), and rows with bad values like “NA”.

We want to load the data from that file into a ClickHouse table of the following structure:

CREATE TABLE target_table
(
    `a` UInt64, 
    `b` Float64, 
    `c` String
)
ENGINE = Log

If we try to load it directly, ClickHouse would fail because of the issues described above.

$ clickhouse-client --query='INSERT INTO target_table FORMAT CSVWithNames' <data.csv

Code: 27. DB::Exception: Cannot parse input: … (at row 1)

Probably we could cut off the first two lines using tail -n +3 command. But it doesn’t help with other issues in the input file:

$ cat data.csv | tail -n+3 | clickhouse-client --query='INSERT INTO target_table FORMAT CSV'
Code: 27. DB::Exception: Cannot parse input: … (at row 8)

So how to deal with this? Let’s use ‘clickhouse-local’!

Columns in the input file contain some bad values, so instead of reading them with proper types directly we will read and process them all as strings.

$ cat data.csv | tail -n +3 | clickhouse-local --table='input' 
--structure="col1 String, col2 String, col3 String" 
--input-format='CSV' 
--query='SELECT * FROM input'

1    35676000    VG9reW8=
2    19354922.0    TmV3IFlvcms=
3    19028000    TWV4aWNvIENpdHk=
4    18978000    TXVtYmFp
5    18845000     U8OjbyBQYXVsbw==
6    15926000    RGVsaGk=
7    14987000    U2hhbmdoYWk=
8a    14787000    S29sa2F0YQ==
9    12815475.0    TG9zIEFuZ2VsZXM=
10    12797394    RGhha2E=
11    12795000    QnVlbm9zIEFpcmVz
12    12130000    S2FyYWNoaQ==
13    11893000    Q2Fpcm8=
14    NA    UmlvIGRlIEphbmVpcm8=
15    11294000    xYxzYWth

Once we have all the data ‘loaded’, we can apply necessary conversions using a SQL query:

$ cat data.csv | tail -n +3 | clickhouse-local --table='input' 
--structure="col1 String, col2 String, col3 String" 
--input-format='CSV' 
--query='SELECT toUInt64OrZero(col1), toFloat64OrZero(col2), tryBase64Decode(col3) FROM input'

1    35676000    Tokyo
2    19354922    New York
3    19028000    Mexico City
4    18978000    Mumbai
5    18845000    
6    15926000    Delhi
7    14987000    Shanghai
0    14787000    Kolkata
9    12815475    Los Angeles
10    12797394    Dhaka
11    12795000    Buenos Aires
12    12130000    Karachi
13    11893000    Cairo
14    0    Rio de Janeiro
15    11294000    ?saka

We can also filter out bad rows (expecting they should not be zero):

cat data.csv | tail -n +3 | clickhouse-local --table='input' 
--structure="col1 String, col2 String, col3 String" 
--input-format='CSV' 
--query="SELECT toUInt64OrZero(col1) a, toFloat64OrZero(col2) b, tryBase64Decode(col3) c FROM input WHERE a<>0 and b<>0 and c<>''"

1    35676000    Tokyo
2    19354922    New York
3    19028000    Mexico City
4    18978000    Mumbai
6    15926000    Delhi
7    14987000    Shanghai
9    12815475    Los Angeles
10    12797394    Dhaka
11    12795000    Buenos Aires
12    12130000    Karachi
13    11893000    Cairo
15    11294000    ?saka

Now we can put the result into ClickHouse:

$ cat data.csv | tail -n +3 | clickhouse-local --table='input' 
--structure="col1 String, col2 String, col3 String" 
--input-format='CSV' 
--query="SELECT toUInt64OrZero(col1) a, toFloat64OrZero(col2) b, tryBase64Decode(col3) c FROM input WHERE a<>0 and b<>0 and c<>''"
| clickhouse-client --query='INSERT INTO target_table FORMAT TSV'

That works, but there was a lot of extra work: ‘clickhouse-local’ first created a ClickHouse native datastream, then converted it to TSV. TSV was sent via pipe to ‘clickhouse-client’, which had to convert it back to ClickHouse native datastream and only then send it over to the server. It sounds like excessive transformations have happened.

We can avoid extra data transformations by passing the native stream directly from ‘clickhouse-local’ to ‘clickhouse-client’, note format on both sides:

$ cat data.csv | tail -n +3 | clickhouse-local --table='input' 
--structure="col1 String, col2 String, col3 String" 
--input-format='CSV' 
--query="SELECT toUInt64OrZero(col1) a, toFloat64OrZero(col2) b, tryBase64Decode(col3) c FROM input WHERE a<>0 and b<>0 and c<>''" --format=Native
| clickhouse-client --query='INSERT INTO target_table FORMAT Native'

Conclusion

This article showed how ‘clickhouse-local’ can be installed and used in two major scenarios: for executing analytical SQL queries on CSV files and as a preprocessor in your shell scripts.

‘clickhouse-local’ handles tasks with outstanding performance leaving other tools created for that purpose far behind. Being a part of clickhouse toolset ‘clickhouse-local’ provides seamless and efficient integration with ‘clickhouse-client’ using Native format.

If you already use ClickHouse, you definitely should try ‘clickhouse-local’. You will love it. It’s handy, simple and helpful.

If you are not using ClickHouse yet, ‘clickhouse-local’ is another good reason to start. It’s worth mentioning that ClickHouse server is even more capable and much faster than ‘clickhouse-local’, so after checking how ‘clickhouse-local’ can be useful for your case, you can also take a look at ClickHouse server.

As you start using ‘clickhouse-local’ you will likely find many more use cases. Please share your ideas. And don’t forget to subscribe to our blog!

  

Share