Altinity
ClickHouse Leading Service Provider

Blog

ClickHouse and Python: Getting to Know the Clickhouse-driver Client

 

Feb 1, 2019

Python is a force in the world of analytics due to powerful libraries like numpy along with a host of machine learning frameworks. ClickHouse is an increasingly popular store of data. As a Python data scientist you may wonder how to connect them.

Fortunately the Altinity Blog is here to solve mysteries, at least those that involve ClickHouse. This post contains a review of the clickhouse-driver client. It’s a solidly engineered module that is easy to use and integrates easily with standard tools like Jupyter Notebooks and Anaconda. Clickhouse-driver is a great way to jump into ClickHouse Python connectivity.

So Many Python Choices

The first hurdle for Python users is just picking a suitable driver. Even a quick search on pypi.org shows 22 projects with ClickHouse references. They include SQLAlchemy drivers (3 choices), async clients (also 3), and a Pandas-to-ClickHouse interface among others.

Clickhouse-driver offers a straightforward interface that enables Python clients to connect to ClickHouse, issue SELECT and DDL commands, and process results. It’s a good choice for direct Python connectivity with 16 published releases on pypi.org. The latest version is 0.0.17, published on January 10, 2019. If you want to connect to the data warehouse, issue SQL commands, and fetch back data, clickhouse-driver is a great place to start.

Code and Community

The clickhouse-driver source code is published on Github under an MIT license. The main committer is Konstantin Lebedev (@xzkostyan) though there have been a few contributions from others.

Konstantin is very responsive to questions about the driver, which you can register as issues. Much of my understanding of the wire protocol started from Konstantin’s comprehensive responses to an issue related to CSV loading that I filed early on in my use of the code. He has helped a number of other users as well.

Installation

You can of course install clickhouse-driver straight from Github but since releases are posted on pypi.org it’s far easier to use pip, like the example below. Just a note: examples are based on Python 3.7. This installation command includes lz4 compression, which can reduce data transfer sizes enormously.

pip3 install clickhouse-driver[lz4]

For testing purposes it’s a best practice to use a virtual environment, which means the installation usually looks like the following example:

python3 -m venv test 
. test/bin/activate
pip3 install clickhouse-driver[lz4]

If you use Anaconda there is conveniently a clickhouse package in Anaconda Cloud. You can install it with the following command:

conda install -c conda-forge clickhouse-driver

After doing this you can use clickhouse-driver in Jupyter Notebooks served up by Anaconda. We will dig more deeply into Anaconda integration in a future blog article. Meanwhile this should get you started.

Documentation

One of the strengths of clickhouse-driver is excellent documentation. The docs provide a nice introduction to the code as well as detailed descriptions of the API. In fact, it was somewhat challenging to make useful code-level observations for this article because the documentation already covered API behavior so well.

The docs should probably be the first stop for new clickhouse-driver users but are easy to overlook initially since they are referenced at the bottom of the project README.md. I only noticed them after writing a couple of test programs. It would be nice if docs were published in future using Github pages, which puts a prominent link on the top of the Github project. Once you find them though you’ll refer to them regularly.

Basic Operation

Clickhouse-driver is very simple to use. The main interface is the Client class, which most programs import directly.

from clickhouse_driver import Client

To set up a connection you instantiate the class with appropriate arguments. Here’s the simplest example for a connection to a localhost server using the default ClickHouse user and unencrypted communications. This is sufficient for trivial tests.

client = Client('localhost')

Of course real applications are more demanding. It’s typical to see something akin to the sample code below. It has a non-default user on a secure connection with self-signed certificates. The database is also different from the usual ‘default’. To top it off we are compressing data.

client = Client('ch01.yoyodyne.com',
                user='python',
                password='secret',
                secure=True,
                verify=False,
                database='marketing',
                compression=True)

The option flexibility is great. In particular security options are robust and include basic features corporate InfoSec teams expect. With the foregoing options clickhouse-driver auto-negotiates to TLSv1.2 on a properly configured ClickHouse server. That meets current PCI standards among others. I was also very pleased to find easy support for self-signed certificates, which are common in test scenarios.

Creating a client sets up the connection information but does not actually touch the ClickHouse server. The connection is established when you invoke the Client.execute() method. Here’s an example of a simple SELECT, followed by some code to iterate through the query result so we can see how it is put together.

result = client.execute('SELECT now(), version()')
print("RESULT: {0}: {1}".format(type(result), result))
for t in result:
    print(" ROW: {0}: {1}".format(type(t), t))
    for v in t:
            print("  COLUMN: {0}: {1}".format(type(v), v))

The output is shown below. It’s a list of tuples containing column values.

RESULT: <class 'list'>: [(datetime.datetime(2019, 1, 26, 2, 4, 49), '19.1.6')]
 ROW: <class 'tuple'>: (datetime.datetime(2019, 1, 26, 2, 4, 49), '19.1.6')
  COLUMN: <class 'datetime.datetime'>: 2019-01-26 02:04:49
  COLUMN: <class 'str'>: 19.1.6

The result format has a couple of advantages. First, it’s easy to manipulate in Python. For example you can just print any part of the output and it will show values, which is handy for debugging. Second, you can use values immediately rather than having to figure out conversions yourselves. That’s handy because Python does not automatically do even relatively simple coercions like str to int in numerical equations.

Let’s quickly tour operations to create a table, load some data, and fetch it back.

Data definition language (DDL) like CREATE TABLE uses a single string argument. The following example splits the string across lines for readability.

client.execute('CREATE TABLE iris_from_csv ('
               'sepal_length Decimal32(2), sepal_width Decimal32(2), '
               'petal_length Decimal32(2), petal_width Decimal32(2), '
               'species String) ENGINE = MergeTree '
               'PARTITION BY species ORDER BY (species)')

INSERT statements take an extra params argument to hold the values, as shown by the following example.

client.execute(
    'INSERT INTO iris (sepal_length, sepal_width, petal_length, petal_width, species) VALUES',
    [(5.1, 3.7, 1.5, 0.4, 'Iris-setosa'), (4.6, 3.6, 1.0, 0.2, 'Iris-setosa')]
)

The format for values is the same as the result format for SELECT statements. Clickhouse-driver uses a similar format in both directions. The INSERT params also support dictionary organization as well as generators, as we’ll see in a later section. See the docs for more insert examples.

We already showed an example of a SELECT statement using functions to generate output. Selecting out of a table looks pretty much the same, as shown by the following example.

result = client.execute('SELECT COUNT(*), species FROM iris '
                        'WHERE petal_length > toDecimal32(3.4, 2) '
                        'GROUP BY species ORDER BY species')
print(result)

Clickhouse-driver has a lot of useful features related to SELECTs. For instance, you can enable progress tracking using the Client.execute_with_progress() method, which is great when pulling down large result sets. Similarly the Client.execute_iter() method allows you to chunk results from large datasets to avoid overflowing memory. There’s even cancellation which covers you when somebody accidentally selects a few billion rows. Again, see the docs for examples.

One place where you need to be a little wary is prevention of SQL injection attacks. The procedure for query parameterization uses Python dictionary substitutions, as in the following example.

result = client.execute('SELECT * FROM iris '
                       'WHERE species = %(species)s LIMIT 5',
                        {'species': "Iris-setosa"})

You might try to circumvent the substitution scheme by setting ‘species’ to a string like “‘Iris-setosa’ AND evil_function() = 0”. The clickhouse-driver cleverly foils this attack by escaping strings and other common data types before doing substitutions. The query ends up looking like the following, which may break but won’t call evil_function() unexpectedly.

SELECT * FROM iris WHERE species = '\'Iris-setosa\' AND evil_function() = 0' LIMIT 5

This approach will protect you from run-of-the-mill villany with strings but there are ways around it. For instance, it appears possible to pass in Python object types that will not be escaped properly. (Check the driver code here to see why this might be so.) You should review substitution format strings carefully and also check Python parameter types at runtime to ensure something bad does not weasel through. That’s especially the case for Internet-facing applications.

A Deeper Look at the ClickHouse Wire Protocol

This is a good time to discuss what’s actually happening on the wire when communicating between the Python client and ClickHouse. To set context, ClickHouse has two wire protocols: HTTP protocol which uses simple PUT and POST operations to issue queries, and a native TCP/IP protocol that ships data as typed values. These run on different ports so there’s no confusion.

Clickhouse-driver uses the native TCP/IP protocol. This choice is better for Pythonistas because the native protocol knows about types and avoids loss of precision due to binary-to-string conversions. The implementation is correct, at least for the samples that I tried. That is an impressive accomplishment, because the documentation for the native protocol is the C++ implementation code.

As you go deeper into Python access to ClickHouse it’s helpful to understand what the TCP/IP protocol is actually doing. When you run a query, ClickHouse returns results in a binary block format that contains column results in a typed binary format. Here’s an example:

client.execute('SELECT * from iris limit 1000')

Unlike many databases ClickHouse results are column-oriented (like the storage). This means that compression works well on query results just as it does on stored values. Compression is invisible to users but can vastly reduce network traffic.

Where ClickHouse is differs from many other DBMS implementations is on upload. Let’s look at the INSERT statement again from the previous section.

client.execute(
    'INSERT INTO iris (sepal_length, sepal_width, petal_length, petal_width, species) VALUES',
    [(5.1, 3.7, 1.5, 0.4, 'Iris-setosa'), (4.6, 3.6, 1.0, 0.2, 'Iris-setosa')]
)

This format may be a little confusing if you are used to executing INSERT statements as a single string, which is typical for many DBMS types. What you are seeing is a side-effect of the native TCP/IP wire protocol, which ships typed values in both directions. The data values use a column-oriented format, just like the query output.

The TCP/IP protocol has another curious effect, which is that sending INSERTs as a single string won’t even work in clickhouse-driver. It just hangs and will eventually time out.

#Don’t do this.
client.execute(
    "INSERT INTO iris (sepal_length, sepal_width, petal_length, petal_width, species) “ 
             “VALUES (5.1, 3.7, 1.5, 0.4, 'Iris-setosa'), (4.6, 3.6, 1.0, 0.2, 'Iris-setosa')"
)

What’s going on? The server has the first part of the INSERT and is now waiting for data from the client to complete the INSERT in the native protocol. Meanwhile, the client is waiting for the server to respond. This behavior is clearly documented in the clickhouse-driver documentation so one could argue it’s not a bug: you are doing something the protocol does not expect. I don’t completely agree with that view, mostly because it’s confusing to newcomers. This seems like a nice pull request for somebody to work on in future.

But wait, you might ask. The C++ clickhouse-client binary will process an INSERT like the one shown above. How can that possibly work? Well, the trick is that clickhouse-client runs the same code as the ClickHouse server and can parse the query on the client side. It extracts and sends the INSERT statement up to the VALUES clause, waits for the server to send back data types, then converts and sends the data as column-oriented blocks.

Overall the wire protocol is quite reasonable once you understand what is going on. Problems like hanging INSERTs easy to avoid. If you have further questions I suggest firing up WireShark and watching the packets on an unencrypted, uncompressed connection. It’s relatively easy to figure out what’s happening.

Loading CSV

Armed with a better understanding of what the clickhouse-driver is doing under the covers we can tackle a final topic: how to load CSV.

As we now know you can’t just pipe raw CSV into the the driver the way that the clickhouse-client program does it. Fortunately, there’s an easy solution. You can parse CSV into a list of tuples as shown in the following example.

import csv

#Create a generator to fetch parsed rows.
def row_reader():
     with open('iris.csv') as iris_csv:
     for line in csv.reader(iris_csv):
         yield line

#Use a generator expression to load values as a list of tuples.
client.execute("INSERT INTO iris VALUES", (line for line in row_reader()))

This code works for the Iris dataset values used in this sample, which are relatively simple and automatically parse into types that load properly. For more diverse tables you may need to add additional logic to coerce types. Here’s another approach that works by assigning values in each line to a dictionary. It’s more complex but ensures types are correctly assigned. You can also rearrange the order of columns in the input and do other manipulations to clean up data.

import csv

#Create a generator to fetch parsed rows. CSV must have variable names in header row.
def row_reader():
    with open('iris_with_names.csv') as iris_csv:
    #Use DictReader to get values as a dictionary with column names.
    for line in csv.DictReader(iris_csv):
        yield {
            'sepal_length': float(line['sepal_length']),
            'sepal_width': float(line['sepal_width']),
            'petal_length': float(line['petal_length']),
            'petal_width': float(line['petal_width']),
            'species': str(line['species']),
        }

#Use a generator expression to load values as a list of dictionaries.
client.execute("INSERT INTO iris VALUES", (line for line in row_reader()))

As files run into the 100s of megabytes or more you may want to consider alternatives to Python to get better throughput. Parsing and converting data in Python is relatively slow compared to the C++ clickhouse-client. I would recommend load testing any Python solution for large scale data ingest to ensure you don’t hit bottlenecks.

Summary and Acknowledgments

The clickhouse-driver is relatively young but it is very capable. I am impressed by the thoughtful design, quality of the implementation, and excellent documentation. It looks like a solid base for future Python work with ClickHouse. We’ll review more Python client solutions in the future but for new users clickhouse-driver is a great place to start.

Thanks to Konstantin Lebedev for reviewing a draft of this article!

 

Subscribe

Receive news and updates about ClickHouse technology