Altinity
ClickHouse Leading Service Provider

Blog

ClickTail Introduction

Mar 12, 2018

What Is ClickTail

Altinity is happy to introduce ClickTail -- an easy to use tool for ingesting log file data into ClickHouse and making it available for exploration. Its favorite format is JSON, but it understands how to parse a range of other well-known log formats. ClickTail has been created as a fork from HoneyTail project developed by HoneyComb.io.

Project information could be found on GitHub here: https://github.com/Altinity/clicktail

finger-3139200_1920.jpg

What Is It Made For

ClickTail can parse different types of log files, including most popular ones as Nginx access logs, MySQL slow and audit logs. Once parsed it loads data into ClickHouse DBMS for further analysis.

There is a number of similar tools available already, that can parse logs and put them into some data stores, for example, Logstash, HoneyComb, ApacheLogsParser, etc. So why another one? ClickTail does not only use ClickHouse as the data store so the log data analysis works blazingly fast, but it also does a job of normalizing text data from the log files such as sql query or URL in respective log formats. Normalizing means replacing the actual queries/urls with a pattern which could be used as a filter or group by clause afterward.

ClickTail is written in Go so it is natively fast, and it is implemented in a parallel manner in order to utilize CPU in a most efficient way.

How to Use It

First of all, ClickTail could be installed from rpm/deb packages as well as build from the sources published on github. There is an Altinity’s package repository here: https://packagecloud.io/Altinity/clickhouse and here is the source code repo: https://github.com/Altinity/clicktail

# curl -s https://packagecloud.io/install/repositories/Altinity/clickhouse/script.deb.sh | bash
# apt-get install clicktail

Once ClickTail is installed one could use it as a CLI tool right away or configure a service and run it in a background.

ClickTail supports reading files from STDIN as well as from a file on disk. A complete list of parsers can be found in the parsers/ directory. At the time of this writing, ClickTail supports logs generated by:

  • ArangoDB
  • MongoDB
  • MySQL (Slow and Audit Logs)
  • PostgreSQL
  • Nginx (Access Logs)
  • regex

Let’s look into a couple of example of how to configure ClickTail for specific log types.

Nginx Access Logs

In this example, we will use 1Gb access log file (approx. 3.4M rows) from typical Nginx server.

We start with db schema preparations. We use table definitions from our repository located here: https://github.com/Altinity/clicktail/blob/master/schema/

We assume ClickHouse server and client are installed on the same server we run ClickTail. First we create a separate database clicktail:

$ cat schema/db.sql | clickhouse-client --multiline

Then create clicktail.nginx_log table:

$ cat schema/nginx.sql | clickhouse-client --multiline

Once the schema in ClickHouse is prepared let’s run ClickTail by telling it to parse existing log file using --backfill option:

$ clicktail -p nginx -f ./access.log -d clicktail.nginx_log --nginx.conf=/etc/nginx/nginx.conf --nginx.format=combined --backfill

As you can see there are two unexplained options: nginx.conf and nginx.format -- those two are used to identify nginx log format in order to parse log entries.

In our test server it took about 5 minutes to load 1Gb log file into ClickHouse.

After ClickTail is done with parsing the log file we can start executing some queries using clickhouse-client. Here are few example queries.

The first one queries the number of requests with successful responses (20x-30x HTTP response codes) and a number of requests with errors (40x-50x codes respectively):

SELECT
    if(status >= 400, '40x', 'ok'),
    count()
FROM nginx_log
GROUP BY if(status >= 400, '40x', 'ok')
FORMAT TSVRaw

Results:

ok      3399777
40x     39649

2 rows in set. Elapsed: 0.027 sec. Processed 3.44 million rows, 13.76 MB (125.19 million rows/s., 500.74 MB/s.)

So here we get a result in less than a tenth of a second by processing about 3.5M records.

Here is another query example which gets top 10 most requested pages excluding some static content:

SELECT
    request_shape,
    count()
FROM nginx_log
WHERE (request_shape NOT LIKE '%.js%')
    AND (request_shape NOT LIKE '%.css%')
    AND (request_shape NOT LIKE '%.png%')
    AND (request_shape NOT LIKE '%.gif%')
    AND (request_shape NOT LIKE '%.jpg')
GROUP BY request_shape
ORDER BY count() DESC
LIMIT 10

┌─request_shape─────────────────────────────────────────────────────────────────────────┬─count()─┐
│ /services/                                                                            │  566945 │
│ /robots.txt                                                                           │  406599 │
│ /                                                                                     │  181174 │
│ /wp-login.php                                                                         │  171027 │
│ /blog/using-sphinx-search-engine-with-chinese-japanese-and-korean-language-documents/ │   49972 │
│ /feed/                                                                                │   46878 │
│ /joombah/index.php?format=?&option=?                                                  │   31458 │
│ /contact-us/                                                                          │   14029 │
│ /check_ivinco_contact_form.php                                                        │   11001 │
│ /software/wordpress-sphinx-search/                                                    │   10383 │

10 rows in set. Elapsed: 0.097 sec. Processed 3.44 million rows, 160.58 MB (35.38 million rows/s., 1.65 GB/s.)

Here we used request_shape column instead of original request attribute in order to match requests using uri format like this: /joombah/index.php?format=?&option=? where question marks are located in places where actual parameters were in the original queries.

MySQL Slow Logs

ClickTail parses all the attributes from mysql-slow, nginx access and mysql-audit log file including some extra attributes that only Percona Server provides. That means so whatever comes in log file could be used as a filter, group or order by field in ClickHouse.

Here we will see how one can start parsing mysql slow log files. First of all, we need to create a schema in ClickHouse where parsed log entries will be stored:

$ cat schema/mysql.sql | clickhouse-client --multiline

Then if we need to get already logged entries from the log file we run ClickTail with --backfill option:

$ clicktail -p mysql -f /var/lib/mysql/centos-slow.log -d clicktail.mysql_slow_log --backfill

After that we can start ClickTail to watch for new entries simply by removing --backfill:

$ clicktail -p mysql -f /var/lib/mysql/centos-slow.log -d clicktail.mysql_slow_log

Once we have log file parsed into ClickHouse table we can start querying the data. Here is a simple query to get a number of log entries, avg and max query time grouped by date:

SELECT
    toDate(_date) AS t,
    count(),
    max(query_time),
    avg(query_time)
FROM mysql_slow_log
GROUP BY t
ORDER BY t ASC

┌──────────t─┬─count()─┬─max(query_time)─┬───────avg(query_time)─┐
│ 2018-02-21 │   31920 │       4.3042603 │  0.008569910840723489 │
│ 2018-02-22 │   70135 │        8.375134 │  0.013615735582736797 │
│ 2018-03-12 │    2446 │     0.058650002 │ 0.0003342101415163659 │
└────────────┴─────────┴─────────────────┴───────────────────────┘

3 rows in set. Elapsed: 0.005 sec. Processed 104.50 thousand rows, 627.01 KB (19.55 million rows/s., 117.29 MB/s.)

Here is one more query to get most used query share from the whole set of queries:

SELECT
    normalized_query,
    count() AS c,
    avg(query_time)
FROM mysql_slow_log
GROUP BY normalized_query
ORDER BY c DESC
LIMIT 3

┌─normalized_query──────────────────────────────────────────────┬─────c─┬────────avg(query_time)─┐
│ select `competitionteamdriver`.* from `competitionteamdriver` │ 56805 │ 0.00009646791773688537 │
│ select `driver`.* from `driver` where lower(`link`) = ?       │ 14055 │   0.049214935139197974 │
│ select `top25class`.* from `top25class`                       │  8062 │ 0.00009259848763351935 │
└───────────────────────────────────────────────────────────────┴───────┴────────────────────────┘

3 rows in set. Elapsed: 0.011 sec. Processed 104.50 thousand rows, 9.22 MB (9.82 million rows/s., 866.00 MB/s.)

For more examples you may refer to a recent article of Peter Zaitsev from Percona, that highlights MySQL slow logs analysis with ClickTail and ClickHouse: https://www.percona.com/blog/2018/02/28/analyze-raw-mysql-query-logs-clickhouse/

Create a Custom Log Format Parser

ClickTail is an open source and easily extensible. It is possible to create a custom log format parser and start using all of ClickTail’s benefits and features with your own type of logs.

In order to achieve that, first you need to checkout project sources: https://github.com/Altinity/clicktail

The next step is to create a parser class and respective package. To do so one needs to create a subfolder under ‘parsers’ folder and create a package file there. Name of the file should match package name and the subfolder as well, i.e.:

$ cd parsers
$ mkdir customlog
$ echo “package customlog” > customlog/customlog.go

See other parsers in ‘parsers’ folder as an example and guidance of how to make interfaces and structures for custom log parser.

Once class and package for custom parser are ready, the new parser needs to be enabled in order to be used at runtime. It can be done in the following files (with their respective sections):

  • leash.go: import section and getParserAndOptions function;
  • main.go: import section, validParsers array and GlobalOptions struct.

The last step is to create a table schema in ClickHouse describing attributes of custom log format.