New Altinity Stable ClickHouse 19.11.8 Release Is Out!

Sep 2, 2019

It has been quite a while since we announced the previous ‘Altinity Stable’ ClickHouse in December 2018. Since then there have been a lot of changes and new features in ClickHouse. The core team has merged almost 1000 pull requests, and 217 contributors completed about 6000 commits. Unfortunately, during those months of active development ClickHouse suffered from stability issues. We have been working hard together with the core ClickHouse team to nail them down. Today we are happy to introduce a new ‘Altinity Stable’ release 19.11.8!

We want to thank everybody who helped to create this release: the Yandex team, the awesome ClickHouse community, and of course all of you who use ClickHouse. You help make ClickHouse better by providing feedback, bug reports and feature requests.

In this article you will find major changes since 18.14.19 as well as known issues. Some new features we have already reviewed in our blog. Others are still waiting to be discussed in detail.

Major new features in 19.11.8* since 18.14.19

* – the release number is actually 19.11.8.46, where 46 is a build number that is different for every release.

  • Multi-table JOINs with standard SQL syntax (some limitations are left: only equi-JOIN, limited condition pushdown support)

  • Secondary indices – allow skipping data blocks without reading data: min/max filter, n-grams or token-based bloom filter

  • TTL expressions for columns and tables to remove old data automatically

  • Unique ASOF JOIN – join to the most recent value known

  • Per-column custom compression codecs (Delta, T64, DoubleDelta, Gorilla): fine-tune compression for particular columns

  • Roaring Bitmaps – use bitmaps for different on-set operations (calculate cardinality, unions, intersects, etc)

  • Machine-learning functions: evalMLMethod, simpleLinearRegression, stochasticLinearRegression, stochasticLogisticRegression

  • Predefined per-row filter expressions for tables, which enable simple row-level security rules

  • Adaptive index granularity (setting ‘index_granularity_bytes’) – useful for tables having rows with relatively big size (>1280 bytes)

  • Advanced text processing functions

    • Multiple substring search: multiSearch(Any|FirstPosition|FirstIndex|AllPositions)(CaseInsensitive)?(UTF8)?

    • Multiple regex search (hyperscan-powered): multiMatch(Any|AnyIndex)

    • Fuzzy regex match (hyperscan-powered): multiFuzzyMatch(Any|AnyIndex)

    • N-gram distance for fuzzy string comparison and search (similar to q-gram metrics in R language): ngramDistance(CaseInsensitive)?(UTF8)?, ngramSearch(CaseInsensitive)?(UTF8)?

  • HDFS read/write access

  • New JSON processing functions – high performance & compliant with JSON standard

  • IPv4 and IPv6 data types

  • New formats

    • Protobuf — now fully supported with input and output plus nested data structures

    • Parquet

    • RowBinaryWithNamesAndTypes

    • JSONEachRow and TSKV – now support default expressions for missing fields (Check ‘input_format_defaults_for_omitted_fields’)

    • TSVWithNames/CSVWithNames – column order can now be determined from file header (Check ‘input_format_with_names_use_header’ parameter).

  • SQL statements with bind parameters

  • Improved MySQL integration:

    • new database engine to access all the tables in remote MySQL server

    • support for MySQL wire protocol, allowing to connect to ClickHouse using MySQL clients.

We recommend upgrading to this release, but be aware of the following known issues and backward incompatible changes.

Known issues

  • All released 19.x versions have had some problems with Kafka engine implementation due to a full re-write of Kafka support. In 19.11.8 Kafka is working much better than previous 19.x releases.  However, there are still some corner cases that can lead to data duplication in certain scenarios, for example, in the event of ClickHouse server restart. Those issues will be addressed soon.

  • Adaptive granularity is enabled by default, and index_granularity_bytes is set to 10Mb. This feature uses a different data format, and interoperability between old and new format has some issues. So if you’re upgrading your cluster from an older version, consider disabling it before the upgrade by putting the following fragment in your config.xml:

<merge_tree>
    <index_granularity_bytes>0</index_granularity_bytes>
</merge_tree>

After upgrade, you can choose any convenient time to turn adaptive granularity on. In general, it’s a cool feature and especially useful when you have rows of size > 1Kb in your tables. If you are doing a new installation, please leave the default setting value as is.  The adaptive granularity feature is very nice and useful.

  • ‘enable_optimize_predicate_expression’ is now enabled by default. It’s possible you may have some issues when a condition passed to subselect leads to some suboptimal / undesired effects.  If this happens please report the issue and disable the feature for that select (or globally).

  • Secondary indices are maturing but still considered as experimental. There is at least one severe bug: when a mutation is executed with a condition on the column with secondary index – it can affect more rows than expected. Please be careful with DELETE, or upgrade to 19.13 (see issue #6224 )

  • Some users have reported problems with ODBC data sources after upgrade. In most cases these were misconfigurations. Nevertheless, please do canary/staging updates and check how your ODBC connections work before moving to production.

Backward compatibility issues

  • Due to update of LZ4 library the new ClickHouse version writes parts which are not binary equivalent to those written with older versions. That makes it problematic to update only one replica. Leaving the cluster in such a state for a long period of time time will work but may lead to excessive parts copying between nodes due to checksum mismatches.

  • There is a new setting  ‘max_partitions_per_insert_block’ with default value 100. If the inserted block contains a larger number of partitions, an exception is thrown. Set it to 0 if you want to remove the limit (not recommended).

  • If you are using unexpected low cardinality combinations like LowCardinality(UInt8), the new version will prevent you from doing so. if you really know what you are doing check ‘allow_suspicious_low_cardinality_types’ and set it to 1.

  • This release adds ‘max_parts_in_total’ setting for MergeTree family of tables (default: 100 000). We hope your number of partitions is much lower than this limit.  If necessary you can raise the value.

  • The ‘system.dictionaries’ table has been changed. If you used it for monitoring purposes, you may need to change your scripts.

  • Dictionaries are loaded lazily by default. It means they have status ‘NOT_LOADED’ until the first access.

Besides that, we can confirm that 19.11.8 is well tested and stable, and we didn’t see any significant issues while operating it in our environment and several user installations. We recommend upgrading to this version if you are using an older one.

Please contact us at info@altinity.com if you experience any issues with the upgrade.  Meanwhile, enjoy the new stable release!

Appendix

As a part of those release notes, we have collected everything new that appeared in ClickHouse since 18.14.19, so if you are upgrading take a look at those. You can also look at official changelog: https://github.com/yandex/ClickHouse/blob/master/CHANGELOG.md

New functions

  • New hash functions: xxHash32, xxHash64, gccMurmurHash, hiveHash, javaHash, CRC32

  • JSON processing functions: JSONExtract,JSONExtract(Bool|Float|Int|KeysAndValues|Raw|String|UInt),JSONHas,JSONKey,JSONLength,JSONType

  • Geospatial processing: geoToH3, geohashEncode, geohashDecode

  • IP address handling: IPv4CIDRtoIPv4Range, IPv6CIDRtoIPv6Range, toIPv4, toIPv6

  • New statistical aggregate functions: skewPop, skewSamp, kurtPop, kurtSamp

  • String functions: isValidUTF8, regexpQuoteMeta, trimBoth, trimLeft, trimRight, format, toValidUTF8

  • Encoding: tryBase64Decode, base64Decode, base64Encode

  • Array processing: arrayEnumerateDenseRanked, arrayEnumerateUniqRanked, flatten, arrayFlatten, arrayWithConstant,

  • Date/Time processing: toStartOfInterval, addQuarters, subtractQuarters, toIntervalQuarter, toStartOfTenMinutes

  • Numerical: roundDown, toDecimal*OrZero, toDecimal*OrNull

  • Dictionary: dictGet, dictGetOrDefault

  • Roaring Bitmaps: bitmapAndCardinality, bitmapAndnot, bitmapAndnotCardinality, bitmapBuild, bitmapCardinality, bitmapContains, bitmapHasAll, bitmapHasAny, bitmapOr, bitmapOrCardinality, bitmapToArray, bitmapXor, bitmapXorCardinality, groupBitmap

  • OS level introspection: filesystemAvailable, filesystemCapacity, filesystemFree, basename

  • New aggregate functions: boundingRatio, entropy, groupBitmap, sumMap(Filtered|FilteredWithOverflow|WithOverflow), topKWeighted, groupArrayMovingAvg,groupArrayMovingSum, timeSeriesGroupRateSum, timeSeriesGroupSum

  • SQL standard compatibility aliases added: left, right, trim, ltrim, rtrim, timestampadd, timestampsub, power, replace, ln, locate, mid

Other notable changes:

  • Setting constraints which limit the possible range of setting value per user profile.

  • KILL MUTATION added

  • New aggregate function combinators: -Resample

  • Added new data type SimpleAggregateFunction – light aggregation for simple functions like any, anyLast, sum, min, max

  • Ability to use different sorting key (ORDER BY) and index (PRIMARY KEY). The sorting key can be longer than the index. You can alter ORDER BY at the moment of adding / removing the column

  • HTTP interface: brotli compression support, X-ClickHouse-Query-Id and X-ClickHouse-Summary headers in response, ability to cancel query on disconnect (check ‘cancel_http_readonly_queries_on_client_close’)

  • DFA-based implementation for functions sequenceMatch and sequenceCount in case the pattern doesn’t contain time

  • Back up all partitions at once with ALTER TABLE … FREEZE

  • Comments for a column in the table description

  • Join engine: new options ‘join_use_nulls’, ‘max_rows_in_join’, ‘max_bytes_in_join’, ‘join_any_take_last_row’ and ‘join_overflow_mode’ + joinGet function that allows you to use a Join type table like a dictionary.

  • /docker-entrypoint-initdb.d for database initialization in docker

  • Graphite rollup rules reworked.

  • Query settings in asynchronous INSERTs into Distributed tables are respected now

  • Hints when while user make a typo in function name or type in command line client

  • system.detached_parts table containing information about detached parts of MergeTree tables

  • Table function remoteSecure

  • Ability to write zookeeper part data in more compact form (‘use_minimalistic_part_header_in_zookeeper’)

  • Ability to close MySQL connections after their usage in external dictionaries

  • Support RENAME operation for Materialized View

  • Non-blocking loading of external dictionaries

Kafka now supports SASL SCRAM authentication, new virtual columns _topic, _offset, _key are available, and a lot of other improvements.

Settings changed/added:

Name Default Description
allow_experimental_cross_to_join_conversion 1 Convert CROSS JOIN to INNER JOIN if possible
allow_experimental_data_skipping_indices 0 If it is set to true, data skipping indices can be used in CREATE TABLE/ALTER TABLE queries.
allow_experimental_low_cardinality_type 1 (was 0) Obsolete setting, does nothing. Will be removed after 2019-08-13
allow_experimental_multiple_joins_emulation 1 Emulate multiple joins using subselects
allow_hyperscan 1 Allow functions that use Hyperscan library. Disable to avoid potentially long compilation times and excessive resource usage.
allow_simdjson 1 Allow using simdjson library in ‘JSON*’ functions if AVX2 instructions are available. If disabled rapidjson will be used.
allow_suspicious_low_cardinality_types 0 In CREATE TABLE statement allows specifying LowCardinality modifier for types of small fixed size (8 or less). Enabling this may increase merge times and memory consumption.
cancel_http_readonly_queries_on_client_close 0 Cancel HTTP readonly queries when a client closes the connection without waiting for response.
check_query_single_value_result 1 Return check query result as single 1/0 value
enable_conditional_computation (was 0) DELETED
enable_optimize_predicate_expression 1 (was 0) If it is set to true, optimize predicates to subqueries.
enable_unaligned_array_join 0 Allow ARRAY JOIN with multiple arrays that have different sizes. When this setting is enabled, arrays will be resized to the longest one.
external_table_functions_use_nulls 1 If it is set to true, external table functions will implicitly use Nullable type if needed. Otherwise NULLs will be substituted with default values. Currently supported only for ‘mysql’ table function.
idle_connection_timeout 3600 Close idle TCP connections after specified number of seconds.
input_format_defaults_for_omitted_fields 0 For input data calculate default expressions for omitted fields (it works for JSONEachRow format).
input_format_with_names_use_header 0 For TSVWithNames and CSVWithNames input formats this controls whether format parser is to assume that column data appear in the input exactly as they are specified in the header.
join_any_take_last_row 0 When disabled (default) ANY JOIN will take the first found row for a key. When enabled, it will take the last row seen if there are multiple rows for the same key. Allows you to overwrite old values in table with Engine=Join.
join_default_strictness ALL Set default strictness in JOIN query. Possible values: empty string, ‘ANY’, ‘ALL’. If empty, query without strictness will throw exception.
low_cardinality_allow_in_native_format 1 Use LowCardinality type in Native format. Otherwise, convert LowCardinality columns to ordinary for select query, and convert ordinary columns to required LowCardinality for insert query.
max_alter_threads auto The maximum number of threads to execute the ALTER requests. By default, it is determined automatically.
max_execution_speed 0 Maximum number of execution rows per second.
max_execution_speed_bytes 0 Maximum number of execution bytes per second.
max_partitions_per_insert_block 100 Limit maximum number of partitions in single INSERTed block. Zero means unlimited. Throw exception if the block contains too many partitions. This setting is a safety threshold, because using large number of partitions is a common misconception.
max_streams_multiplier_for_merge_tables 5 Request more streams when reading from Merge table. Streams will be spread across tables that Merge table will use. This allows more even distribution of work across threads and is especially helpful when merged tables differ in size.
max_threads auto The maximum number of threads to execute the request. By default, it is determined automatically.
merge_tree_max_bytes_to_use_cache 2013265920 The maximum number of rows per request, to use the cache of uncompressed data. If the request is large, the cache is not used. (For large queries not to flush out the cache.)
merge_tree_min_bytes_for_concurrent_read 251658240 If at least as many bytes are read from one file, the reading can be parallelized.
merge_tree_min_bytes_for_seek 0 You can skip reading more than that number of bytes at the price of one seek per file.
min_count_to_compile_expression 3 The number of identical expressions before they are JIT-compiled
min_execution_speed_bytes 0 Minimum number of execution bytes per second.
network_compression_method LZ4 Allows you to select the method of data compression when writing.
optimize_skip_unused_shards 0 Assumes that data is distributed by sharding_key. Optimization to skip unused shards if SELECT query filters by sharding_key.
os_thread_priority 0 If non zero – set corresponding ‘nice’ value for query processing threads. Can be used to adjust query priority for OS scheduler.
output_format_parquet_row_group_size 1000000 Row group size in rows.
stream_poll_timeout_ms 500 Timeout for polling data from streaming storages.
tcp_keep_alive_timeout 0 The time (in seconds) the connection needs to remain idle before TCP starts sending keepalive probes

MergeTree settings:

Name Default Description
enable_mixed_granularity_parts 0 Enable parts with adaptive and non adaptive granularity
index_granularity_bytes 10485760 Approximate amount of bytes in single granule (0 – disabled).
write_final_mark 1 Write final mark after end of column (0 – disabled, do nothing if index_granularity_bytes=0)
max_parts_in_total 100000 If more than this number active parts in all partitions in total, throw ‘Too many parts …’ exception.
merge_with_ttl_timeout 86400 Minimal time in seconds, when merge with TTL can be repeated.
min_merge_bytes_to_use_direct_io 10737418240 (was 0) Minimal amount of bytes to enable O_DIRECT in merge (0 – disabled).
replicated_max_parallel_fetches_for_host 15 Limit parallel fetches from endpoint (actually pool size).
use_minimalistic_part_header_in_zookeeper 0 Store part header (checksums and columns) in a compact format and a single part znode instead of separate znodes (/columns and /checksums). This can dramatically reduce snapshot size in ZooKeeper. Before enabling check that all replicas support new format.
Share