Query Performance

Columnstore

The next significant performance improvement comes from storing table rows in a format optimized for analytics. Usually, databases store rows for fast lookups and data manipulation, which is perfect for CRUD workloads but not for analytics.

For simple analytical queries, such as calculating the sum of two columns, a database always loads chunks of rows at a time (called pages). Then, each column is extracted from the page one after another and added to the sum.

This approach is clearly inefficient for analytical queries, as many unnecessary rows and unnecessary columns of needed rows are loaded without being used. As a result, a significant amount of time is spent loading data from disk that is not needed for the query.

This can be optimized by using columnar storage for tables: The idea is to store batches of columns instead of rows, with each column loaded only when needed. With this new storage format, expressions (SUM(col1)), filters (WHERE col2 = 4) etc. are also significantly faster as they can act on multiple values with a single operation thanks to vectorization support in modern CPUs.

With columnar storage, you'll also need less storage space:

A value must only be stored once if it appears multiple times in a row.
Often-used strings can be shortened by an internal dictionary.
Numbers store only the difference to the last value in the smallest datatype needed to fit that value.
Boolean values can be saved with a single bit.
... and many more

By applying these optimizations, TimescaleDB often reduces a table's size by more than 90%. As a result, performance improves significantly, since less data is loaded from disk and more can be cached in memory.

Let's start by understanding the concept of the columnstore within TimescaleDB. Hypertable chunks are stored in a CRUD-optimized format for fast inserts and modifications. Once all data is inserted and no further modifications are expected, a hypertable chunk is converted into a columnstore chunk. With TimescaleDB's columnstore, you can still insert new rows or modify and delete existing ones - which is not the case with other analytical databases. However, every query will be slightly slower since the data is now optimized for analytical queries rather than modifications. Ironically, inserting into a columnstore chunk is also the fastest way to insert many rows. With the “Direct to Columnstore” option you can insert more than 5 millions rows each second - PostgreSQL can't do that.


        
            ALTER TABLE requests SET(

              timescaledb.enable_columnstore,

              timescaledb.segmentby = 'url_domain, service',

              timescaledb.orderby = 'time DESC, url_path'

            );

When activating the columnstore for a hypertable, the segmentby and orderby configurations should be tuned for your application:

The columns listed in the segmentby option are the most important configuration as it controls which rows are stored together in column batches. They should be selected based on your querying patterns to filter the hypertable rows and only keep the necessary ones. Typically, the tenancy column is used so that all column batches for other tenants than the current one can be skipped by a query. Each column batch can store up to 1,000 values. You should aim for them to be filled as much as possible to maximize storage savings and achieve optimal disk performance. So at best there should be more than 1,000 rows for every value of that column.

The orderby option controls the sorting order within the column batches. TimescaleDB then tracks the minimum and maximum values for each column batch to skip any that can't match the query's filters. As you always filter by time range, it is typically best to place the timestamp column first for optimal performance.

Activating the columnstore for a hypertable doesn't automatically convert its chunks to columnstore chunks. This misconception often leads to underwhelming performance during initial performance testing. You must transform a chunk with the convert_to_columnstore procedure.

This boring task can and should be automated by adding a Timescale policy: The policy below (add_columnstore_policy) transforms each chunk when its time range for rows is one hour in the past. You might wonder why we don't transform chunks earlier. Due to high load, queuing, and other issues, the INSERTs may be delayed some time which the one-hour window compensates for to ensure fast INSERTs. Although it's not strictly necessary because rows can also be inserted into columnstore chunks.

An important point is that a columnstore chunk no longer has indexes. This change matters because keeping indexes for billions of rows would waste storage space and undo many of the savings from all storage optimizations.

Nevertheless, TimescaleDB can still quickly find all rows for your analytics query: By including the segmentby columns in your conditions (which is recommended!), TimescaleDB can instantly ignore irrelevant column batches. Next, it excludes more column batches by checking if your query's conditions don't fit into the minimum and maximum ranges for the orderby columns. This two-step process significantly reduces the number of column batches loaded by an analytics query.

Even simple row lookups, like WHERE id = 'V1StGXR8' are fast. When an index is defined on a column, TimescaleDB builds small index-like structures for every column batch that are much smaller than real indexes:

For numeric values and timestamps: The minimum and maximum value for each column batch is used to remove batches when the query condition's value does not fit within that range.
For strings and UUIDs: A tiny bloom filter (probabilistic summary of all values) is created to remove batches that definitely can't have the value in the query's conditions and only check the batches that could have it.

However, you can also change these defaults by defining which index-like structures to create yourself:


        
            ALTER TABLE requests SET (

              timescaledb.compress_index = 'bloom(geo_country), minmax(duration_total)'

            )

Again, keep in mind that this configuration must be made before a hyperchunk is transformed into a columnstore chunk. Alternatively, the columnstore hyperchunk must be converted to a regular hyperchunk and then transformed back into a columnstore chunk again.

Limitations

Those index-like structures will speed up your queries and substitute the need for regular indexes, but they don't fully replace them:

They can only be used for equality checks, e.g. WHERE status = 'open'
They don't work when searching for many values, e.g. WHERE mode IN(2, 4, 5)