Query Performance

Reordering

Switching to hypertables with the optimized indexes of the previous module improved the performance but is still far away from the desired outcome. The bottleneck is no longer identifying the rows for the analysis but loading them. Any big analytics calculation struggles with the same issue.

So what is the problem? The rows of a table are ordered by the time they have been inserted and all rows for a specific domain must be loaded from all over the database file.

The performance problem originates from hardware limits: A disk (and cloud block storage) is much faster at loading many consecutive chunks of data (sequential read) than loading many small ones from different offsets (random read) within a file. The performance difference typically ranges from 10x to 100x.

Luckily, PostgreSQL implements functionality to benefit from this behavior. Rows are always stored and loaded together in a group (called "page") of 8KB to reduce the number of disk reads. How many rows can fit into one page? That depends on the number and size of your columns, but in most cases, it will be between 50 and 100 rows.

So why is loading the data slow if PostgreSQL already optimizes this? There is no data locality! Rows are still ordered by insertion time, so many different pages at different offsets within the database file must be loaded. The efficiency win is minimal.

The performance would improve drastically when pages would be filled with rows for the same domain. On top of that, a few fast sequential read operations could read all rows for a domain if those optimized pages would be sorted one after another.

Don't worry! You don't have to implement complex logic to insert rows in a specific order to achieve this. Chunk reordering is doing all of that for you. You can instruct Timescale to reorder the rows of a chunk so they're aligned on disk perfectly for your queries: First, you create an index that has your desired sorting order, e.g. by url_domain first and then time. After that, call the reorder_chunk function with a chunk (or all ones as in this example) and the created index.

CREATE INDEX requests_ordering_idx ON requests (url_domain, time);

SELECT reorder_chunk(c, 'requests_ordering_idx') FROM show_chunks('requests') AS c;

PostgreSQL now no longer must load pages from many random locations because all rows are sorted on disk by domain and time. Everything can be loaded sequentially, utilizing the fastest way for a disk to read data.

The former query will reorder all chunks once. But you don't have to do that manually. Like all stuff in Timescale, maintenance tasks can be executed by a policy.

SELECT add_reorder_policy('requests','requests_ordering_idx');

From now on, chunks are automatically reordered without you having to do anything - even new ones that will be created in weeks. However, they are only reordered once!

Any rows inserted will be (as before) appended to the end. You might want to re-run the reorder_chunk function after large inserts on the affected chunks for best performance. For this reason, the two newest chunks are always excluded from automatic reordering because you probably still add new rows to them.

The Columnstore is the way to go

It is essential to understand the problem of insertion-ordered rows and how reordering them will have a significant performance impact on your queries. However, TimescaleDB's Columnstore provides the same reordering behaviour with much more performance improvements. So you should, at best, use the columnstore instead of reordering.