Using TimescaleDB

Indexes & Chunk Skipping

How do you create indexes to speed up analytics queries with chunked hypertables? The same way you're used to: The good news is that it's still PostgreSQL, so you can leverage your existing knowledge. But there are some important differences for optimal performance and special behavior for primary and unique keys.

When you index a column of a hypertable, an index is added for every chunk instead of just one for all the chunks together. This is done to avoid the slowdown that would occur when deleting an old chunk: With one index, all the index entries for the dropped chunk must be removed from a gigantic index - which can take many seconds to minutes and impact the database server's performance. But with an index for every chunk, the index file can just be deleted without any maintenance work as it is no longer needed. Furthermore, it's more efficient to update a small index when inserting new rows than a large one.

Here's how searching for a specific row works: TimescaleDB checks the index for each chunk to see if there is a match, and then loads the referenced row.

This approach is inefficient: The more chunks you have, the more indexes must be checked. The ideal solution would be to use a single index like regular tables. You can add the ID's timestamp to the conditions to use TimescaleDB's chunk exclusion strategy, which you've learned about before, so only the index for that timestamp will be used. Always do that! The performance will be significantly impacted otherwise.

However, this optimization requires a timestamp in the query. Some ORMs, frameworks, or libraries assume you can look up rows by their id alone without needing other columns. And some applications aren't designed to pass two values around. So what should you do?

There is a feature to help with this called chunk skipping: With enable_chunk_skipping, you activate it for a column and TimescaleDB automatically tracks the minimum and maximum values for that column in every chunk. With this value range, some indexes can again be excluded because they can't have a match for the requested ID. It is similar to how timestamps are used to determine which chunks are relevant for a query.

Note that this feature is limited to columnstore chunks which will be covered in more detail later.

SET timescaledb.enable_chunk_skipping = on;
SELECT enable_chunk_skipping('requests', 'id');

ALTER TABLE requests SET (timescaledb.enable_columnstore);
SELECT compress_chunk(c) FROM show_chunks('requests') c;

The index usage is ideal as any chunk without this value can be skipped. However, efficiency will degrade rapidly if you regularly insert new rows into old chunks: In this example, the next row will have the ID 89.143. If inserted into the January chunk, the min/max range will expand so that again more that one index must be used.

In summary, chunk skipping is a great approach if adding a timestamp to the query's condition is too difficult to implement. However, it is not the best choice if you often add new rows to old chunks.

So standard indexes on hypertables are similar to those on regular tables but need tuning for best results. Unique indexes and primary keys, however, behave differently.

You can't create a unique index (or primary key) on a hypertable like regular tables: Indexes are created on chunks, not the hypertable itself, so it's impossible to guarantee the global uniqueness of a value. As a consequence, TimescaleDB blocks the creation of e.g. a primary key on the id column. Instead, the correct approach is to be specific that you know that uniqueness can only be guaranteed within a chunk by creating the index on your column and the partition column - an index on time and id in this case. To be more precise, the id is now only unique for every timestamp which is sufficient to find exactly this row when you include the timestamp in your query.

But do you really need a primary/unique key for your hypertable? Most analytical workloads only store rows (and run calculations) without accessing them individually. TimescaleDB doesn't require a primary/unique key, so you can omit them when not needed for your use case.

However, there's a clever way to get unique values across all chunks if you need it: You can use UUIDv7 values instead of incrementing integers! The new thing about v7 UUIDs is their combination of a timestamp and randomness, which eliminates all the database performance issues of purely random UUIDv4 values.

The idea is to insert rows where the timestamp and time part of the UUIDv7 value are identical. Usually, the same value can appear multiple times (even in a single chunk) because only the combination of the timestamp and value must be unique. But since the timestamp is also embedded in the UUIDv7 value, storing the same UUIDv7 for different timestamps is impossible. As a result, each UUIDv7 is unique within a chunk and globally as well.

You can use a check constraint in the database to prevent a critical error: Uniqueness is only guaranteed when the timestamp column and the timestamp within the UUIDv7 are identical. If they differ slightly, the same value can be stored repeatedly - which the check constraint can forbid. Another issue is the precision of the timestamps. A timestamp column can store times with microsecond precision while UUIDv7 is (with most implementations) limited to millisecond precision. So the timestamps should be truncated to millisecond precision. Luckily, the check constraint will also prevent this mistake.

CREATE TABLE requests (
  id uuid NOT NULL CHECK (uuid_timestamp(id) = time),
  time timestamptz NOT NULL,
  PRIMARY KEY (time, id)
  -- other columns...
);

The timestamp being a part of the UUIDv7 value is also helpful for your application: You must only pass the UUIDv7 around. The timestamp can be extract from the UUIDv7 within your query or application.

SELECT * FROM requests WHERE uuid = '0195f612-7b84-7f06-aee0-2c459ccd98a8'
   AND time = uuid_timestamp('0195f612-7b84-7f06-aee0-2c459ccd98a8')

UUIDv7 values are the best solution for your table design if you need guaranteed uniqueness and don't care that they take twice the storage space compared to incrementing integers. However, UUIDv7 values are not compressed at all - in contrast to integer values. But more about compression later.