Make Analytics Fast
Learn

Continuous Aggregates

A columnstore is already the state-of-the-art solution for fast analytics queries. However, there's still room for improvement. You could further speed up queries by accessing pre-computed results instead of aggregating rows each time a query is executed...

That sounds easy. For every new row inserted into the hypertable, you'll update values in some summary tables. The problem is that only one query can update the same row in a summary table at a time (concurrent locking). So each query must wait until all earlier queries have finished updating the same row.

As shown, this implementation will cause problems in production because it can't handle a high volume of requests per second. With each new query, the waiting time increases until the database becomes overwhelmed by a mass of blocked queries waiting to run.

However, the idea of summary tables is great, but it can't be implemented in a performant way by us. The TimescaleDB extension solves this problem by tapping into PostgreSQL's internal capabilities. This way, you can have TimescaleDB automatically maintain a pre-computed result (Continuous Aggregate) for your query.

CREATE MATERIALIZED VIEW requests_15min
WITH (timescaledb.continuous) AS
SELECT
url_domain, service, server_id, url_path,
time_bucket(INTERVAL '15 minutes', time) AS bucket_15min,
COUNT(*) AS requests,
MIN(duration_total) AS duration_min,
MAX(duration_total) AS duration_max
FROM requests
GROUP BY bucket_15min, url_domain, service, server_id, url_path;

SELECT add_continuous_aggregate_policy('requests_15min', schedule_interval => INTERVAL '5 minutes', start_offset => NULL, end_offset => NULL);

With the above code, you instruct TimescaleDB to update the `requests_hourly` continuous aggregate every five minutes. It's saving 15-minute pre-computed results that can be:

  • filtered to the specific tenant needed for a specific dashboard
    SELECT COUNT(*)
    FROM requests_15min
    WHERE url_domain = 'example.com' AND service = 'shop'
  • aggregated again to get e.g. daily results
    SELECT COUNT(*), time_bucket(INTERVAL '1 hour', bucket_15min) AS bucket_1h
    FROM requests_15min
    GROUP BY bucket_1h

These continuous aggregates are updated fully automatic, so there's no need for you to take any action anymore. TimescaleDB efficiently keeps them updated by identifying which time buckets have changed (due to new inserts, updates, or deletes) and only recalculates those on its periodic updates. Even with years of data, the update process remains quick because every time only a few buckets may have changed since the last run.

Some applications can't tolerate a 5-minute delay in seeing new data in their continuous aggregates. For these cases, you can use realtime aggregates. They work by always pulling the latest data from the hypertable and combining it with older pre-computed data on the fly.

CREATE MATERIALIZED VIEW requests_15min
WITH (timescaledb.continuous, timescaledb.materialized_only = FALSE) AS
SELECT
url_domain, service, server_id, url_path,
time_bucket(INTERVAL '15 minutes', time) AS bucket_15min,
COUNT(*) AS requests,
MIN(duration_total) AS duration_min,
MAX(duration_total) AS duration_max
FROM requests
GROUP BY bucket_15min, url_domain, service, server_id, url_path
WITH NO DATA;

SELECT add_continuous_aggregate_policy('requests_15min', schedule_interval => INTERVAL '5 minutes', start_offset => NULL, end_offset => INTERVAL '30 minutes');

For this magic to work, TimescaleDB remembers the timestamp up to which the data has already been pre-computed. So the realtime aggregate must be created with the WITH NO DATA option so that not all data is pre-computed on creation. Additionally, you must add the end_offset parameter to the policy, which defines the timespan that should not be pre-computed. At least two times the time interval should be used that the newest data is never pre-computed. If you have many inserts happening much later than their timestamps (e.g. queuing), you should increase the value. In the previous example, the last 30 minutes are exempt from pre-computation.

Applications often need to aggregate data across various time intervals, such as hourly, daily, weekly, monthly and yearly. To do this, you can either aggregate again the 15-minute aggregate when querying or create separate continuous/realtime aggregates for each time interval. The downside of many aggregates is that they put pressure on the database, as they all must update their results whenever there are changes. A more efficient approach is to create hierarchical aggregates that load data from other aggregates rather than directly from the hypertable. For example, the hourly aggregate can be calculated using pre-computed 15-minute aggregate results, the daily aggregate can be calculated from the hourly aggregate, and so on.

CREATE MATERIALIZED VIEW requests_hourly
WITH (timescaledb.continuous, timescaledb.materialized_only = FALSE) AS
SELECT
url_domain, service, server_id, url_path,
time_bucket(INTERVAL '1 day', bucket_15min) AS bucket_1h,
SUM(requests) AS requests,
MIN(duration_min) AS duration_min,
MAX(duration_max) AS duration_max
FROM requests_15min
GROUP BY bucket_1h, url_domain, service, server_id, url_path
WITH NO DATA;

SELECT add_continuous_aggregate_policy('requests_hourly', schedule_interval => INTERVAL '5 minutes', start_offset => NULL, end_offset => INTERVAL '2 hours');

The hierarchical realtime aggregate approach is also more efficient when querying. When you query an hourly aggregate (as configured before), TimescaleDB must load the past two hours from the hypertable. However, hierarchical aggregates can use the underlying 15-minute realtime aggregate for that time interval, which has already pre-computed everything except for the last 30 minutes. As a result, the query executes much faster.

While continuous aggregates and realtime aggregates may appear magical, they are just hypertables managed by TimescaleDB. As a consequence, you can also add indexes and activate the column store to improve the speed of finding and aggregating rows.