Query Performance

Hyperfunctions

TimescaleDB's hyperfunctions make writing analytics queries easier and more powerful. You've already been introduced to the time_bucket function, which groups rows into a specific interval in the previous chapter.

You will likely use time_bucket_gapfill most because your hypertables usually don't have data for every time interval you're aggregating for. With shorter intervals (e.g. five minutes), gaps with no data (for a specific condition) are more common. You can either write code to fill them or have TimescaleDB do it. When you let TimescaleDB handle that task by replacing time_bucket with time_bucket_gapfill in your queries, you can also choose from two optional strategies:

copy the last value (locf: Last Observation Carried Forward)
interpolation to calculate realistic values

SELECT
  time_bucket_gapfill('5 minutes', time, NOW() - INTERVAL '24h', NOW()) AS bucket,
  count(*),
  locf(count(*)),
  interpolate(count(*))
FROM requests
WHERE time >= NOW() - INTERVAL '24h'
GROUP BY bucket

The primary focus of TimescaleDB's hyperfunctions is not to provide helper functions. They're designed to address a common issue when working with continuous and realtime aggregates. How do you re-aggregate the resulting rows of the following continuous aggregate for a larger time interval?

CREATE MATERIALIZED VIEW requests_15min
WITH (timescaledb.continuous) AS
SELECT
  time_bucket(INTERVAL '15 minutes', time) AS bucket_15min,
  AVG(duration) AS duration_avg
FROM requests
GROUP BY bucket_15min

It is not easy. Calculating SUM(duration_avg) doesn't make sense. And neither does AVG(duration_avg), because this approach doesn't take into account that each average represents a different number of values. The correct approach is to track the count and sum separately and then calculate the average from these two values.

CREATE MATERIALIZED VIEW requests_15min
WITH (timescaledb.continuous) AS
SELECT
  time_bucket(INTERVAL '15 minutes', time) AS bucket_15min,
  SUM(duration) AS duration_sum,
  COUNT(*) AS duration_count
FROM requests
GROUP BY bucket_15min;


SELECT SUM(duration_sum) / SUM(duration_count) FROM requests_15min;

You'll end up with a complicated schema and queries if you have to do this for multiple columns. The problem is that aggregating values is a two-step process for any database which is hidden from you: First, all values are processed by calculating an internal state. After that, the result is calculated from the state.

So with any aggregate function (like AVG), you're limited to the result. But to allow re-aggregation, you need access to the internal state. The workaround (as before) is to store enough information that you can re-aggregate results later. This is a lot of work, especially for more complex aggregate functions.

TimescaleDB's innovative idea is to break down the aggregation process into manual steps. After multiple values are aggregated, hyperfunctions return the calculated state instead of the result so you can store it in a continuous aggregate:

Because of this, you now have the flexibility to rollup (merge) multiple states at any time and calculate the result from any state or rolled-up state.

The former example can now be rewritten using the stats_agg hyperfunction to calculate and store the duration state within the continuous aggregate. Later, you can combine this state with rollup and calculate the average value using the average state accessor function. Furthermore, the state can be used to calculate additional statistics such as variance and standard deviation.

CREATE MATERIALIZED VIEW requests_15min
WITH (timescaledb.continuous) AS
SELECT
  time_bucket(INTERVAL '15 minutes', time) AS bucket_15min,
  stats_agg(duration) AS duration
FROM requests
GROUP BY bucket_15min;


SELECT
  average(rollup(duration)),
  variance(rollup(duration), 'population'),
  stddev(rollup(duration), 'population')
FROM requests_15min;

This is a game-changer for continuous aggregates, as you can now store aggregate states in rows to reuse them later. You can merge them when querying or build hierarchical continuous aggregates. The new aggregate functions with partial aggregation replaces the limited calculations you could do with continuous aggregates and offer you a new approach with unlimited possibilities.

There are many more hyperfunctions than the one shown here and it's worth exploring the hyperfunctions reference to discover their capabilities. They can be categorized into exact and approximate aggregations. Some calculations require access to all values to produce accurate results. Imagine an aggregation named nth(column, number) which returns the value of the n-th row. This aggregation can't calculate a small state because every value is essential. To address this, TimescaleDB also offers approximate aggregations that work on a small state with a known error percentage instead of producing an exact result would have to save all values. So you can save the state for even trillions of rows without running out of storage space. A few will be introduced now so you have a better idea of what's possible.

Histogram exact

With the histogram aggregate, you can calculate the distribution of values. To analyze slow page load times, for example, you can use histogram(duration_total, 2, 10, 4). This will count all page load times into four buckets ranging from two seconds (inclusive) up to ten seconds (exclusive) - each covering two seconds. The result is an array like {121,89,40,11,7,43} to know the distribution of loading times:

The first element for all values below the minimum value
The four elements requested for values from two to ten seconds
The last element for all values greater than or equal to the maximum value

SELECT histogram(duration_total, 2, 10, 4)
FROM requests
WHERE url_domain = 'example.com' AND time >= NOW() - INTERVAL '24h'

Percentile approximate

You might also want to know about typical loading performance, not just a few two-second brackets. The statistical concept you're looking for is the median, which is the value that separates all values into two equal parts, one lower and one higher. TimescaleDB supports the median, also known as the 50% percentile, through the percentile_agg aggregate which provides a highly efficient approximate implementation.

SELECT approx_percentile(0.5, percentile_agg(duration_total))
FROM requests
WHERE url_domain = 'example.com' AND time >= NOW() - INTERVAL '24h'

HyperLogLog (Count Distinct) approximate

One common query you'll run is counting all unique values, e.g. COUNT(DISTINCT userid). However, to make distinct counts work, all values must be remembered to ignore duplicates. This requires a lot of storage, which goes against the design goal of hyperfunctions.

Luckily, TimescaleDB provides the hyperloglog aggregate for this. This algorithm estimates the number of distinct elements using complex mathematical concepts. But don't worry, it's easy to use! You can fine-tune the error probability and storage requirements with the hyperloglog aggregate, or use approx_count_distinct for defaulting to sensible defaults.

SELECT distinct_count(approx_count_distinct(duration_total))
FROM requests
WHERE url_domain = 'example.com' AND time >= NOW() - INTERVAL '24h'

TimerscaleDB's hyperfunctions are ideal for maximizing the benefits of continuous aggregates, as more results can be pre-computed. It's worth considering whether all your statistics require exact accuracy or if a small margin of error is tolerable. As your data grows, the performance benefits of pre-computing approximate results will become increasingly important. Eventually, calculating exact results with every query will put significant pressure on your database.