Fill Gaps in Statistical Time Series Results

Statistical calculations on a time frame have to handle missing data. Instead of filling in the missing records in the application, the query can be changed to do that directly.

The results for some statistical calculations will have gaps because no information was saved for specific days. But instead of back-filling these holes with application code, the database query can be restructured: A sequence of gapless values is created as source for joining to the statistical data. For PostgreSQL the generate_series function might be used to create the sequence, whereas for MySQL the same needs to be performed manually using a recursive common table expression (CTE).

Usage

MySQL

WITH RECURSIVE dates_without_gaps(day) AS (
  SELECT DATE_SUB(CURRENT_DATE, INTERVAL 14 DAY) as day
  UNION ALL
  SELECT DATE_ADD(day, INTERVAL 1 DAY) as day
  FROM dates_without_gaps
  WHERE day < CURRENT_DATE
)
SELECT dates_without_gaps.day, COALESCE(SUM(statistics.count), 0)
FROM dates_without_gaps
LEFT JOIN statistics ON(statistics.day = dates_without_gaps.day)
GROUP BY dates_without_gaps.day;

PostgreSQL

SELECT dates_without_gaps.day, COALESCE(SUM(statistics.count), 0)
FROM generate_series(
  CURRENT_DATE - INTERVAL '14 days',
  CURRENT_DATE,
  '1 day'
) as dates_without_gaps(day)
LEFT JOIN statistics ON(statistics.day = dates_without_gaps.day)
GROUP BY dates_without_gaps.day;

Detailed Explanation

Many applications have statistically interesting information like user registrations, number of created blog articles or sales. These rows are often aggregated by specific time frames like days, weeks or months for reporting purposes. The result is a time series showing the change of values by equal length time periods. However, the time series misses some values when one or multiple periods did not have any events. If the data is then displayed on a chart, the graphic will be misleading because e.g. January will be directly followed by April.

Gaps in data are traditionally resolved in the application. But this adds more verbose code as the timespan handling needs to be copied from the database, sometimes with the added complexity of user-configurable timespans and periods. The programmer has to develop two different implementations for the statistical aggregation, one for building the query and then one to fill in the gaps. Building one SQL statement that produces the results without gaps would be the easiest solution.

The idea is to reverse the workflow by filling the gaps before calculating the statistics. A runtime-only relation is created, named dates_without_gaps , that will have all the dates that would be present after the calculation if the data did not have any missing values. By using this relation, the statistical data can be joined to the timeframe reference so the result will no longer have any gaps if some days do not have any data. Building this temporary relation is straightforward when working with PostgreSQL: The generate_series function creates all the data based on an interval and the starting together with an ending point. Whereas for MySQL, the same needs to be performed manually using a recursive common table expression (CTE).

Additional Resources

Author Tobias Petry

Tobias Petry

SQL for Devs Founder


Are you interested to learn more?

Be notified on future content. Never spam.