Storage Management
Data in any application grows over time and it usually grows fast. After a while, you'll encounter problems managing the size: For example, backing up the entire database regularly will take too long and produce massive files. Restoring those backups will also take an eternity. But hopefully, you will never need to do it, and you can accept a couple of hours of downtime. Additionally, you must keep an eye on your total size so you don't reach your disk's limit anytime soon. Analytics at a large scale can be challenging due to the sheer volume of data. But size isn't the only issue. Your costs will also increase a lot: Every stored backup will be large and together can be a significant factor of your monthly costs which is an often overlooked aspect in cost calculations.
However, the truth is that old data becomes less important over time. You may no longer need data that is several months or years old. This is where TimescaleDB's automatic retention feature comes in: You define a cutoff time when data should be deleted, and TimescaleDB automatically handles it.
SELECT add_retention_policy('requests', INTERVAL '3 months');
This retention policy is also more efficient than custom implementations: TimescaleDB drops entire chunks instead of single rows. Even a single big DELETE query would be slow because a chunk's table must be reorganized multiple times during the operation, whereas dropping a chunk only requires removing a file from disk. But therefore TimescaleDB's retention can only drop a chunk when all its rows are suitable for deletion.
However, assuming you no longer need some data is overly optimistic. As your rows get older, you're probably okay with getting rid of detailed information for every second - as long as you still have your aggregated data by continuous aggregates. You were only querying the continuous aggregates for anything older than a few months anyway.
This use case is already covered by continuous aggregates since they keep their data when a time range is not refreshed. All you need to do is configure a continuous aggregate to only update values after a specific date offset.
SELECT add_continuous_aggregate_policy(
'requests_15min',
schedule_interval => INTERVAL '5 minutes',
- start_offset => NULL,
+ start_offset => INTERVAL '3 months',
end_offset => NULL
);
With the adjusted continuous aggregate, changes to rows older than three months will no longer automatically update the aggregate. The retention policy can now safely drop a chunk and the aggregated results will still be there.
You must configure this correctly. If you don't match the refresh window of the continuous aggregate to the retention policy, refreshes will try to load non-existing data and delete your aggregated rows.
Likewise, you don't have to keep the aggregated results in a continuous aggregate forever as they will grow over time too. You can add a retention policy to a continuous aggregate just like you do with a hypertable. This also enables an interesting optimization: By combining retention policies with hierarchical continuous aggregates, you can gradually remove more detailed information at lower hierarchy levels as detailed data gets less interesting.