Efficiently Delete Old Rows with Partitions

Deleting hundreds of thousands of rows is slow, impacts your performance and doesn't free the allocated disk space. By using time partitioning you can do it very fast with reclaiming disk space.

Tables storing historical data like audit logs or saved webhooks will drastically increase in size over time. But deleting old rows will be slow and the disk space will only be reclaimed when rebuilding them. This can take hours with big tables. But the cleanup workflow can be optimized by transparently splitting those tables into many smaller (e.g. monthly) partitions. Dropping an entire partition is very fast because just the linked partition file on disk will be removed, reclaiming the used disk space.

Usage

MySQL

CREATE TABLE audits (
  -- more columns...
  created_at timestamp NOT NULL
) PARTITION BY RANGE (UNIX_TIMESTAMP(created_at)) (
  PARTITION audits_2023_01 VALUES
    LESS THAN (UNIX_TIMESTAMP('2023-02-01 00:00:00')),
  PARTITION audits_2023_02 VALUES
    LESS THAN (UNIX_TIMESTAMP('2023-03-01 00:00:00')),
  PARTITION audits_2023_03 VALUES
    LESS THAN (UNIX_TIMESTAMP('2023-04-01 00:00:00')),
  PARTITION audits_2023_04 VALUES
    LESS THAN (UNIX_TIMESTAMP('2023-05-01 00:00:00'))
);

ALTER TABLE audits DROP PARTITION audits_2023_01;

PostgreSQL

CREATE TABLE audits (
  -- more columns...
  created_at timestamptz NOT NULL
) PARTITION BY RANGE (created_at);

CREATE TABLE audits_2023_01 PARTITION OF audits FOR VALUES
  FROM ('2023-01-01 00:00:00+00') TO ('2023-01-31 23:59:59+00');
CREATE TABLE audits_2023_02 PARTITION OF audits FOR VALUES
  FROM ('2023-02-01 00:00:00+00') TO ('2023-02-28 23:59:59+00');
CREATE TABLE audits_2023_03 PARTITION OF audits FOR VALUES
  FROM ('2023-03-01 00:00:00+00') TO ('2023-03-31 23:59:59+00');
CREATE TABLE audits_2023_04 PARTITION OF audits FOR VALUES
  FROM ('2023-04-01 00:00:00+00') TO ('2023-04-30 23:59:59+00');

ALTER TABLE audits DETACH PARTITION audits_2023_01 CONCURRENTLY;

DROP TABLE audits_2023_01;

Detailed Explanation

Any application used for some years will grow linear or exponentially in data size. You get more users, they do more within your application and your data is growing and growing. With the constant size increase, you are experiencing new problems: Your backup process is becoming much slower, changing the table structure or adding indexes takes forever and much more you can experience.

After some time, you decide to delete old historical to free up some space by deleting data that is no longer needed to make the big database more manageable. Good cleanup candidates are audit tables, any logging information or saved webhooks. You won't need them anymore if they were created a few months ago. But just deleting hundreds of thousands of rows has several problems:

  • Deleting them will take a very long time and may severely impact the performance of your system.
  • All those changes need to be transferred and applied to your replicated databases which can create significant replication lags. You may have to implement a more complex chunking behavior to distribute the load on a longer timeframe.
  • The disk size claimed by the deleted rows will not be freed. You have to execute VACUUM FULL on PostgreSQL or OPTIMIZE TABLE on MySQL to completely rebuild the table to shrink it. But the exclusive lock by these operations will block any access with big tables for a very long time.

These drawbacks are unacceptable for most applications. So databases added a solution to many widespread problems including all stated ones. With table partitioning, one big table is split transparently into many smaller partitions. In our example, we've set up monthly partitions for January to April 2023. All queries are still executed on the parent table but the database will re-route them transparently to the affected partitions without any effort. From a SQL perspective, there is no difference between one big table and the partitioned one. You are still executing select or insert queries on the main table.

But with the partitioned approach, we can delete an entire month's data by dropping one table partition instead of deleting specific rows. The database file of that partition will be removed from the filesystem with a single fast operation and won't leave back over any wasted storage space. If you have different retain and cleanup rules you can also create partitions of different time ranges like quarters of weeks

Additional Resources

Author Tobias Petry

Tobias Petry

SQL for Devs Founder


Are you interested to learn more?

Be notified on future content. Never spam.