Delete Duplicate Rows

Every application will have duplicate rows at some point. The cleaning is often implemented in application logic, although the database can do it with a single query, including a selection of which rows to keep.

After some time most application will have some duplicated rows resulting in bad user experience, higher storage requirements and less database performance. The cleaning process is usually implemented in application code with complex chunking behavior as the data does not fit into memory completely. However, a single SQL query can do the whole process, including prioritization of rows and the number of duplicates to keep.

Usage

MySQL

WITH duplicates AS (
  SELECT id, ROW_NUMBER() OVER(
    PARTITION BY firstname, lastname, email
    ORDER BY age DESC
  ) AS rownum
  FROM contacts
)
DELETE contacts
FROM contacts
JOIN duplicates USING(id)
WHERE duplicates.rownum > 1

PostgreSQL

WITH duplicates AS (
  SELECT id, ROW_NUMBER() OVER(
    PARTITION BY firstname, lastname, email
    ORDER BY age DESC
  ) AS rownum
  FROM contacts
)
DELETE FROM contacts
USING duplicates
WHERE contacts.id = duplicates.id AND duplicates.rownum > 1;

Detailed Explanation

No matter how good an application is, there may be duplicate rows after some time. They may not be a big problem at first. However, the user experience is affected quickly with the duplicate rows listed repeatedly, and the database performance will degrade because of the larger data size. Furthermore, an efficient unique index to tell the database it can stop searching after the first row has been found cannot be used anymore. These duplicate rows must be removed. While inserting them was easy, removing them is a much more complex task.

The standard approach is to GROUP BY on the duplicate columns and keep one remaining row using the MIN(id) or MAX(id) value. This simple way of deleting duplicate rows will not work when additional requirements have to be followed:

  • Instead of deleting all duplicate rows, some should be kept. Duplicate rows may be a valid use-case for some applications but they should be limited to e.g. the last five ones created.
  • The remaining row should not be the first or last one created. In some cases, additional columns set a priority to keep a row: A verified user should not be deleted to keep an unverified one.

To comply with these requirements, all rows are typically loaded into the application's memory in small chunks and some programming code calculates the duplicate rows to delete. However, this is inefficient as it can be done without shifting around a lot of data. The execution should happen where the data is for best efficiency which is possible with SQL window functions:

  1. The rows are partitioned by the columns indicating a duplicate row. For every combination of the specified columns a partition is automatically created to collect the duplicate rows.
  2. Every partition is sorted by a number of columns to mark their importance. If e.g. only the last five records should be kept, the partition's rows should be sorted by their creation date in descending order.
  3. The sorted rows within a partition are assigned an incrementing number by the ROW_NUMBER window function.
  4. Any row can be deleted according to the desired number of remaining rows. When e.g. only the last five rows should be kept, any row with a row number greater than five can be deleted.

Additional Resources

  • MySQL Documentation: DELETE statements for multiple tables.
  • PostgreSQL documentation: DELETE statements for multiple tables.
Author Tobias Petry

Tobias Petry

SQL for Devs Founder


Are you interested to learn more?

Be notified on future content. Never spam.