Every application will have duplicate rows at some point. The cleaning is often implemented in application logic, although the database can do it with a single query, including a selection of which rows to keep.
After some time most application will have some duplicated rows resulting in bad user experience, higher storage requirements and less database performance. The cleaning process is usually implemented in application code with complex chunking behavior as the data does not fit into memory completely. However, a single SQL query can do the whole process, including prioritization of rows and the number of duplicates to keep.
WITH duplicates AS (
SELECT id, ROW_NUMBER() OVER(
PARTITION BY firstname, lastname, email
ORDER BY age DESC
) AS rownum
FROM contacts
)
DELETE contacts
FROM contacts
JOIN duplicates USING(id)
WHERE duplicates.rownum > 1
WITH duplicates AS (
SELECT id, ROW_NUMBER() OVER(
PARTITION BY firstname, lastname, email
ORDER BY age DESC
) AS rownum
FROM contacts
)
DELETE FROM contacts
USING duplicates
WHERE contacts.id = duplicates.id AND duplicates.rownum > 1;
No matter how good an application is, there may be duplicate rows after some time. They may not be a big problem at first. However, the user experience is affected quickly with the duplicate rows listed repeatedly, and the database performance will degrade because of the larger data size. Furthermore, an efficient unique index to tell the database it can stop searching after the first row has been found cannot be used anymore. These duplicate rows must be removed. While inserting them was easy, removing them is a much more complex task.
The standard approach is to
GROUP BY
on the duplicate columns and keep one remaining row using the
MIN(id)
or
MAX(id)
value.
This simple way of deleting duplicate rows will not work when additional requirements have to be followed:
To comply with these requirements, all rows are typically loaded into the application's memory in small chunks and some programming code calculates the duplicate rows to delete. However, this is inefficient as it can be done without shifting around a lot of data. The execution should happen where the data is for best efficiency which is possible with SQL window functions:
SQL for Devs Founder
Be notified on future content. Never spam.