Delete duplicate rows from small table

SqlPostgresql

Sql Problem Overview


I have a table in a PostgreSQL 8.3.8 database, which has no keys/constraints on it, and has multiple rows with exactly the same values.

I would like to remove all duplicates and keep only 1 copy of each row.

There is one column in particular (named "key") which may be used to identify duplicates, i.e. there should only exist one entry for each distinct "key".

How can I do this? (Ideally, with a single SQL command.)
Speed is not a problem in this case (there are only a few rows).

Sql Solutions


Solution 1 - Sql

A faster solution is

DELETE FROM dups a USING (
      SELECT MIN(ctid) as ctid, key
        FROM dups 
        GROUP BY key HAVING COUNT(*) > 1
      ) b
      WHERE a.key = b.key 
      AND a.ctid <> b.ctid

Solution 2 - Sql

DELETE FROM dupes a
WHERE a.ctid <> (SELECT min(b.ctid)
                 FROM   dupes b
                 WHERE  a.key = b.key);

Solution 3 - Sql

This is fast and concise:

DELETE FROM dupes T1
    USING   dupes T2
WHERE   T1.ctid < T2.ctid  -- delete the older versions
    AND T1.key  = T2.key;  -- add more columns if needed

See also my answer at How to delete duplicate rows without unique identifier which includes more information.

Solution 4 - Sql

I tried this:

DELETE FROM tablename
WHERE id IN (SELECT id
              FROM (SELECT id,
                             ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
                     FROM tablename) t
              WHERE t.rnum > 1);

provided by Postgres wiki:

https://wiki.postgresql.org/wiki/Deleting_duplicates

Solution 5 - Sql

EXISTS is simple and among the fastest for most data distributions:

DELETE FROM dupes d
WHERE  EXISTS (
   SELECT FROM dupes
   WHERE  key = d.key
   AND    ctid < d.ctid
   );

From each set of duplicate rows (defined by identical key), this keeps the one row with the minimum ctid.

Result is identical to the currently accepted answer by a_horse. Just faster, because EXISTS can stop evaluating as soon as the first offending row is found, while the alternative with min() has to consider all rows per group to compute the minimum. Speed is of no concern to this question, but why not take it?

You may want to add a UNIQUE constraint after cleaning up, to prevent duplicates from creeping back in:

ALTER TABLE dupes ADD CONSTRAINT constraint_name_here UNIQUE (key);

About the system column ctid:

If there is any other column defined UNIQUE NOT NULL column in the table (like a PRIMARY KEY) then, by all means, use it instead of ctid.

If key can be NULL and you only want one of those, too, use IS NOT DISTINCT FROM instead of =. See:

As that's slower, you might instead run the above query as is, and this in addition:

DELETE FROM dupes d
WHERE  key IS NULL
AND    EXISTS (
   SELECT FROM dupes
   WHERE  key IS NULL
   AND    ctid < d.ctid
   );

And consider:

For small tables, indexes generally do not help performance. And we need not look further.

For big tables and few duplicates, an existing index on (key) can help (a lot).

For mostly duplicates, an index may add more cost than benefit, as it has to be kept up to date concurrently. Finding duplicates without index becomes faster anyway because there are so many and EXISTS only needs to find one. But consider a completely different approach if you can afford it (i.e. concurrent access allows it): Write the few surviving rows to a new table. That also removes table (and index) bloat in the process. See:

Solution 6 - Sql

I would use a temporary table:

create table tab_temp as
select distinct f1, f2, f3, fn
  from tab;

Then, delete tab and rename tab_temp into tab.

Solution 7 - Sql

I had to create my own version. Version written by @a_horse_with_no_name is way too slow on my table (21M rows). And @rapimo simply doesn't delete dups.

Here is what I use on PostgreSQL 9.5

DELETE FROM your_table
WHERE ctid IN (
  SELECT unnest(array_remove(all_ctids, actid))
  FROM (
         SELECT
           min(b.ctid)     AS actid,
           array_agg(ctid) AS all_ctids
         FROM your_table b
         GROUP BY key1, key2, key3, key4
         HAVING count(*) > 1) c);

Solution 8 - Sql

Another approach (works only if you have any unique field like id in your table) to find all unique ids by columns and remove other ids that are not in unique list

DELETE
FROM users
WHERE users.id NOT IN (SELECT DISTINCT ON (username, email) id FROM users);

Solution 9 - Sql

How about:

WITH
u AS (SELECT DISTINCT * FROM your_table),
x AS (DELETE FROM your_table)
INSERT INTO your_table SELECT * FROM u;
I had been concerned about execution order, would the DELETE happen before the SELECT DISTINCT, but it works fine for me. And has the added bonus of not needing any knowledge about the table structure.

Solution 10 - Sql

Here is a solution using PARTITION BY and the virtual ctid column, which is works like a primary key, at least within a single session:

DELETE FROM dups
USING (
  SELECT
    ctid,
    (
      ctid != min(ctid) OVER (PARTITION BY key_column1, key_column2 [...])
    ) AS is_duplicate
  FROM dups 
) dups_find_duplicates
WHERE dups.ctid == dups_find_duplicates.ctid
AND dups_find_duplicates.is_duplicate

A subquery is used to mark all rows as duplicates or not, based on whether they share the same "key columns", but not the same ctid, as the "first" one found in the "partition" of rows sharing the same keys.

In other words, "first" is defined as:

  • min(ctid) OVER (PARTITION BY key_column1, key_column2 [...])

Then, all rows where is_duplicate is true are deleted by their ctid.

From the documentation, ctid represents (emphasis mine):

> The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. A primary key should be used to identify logical rows.

Solution 11 - Sql

Postgresql has windows function, you can use rank() to archive your goal, sample:

WITH ranked as (
	SELECT
		id, column1,
		"rank" () OVER (
			PARTITION BY column1
			order by column1 asc
		) AS r
	FROM
		table1
) 
delete from table1 t1
using ranked
where t1.id = ranked.id and ranked.r > 1

Solution 12 - Sql

This worked well for me. I had a table, terms, that contained duplicate values. Ran a query to populate a temp table with all of the duplicate rows. Then I ran the a delete statement with those ids in the temp table. value is the column that contained the duplicates.

        CREATE TEMP TABLE dupids AS
        select id from (
                    select value, id, row_number() 
over (partition by value order by value) 
    as rownum from terms
                  ) tmp
                  where rownum >= 2;

delete from [table] where id in (select id from dupids)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAndr&#233; Moruj&#227;oView Question on Stackoverflow
Solution 1 - SqlrapimoView Answer on Stackoverflow
Solution 2 - Sqla_horse_with_no_nameView Answer on Stackoverflow
Solution 3 - SqlisapirView Answer on Stackoverflow
Solution 4 - SqlRadu GabrielView Answer on Stackoverflow
Solution 5 - SqlErwin BrandstetterView Answer on Stackoverflow
Solution 6 - SqlPablo Santa CruzView Answer on Stackoverflow
Solution 7 - SqlexpertView Answer on Stackoverflow
Solution 8 - SqlLogovsky DmitryView Answer on Stackoverflow
Solution 9 - SqlBarrie WalkerView Answer on Stackoverflow
Solution 10 - SqlLeoRochaelView Answer on Stackoverflow
Solution 11 - SqlNewBeeView Answer on Stackoverflow
Solution 12 - SqlBeanwahView Answer on Stackoverflow