How to delete duplicate entries?

Sql Postgresql Duplicate Removal Unique Constraint Sql Delete

Sql Problem Overview

I have to add a unique constraint to an existing table. This is fine except that the table has millions of rows already, and many of the rows violate the unique constraint I need to add.

What is the fastest approach to removing the offending rows? I have an SQL statement which finds the duplicates and deletes them, but it is taking forever to run. Is there another way to solve this problem? Maybe backing up the table, then restoring after the constraint is added?

Sql Solutions

Solution 1 - Sql

Some of these approaches seem a little complicated, and I generally do this as:

Given table table, want to unique it on (field1, field2) keeping the row with the max field3:

DELETE FROM table USING table alias 
  WHERE table.field1 = alias.field1 AND table.field2 = alias.field2 AND
    table.max_field < alias.max_field

For example, I have a table, user_accounts, and I want to add a unique constraint on email, but I have some duplicates. Say also that I want to keep the most recently created one (max id among duplicates).

DELETE FROM user_accounts USING user_accounts ua2
  WHERE user_accounts.email = ua2.email AND user_account.id < ua2.id;

Note - USING is not standard SQL, it is a PostgreSQL extension (but a very useful one), but the original question specifically mentions PostgreSQL.

Solution 2 - Sql

For example you could:

CREATE TABLE tmp ...
INSERT INTO tmp SELECT DISTINCT * FROM t;
DROP TABLE t;
ALTER TABLE tmp RENAME TO t;

Solution 3 - Sql

Instead of creating a new table, you can also re-insert unique rows into the same table after truncating it. Do it all in one transaction.

This approach is only useful where there are lots of rows to delete from all over the table. For just a few duplicates, use a plain DELETE.

You mentioned millions of rows. To make the operation fast you want to allocate enough temporary buffers for the session. The setting has to be adjusted before any temp buffer is used in your current session. Find out the size of your table:

SELECT pg_size_pretty(pg_relation_size('tbl'));

Set temp_buffers at least a bit above that.

SET temp_buffers = 200MB;   -- example value

BEGIN;

CREATE TEMP TABLE t_tmp AS  -- retains temp for duration of session
SELECT DISTINCT * FROM tbl  -- DISTINCT folds duplicates
ORDER  BY id;               -- optionally "cluster" data

TRUNCATE tbl;

INSERT INTO tbl
SELECT * FROM t_tmp;        -- retains order (implementation detail)

COMMIT;

This method can be superior to creating a new table if depending objects exist. Views, indexes, foreign keys or other objects referencing the table. TRUNCATE makes you begin with a clean slate anyway (new file in the background) and is much faster than DELETE FROM tbl with big tables (DELETE can actually be faster with small tables).

For big tables, it is regularly faster to drop indexes and foreign keys (FK), refill the table and recreate these objects. As far as FK constraints are concerned you have to be certain the new data is valid, of course, or you'll run into exceptions on trying to create the FK.

Note that TRUNCATE requires more aggressive locking than DELETE. This may be an issue for tables with heavy, concurrent load. But it's still less disruptive than to drop and replace the table completely.

If TRUNCATE is not an option or generally for small to medium tables there is a similar technique with a data-modifying CTE (Postgres 9.1+):

WITH del AS (DELETE FROM tbl RETURNING *)
INSERT INTO tbl
SELECT DISTINCT * FROM del;
ORDER  BY id; -- optionally "cluster" data while being at it.

Slower for big tables, because TRUNCATE is faster there. But may be faster (and simpler!) for small tables.

If you have no depending objects at all, you might create a new table and delete the old one, but you hardly gain anything over this universal approach.

For very big tables that would not fit into available RAM, creating a new table will be considerably faster. You'll have to weigh this against possible troubles / overhead with depending objects.

Solution 4 - Sql

You can use oid or ctid, which is normally a "non-visible" columns in the table:

DELETE FROM table
 WHERE ctid NOT IN
  (SELECT MAX(s.ctid)
    FROM table s
    GROUP BY s.column_has_be_distinct);

Solution 5 - Sql

The PostgreSQL window function is handy for this problem.

DELETE FROM tablename
WHERE id IN (SELECT id
              FROM (SELECT id,
                             row_number() over (partition BY column1, column2, column3 ORDER BY id) AS rnum
                     FROM tablename) t
              WHERE t.rnum > 1);

See Deleting duplicates.

Solution 6 - Sql

Generalized query to delete duplicates:

DELETE FROM table_name
WHERE ctid NOT IN (
  SELECT max(ctid) FROM table_name
  GROUP BY column1, [column 2, ...]
);

The column ctid is a special column available for every table but not visible unless specifically mentioned. The ctid column value is considered unique for every row in a table. See PostgreSQL system columns to learn more about ctid.

Solution 7 - Sql

From an old postgresql.org mailing list:

create table test ( a text, b text );

Unique values

insert into test values ( 'x', 'y');
insert into test values ( 'x', 'x');
insert into test values ( 'y', 'y' );
insert into test values ( 'y', 'x' );

Duplicate values

insert into test values ( 'x', 'y');
insert into test values ( 'x', 'x');
insert into test values ( 'y', 'y' );
insert into test values ( 'y', 'x' );

One more double duplicate

insert into test values ( 'x', 'y');

select oid, a, b from test;

###Select duplicate rows

select o.oid, o.a, o.b from test o
    where exists ( select 'x'
                   from test i
                   where     i.a = o.a
                         and i.b = o.b
                         and i.oid < o.oid
                 );

###Delete duplicate rows

Note: PostgreSQL dosn't support aliases on the table mentioned in the from clause of a delete.

delete from test
    where exists ( select 'x'
                   from test i
                   where     i.a = test.a
                         and i.b = test.b
                         and i.oid < test.oid
             );

Solution 8 - Sql

This function removes duplicates without removing indexes and does it to any table.

Usage: select remove_duplicates('mytable');

--- remove_duplicates(tablename) removes duplicate records from a table (convert from set to unique set)

CREATE OR REPLACE FUNCTION remove_duplicates(text) RETURNS void AS $$ DECLARE tablename ALIAS FOR $1; BEGIN EXECUTE 'CREATE TEMPORARY TABLE DISTINCT' || tablename || ' AS (SELECT DISTINCT * FROM ' || tablename || ');'; EXECUTE 'DELETE FROM ' || tablename || ';'; EXECUTE 'INSERT INTO ' || tablename || ' (SELECT * FROM DISTINCT' || tablename || ');'; EXECUTE 'DROP TABLE DISTINCT' || tablename || ';'; RETURN; END; $$ LANGUAGE plpgsql;

Solution 9 - Sql

I just used Erwin Brandstetter's answer successfully to remove duplicates in a join table (a table lacking its own primary IDs), but found that there's one important caveat.

Including ON COMMIT DROP means the temporary table will get dropped at the end of the transaction. For me, that meant the temporary table was no longer available by the time I went to insert it!

I just did CREATE TEMPORARY TABLE t_tmp AS SELECT DISTINCT * FROM tbl; and everything worked fine.

The temporary table does get dropped at the end of the session.

Solution 10 - Sql

DELETE FROM table
  WHERE something NOT IN
    (SELECT 	MAX(s.something)
      FROM   	table As s
      GROUP BY 	s.this_thing, s.that_thing);

Solution 11 - Sql

First, you need to decide on which of your "duplicates" you will keep. If all columns are equal, OK, you can delete any of them... But perhaps you want to keep only the most recent, or some other criterion?

The fastest way depends on your answer to the question above, and also on the % of duplicates on the table. If you throw away 50% of your rows, you're better off doing CREATE TABLE ... AS SELECT DISTINCT ... FROM ... ;, and if you delete 1% of the rows, using DELETE is better.

Also for maintenance operations like this, it's generally good to set work_mem to a good chunk of your RAM: run EXPLAIN, check the number N of sorts/hashes, and set work_mem to your RAM / 2 / N. Use lots of RAM; it's good for speed. As long as you only have one concurrent connection...

Solution 12 - Sql

If you have only one or a few duplicated entries, and they are indeed duplicated (that is, they appear twice), you can use the "hidden" ctid column, as proposed above, together with LIMIT:

DELETE FROM mytable WHERE ctid=(SELECT ctid FROM mytable WHERE […] LIMIT 1);

This will delete only the first of the selected rows.

Solution 13 - Sql

I'm working with PostgreSQL 8.4. When I ran the proposed code, I found that it was not actually removing the duplicates. In running some tests, I found that adding the "DISTINCT ON (duplicate_column_name)" and the "ORDER BY duplicate_column_name" did the trick. I'm no SQL guru, I found this in the PostgreSQL 8.4 SELECT...DISTINCT doc.

CREATE OR REPLACE FUNCTION remove_duplicates(text, text) RETURNS void AS $$
DECLARE
  tablename ALIAS FOR $1;
  duplicate_column ALIAS FOR $2;
BEGIN
  EXECUTE 'CREATE TEMPORARY TABLE _DISTINCT_' || tablename || ' AS (SELECT DISTINCT ON (' || duplicate_column || ') * FROM ' || tablename || ' ORDER BY ' || duplicate_column || ' ASC);';
  EXECUTE 'DELETE FROM ' || tablename || ';';
  EXECUTE 'INSERT INTO ' || tablename || ' (SELECT * FROM _DISTINCT_' || tablename || ');';
  EXECUTE 'DROP TABLE _DISTINCT_' || tablename || ';';
  RETURN;
END;
$$ LANGUAGE plpgsql;

Solution 14 - Sql

This works very nicely and is very quick:

CREATE INDEX otherTable_idx ON otherTable( colName );
CREATE TABLE newTable AS select DISTINCT ON (colName) col1,colName,col2 FROM otherTable;

Solution 15 - Sql

CREATE TABLE test (col text);
INSERT INTO test VALUES
 ('1'),
 ('2'), ('2'),
 ('3'),
 ('4'), ('4'),
 ('5'),
 ('6'), ('6');
DELETE FROM test
 WHERE ctid in (
   SELECT t.ctid FROM (
     SELECT row_number() over (
               partition BY col
               ORDER BY col
               ) AS rnum,
            ctid FROM test
       ORDER BY col
     ) t
    WHERE t.rnum >1);

Solution 16 - Sql

DELETE FROM tablename
WHERE id IN (SELECT id
    FROM (SELECT id,ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
                 FROM tablename) t
          WHERE t.rnum > 1);

Delete duplicates by column(s) and keep row with lowest id. The pattern is taken from the postgres wiki

Using CTEs you can achieve a more readable version of the above through this

WITH duplicate_ids as (
    SELECT id, rnum 
    FROM num_of_rows
    WHERE rnum > 1
),
num_of_rows as (
    SELECT id, 
        ROW_NUMBER() over (partition BY column1, 
                                        column2, 
                                        column3 ORDER BY id) AS rnum
        FROM tablename
)
DELETE FROM tablename 
WHERE id IN (SELECT id from duplicate_ids)

Content Type	Original Author	Original Content on Stackoverflow
Question	gjrwebber	View Question on Stackoverflow
Solution 1 - Sql	Tim	View Answer on Stackoverflow
Solution 2 - Sql	just somebody	View Answer on Stackoverflow
Solution 3 - Sql	Erwin Brandstetter	View Answer on Stackoverflow
Solution 4 - Sql	Jan Marek	View Answer on Stackoverflow
Solution 5 - Sql	shekwi	View Answer on Stackoverflow
Solution 6 - Sql	naXa stands with Ukraine	View Answer on Stackoverflow
Solution 7 - Sql	Bhavik Ambani	View Answer on Stackoverflow
Solution 8 - Sql	Ole Tange	View Answer on Stackoverflow
Solution 9 - Sql	codebykat	View Answer on Stackoverflow
Solution 10 - Sql	Secko	View Answer on Stackoverflow
Solution 11 - Sql	bobflux	View Answer on Stackoverflow
Solution 12 - Sql	Skippy le Grand Gourou	View Answer on Stackoverflow
Solution 13 - Sql	CM.	View Answer on Stackoverflow
Solution 14 - Sql	Mark Cupitt	View Answer on Stackoverflow
Solution 15 - Sql	Shamseer PC	View Answer on Stackoverflow
Solution 16 - Sql	denplis	View Answer on Stackoverflow