Delete duplicate rows from a BigQuery table

DistinctGoogle Bigquery

Distinct Problem Overview


I have a table with >1M rows of data and 20+ columns.

Within my table (tableX) I have identified duplicate records (~80k) in one particular column (troubleColumn).

If possible I would like to retain the original table name and remove the duplicate records from my problematic column otherwise I could create a new table (tableXfinal) with the same schema but without the duplicates.

I am not proficient in SQL or any other programming language so please excuse my ignorance.

delete from Accidents.CleanedFilledCombined 
where Fixed_Accident_Index 
in(select Fixed_Accident_Index from Accidents.CleanedFilledCombined 
group by Fixed_Accident_Index 
having count(Fixed_Accident_Index) >1);

Distinct Solutions


Solution 1 - Distinct

You can remove duplicates by running a query that rewrites your table (you can use the same table as the destination, or you can create a new table, verify that it has what you want, and then copy it over the old table).

A query that should work is here:

SELECT *
FROM (
  SELECT
      *,
      ROW_NUMBER()
          OVER (PARTITION BY Fixed_Accident_Index)
          row_number
  FROM Accidents.CleanedFilledCombined
)
WHERE row_number = 1

Solution 2 - Distinct

UPDATE 2019: To de-duplicate rows on a single partition with a MERGE, see:


An alternative to Jordan's answer - this one scales better when having too many duplicates:

#standardSQL
SELECT event.* FROM (
  SELECT ARRAY_AGG(
    t ORDER BY t.created_at DESC LIMIT 1
  )[OFFSET(0)]  event
  FROM `githubarchive.month.201706` t 
  # GROUP BY the id you are de-duplicating by
  GROUP BY actor.id
)

Or a shorter version (takes any row, instead of the newest one):

SELECT k.*
FROM (
  SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k 
  FROM `fh-bigquery.reddit_comments.2017_01` x 
  GROUP BY id
)

To de-duplicate rows on an existing table:

CREATE OR REPLACE TABLE `deleting.deduplicating_table`
AS
# SELECT id FROM UNNEST([1,1,1,2,2]) id
SELECT k.*
FROM (
  SELECT ARRAY_AGG(row LIMIT 1)[OFFSET(0)] k 
  FROM `deleting.deduplicating_table` row
  GROUP BY id
)

Solution 3 - Distinct

Not sure why nobody mentioned DISTINCT query.

Here is the way to clean duplicate rows:

CREATE OR REPLACE TABLE project.dataset.table
AS
SELECT DISTINCT * FROM project.dataset.table

Solution 4 - Distinct

If your schema doesn’t have any records - below variation of Jordan’s answer will work well enough with writing over same table or new one, etc.

SELECT <list of original fields>
FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY Fixed_Accident_Index) AS pos,
  FROM Accidents.CleanedFilledCombined
)
WHERE pos = 1

In more generic case - with complex schema with records/netsed fields, etc. - above approach can be a challenge.

I would propose to try using Tabledata: insertAll API with rows[].insertId set to respective Fixed_Accident_Index for each row. In this case duplicate rows will be eliminated by BigQuery

Of course, this will involve some client side coding - so might be not relevant for this particular question. I havent tried this approach by myself either but feel it might be interesting to try :o)

Solution 5 - Distinct

If you have a large-size partitioned table, and only have duplicates in a certain partition range. You don't want to overscan nor process the whole table. use the MERGE SQL below with predicates on partition range:

-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table 
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------

DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");

MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
  SELECT k.*
  FROM (
    SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k 
    FROM `gcp_project`.`data_set`.`the_table` AS original_data
    WHERE stamp BETWEEN dt_start AND dt_end
    GROUP BY surrogate_key
  )

) AS INTERNAL_SOURCE
ON FALSE

WHEN NOT MATCHED BY SOURCE
  AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
    THEN DELETE

WHEN NOT MATCHED THEN INSERT ROW

credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a

Solution 6 - Distinct

Easier answer, without a subselect

SELECT *, ROW_NUMBER() OVER (PARTITION BY Fixed_Accident_Index) row_number FROM Accidents.CleanedFilledCombined WHERE TRUE QUALIFY row_number = 1

The Where True is neccesary because qualify needs a where, group by or having clause

Solution 7 - Distinct

Felipe's answer is the best approach for most cases. Here is a more elegant way to accomplish the same:

CREATE OR REPLACE TABLE Accidents.CleanedFilledCombined
AS
SELECT 
  Fixed_Accident_Index, 
  ARRAY_AGG(x LIMIT 1)[SAFE_OFFSET(0)].* EXCEPT(Fixed_Accident_Index)
FROM Accidents.CleanedFilledCombined AS x
GROUP BY Fixed_Accident_Index;

To be safe, make sure you backup the original table before you run this ^^

I don't recommend to use ROW NUMBER() OVER() approach if possible since you may run into BigQuery memory limits and get unexpected errors.

Solution 8 - Distinct

  1. Update BigQuery schema with new table column as bq_uuid making it NULLABLE and type STRING 

  2. Create duplicate rows by running same command 5 times for example

insert into beginner-290513.917834811114.messages (id, type, flow, updated_at) Values(19999,"hello", "inbound", '2021-06-08T12:09:03.693646')

  1. Check if duplicate entries exist 
 select * from beginner-290513.917834811114.messages where id = 19999

  2. Use generate uuid function to generate uuid corresponding to each message 
UPDATE beginner-290513.917834811114.messages SET bq_uuid = GENERATE_UUID() where id>0

  3. Clean duplicate entries


DELETE FROM beginner-290513.917834811114.messages WHERE bq_uuid IN (SELECT bq_uuid FROM (SELECT bq_uuid, ROW_NUMBER() OVER( PARTITION BY updated_at ORDER BY bq_uuid ) AS row_num FROM beginner-290513.917834811114.messages ) t WHERE t.row_num > 1 );

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionTheGoatView Question on Stackoverflow
Solution 1 - DistinctJordan TiganiView Answer on Stackoverflow
Solution 2 - DistinctFelipe HoffaView Answer on Stackoverflow
Solution 3 - DistinctSemraView Answer on Stackoverflow
Solution 4 - DistinctMikhail BerlyantView Answer on Stackoverflow
Solution 5 - DistinctHui ZhengView Answer on Stackoverflow
Solution 6 - DistinctelauserView Answer on Stackoverflow
Solution 7 - DistinctIgor-SView Answer on Stackoverflow
Solution 8 - DistinctAkhilesh NegiView Answer on Stackoverflow