Fastest way to remove duplicate documents in mongodb

MongodbPerformanceOptimizationDuplicates

Mongodb Problem Overview


I have approximately 1.7M documents in mongodb (in future 10m+). Some of them represent duplicate entry which I do not want. Structure of document is something like this:

{
    _id: 14124412,
    nodes: [
        12345,
        54321
        ],
    name: "Some beauty"
}

Document is duplicate if it has at least one node same as another document with same name. What is the fastest way to remove duplicates?

Mongodb Solutions


Solution 1 - Mongodb

dropDups: true option is not available in 3.0.

I have solution with aggregation framework for collecting duplicates and then removing in one go.

It might be somewhat slower than system level "index" changes. But it is good by considering way you want to remove duplicate documents.

a. Remove all documents in one go

var duplicates = [];

db.collectionName.aggregate([
  { $match: { 
    name: { "$ne": '' }  // discard selection criteria
  }},
  { $group: { 
    _id: { name: "$name"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
  }},
  { $match: { 
    count: { "$gt": 1 }    // Duplicates considered as count greater than one
  }}
],
{allowDiskUse: true}       // For faster processing if set is larger
)               // You can display result until this and check duplicates 
.forEach(function(doc) {
    doc.dups.shift();      // First element skipped for deleting
    doc.dups.forEach( function(dupId){ 
        duplicates.push(dupId);   // Getting all duplicate ids
        }
    )
})

// If you want to Check all "_id" which you are deleting else print statement not needed
printjson(duplicates);     

// Remove all duplicates in one go    
db.collectionName.remove({_id:{$in:duplicates}})  

b. You can delete documents one by one.

db.collectionName.aggregate([
  // discard selection criteria, You can remove "$match" section if you want
  { $match: { 
    source_references.key: { "$ne": '' }  
  }},
  { $group: { 
    _id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
  }}, 
  { $match: { 
    count: { "$gt": 1 }    // Duplicates considered as count greater than one
  }}
],
{allowDiskUse: true}       // For faster processing if set is larger
)               // You can display result until this and check duplicates 
.forEach(function(doc) {
    doc.dups.shift();      // First element skipped for deleting
    db.collectionName.remove({_id : {$in: doc.dups }});  // Delete remaining duplicates
})

Solution 2 - Mongodb

Assuming you want to permanently delete docs that contain a duplicate name + nodes entry from the collection, you can add a unique index with the dropDups: true option:

db.test.ensureIndex({name: 1, nodes: 1}, {unique: true, dropDups: true}) 

As the docs say, use extreme caution with this as it will delete data from your database. Back up your database first in case it doesn't do exactly as you're expecting.

UPDATE

This solution is only valid through MongoDB 2.x as the dropDups option is no longer available in 3.0 (docs).

Solution 3 - Mongodb

Create collection dump with mongodump

Clear collection

Add unique index

Restore collection with mongorestore

Solution 4 - Mongodb

I found this solution that works with MongoDB 3.4: I'll assume the field with duplicates is called fieldX

db.collection.aggregate([
{
    // only match documents that have this field
    // you can omit this stage if you don't have missing fieldX
    $match: {"fieldX": {$nin:[null]}}  
},
{
    $group: { "_id": "$fieldX", "doc" : {"$first": "$$ROOT"}}
},
{
    $replaceRoot: { "newRoot": "$doc"}
}
],
{allowDiskUse:true})

Being new to mongoDB, I spent a lot of time and used other lengthy solutions to find and delete duplicates. However, I think this solution is neat and easy to understand.

It works by first matching documents that contain fieldX (I had some documents without this field, and I got one extra empty result).

The next stage groups documents by fieldX, and only inserts the $first document in each group using $$ROOT. Finally, it replaces the whole aggregated group by the document found using $first and $$ROOT.

I had to add allowDiskUse because my collection is large.

You can add this after any number of pipelines, and although the documentation for $first mentions a sort stage prior to using $first, it worked for me without it. " couldnt post a link here, my reputation is less than 10 :( "

You can save the results to a new collection by adding an $out stage...

Alternatively, if one is only interested in a few fields e.g. field1, field2, and not the whole document, in the group stage without replaceRoot:

db.collection.aggregate([
{
    // only match documents that have this field
    $match: {"fieldX": {$nin:[null]}}  
},
{
    $group: { "_id": "$fieldX", "field1": {"$first": "$$ROOT.field1"}, "field2": { "$first": "$field2" }}
}
],
{allowDiskUse:true})

Solution 5 - Mongodb

My DB had millions of duplicate records. @somnath's answer did not work as is so writing the solution that worked for me for people looking to delete millions of duplicate records.

/** Create a array to store all duplicate records ids*/
var duplicates = [];

/** Start Aggregation pipeline*/
db.collection.aggregate([
  {
    $match: { /** Add any filter here. Add index for filter keys*/
      filterKey: {
        $exists: false
      }
    }
  },
  {
    $sort: { /** Sort it in such a way that you want to retain first element*/
      createdAt: -1
    }
  },
  {
    $group: {
      _id: {
        key1: "$key1", key2:"$key2" /** These are the keys which define the duplicate. Here document with same value for key1 and key2 will be considered duplicate*/
      },
      dups: {
        $push: {
          _id: "$_id"
        }
      },
      count: {
        $sum: 1
      }
    }
  },
  {
    $match: {
      count: {
        "$gt": 1
      }
    }
  }
],
{
  allowDiskUse: true
}).forEach(function(doc){
  doc.dups.shift();
  doc.dups.forEach(function(dupId){
    duplicates.push(dupId._id);
  })
})

/** Delete the duplicates*/
var i,j,temparray,chunk = 100000;
for (i=0,j=duplicates.length; i<j; i+=chunk) {
    temparray = duplicates.slice(i,i+chunk);
    db.collection.bulkWrite([{deleteMany:{"filter":{"_id":{"$in":temparray}}}}])
}

Solution 6 - Mongodb

Here is a slightly more 'manual' way of doing it:

Essentially, first, get a list of all the unique keys you are interested.

Then perform a search using each of those keys and delete if that search returns bigger than one.

  db.collection.distinct("key").forEach((num)=>{
    var i = 0;
    db.collection.find({key: num}).forEach((doc)=>{
      if (i)   db.collection.remove({key: num}, { justOne: true })
      i++
    })
  });

Solution 7 - Mongodb

tips to speed up, when only small portion of your documents are duplicated:

  1. you need an index on the field to detect duplicates.
  2. $group does not use the index, but it can take advantage of $sort and $sort use the index. so you should put a $sort step at the beginning
  3. do inplace delete_many() instead of $out to new collection, this will save lots of IO time and disk space.

if you use pymongo you can do:

index_uuid = IndexModel(
    [
        ('uuid', pymongo.ASCENDING)
    ],
)
col.create_indexes([index_uuid])
pipeline = [
    {"$sort": {"uuid":1}},
    {
        "$group": {
            "_id": "$uuid",
            "dups": {"$addToSet": "$_id"},
            "count": {"$sum": 1}
        }
    },
    {
        "$match": {"count": {"$gt": 1}}
    },
]
it_cursor = col.aggregate(
    pipeline, allowDiskUse=True
)
# skip 1st dup of each dups group
dups = list(itertools.chain.from_iterable(map(lambda x: x["dups"][1:], it_cursor)))
col.delete_many({"_id":{"$in": dups}})
performance

I test it on a database contain 30M documents and 1TB large.

  • Without index/sort it takes more than an hour to get the cursor (I do not even have the patient to wait for it).
  • with index/sort but use $out to output to a new collection. This is safer if your filesystem does not support snapshot. But it requires lots of disk space and takes more than 40mins to finish despite the fact that we are using SSDs. It will be much slower if you are on HDD RAID.
  • with index/sort and inplace delete_many, it takes around 5mins in total.

Solution 8 - Mongodb

The following Mongo aggregation pipeline does the deduplication and outputs it back to the same or different collection.

collection.aggregate([
  { $group: {
    _id: '$field_to_dedup',
    doc: { $first: '$$ROOT' }
  } },
  { $replaceRoot: {
    newRoot: '$doc'
  } },
  { $out: 'collection' }
], { allowDiskUse: true })

Solution 9 - Mongodb

The following method merges documents with the same name while only keeping the unique nodes without duplicating them.

I found using the $out operator to be a simple way. I unwind the array and then group it by adding to set. The $out operator allows the aggregation result to persist [docs]. If you put the name of the collection itself it will replace the collection with the new data. If the name does not exist it will create a new collection.

Hope this helps.

allowDiskUse may have to be added to the pipeline.

db.collectionName.aggregate([
  {
    $unwind:{path:"$nodes"},
  },
  {
    $group:{
      _id:"$name",
      nodes:{
        $addToSet:"$nodes"
      }
  },
  {
    $project:{
      _id:0,
      name:"$_id.name",
      nodes:1
    }
  },
  {
    $out:"collectionNameWithoutDuplicates"
  }
])

Solution 10 - Mongodb

Using pymongo this should work.

Add the fields that need to be unique for the collection in unique_field

unique_field = {"field1":"$field1","field2":"$field2"}

cursor = DB.COL.aggregate([{"$group":{"_id":unique_field, "dups":{"$push":"$uuid"}, "count": {"$sum": 1}}},{"$match":{"count": {"$gt": 1}}},{"$group":"_id":None,"dups":{"$addToSet":{"$arrayElemAt":["$dups",1]}}}}],allowDiskUse=True)

slice the dups array depending on the duplications count(here i had only one extra duplicate for all)

items = list(cursor)
removeIds = items[0]['dups']
hold.remove({"uuid":{"$in":removeIds}})

Solution 11 - Mongodb

I don't know whether is it going to answer main question, but for others it'll be usefull. 1.Query the duplicate row using findOne() method and store it as an object.

const User = db.User.findOne({_id:"duplicateid"});

2.Execute deleteMany() method to remove all the rows with the id "duplicateid"

db.User.deleteMany({_id:"duplicateid"});

3.Insert the values stored in User object.

db.User.insertOne(User);

Easy and fast!!!!

Solution 12 - Mongodb

First, you can find all the duplicates and remove those duplicates in the DB. Here we take the id column to check and remove duplicates.

db.collection.aggregate([
    { "$group": { "_id": "$id", "count": { "$sum": 1 } } },
    { "$match": { "_id": { "$ne": null }, "count": { "$gt": 1 } } },
    { "$sort": { "count": -1 } },
    { "$project": { "name": "$_id", "_id": 0 } }
]).then(data => {
    var dr = data.map(d => d.name);
    console.log("duplicate Recods:: ", dr);
    db.collection.remove({ id: { $in: dr } }).then(removedD => {
        console.log("Removed duplicate Data:: ", removedD);
    })
})

Solution 13 - Mongodb

  1. General idea is to use findOne https://docs.mongodb.com/manual/reference/method/db.collection.findOne/ to retrieve one random id from the duplicate records in the collection.

  2. Delete all the records in the collection other than the random-id that we retrieved from findOne option.

You can do something like this if you are trying to do it in pymongo.

def _run_query():

        try:

			for record in (aggregate_based_on_field(collection)):
                if not record:
                    continue
                _logger.info("Working on Record %s", record)

                try:
                    retain = db.collection.find_one(find_one({'fie1d1': 'x',  'field2':'y'}, {'_id': 1}))
					_logger.info("_id to retain from duplicates %s", retain['_id'])

                    db.collection.remove({'fie1d1': 'x',  'field2':'y', '_id': {'$ne': retain['_id']}})

                except Exception as ex:
                    _logger.error(" Error when retaining the record :%s Exception: %s", x, str(ex))

        except Exception as e:
            _logger.error("Mongo error when deleting duplicates %s", str(e))


def aggregate_based_on_field(collection):
    return collection.aggregate([{'$group' : {'_id': "$fieldX"}}])

From the shell:

  1. Replace find_one to findOne
  2. Same remove command should work.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionewooycomView Question on Stackoverflow
Solution 1 - MongodbSomnath MulukView Answer on Stackoverflow
Solution 2 - MongodbJohnnyHKView Answer on Stackoverflow
Solution 3 - MongodbdhythhsbaView Answer on Stackoverflow
Solution 4 - MongodbAli Abul HawaView Answer on Stackoverflow
Solution 5 - MongodbMayank PatelView Answer on Stackoverflow
Solution 6 - MongodbFernandoView Answer on Stackoverflow
Solution 7 - MongodbWangView Answer on Stackoverflow
Solution 8 - MongodbMihailoffView Answer on Stackoverflow
Solution 9 - Mongodbsanair96View Answer on Stackoverflow
Solution 10 - MongodbRennyView Answer on Stackoverflow
Solution 11 - MongodbRakshith HRView Answer on Stackoverflow
Solution 12 - MongodbKundan SharmaView Answer on Stackoverflow
Solution 13 - MongodbamateurView Answer on Stackoverflow