Remove duplicates in a Django query

SqlDjango

Sql Problem Overview


Is there a simple way to remove duplicates in the following basic query:

email_list = Emails.objects.order_by('email')

I tried using duplicate() but it was not working. What is the exact syntax for doing this query without duplicates?

Sql Solutions


Solution 1 - Sql

This query will not give you duplicates - ie, it will give you all the rows in the database, ordered by email.

However, I presume what you mean is that you have duplicate data within your database. Adding distinct() here won't help, because even if you have only one field, you also have an automatic id field - so the combination of id+email is not unique.

Assuming you only need one field, email_address, de-duplicated, you can do this:

email_list = Email.objects.values_list('email', flat=True).distinct()

However, you should really fix the root problem, and remove the duplicate data from your database.

Example, deleting duplicate Emails by email field:

for email in Email.objects.values_list('email', flat=True).distinct():
    Email.objects.filter(pk__in=Email.objects.filter(email=email).values_list('id', flat=True)[1:]).delete()

Or books by name:

for name in Book.objects.values_list('name', flat=True).distinct(): 
    Book.objects.filter(pk__in=Artwork.objects.filter(name=name).values_list('id', flat=True)[3:]).delete()

Solution 2 - Sql

For checking duplicate you can do a GROUP_BY and HAVING in Django as below. We are using Django annotations here.

from django.db.models import Count
from app.models import Email

duplicate_emails = Email.objects.values('email').annotate(email_count=Count('email')).filter(email_count__gt=1)

Now looping through the above data and deleting all other emails except the first one (depends on requirement or whatever).

for data in duplicates_emails:
    email = data['email']
    Email.objects.filter(email=email).order_by('pk')[1:].delete()

Solution 3 - Sql

You can chain .distinct() on the end of your queryset to filter duplicates. Check out: http://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.distinct

Solution 4 - Sql

You may be able to use the distinct() function, depending on your model. If you only want to retrieve a single field form the model, you could do something like:

email_list = Emails.objects.values_list('email').order_by('email').distinct()

which should give you an ordered list of emails.

Solution 5 - Sql

You can also use set()

email_list = set(Emails.objects.values_list('email', flat=True))

Solution 6 - Sql

I used the following to actually remove the duplicate entries from from the database, hopefully this helps someone else.

adds = Address.objects.all()
d = adds.distinct('latitude', 'longitude')
for address in adds:    
  if i not in d:
    address.delete()

Solution 7 - Sql

Use, self queryset.annotate()!

from django.db.models import Subquery, OuterRef

email_list = Emails.objects.filter(
	pk__in = Emails.objects.values('emails').distinct().annotate(
		pk = Subquery(
        Emails.objects.filter(
          emails= OuterRef("emails")
        )
        .order_by("pk")
       	.values("pk")[:1])
    )
    .values_list("pk", flat=True)
)

This queryset goes to make this query.

 SELECT `email`.`id`,
        `email`.`title`,
        `email`.`body`,
       ...
       ...
  FROM `email`
 WHERE `email`.`id` IN (
        SELECT DISTINCT (
                SELECT U0.`id`
                  FROM `email` U0
                 WHERE U0.`email` = V0.`approval_status`
                 ORDER BY U0.`id` ASC
                 LIMIT 1
               ) AS `pk`
         FROM `agent` V0
 )

cheet-sheet

from django.db.models import Subquery, OuterRef

group_by_duplicate_col_queryset = Models.objects.filter(
	pk__in = Models.objects.values('duplicate_col').distinct().annotate(
		pk = Subquery(
        Models.objects.filter(
          duplicate_col= OuterRef('duplicate_col')
        )
        .order_by("pk")
       	.values("pk")[:1])
    )
    .values_list("pk", flat=True)
)

Solution 8 - Sql

you can use this raw query : your_model.objects.raw("select * from appname_Your_model group by column_name")

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionDavid542View Question on Stackoverflow
Solution 1 - SqlDaniel RosemanView Answer on Stackoverflow
Solution 2 - SqlParag TyagiView Answer on Stackoverflow
Solution 3 - SqlzeekayView Answer on Stackoverflow
Solution 4 - SqlMichael C. O'ConnorView Answer on Stackoverflow
Solution 5 - SqlSuperNovaView Answer on Stackoverflow
Solution 6 - SqlChris MontanaroView Answer on Stackoverflow
Solution 7 - SqlrumbarumView Answer on Stackoverflow
Solution 8 - SqlRaj KumarView Answer on Stackoverflow