In a Django QuerySet, how to filter for "not exists" in a many-to-one relationship

DjangoDjango ModelsDjango Orm

Django Problem Overview


I have two models like this:

class User(models.Model):
    email = models.EmailField()

class Report(models.Model):
    user = models.ForeignKey(User)

In reality each model has more fields which are of no consequence to this question.

I want to filter all users who have an email which starts with 'a' and have no reports. There will be more .filter() and .exclude() criteria based on other fields.

I want to approach it like this:

users = User.objects.filter(email__like = 'a%')

users = users.filter(<other filters>)

users = ???

I would like ??? to filter out users who do not have reports associated with them. How would I do this? If this is not possible as I have presented it, what is an alternate approach?

Django Solutions


Solution 1 - Django

Note: this answer was written in 2013 for Django 1.5. See the other answers for better approaches that work with newer versions of Django

Use isnull.

users_without_reports = User.objects.filter(report__isnull=True)
users_with_reports = User.objects.filter(report__isnull=False).distinct()

When you use isnull=False, the distinct() is required to prevent duplicate results.

Solution 2 - Django

As of Django 3.0 you can now use expressions directly in a filter(), removing the unnecessary SQL clause:

User.objects.filter(
    ~Exists(Reports.objects.filter(user=OuterRef('pk'))),
    email__startswith='a'
)
SELECT user.pk, user.email
FROM user
WHERE NOT EXISTS (SELECT U0.pk FROM reports U0 WHERE U0.user = user.pk) AND email LIKE 'a%';

Docs:


For Django 1.11+ you can add EXISTS subqueries:

User.objects.annotate(
    no_reports=~Exists(Reports.objects.filter(user__eq=OuterRef('pk')))
).filter(
    email__startswith='a',
    no_reports=True
)

This generates SQL something like this:

SELECT
    user.pk,
    user.email,
    NOT EXISTS (SELECT U0.pk FROM reports U0 WHERE U0.user = user.pk) AS no_reports
FROM user
WHERE email LIKE 'a%' AND NOT EXISTS (SELECT U0.pk FROM reports U0 WHERE U0.user = user.pk);

A NOT EXISTS clause is almost always the most efficient way to do a "not exists" filter.


Solution 3 - Django

The only way to get native SQL EXISTS/NOT EXISTS without extra queries or JOINs is to add it as raw SQL in the .extra() clause:

users = users.extra(where=[
    """NOT EXISTS(SELECT 1 FROM {reports} 
                  WHERE user_id={users}.id)
    """.format(reports=Report._meta.db_table, users=User._meta.db_table)
])

In fact, it's a pretty obvious and efficient solution and I sometimes wonder why it wasn't built in to Django as a lookup. Also it allows to refine the subquery to find e.g. only users with[out] a report during last week, or with[out] an unanswered/unviewed report.

Solution 4 - Django

In addition to @OrangeDog answer. Since Django 3.0 you can use Exists subquery to directly filter a queryset:

User.objects.filter(
    ~Exists(Reports.objects.filter(user__eq=OuterRef('pk'))
)

Solution 5 - Django

Alasdair's answer is helpful, but I don't like using distinct(). It can sometimes be useful, but it's usually a code smell telling you that you messed up your joins.

Luckily, Django's queryset lets you filter on subqueries. With Django 3.0, you can also use an exists clause.

Here are a few ways to run the queries from your question:

# Tested with Django 3.0 and Python 3.6
import logging
import sys

import django
from django.apps import apps
from django.apps.config import AppConfig
from django.conf import settings
from django.db import connections, models, DEFAULT_DB_ALIAS
from django.db.models import Exists, OuterRef
from django.db.models.base import ModelBase

NAME = 'udjango'
DB_FILE = NAME + '.db'


def main():
    setup()

    class User(models.Model):
        email = models.EmailField()

        def __repr__(self):
            return 'User({!r})'.format(self.email)

    class Report(models.Model):
        user = models.ForeignKey(User, on_delete=models.CASCADE)

    syncdb(User)
    syncdb(Report)

    anne = User.objects.create(email='[email protected]')
    User.objects.create(email='[email protected]')
    alice = User.objects.create(email='[email protected]')
    User.objects.create(email='[email protected]')

    Report.objects.create(user=anne)
    Report.objects.create(user=alice)
    Report.objects.create(user=alice)

    logging.info('users without reports')
    logging.info(User.objects.filter(report__isnull=True, email__startswith='a'))

    logging.info('users with reports (allows duplicates)')
    logging.info(User.objects.filter(report__isnull=False, email__startswith='a'))

    logging.info('users with reports (no duplicates)')
    logging.info(User.objects.exclude(report__isnull=True).filter(email__startswith='a'))

    logging.info('users with reports (no duplicates, simpler SQL)')
    report_user_ids = Report.objects.values('user_id')
    logging.info(User.objects.filter(id__in=report_user_ids, email__startswith='a'))

    logging.info('users with reports (EXISTS clause, Django 3.0)')
    logging.info(User.objects.filter(
        Exists(Report.objects.filter(user_id=OuterRef('id'))),
        email__startswith='a'))

    logging.info('Done.')


def setup():
    with open(DB_FILE, 'w'):
        pass  # wipe the database
    settings.configure(
        DEBUG=True,
        DATABASES={
            DEFAULT_DB_ALIAS: {
                'ENGINE': 'django.db.backends.sqlite3',
                'NAME': DB_FILE}},
        LOGGING={'version': 1,
                 'disable_existing_loggers': False,
                 'formatters': {
                    'debug': {
                        'format': '%(asctime)s[%(levelname)s]'
                                  '%(name)s.%(funcName)s(): %(message)s',
                        'datefmt': '%Y-%m-%d %H:%M:%S'}},
                 'handlers': {
                    'console': {
                        'level': 'DEBUG',
                        'class': 'logging.StreamHandler',
                        'formatter': 'debug'}},
                 'root': {
                    'handlers': ['console'],
                    'level': 'INFO'},
                 'loggers': {
                    "django.db": {"level": "DEBUG"}}})
    app_config = AppConfig(NAME, sys.modules['__main__'])
    apps.populate([app_config])
    django.setup()
    original_new_func = ModelBase.__new__

    @staticmethod
    def patched_new(cls, name, bases, attrs):
        if 'Meta' not in attrs:
            class Meta:
                app_label = NAME
            attrs['Meta'] = Meta
        return original_new_func(cls, name, bases, attrs)
    ModelBase.__new__ = patched_new


def syncdb(model):
    """ Standard syncdb expects models to be in reliable locations.

    Based on https://github.com/django/django/blob/1.9.3
    /django/core/management/commands/migrate.py#L285
    """
    connection = connections[DEFAULT_DB_ALIAS]
    with connection.schema_editor() as editor:
        editor.create_model(model)


main()

If you put that into a Python file and run it, you should see something like this:

2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.000) PRAGMA foreign_keys = OFF; args=None
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.000) PRAGMA foreign_keys; args=None
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.000) BEGIN; args=None
2019-12-06 11:45:17[DEBUG]django.db.backends.schema.execute(): CREATE TABLE "udjango_user" ("id" integer NOT NULL PRIMARY KEY AUTOINCREMENT, "email" varchar(254) NOT NULL); (params None)
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.000) CREATE TABLE "udjango_user" ("id" integer NOT NULL PRIMARY KEY AUTOINCREMENT, "email" varchar(254) NOT NULL); args=None
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.000) PRAGMA foreign_key_check; args=None
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.000) PRAGMA foreign_keys = ON; args=None
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.000) PRAGMA foreign_keys = OFF; args=None
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.000) PRAGMA foreign_keys; args=None
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.000) BEGIN; args=None
2019-12-06 11:45:17[DEBUG]django.db.backends.schema.execute(): CREATE TABLE "udjango_report" ("id" integer NOT NULL PRIMARY KEY AUTOINCREMENT, "user_id" integer NOT NULL REFERENCES "udjango_user" ("id") DEFERRABLE INITIALLY DEFERRED); (params None)
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.000) CREATE TABLE "udjango_report" ("id" integer NOT NULL PRIMARY KEY AUTOINCREMENT, "user_id" integer NOT NULL REFERENCES "udjango_user" ("id") DEFERRABLE INITIALLY DEFERRED); args=None
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.000) PRAGMA foreign_key_check; args=None
2019-12-06 11:45:17[DEBUG]django.db.backends.schema.execute(): CREATE INDEX "udjango_report_user_id_60bc619c" ON "udjango_report" ("user_id"); (params ())
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.000) CREATE INDEX "udjango_report_user_id_60bc619c" ON "udjango_report" ("user_id"); args=()
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.000) PRAGMA foreign_keys = ON; args=None
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.017) INSERT INTO "udjango_user" ("email") VALUES ('[email protected]'); args=['[email protected]']
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.023) INSERT INTO "udjango_user" ("email") VALUES ('[email protected]'); args=['[email protected]']
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.022) INSERT INTO "udjango_user" ("email") VALUES ('[email protected]'); args=['[email protected]']
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.022) INSERT INTO "udjango_user" ("email") VALUES ('[email protected]'); args=['[email protected]']
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.029) INSERT INTO "udjango_report" ("user_id") VALUES (1); args=[1]
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.033) INSERT INTO "udjango_report" ("user_id") VALUES (3); args=[3]
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.033) INSERT INTO "udjango_report" ("user_id") VALUES (3); args=[3]
2019-12-06 11:45:17[INFO]root.main(): users without reports
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.000) SELECT "udjango_user"."id", "udjango_user"."email" FROM "udjango_user" LEFT OUTER JOIN "udjango_report" ON ("udjango_user"."id" = "udjango_report"."user_id") WHERE ("udjango_user"."email" LIKE 'a%' ESCAPE '\' AND "udjango_report"."id" IS NULL) LIMIT 21; args=('a%',)
2019-12-06 11:45:17[INFO]root.main(): <QuerySet [User('[email protected]')]>
2019-12-06 11:45:17[INFO]root.main(): users with reports (allows duplicates)
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.000) SELECT "udjango_user"."id", "udjango_user"."email" FROM "udjango_user" INNER JOIN "udjango_report" ON ("udjango_user"."id" = "udjango_report"."user_id") WHERE ("udjango_user"."email" LIKE 'a%' ESCAPE '\' AND "udjango_report"."id" IS NOT NULL) LIMIT 21; args=('a%',)
2019-12-06 11:45:17[INFO]root.main(): <QuerySet [User('[email protected]'), User('[email protected]'), User('[email protected]')]>
2019-12-06 11:45:17[INFO]root.main(): users with reports (no duplicates)
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.000) SELECT "udjango_user"."id", "udjango_user"."email" FROM "udjango_user" WHERE (NOT ("udjango_user"."id" IN (SELECT U0."id" FROM "udjango_user" U0 LEFT OUTER JOIN "udjango_report" U1 ON (U0."id" = U1."user_id") WHERE U1."id" IS NULL)) AND "udjango_user"."email" LIKE 'a%' ESCAPE '\') LIMIT 21; args=('a%',)
2019-12-06 11:45:17[INFO]root.main(): <QuerySet [User('[email protected]'), User('[email protected]')]>
2019-12-06 11:45:17[INFO]root.main(): users with reports (no duplicates, simpler SQL)
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.000) SELECT "udjango_user"."id", "udjango_user"."email" FROM "udjango_user" WHERE ("udjango_user"."email" LIKE 'a%' ESCAPE '\' AND "udjango_user"."id" IN (SELECT U0."user_id" FROM "udjango_report" U0)) LIMIT 21; args=('a%',)
2019-12-06 11:45:17[INFO]root.main(): <QuerySet [User('[email protected]'), User('[email protected]')]>
2019-12-06 11:45:17[INFO]root.main(): users with reports (EXISTS clause, Django 3.0)
2019-12-06 11:45:17[DEBUG]django.db.backends.debug_sql(): (0.000) SELECT "udjango_user"."id", "udjango_user"."email" FROM "udjango_user" WHERE (EXISTS(SELECT U0."id", U0."user_id" FROM "udjango_report" U0 WHERE U0."user_id" = "udjango_user"."id") AND "udjango_user"."email" LIKE 'a%' ESCAPE '\') LIMIT 21; args=('a%',)
2019-12-06 11:45:17[INFO]root.main(): <QuerySet [User('[email protected]'), User('[email protected]')]>
2019-12-06 11:45:17[INFO]root.main(): Done.

You can see that the final query uses all inner joins.

Solution 6 - Django

To filter users who do not have reports associated with them try this:

users = User.objects.exclude(id__in=[elem.user.id for elem in Report.objects.all()])

Solution 7 - Django

The best option for finding rows where there is a joining row:
Report.objects.filter(user__isnull=False).distinct()

This uses an INNER JOIN (and then redundantly checks User.id is not null).

The best option for finding rows where there is not a joining row:
Report.objects.filter(user__isnull=True)

This makes LEFT OUTER JOIN, then checks User.id is not null.

Queries based on joins will be quicker than subqueries, so this is quicker than newly available options such as in Django >= 3, for finding rows without a joining row:

Report.objects.filter(~Exists(User.objects.filter(report=OuterRef('pk'))))

This creates a WHERE NOT EXISTS (SELECT .. FROM User..) so involves a potentially large intermediate result set (thanks @Tomasz Gandor).

This for Django <3, where filter() can't be passed subqueries, also uses a subquery so is slower:

Report.objects.annotate(
    no_users=~Exists(User.objects.filter(report=OuterRef('pk')))
).filter(no_users=True)

This can be combined with subqueries. In this example, a Textbook has a number of Versions (ie, version has textbook_id), and a version has a number of Pages (ie, page has version_id). The subquery gets the latest version of each textbook that has pages associated:

subquery = (
    Version.objects
        .filter(
            # OuterRef joins to Version.textbook in outer query below
            textbook=OuterRef('textbook'), 
            # excludes rows with no joined Page records
            page__isnull=False)
        # ordered so [:1] below gets highest (ie, latest) version number
        .order_by('-number').distinct()
)
# Only the Version.ids of the latest versions that have pages returned by the subquery
books = Version.objects.filter(pk=Subquery(subquery.values('pk')[:1])).distinct()

To return rows that have a join to one or both of two tables, use Q objects (Page and TextMarkup both have nullable foreign keys joining to File):

from django.db.models import Q

File.objects.filter(Q(page__isnull=False) | Q(textmarkup__isnull=False).distinct()

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionKrystian CybulskiView Question on Stackoverflow
Solution 1 - DjangoAlasdairView Answer on Stackoverflow
Solution 2 - DjangoOrangeDogView Answer on Stackoverflow
Solution 3 - DjangoYuri ShatrovView Answer on Stackoverflow
Solution 4 - DjangoneverwalkalonerView Answer on Stackoverflow
Solution 5 - DjangoDon KirkbyView Answer on Stackoverflow
Solution 6 - DjangoLukasz KoziaraView Answer on Stackoverflow
Solution 7 - DjangoChrisView Answer on Stackoverflow