State of the ORM

Review of the past and future of django.db.models

by Simon Charette

Simon Charette

  • Montréal, Canada
  • Django contributor for ~10 years
    • Mostly to the ORM and migrations
  • Staff Eng. at Zapier

📧 charettes@gmail.com
🐙 github.com/charettes

  • No Code automation tool
  • 5000+ integrations
  • Multiple Python & Django services
  • Extensive usage of the ORM against MySQL and PostgreSQL

What is the ORM?

  • Object–relational mapping
  • Definition of model relations mapping to database tables
  • Allow for retrieval and update of data
    • QuerySet.filter(), .all(), .get()
  • Abstraction layer between Python and databases
    • Speaks SQL
    • Maps table rows to Python objects

Evolution or the ORM

  • Review of some of the most influential changes in the past 10 years
  • Technical dive into issues that remains unsolved, lacking
  • What lies ahead and how to get there

Pick of most influential features

  • A bit of Git-Trac archeology from 1.4 to 4.1
  • Perceived momentum to contributions
  • Hard to imagine the ORM without them nowadays

prefetch_related multi-valued relationship fetching

Django 1.4 (2012)
  • Refactor of relation caching in 1.5
  • Proper interactions with select_related
  • Enhanced with Prefetch, to_attr in 1.7
  • prefetch_related_objects made public in 1.10
  • Slicing support should land in Django 4.2 🤞
  • Still lacking support for QuerySet.values

transaction.atomic improved transaction management

Django 1.6 (2013)
  • Large and complex refactor
  • Made transaction handling understandable
  • Saner public API transaction.atomic thanks to @Xof/xact
  • Default to auto-commit
  • transaction.on_commit hook

Lookup, Transform, and Expression support

Django 1.7, 1.8 (2014-15)
  • Liberalized access to the ORM
  • Formalized APIs resolve_expression, as_sql
  • Most of usages of extra were made obsolete

Subquery, Exists, and FilteredExpression

Django 1.11, 2.0 (2017)

  • Push the concept of expressions further with OuterRef
  • More control into JOIN and nested queries generation
  • Made even more usages of extra usage obsoletes
  • Legitimate performance improvements
  • Necessary ways to circumvent long standing ORM limitations

Model.Meta.constraints and friends

Django 2.2-4.1 (2019-22)

  • Functional indexes in 3.2
  • Functional unique constraints in 4.0
  • Constraint.validate in 4.1
  • Eventual deprecation of unique_together
  • Possibilty to revisit model validation of constraints

Long standing issues / missing features

  • Multi-valued relationship handling (#2361)
  • Relation fetching strategies (#28586)
  • Reusable relationship definitions

Multi-valued relationship handling

class Author(Model):
    name = CharField(max_length=100)
class Book(Model):
    title = CharField(max_length=100)
    author = ForeignKey(
        Author, CASCADE, related_name="books"
    )
    coauthor = ForeignKey(
        Author, CASCADE, related_name="cobooks", null=true
    )

Multi-valued relationship handling

>>> Author.objects.filter(books__title__icontains="ring")
[
    <Author: "J. R. R. Tolkien">,
    <Author: "J. R. R. Tolkien">, ...
]
SELECT author.*
FROM author
INNER JOIN book ON (book.author_id = author.id)
WHERE book.title ILIKE '%ring%'

Multi-valued relationship handling

WHERE EXISTS(
    SELECT 1 FROM book
    WHERE book.author_id = author.id AND book.title ILIKE '%ring%'
)

instead of

INNER JOIN book ON (
    book.author_id = author.id
    AND book.title ILIKE '%ring%'
)

Multi-valued relationship handling

Desirable for aggregation purposes

>>> Author.objects.annotate(books_cnt=Count("books"))
[<Author: "J. R. R. Tolkien">, ...]
>>> _[0].books_cnt
>>> 5
SELECT author.*, COUNT(book.id)
FROM author
INNER JOIN book ON (book.author_id = author.id)
GROUP BY author.id

Multi-valued relationship handling

As long as only a single multi-valued relationship is involved (#10060)

>>> Author.objects.annotate(
    books_cnt=Count("books"),
    cobooks_cnt=Count("cobooks"),
)
[<Author: "J. R. R. Tolkien">, ...]
>>> _[0].books_cnt, _[0].cobooks_cnt  # COUNT(book.id) X COUNT(coobook.id)
SELECT author.*, COUNT(cobook.id), COUNT(book.id) FROM author
INNER JOIN book ON (book.author_id = author.id)
LEFT JOIN book cobook ON (cobook.author_id = author.id)
GROUP BY author.id

Multi-valued relationship handling

Multi-valued relationship handling

Makes aggregation through this method a 🦶 🔫

  • Possible to work around for COUNT by using DISTINCT
  • Not possible for aggregate that don't support DISTINCT
  • Possible to work around using subqueries (#28296)

Multi-valued relationship handling

Possible solutions (mostly backward incompatible)
  • Warn or error when multiple multi-valued relationships are involved
  • GROUP BY even when no aggregation is involved
  • Disallow multi-valued relationships filtering
    • e.g. require explicit __exists lookup instead
  • Teach the ORM to be smart enough to use subqueries when appropriate (some database have performance issues with subqueries e.g. MySQL)

Limited support for field fetching strategies

Current situation
  • Relations are deferred by default (opt-in via select_related)
  • Fields are selected by default (opt-out via defer / only)

Limited support for field fetching strategies

class Author(Model):
    name = CharField(max_length=100)

class Book(Model):
    title = CharField(max_length=100)
    author = ForeignKey(Author, CASCADE, related_name="books")

>>> Book.objects.all()
SELECT book.id, book.title, book.author_id FROM book

Limited support for field fetching strategies

>>> book = Book.objects.get(id=1)
>>> book.author
<Author: Tolkien>
SELECT book.id, book.title, book.author_id
FROM book WHERE id = 1
-- Assuming Book(id=1).author_id == 2
SELECT author.id, author.name
FROM author WHERE id = 2

Limited support for field fetching strategies

>>> book = Book.objects.select_related("author").get(id=1)
>>> book.author  # Attribute already fetched
<Author: Tolkien>
SELECT
    book.id, book.title, book.author_id, author.id, author.name
FROM book
INNER JOIN author ON (book.author_id = author.id)
WHERE id = 1

Limited support for field fetching strategies

>>> book = Book.objects.defer("title").get(id=1)
>>> book.title
"The Hobbit"
SELECT book.id, book.author_id
FROM book
WHERE id = 1
---
SELECT book.title
FROM book
WHERE id = 1

Limited support for field fetching strategies

N+1 query problem

>>> for book in Book.objects.defer("name"):
>>>     book.author, book.name  # Triggers two queries!
SELECT book.id, book.author_id FROM book
--- Repeat the following queries for each book (N times)
SELECT author.id, author.name FROM author WHERE id = 2
SELECT book.title FROM book WHERE id = 1
--- ...

Limited support for field fetching strategies

Current (core) mitigations

  • Extensive code coverage relying on assertNumQueries
  • Monitoring # of queries per request through middleware
  • django-debug-toolbar SQL panel

Limited support for field fetching strategies

Possible mitigations

Automatically prefetch when N+1 pattern is detected

Limited support for field fetching strategies

Automatically prefetch when N+1 pattern is detected

>>> for book in Book.objects.all():
>>>     book.author
SELECT book.id, book.author_id FROM book
---
SELECT author.id, author.name FROM author
WHERE id IN (...)

Limited support for field fetching strategies

Automatically prefetch when N+1 pattern is detected

Cons

  • prefetch_related doesn't enforce referential integrity; state can change between separate queries (#27403)
  • Encourage lax practices in terms of data access pattern

Limited support for field fetching strategies

Possible mitigations

Warn on deferred attribute fetching

Warn on deferred attribute fetching

>>> for book in Book.objects.all():
>>>     book.author

LazyAttributeAccess:
    Attempt to fetch deferred relationship "author"
>>> for book in Book.objects.defer("title").all():
>>>     book.title

LazyAttributeAccess:
    Attempt to fetch deferred field "title"

Limited support for field fetching strategies

Possible mitigations

Warn on deferred attribute fetching

Cons

  • Doesn't actually solve the problem, only points at its origin
  • Requires opt-in to maintain backward compatiblity

Limited support for field fetching strategies

Thinking of a generic solution

  • SQLAlchemy lazy field option allows for defining a per-field strategy
  • Could we have a similar strategy that allows for third-party customization. e.g. Field.on_fetch/fetch_policy?
  • Focus should be on providing an entry point at first
  • Discussed in 2013 on django-developers

Reusable relationship definitions

Reusable relationship definitions

class Book(Model):
    date = DateField()
    same_year = ManyToManyRel(
        "self", Q(date__year=F("date__year"))
    )
    translation = ManyToOneRel(
        "translations", Q(translations__language=get_language)
    )

class Translation(Model):
    book = ForeignKey(Book, CASCADE, related_name="translations")
    language = CharField(max_length=2)
    content = TextField()

Reusable relationship definitions

>>> Book.objects.annotate(
    Count("same_year")
).select_related("translation")
SELECT book.*, COUNT(same_year.id), translation.* FROM book
LEFT JOIN book same_year ON (
    EXTRACT(YEAR from book.date) = EXTRACT(YEAR from same_year.date)
)
LEFT JOIN translation ON (
    translation.book_id = book_id AND translation.language = "fr"
)
GROUP BY book.id

Reusable relationship definitions

  • Improve code reusability through model defined alias
  • Much easier to reason about complex data models
  • Support for select_related and prefetch_related
  • Inherit relationship caching

Few of many issues in a similar situation

  • Composite primary keys
  • Support for multi-valued select_related (e.g. JSON_ARRAYAGG)
  • Generated fields
  • Field.db_default
  • Field.db_on_delete

Current state of things

  • Long commitment required to see changes through
  • New features often comes with a long tail of regressions
  • Small number of individuals are able review ORM changes
  • Some issues are stalled due to lack of consensus, diversity
  • ORM is perceived as this undecipherable part of Django

Possible ways forward

Document ORM internals 📄

  • Define expected beaviour beyond test suite
  • Highlight currently undefined behaviour
  • Diagram relationships between features
  • Re-discover and formalize reasoning behind years old decisions

Possible ways forward

Add typing to django.db.models internals

  • Evaluate tradeoffs of DEP 0484: Static type checking for Django
  • Matured ecosystem, typing.Protocol supports align with resolvable and compilable duck-typing
  • IDE supports makes code base more approachable
  • Reduce long tail of TypeError, AttributeError on new feature addition

Possible ways forward

Mentoring 👨‍🏫

  • Released based mentoring
  • Focus on incremental improvements and refactors
  • Onboard contributors in a release cycle

Thanks! 🙇‍♂️

  • Happy to discuss further off stage and during sprints

Thanks Logan!

- Mission of the product is not Make Automation Work for Everyone - 700 people accross 40 countries all remotely

- Hard to imagine Django without `prefetch_related` support that Luke Plant worked on. - Single query fetching - All of the fetching and caching of multi-valued relationships had to be done manually - Excluding the great number of bugs and enhancements - Incremental delivery of feature enhancements

- Thanks to the hard work of Aymeric Augustin and inspiration of Christophe Pettus Clearly defined expectations of the public (managed) API and the off-road expectations Auto-commit usually makes more sense in the context of a web framework that has more read and defaults to

- Long standing dragons we litteraly tamed thanks to Josh Smeathon and Anssi Kääriäinen - Finally -

- Validate the concept of expressions and query resolving - Expressions support V2

Required internal machinery rework to allow resolving expressions against Mapping of IntegrityError to ConstraintViolation(constraint) could change how model validation is performed

From the Book's perspective its relationship is single-valued as it can have a single author

- `JOIN` results in duplicate rows; many books per author - Usually worked around by using `distinct()`; `SELECT DISTINCT`

- Sometimes hard to diagnose issue due as it depends on the dataset

While there are many third party solutions and inhouse tricks to tame this issue the in-core

A lazy `prefetch_related("author")` only triggered if necessary if there are multiple peers

The summary is that prefetch related is not always a good default, could cause more harm than good in some situations

- At Zapier we used this tool to help with the creation of GraphQL endpoint - We turn violations into warnings elevated to exceptions when running our test suite and direct warnings in production to a particular logging facility - This ensures we don't regress when adjusting our endpoints without litering out test suite with tons of `assertNumQueries`

- Opt-in might be required as lazy fetching might be legitimate in some circumstances (e.g. expensive JOIN)

Lot of the required machinery is already existing, it powers reverse relationships It would allow to make virtual relationships first-class citizen in the ORM instead of hacks to support generic relationships

Some issues take a while to wrap your head around, recently fixed an issue I worked on and off over 8 years

Take the diagram with a brain of salt as contribution attributions goes beyond commit author assignment (talks, tutorial, etc) but it does have the benefit of being easily countable.

- Produce more high quality documentation on the subject - Documentation with a disclaimer; it doesn't have to come with the same backward compatibility guarantee at least until things stabilize

- Previously opposed to DEP four-eighty-four - Could allow maintainers to dip their toes in `typing` code without getting our of hands gradually.

Thanks to everyone who reviewed this talk and I'm looking forward to discuss the state of the ORM further off stage and during the sprints on Thursday.