We are recruiting: Python/Django Engineer & JavaScript/TypeScript/React Engineer

The problem with laziness: minimising performance issues caused by Django's implicit database queries

Person on sofa with cat

Django's object-relational mapper (ORM) is a huge part of the reason for the framework's massive success. The ORM makes talking to databases easy by abstracting away the details of connections, cursors and queries. It allows developers to think in high-level Python and focus on business logic rather than low-level database plumbing.

However, its greatest strength is also its greatest weakness. The fact that it abstracts away database queries, particularly when it comes to lazily traversing relationships between objects, means that the queries emitted by the application become difficult to reason about. It's not possible to tell, just by looking at a piece of code, which queries will be run when it is executed. This can cause huge performance problems.

Example

To illustrate, consider the following simple line of code:

user.address

This line might be found in, say, a Django template (<div>{{ user.address }}</div>). Or (if your Django project exposes an API for a single-page or mobile app) perhaps specified as a field in a Django REST framework UserSerializer .

Here's the question: how many database queries are executed when this line of code is run?

The problem is that just by looking at this piece of code in isolation, you simply can't tell. Maybe address is just a field on the User model, so no extra queries are needed. Unless, of course, the field has been included in a call to defer on the queryset, meaning that accessing it will require a query. Or maybe address is actually a foreign key pointing at a separate Address model? Or maybe a user can have many addresses, and the address attribute is actually a method on the User model which fetches the address marked primary from the set of related addresses?

The point is that without looking at the model and/or the view, there's no way of knowing. Django's template system is designed to be used by frontend developers who only know HTML, so they can't be expected to look at and understand the Python code. Good object-oriented design encourages us to hide implementation details, so developers wouldn't naturally think to care about what happens when an attribute is accessed, or when a method is called. This makes it very easy for seemingly innocuous changes to client requirements ("can we add the user's address to each row in the user table?") to result in a serious performance hit.

It should be noted that Django does, of course, provide tools for solving the n+1 queries problem - select_related and prefetch_related as well as annotate and aggregate let you perform more complex queries to pull out the data you need in the most efficient way possible. But often, the only way to know that these optimisations are needed is after the performance problems materialise - usually once your code is in production, as your database starts to fill up with real user data (in other words, the "n" in the "n+1 queries problem" gets bigger).

Avoiding implicit queries

So how can we make sure this kind of problem doesn't occur? The key insight here is that idiomatic Django code encourages the intermingling of two basic steps during request handling that should instead be kept separate. Those two phases are:

  1. Fetch the data you need
  2. Present the data to the user

Django (and Django REST framework) blurs these steps together by allowing queries to happen when model attributes are accessed (in a template or serializer) and in other places like model methods, custom template tags, or in a SerializerMethodField. The queries that happen implicitly during the "presentation" phase are the ones that cause the problems. Adding prefetch_related and select_related pushes those queries (conceptually) into the first phase.

By identifying these two steps, we can adopt programming patterns where they are kept separate. We could (and indeed should) try to enforce this separation with code review. Our team could (and should) establish a set of recommended design patterns around making the two steps explicit in our codebases.

But without a way to programmatically enforce the separation, mistakes will always be able to slip through. Ideally, we'd have a way to force queries to happen in step 1, and prevent them from happening in step 2. This means that if a frontend developer working in a Django template added a call to an attribute that incurred a query, they would be informed straight away, rather than only noticing later when the performance of the application went down the drain.

Introducing django-zen-queries

In order to achieve this, we have created a simple library that allows developers to mark explicitly the areas of code that are allowed to run database queries, and those that aren't. At the basic level, it does this using Python's context manager syntax:

data = get_data_from_database() # <- (1) data fetching phase
with queries_disabled(): # <- django-zen-queries context manager
    return render_template("template.html", data) # <- (2) presentation phase

(this is pseudocode rather than real code, but should illustrate the pattern).

If any attempt to execute a query is made by code running under the queries_disabled block, an exception will be raised (QueriesDisabledError). The stack trace of the exception will include the point in the code where the query was triggered, so the developer should be able to easily find the problem. The queries_disabled context manager can also be used as a function decorator (in Python 3.2+) and a method decorator (by wrapping it in Django's method_decorator utility).

The library also includes some helpers to make it easier to enforce this pattern with minimal extra effort. At DabApps, most of our backend work is building APIs for single-page apps or mobile apps (rather than using Django's template rendering system to return HTML). The most useful part of the library for us is the QueriesDisabledViewMixin. This can be mixed in to Django REST framework generic views.

When a generic view executes, the last step is to return Response(serializer.data). By adding the QueriesDisabledViewMixin to the view, accessing the .data property on the serializer is prevented from performing any queries. For example, a nested serializer that represents a related object must already have had its data retrieved from the database (eg select_related must have been called on the queryset passed to the serializer). By protecting generic views with this mixin when they are first written, a future developer making a change to the behaviour of the endpoint cannot accidentally cause additional queries to be executed simple by adding an entry to the fields attribute on the serializer class. By adding this mixin to an existing view, running the test suite for the application will show us where we have potential performance issues in our code.

Conclusion

Django's lazy database query execution makes our code beautiful but hides performance problems that may come to bite us later. By explicitly splitting our request handling code into two separate steps: fetching and presenting, we make those problems more visible up-front, while maintaining most of the benefits. By using django-zen-queries we can programmatically prevent the rendering phase from being allowed to execute any queries, protecting our application from unexpected performance degradation.

blog comments powered by Disqus