Proposed change in ORM model save logic

Barry Johnson Thu, 14 Oct 2021 23:50:11 -0700

Trac ticket #33191 was recently closed as a “wontfix”, but Carlton encouraged 
bringing the matter to this group.  The actual issue is a small one, but it 
seems that Django could do better than it is doing today with one small change, 
and avoid the need for a counter-intuitive "x = x" statement added to 
application code.
The claim is that Django is unnecessarily clearing the cached entry for an 
in-memory related object at a time that cache does not need to be cleared, and 
that this cache clearing can result in unwanted lazy reads down the line.  A 
one-line fix for the problem is suggested.
Many apologies in advance for the extreme length of this post.  The requested 
change is small and subtle (although quite beneficial in some cases); we must 
dive deep into code to understand WHY it is important.

Background
———————
My team is developing a major retail point of sale application; it’s currently
live in nearly 1,000 stores and we’re negotiating contracts that could bring it
to another 10,000 stores across the US over the next three to five years before
going world-wide. The backend is Django and PostgreSQL.
One of the typical problems we face has to do with importing data from existing
systems. This data usually comes to us in large sets of flat ASCII files,
perhaps a gigabyte or few at a time. We have to parse this incoming data and
fill our tables as quickly as possible. It’s not usual for one such data
import to create tens of millions of rows across a more than a hundred tables.
Our processes to do this are working reasonably well.
In many cases, the incoming stream of data generates related records in several
tables at a time. For example, our product import populates related data
across a dozen tables organized into parent/child hierarchies that are up to
four levels deep. The incoming data is grouped by product (not by functional
data type); the rows and columns of the records for a single product will
create ORM objects scattered across those dozen ORM models. The top-level
model, Product, has four child models with references to Product; and each of
those child models may have other child tables referring back to THOSE models,
etc.
If the database schema sounds surprisingly complex, it is. Big box retail is
more complex than most people realize. Each store typically carries 30,000 to
70,000 individual products; some are even larger. Some of the retail chains to
which we market our systems have more than $1 billion USD per year in sales.
To be efficient, we process this incoming data in chunks: We may instantiate
ORM objects representing, say, 5,000 products and all of their related child
objects. For example, we’ll create a new instance of a Product model, then
instantiate the various children that reference that product. So a very
typical pattern is something like
product = Product(values)
pv = ProductVariant(product=product, **more_values)
upc = UPC(product_variant=product_variant, **upc_values)
It’s not that simple, of course; we’re reading in sequential data that
generates multiple instances of the children in loops, etc., but essentially
building a list of products and the multi-level hierarchy of child objects that
dependent upon each of those products. We then use bulk_create() to create all
of the top-level products in one operation, then bulk_create() the various
lists of first-level children, then bulk_create the children of that second
level, etc. LOTS of bulk creates.

Prior to Django 3.2, we had to explicitly set the associated “_id” field for
each child’s reference to a parent after the list of parents were bulk_created.
Using our examples above, we would see things like:
bulk_create(list_of_products)
for each variant in list_of_product_variants:
variant.product_id = variant.product.id
bulk_create(list_of_product_variants)
for each upc in list_of_upc_entries:
upc.product_variant_id = upc.product_variant.id
bulk_create(list_of_upc_entries)
[…]
Again, this is somewhat simplifying the code, but the key takeaway is that
older versions of Django required us to manually pick up the primary key value
from recently created instances and set the associated “_id” value in each
instance pointing at the recently created objects.
As expected, setting the “_id” value of a foreign key field clears the internal
cache entry containing the reference to the parent object. We would fully
expect that if we said “variant.product_id = 10”, that “variant.product” would
be considered “not yet loaded”, and a reference to variant.product would
generate a lazy read. We don’t disagree with that interpretation or behavior,
and that is the existing behavior that Carlton implicitly referenced when he
closed the ticket in question as “wontfix”.
But…

Relatively Recent Django Changes
——————————————————
In the Django 3.2 timeframe, Ticket #29497 (“Saving parent object after setting
on child leads to an unexpected data loss in bulk_create()”) has affected code
in this area. Consider this sequence of code:
my_parent = Parent()
my_child = Child(parent=parent)
my_parent.save()
my_child.save()
(Yes, our real business case involves use of bulk_create() for thousands of
objects at a time, but this simpler example using only .save() illustrates the
problem in a stripped-down manner.)
What is the expected result in the database for the child’s foreign key to the
parent? Well… We think most application programmers would expect the row
for child object to contain a foreign key refererence to the parent object.
Instead, the pre-3.2 Django behavior was that the child would be saved with a
null value in the foreign key reference to parent… Because, when we save the
parent (and internally populate the parent’s primary key with a newly assigned
ID), Django is unaware of the in-memory reference from child to parent, and
cannot automatically set child_parent_id = child.parent.id, so it remains a
null value.
If the foreign key from child to parent is required, the save of the child
object will fail because of the null value, and the developer will quickly
realize something is wrong. However, if the foreign key to the parent is NOT
required, then the child WILL be quietly saved to the database with a null
foreign key to the parent... and an unwary developer will wind up with
unexpected null values.
This was considered an unintentional loss of data and reported on ticket #29497.
The solution to this problem was clever: It recognizes that this is a case
where my_child.parent_id and my_child.parent.pk are out of sync with one
another (something we normally never see in Django), and that this particular
case can be detected and solved. In particular, the logic implemented as part
of the ticket effectively changes the internal priority of these two related
values.
Prior to 3.2, the code to build the SQL INSERT statement for saving the child
would always use the “parent_id” attribute regardless of whether “parent” was
an ORM instance or not. The new 3.2 code does a quick initial test: If the
“parent” attribute contains an ORM object with a non-null primary key AND if
“parent_id” does not have a value, then set parent_id = parent.pk and proceed
with the remainder of the save.
In other words, if we don’t have a value in “parent_id” but we DO have a value
in “parent.pk”, then fix parent_id before we move on to build the SQL. Nice!
But…

The change we are requesting
——————————————————
For those that want to see the code in this area, check out the method
_prepare_related_fields_for_save() in db.models.base.py. The specific line of
code in question is:
setattr(self, field.attname, obj.pk)

It uses an internal “setter” to set “parent_id” to the primary key of the
referenced object.
Note this point carefully: The value of “obj” is the referenced instance of
the parent in question. Just a few lines earlier in the code, “obj” is set
with this statement:
obj = getattr(self, field.name, None)
When we chase down this getter, it is returning the value in
self._state.fields_cache[fieldname]. In other words, it’s the actual instance
of the parent referenced in memory from the child. The application-level code
has clearly and intentionally made this reference from child to parent, and the
fact that we saved the parent to the database doesn’t (well, shouldn’t) affect
that relationship. It certainly would NOT affect the relationship if the
parent was updated instead of inserted, and we would expect consistent behavior
from .save() regardless of insert vs. update.
So, back to the narrow problem. This line of code:
setattr(self, field.attname, obj.pk)
causes the internal cached reference from parent to child to be cleared!
Well, yes, of course it does. This setattr() operation is effectively doing
the same thing as
child.parent_id = 10
which we’ve already said OUGHT to invalidate an existing reference to
“child.parent”.
But here’s the crux of the argument: The code within
_prepare_related_fields_for_save() ISN’T application-level code clearly
altering the value of an ID field. Instead, it’s internal Django code
attempting to reconcile a mismatch between the cached ID of the parent
(child.parent_id) and primary key value of the in-memory reference to a model
instance (child.parent.pk). And most importantly, it’s reconciling that
difference by believing that the referenced instance is more accurate and
should be preserved. Yet the very act of preserving that referenced instance’s
primary key is removing the reference to the very instance that is considered
more accurate.
Wait a moment — to preserve a pointer to the referenced instance, we’re going
to pluck one tiny value out of it AND THEN throw the referenced instance away!
To those of my generation, this reminds us of a 50+ year old infamous
quotation, “To save the village we had to destroy it.”

Why does it matter?
————————————
If the sole use of the parent and child objects in this example was to get data
saved to the database, it really wouldn’t matter.
However, in our specific use case, we have to record certain types of
operations in an internal log. (It’s an “inventory transaction register”;
whenever we change something in inventory that has an effect on the value of
inventory, we must log that change). That logging logic is actually relative
to ORM instances that are two levels down from the top in our hierarchy of
parent/child objects being bulk created. We have a method to which we pass
each of these not-top-level objects, and the log entries includes pieces of
data from those objects’ parent and grandparent objects.
When we access the parent object through the child we’re logging, we wind up
causing lazy reads of the objects we just saved to the database (and whom we
still have in memory, and whom we THOUGHT were still referenced from our child
objects). In our case, we actually get 2 lazy reads for each store-specific
product being imported; 50,000 items being imported for a four-store chain
causes an extra 400,000 unnecessary SQL SELECT statements. Removing these
potential lazy reads is the performance improvement we’d like to share with the
world.
And perhaps just as importantly, it avoids differences in behavior caused when
a referenced object is updated vs. inserted into the database during a save.
When it is updated, there is no need to alter the primary key, so the
referenced object remains referenced. But if the referenced object is
inserted, the internal code that updates the primary key for that instance
removes the reference. We normally don’t like these types of behavioral
differences in Django – we want .save() to behave the same way regardless of
inserting vs. updating.

There is a workaround
——————————————————
There is a fairly simple workaround, and it harkens back to the days of pre-3.2
Django.
my_parent = Parent()
my_child = Child(parent=parent)
my_parent.save()
my_child.parent = my_child.parent # Add this line to work around the
problem.
my_child.save()
Yes, we can do this. This rather unintuitive statement (that appears to be
useless) obtains the internal cached reference to the parent object, then set
both child.parent_id and the internal cached reference back to that instance.
(The code in the setter first sets child.parent_id (which clears the cached
entry) then inserts the originally referenced parent back into the internal
cache.)
This workaround greatly resembles the statement from days of old:
(“child.parent_id = child.parent.id”), but has the additional benefit of
keeping our in-memory reference to parent (and to all of the objects that
parent itself might be referencing as well), and avoiding lazy reads in the
future.

Our suggested change
——————————————————
After this very long-winded explanation, we are proposing a small change that
makes the workaround unnecessary while still avoiding the potential performance
hit from lazy queries.
Let’s look back to our friend _prepare_related_fields_for_save().
We argue there is a STRONG belief that preparing an object for saving to the
database should not substantially alter the content of that object. Yet this
method does not adhere to that belief: While the results saved to the database
are correct, it has altered the in-memory object by discarding the internal
reference to a parent object that the application level code has carefully set.
In general, we like what _prepare_related_fields_for_save() is doing — we like
the fact that it’s reconciling the difference between the copy of the
referenced object’s primary key (child.parent_id) and the in-memory reference
(child.parent.pk). We agree that this method is doing the right thing by using
the primary key value from the referenced object. We simply do not think that
the parent instance should be removed from the internal cache as part of this
change.
Instead of resolving this difference in keys by executing:
setattr(self, field.attname, obj.pk)
We believe it should instead:
setattr(self, field.name, obj)
This would effectively replace the pseudo-code
child.parent_id = child.parent.pk
With
child.parent = child.parent
It replaces the value of child.parent_id with child.parent.pk, but without the
side effect of discarding the in-memory relation of the child and parent
instances.

Finally, there is a small performance improvement that could be made to this
suggestion.
If we look at what the current
setattr(self, field.attname, obj.pk)
actually does, we have to look at the __set__() method of class
ForeignKeyDeferredAttribute in django.db.models.fields.related_descriptors.py.
The code in this method is:
def __set__(self, instance, value):
if instance.__dict__.get(self.field.attname) != value and
self.field.is_cached(instance):
self.field.delete_cached_value(instance)
instance.__dict__[self.field.attname] = value
It does what we expect: If the “parent_id” value is changing and there is a
cached in-memory reference, then clear the cached entry before setting the
integer value.
But note that the value of the “parent_id” attribute is really stored in the
dictionary instance.__dict__[self.field.attname]
Our proposed change back in _prepare_related_fields_for_save() has the goal of
setting the value of the integer “parent_id” while preserving the cached
in-memory reference to the parent.
Instead of doing
setattr(self, field.name, obj)
(Which accomplishes the goal by essentially clearing the cache, then restoring
it)
We could just do this:
self.__dict__[self.field.attname] = obj.pk
It is a slight denormalization of the logic that manages the value of the
deferred foreign key attribute, but it has a tiny performance improvement by
not removing, then replacing a dictionary entry.

What are WE doing about this problem?
—————————————————————
We’re going to be staying with Django 3.2 until the next long-term support
release is available, so this change has no immediacy for our particular
application. However, we think it is a nice extension that removes a
non-obvious behavior in Django 3.2 (the .save() method can sometimes discard
in-memory references to other ORM objects) and seems like it enhances the magic
of the ORM framework.
In the meantime, we’ll be plugging non-intuitive lines of code that look like
child.parent = child.parent
into about fifty locations throughout our app, as well as training a team of 20
server-side developers to watch out for this edge case in any future code they
write. It’s all doable, and we’ll do it… but we really like Django because it
does “automagically” handle so many little database oddities and nuances. Over
time, we’d like to get rid of “surprise” code like these lines that an
inexperienced developer might remove because they look like they do nothing.

IF YOU’VE REACHED THE END
—————————————————
Thank you! I’m amazed you stuck through it all.
What do you think?

--
You received this message because you are subscribed to the Google Groups
"Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/django-developers/596D3E68-4A37-423D-AEDF-64D2A5658DE8%40epicor.com.

Proposed change in ORM model save logic

Reply via email to