Trac ticket #33191 was recently closed as a “wontfix”, but Carlton encouraged 
bringing the matter to this group.  The actual issue is a small one, but it 
seems that Django could do better than it is doing today with one small change, 
and avoid the need for a counter-intuitive "x = x" statement added to 
application code.
The claim is that Django is unnecessarily clearing the cached entry for an 
in-memory related object at a time that cache does not need to be cleared, and 
that this cache clearing can result in unwanted lazy reads down the line.  A 
one-line fix for the problem is suggested.
Many apologies in advance for the extreme length of this post.  The requested 
change is small and subtle (although quite beneficial in some cases); we must 
dive deep into code to understand WHY it is important.

Background
———————
My team is developing a major retail point of sale application; it’s currently 
live in nearly 1,000 stores and we’re negotiating contracts that could bring it 
to another 10,000 stores across the US over the next three to five years before 
going world-wide.  The backend is Django and PostgreSQL.
One of the typical problems we face has to do with importing data from existing 
systems.  This data usually comes to us in large sets of flat ASCII files, 
perhaps a gigabyte or few at a time.  We have to parse this incoming data and 
fill our tables as quickly as possible.  It’s not usual for one such data 
import to create tens of millions of rows across a more than a hundred tables.  
Our processes to do this are working reasonably well.
In many cases, the incoming stream of data generates related records in several 
tables at a time.  For example, our product import populates related data 
across a dozen tables organized into parent/child hierarchies that are up to 
four levels deep.  The incoming data is grouped by product (not by functional 
data type); the rows and columns of the records for a single product will 
create ORM objects scattered across those dozen ORM models.  The top-level 
model, Product, has four child models with references to Product; and each of 
those child models may have other child tables referring back to THOSE models, 
etc.
If the database schema sounds surprisingly complex, it is.  Big box retail is 
more complex than most people realize.  Each store typically carries 30,000 to 
70,000 individual products; some are even larger.  Some of the retail chains to 
which we market our systems have more than $1 billion USD per year in sales.
To be efficient, we process this incoming data in chunks:  We may instantiate 
ORM objects representing, say, 5,000 products and all of their related child 
objects.  For example, we’ll create a new instance of a Product model, then 
instantiate the various children that reference that product.  So a very 
typical pattern is something like
     product = Product(values)
     pv = ProductVariant(product=product, **more_values)
     upc = UPC(product_variant=product_variant, **upc_values)
It’s not that simple, of course; we’re reading in sequential data that 
generates multiple instances of the children in loops, etc., but essentially 
building a list of products and the multi-level hierarchy of child objects that 
dependent upon each of those products.  We then use bulk_create() to create all 
of the top-level products in one operation, then bulk_create() the various 
lists of first-level children, then bulk_create the children of that second 
level, etc.  LOTS of bulk creates.

Prior to Django 3.2, we had to explicitly set the associated “_id” field for 
each child’s reference to a parent after the list of parents were bulk_created. 
 Using our examples above, we would see things like:
bulk_create(list_of_products)
for each variant in list_of_product_variants:
    variant.product_id = variant.product.id
bulk_create(list_of_product_variants)
for each upc in list_of_upc_entries:
   upc.product_variant_id = upc.product_variant.id
bulk_create(list_of_upc_entries)
            […]
Again, this is somewhat simplifying the code, but the key takeaway is that 
older versions of Django required us to manually pick up the primary key value 
from recently created instances and set the associated “_id” value in each 
instance pointing at the recently created objects.
As expected, setting the “_id” value of a foreign key field clears the internal 
cache entry containing the reference to the parent object.  We would fully 
expect that if we said “variant.product_id = 10”, that “variant.product” would 
be considered “not yet loaded”, and a reference to variant.product would 
generate a lazy read.  We don’t disagree with that interpretation or behavior, 
and that is the existing behavior that Carlton implicitly referenced when he 
closed the ticket in question as “wontfix”.
But…

Relatively Recent Django Changes
——————————————————
In the Django 3.2 timeframe, Ticket #29497 (“Saving parent object after setting 
on child leads to an unexpected data loss in bulk_create()”) has affected code 
in this area.  Consider this sequence of code:
   my_parent = Parent()
   my_child = Child(parent=parent)
   my_parent.save()
   my_child.save()
(Yes, our real business case involves use of bulk_create() for thousands of 
objects at a time, but this simpler example using only .save() illustrates the 
problem in a stripped-down manner.)
What is the expected result in the database for the child’s foreign key to the 
parent?   Well…   We think most application programmers would expect the row 
for child object to contain a foreign key refererence to the parent object.
Instead, the pre-3.2 Django behavior was that the child would be saved with a 
null value in the foreign key reference to parent…  Because, when we save the 
parent (and internally populate the parent’s primary key with a newly assigned 
ID), Django is unaware of the in-memory reference from child to parent, and 
cannot automatically set child_parent_id = child.parent.id, so it remains a 
null value.
If the foreign key from child to parent is required, the save of the child 
object will fail because of the null value, and the developer will quickly 
realize something is wrong.  However, if the foreign key to the parent is NOT 
required, then the child WILL be quietly saved to the database with a null 
foreign key to the parent... and an unwary developer will wind up with 
unexpected null values.
This was considered an unintentional loss of data and reported on ticket #29497.
The solution to this problem was clever:  It recognizes that this is a case 
where my_child.parent_id and my_child.parent.pk are out of sync with one 
another (something we normally never see in Django), and that this particular 
case can be detected and solved.  In particular, the logic implemented as part 
of the ticket effectively changes the internal priority of these two related 
values.
Prior to 3.2, the code to build the SQL INSERT statement for saving the child 
would always use the “parent_id” attribute regardless of whether “parent” was 
an ORM instance or not.  The new 3.2 code does a quick initial test:  If the 
“parent” attribute contains an ORM object with a non-null primary key AND if 
“parent_id” does not have a value, then set parent_id = parent.pk and proceed 
with the remainder of the save.
In other words, if we don’t have a value in “parent_id” but we DO have a value 
in “parent.pk”, then fix parent_id before we move on to build the SQL.   Nice!
But…

The change we are requesting
——————————————————
For those that want to see the code in this area, check out the method 
_prepare_related_fields_for_save() in db.models.base.py.  The specific line of 
code in question is:
    setattr(self, field.attname, obj.pk)

It uses an internal “setter” to set “parent_id” to the primary key of the 
referenced object.
Note this point carefully:  The value of “obj” is the referenced instance of 
the parent in question.  Just a few lines earlier in the code, “obj” is set 
with this statement:
    obj = getattr(self, field.name, None)
When we chase down this getter, it is returning the value in 
self._state.fields_cache[fieldname].  In other words, it’s the actual instance 
of the parent referenced in memory from the child.  The application-level code 
has clearly and intentionally made this reference from child to parent, and the 
fact that we saved the parent to the database doesn’t (well, shouldn’t) affect 
that relationship.  It certainly would NOT affect the relationship if the 
parent was updated instead of inserted, and we would expect consistent behavior 
from .save() regardless of insert vs. update.
So, back to the narrow problem.  This line of code:
    setattr(self, field.attname, obj.pk)
causes the internal cached reference from parent to child to be cleared!
Well, yes, of course it does.  This setattr() operation is effectively doing 
the same thing as
    child.parent_id = 10
which we’ve already said OUGHT to invalidate an existing reference to 
“child.parent”.
But here’s the crux of the argument:  The code within 
_prepare_related_fields_for_save() ISN’T application-level code clearly 
altering the value of an ID field.  Instead, it’s internal Django code 
attempting to reconcile a mismatch between the cached ID of the parent 
(child.parent_id) and primary key value of the in-memory reference to a model 
instance (child.parent.pk).  And most importantly, it’s reconciling that 
difference by believing that the referenced instance is more accurate and 
should be preserved.  Yet the very act of preserving that referenced instance’s 
primary key is removing the reference to the very instance that is considered 
more accurate.
Wait a moment — to preserve a pointer to the referenced instance, we’re going 
to pluck one tiny value out of it AND THEN throw the referenced instance away!  
To those of my generation, this reminds us of a 50+ year old infamous 
quotation, “To save the village we had to destroy it.”

Why does it matter?
————————————
If the sole use of the parent and child objects in this example was to get data 
saved to the database, it really wouldn’t matter.
However, in our specific use case, we have to record certain types of 
operations in an internal log.  (It’s an “inventory transaction register”; 
whenever we change something in inventory that has an effect on the value of 
inventory, we must log that change).  That logging logic is actually relative 
to ORM instances that are two levels down from the top in our hierarchy of 
parent/child objects being bulk created.  We have a method to which we pass 
each of these not-top-level objects, and the log entries includes pieces of 
data from those objects’ parent and grandparent objects.
When we access the parent object through the child we’re logging, we wind up 
causing lazy reads of the objects we just saved to the database (and whom we 
still have in memory, and whom we THOUGHT were still referenced from our child 
objects).  In our case, we actually get 2 lazy reads for each store-specific 
product being imported; 50,000 items being imported for a four-store chain 
causes an extra 400,000 unnecessary SQL SELECT statements.  Removing these 
potential lazy reads is the performance improvement we’d like to share with the 
world.
And perhaps just as importantly, it avoids differences in behavior caused when 
a referenced object is updated vs. inserted into the database during a save.  
When it is updated, there is no need to alter the primary key, so the 
referenced object remains referenced.  But if the referenced object is 
inserted, the internal code that updates the primary key for that instance 
removes the reference.  We normally don’t like these types of behavioral 
differences in Django – we want .save() to behave the same way regardless of 
inserting vs. updating.

There is a workaround
——————————————————
There is a fairly simple workaround, and it harkens back to the days of pre-3.2 
Django.
   my_parent = Parent()
   my_child = Child(parent=parent)
   my_parent.save()
   my_child.parent = my_child.parent    # Add this line to work around the 
problem.
   my_child.save()
Yes, we can do this.  This rather unintuitive statement (that appears to be 
useless) obtains the internal cached reference to the parent object, then set 
both child.parent_id and the internal cached reference back to that instance.  
(The code in the setter first sets child.parent_id (which clears the cached 
entry)  then inserts the originally referenced parent back into the internal 
cache.)
This workaround greatly resembles the statement from days of old: 
(“child.parent_id = child.parent.id”), but has the additional benefit of 
keeping our in-memory reference to parent (and to all of the objects that 
parent itself might be referencing as well), and avoiding lazy reads in the 
future.

Our suggested change
——————————————————
After this very long-winded explanation, we are proposing a small change that 
makes the workaround unnecessary while still avoiding the potential performance 
hit from lazy queries.
Let’s look back to our friend _prepare_related_fields_for_save().
We argue there is a STRONG belief that preparing an object for saving to the 
database should not substantially alter the content of that object.  Yet this 
method does not adhere to that belief:  While the results saved to the database 
are correct, it has altered the in-memory object by discarding the internal 
reference to a parent object that the application level code has carefully set.
In general, we like what _prepare_related_fields_for_save() is doing — we like 
the fact that it’s reconciling the difference between the copy of the 
referenced object’s primary key (child.parent_id) and the in-memory reference 
(child.parent.pk).  We agree that this method is doing the right thing by using 
the primary key value from the referenced object.  We simply do not think that 
the parent instance should be removed from the internal cache as part of this 
change.
Instead of resolving this difference in keys by executing:
    setattr(self, field.attname, obj.pk)
We believe it should instead:
    setattr(self, field.name, obj)
This would effectively replace the pseudo-code
    child.parent_id = child.parent.pk
With
    child.parent = child.parent
It replaces the value of child.parent_id with child.parent.pk, but without the 
side effect of discarding the in-memory relation of the child and parent 
instances.

Finally, there is a small performance improvement that could be made to this 
suggestion.
If we look at what the current
    setattr(self, field.attname, obj.pk)
actually does, we have to look at the __set__() method of class 
ForeignKeyDeferredAttribute in django.db.models.fields.related_descriptors.py.
The code in this method is:
def __set__(self, instance, value):
    if instance.__dict__.get(self.field.attname) != value and 
self.field.is_cached(instance):
        self.field.delete_cached_value(instance)
    instance.__dict__[self.field.attname] = value
It does what we expect:  If the “parent_id” value is changing and there is a 
cached in-memory reference, then clear the cached entry before setting the 
integer value.
But note that the value of the “parent_id” attribute is really stored in the 
dictionary instance.__dict__[self.field.attname]
Our proposed  change back in _prepare_related_fields_for_save() has the goal of 
setting the value of the integer “parent_id” while preserving the cached 
in-memory reference to the parent.
Instead of doing
    setattr(self, field.name, obj)
(Which accomplishes the goal by essentially clearing the cache, then restoring 
it)
We could just do this:
    self.__dict__[self.field.attname] = obj.pk
It is a slight denormalization of the logic that manages the value of the 
deferred foreign key attribute, but it has a tiny performance improvement by 
not removing, then replacing a dictionary entry.

What are WE doing about this problem?
—————————————————————
We’re going to be staying with Django 3.2 until the next long-term support 
release is available, so this change has no immediacy for our particular 
application.  However, we think it is a nice extension that removes a 
non-obvious behavior in Django 3.2 (the .save() method can sometimes discard 
in-memory references to other ORM objects) and seems like it enhances the magic 
of the ORM framework.
In the meantime, we’ll be plugging non-intuitive lines of code that look like
    child.parent = child.parent
into about fifty locations throughout our app, as well as training a team of 20 
server-side developers to watch out for this edge case in any future code they 
write.  It’s all doable, and we’ll do it… but we really like Django because it 
does “automagically” handle so many little database oddities and nuances.  Over 
time, we’d like to get rid of “surprise” code like these lines that an 
inexperienced developer might remove because they look like they do nothing.

IF YOU’VE REACHED THE END
—————————————————
Thank you!  I’m amazed you stuck through it all.
What do you think?

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/596D3E68-4A37-423D-AEDF-64D2A5658DE8%40epicor.com.
  • Pro... Barry Johnson
    • ... Ian Foote
      • ... Barry Johnson
    • ... Aymeric Augustin
      • ... Carlton Gibson
        • ... 'Adam Johnson' via Django developers (Contributions to Django itself)

Reply via email to