Hi Barry,
Based on this I'm cautiously in favour. You set out a good case, it's much
easier to understand the "why" here than in the original ticket.
I'd avoid the extra optimisation of accessing __dict__ directly though - if
__set__ gains extra functionality, I'd prefer not to accidentally miss it
here.
Ian
On Fri, 15 Oct 2021 at 07:49, Barry Johnson wrote:
> Trac ticket #33191 was recently closed as a “wontfix”, but Carlton
> encouraged bringing the matter to this group. The actual issue is a small
> one, but it seems that Django could do better than it is doing today with
> one small change, and avoid the need for a counter-intuitive "x = x"
> statement added to application code.
>
> The claim is that Django is unnecessarily clearing the cached entry for an
> in-memory related object at a time that cache does not need to be cleared,
> and that this cache clearing can result in unwanted lazy reads down the
> line. A one-line fix for the problem is suggested.
>
> Many apologies in advance for the extreme length of this post. The
> requested change is small and subtle (although quite beneficial in some
> cases); we must dive deep into code to understand WHY it is important.
>
>
>
> Background
> ———
>
> My team is developing a major retail point of sale application; it’s
> currently live in nearly 1,000 stores and we’re negotiating contracts that
> could bring it to another 10,000 stores across the US over the next three
> to five years before going world-wide. The backend is Django and
> PostgreSQL.
>
> One of the typical problems we face has to do with importing data from
> existing systems. This data usually comes to us in large sets of flat
> ASCII files, perhaps a gigabyte or few at a time. We have to parse this
> incoming data and fill our tables as quickly as possible. It’s not usual
> for one such data import to create tens of millions of rows across a more
> than a hundred tables. Our processes to do this are working reasonably
> well.
>
> In many cases, the incoming stream of data generates related records in
> several tables at a time. For example, our product import populates
> related data across a dozen tables organized into parent/child hierarchies
> that are up to four levels deep. The incoming data is grouped by product
> (not by functional data type); the rows and columns of the records for a
> single product will create ORM objects scattered across those dozen ORM
> models. The top-level model, Product, has four child models with
> references to Product; and each of those child models may have other child
> tables referring back to THOSE models, etc.
>
> If the database schema sounds surprisingly complex, it is. Big box retail
> is more complex than most people realize. Each store typically carries
> 30,000 to 70,000 individual products; some are even larger. Some of the
> retail chains to which we market our systems have more than $1 billion USD
> per year in sales.
>
> To be efficient, we process this incoming data in chunks: We may
> instantiate ORM objects representing, say, 5,000 products and all of their
> related child objects. For example, we’ll create a new instance of a
> Product model, then instantiate the various children that reference that
> product. So a very typical pattern is something like
>
> product = Product(values)
> pv = ProductVariant(product=product, **more_values)
> upc = UPC(product_variant=product_variant, **upc_values)
>
> It’s not that simple, of course; we’re reading in sequential data that
> generates multiple instances of the children in loops, etc., but
> essentially building a list of products and the multi-level hierarchy of
> child objects that dependent upon each of those products. We then use
> bulk_create() to create all of the top-level products in one operation,
> then bulk_create() the various lists of first-level children, then
> bulk_create the children of that second level, etc. LOTS of bulk creates.
>
>
>
> Prior to Django 3.2, we had to explicitly set the associated “_id” field
> for each child’s reference to a parent after the list of parents were
> bulk_created. Using our examples above, we would see things like:
>
> bulk_create(list_of_products)
>
> for each variant in list_of_product_variants:
> variant.product_id = variant.product.id
> bulk_create(list_of_product_variants)
>
> for each upc in list_of_upc_entries:
>upc.product_variant_id = upc.product_variant.id
> bulk_create(list_of_upc_entries)
>
> […]
>
> Again, this is somewhat simplifying the code, but the key takeaway is that
> older versions of Django required us to manually pick up the primary key
> value from recently created instances and set the associated “_id” value in
> each instance pointing at the recently created objects.
>
> *As expected, setting the “_id” value of a foreign key field clears the
> internal cache entry containing the reference to the parent object.* We
> would fully expect that if we said