Proposed change in ORM model save logic
Trac ticket #33191 was recently closed as a “wontfix”, but Carlton encouraged bringing the matter to this group. The actual issue is a small one, but it seems that Django could do better than it is doing today with one small change, and avoid the need for a counter-intuitive "x = x" statement added to application code. The claim is that Django is unnecessarily clearing the cached entry for an in-memory related object at a time that cache does not need to be cleared, and that this cache clearing can result in unwanted lazy reads down the line. A one-line fix for the problem is suggested. Many apologies in advance for the extreme length of this post. The requested change is small and subtle (although quite beneficial in some cases); we must dive deep into code to understand WHY it is important. Background ——— My team is developing a major retail point of sale application; it’s currently live in nearly 1,000 stores and we’re negotiating contracts that could bring it to another 10,000 stores across the US over the next three to five years before going world-wide. The backend is Django and PostgreSQL. One of the typical problems we face has to do with importing data from existing systems. This data usually comes to us in large sets of flat ASCII files, perhaps a gigabyte or few at a time. We have to parse this incoming data and fill our tables as quickly as possible. It’s not usual for one such data import to create tens of millions of rows across a more than a hundred tables. Our processes to do this are working reasonably well. In many cases, the incoming stream of data generates related records in several tables at a time. For example, our product import populates related data across a dozen tables organized into parent/child hierarchies that are up to four levels deep. The incoming data is grouped by product (not by functional data type); the rows and columns of the records for a single product will create ORM objects scattered across those dozen ORM models. The top-level model, Product, has four child models with references to Product; and each of those child models may have other child tables referring back to THOSE models, etc. If the database schema sounds surprisingly complex, it is. Big box retail is more complex than most people realize. Each store typically carries 30,000 to 70,000 individual products; some are even larger. Some of the retail chains to which we market our systems have more than $1 billion USD per year in sales. To be efficient, we process this incoming data in chunks: We may instantiate ORM objects representing, say, 5,000 products and all of their related child objects. For example, we’ll create a new instance of a Product model, then instantiate the various children that reference that product. So a very typical pattern is something like product = Product(values) pv = ProductVariant(product=product, **more_values) upc = UPC(product_variant=product_variant, **upc_values) It’s not that simple, of course; we’re reading in sequential data that generates multiple instances of the children in loops, etc., but essentially building a list of products and the multi-level hierarchy of child objects that dependent upon each of those products. We then use bulk_create() to create all of the top-level products in one operation, then bulk_create() the various lists of first-level children, then bulk_create the children of that second level, etc. LOTS of bulk creates. Prior to Django 3.2, we had to explicitly set the associated “_id” field for each child’s reference to a parent after the list of parents were bulk_created. Using our examples above, we would see things like: bulk_create(list_of_products) for each variant in list_of_product_variants: variant.product_id = variant.product.id bulk_create(list_of_product_variants) for each upc in list_of_upc_entries: upc.product_variant_id = upc.product_variant.id bulk_create(list_of_upc_entries) […] Again, this is somewhat simplifying the code, but the key takeaway is that older versions of Django required us to manually pick up the primary key value from recently created instances and set the associated “_id” value in each instance pointing at the recently created objects. As expected, setting the “_id” value of a foreign key field clears the internal cache entry containing the reference to the parent object. We would fully expect that if we said “variant.product_id = 10”, that “variant.product” would be considered “not yet loaded”, and a reference to variant.product would generate a lazy read. We don’t disagree with that interpretation or behavior, and that is the existing behavior that Carlton implicitly referenced when he closed the ticket in question as “wontfix”. But… Relatively Recent Django Changes —— In the Django 3.2 timeframe, Ticket #29497 (“Saving parent object after setting on child leads to an unexpected data l
Re: Proposed change in ORM model save logic
On Friday, October 15, 2021 at 12:39:52 PM UTC-5 pyt...@ian.feete.org wrote: > You set out a good case, it's much easier to understand the "why" here than in the original ticket. Yes. It was hard to compress the matter down small enough to reasonably fit in a Trac entry, so skipped the vast majority of the rationale. Sorry that it took that much writing to make the case, but had to set the groundwork first. On later consideration, the two most compelling points seem to be: * Internal machinations of Django shouldn't change object relationships that the application-level code has set * save() shouldn't have different effects (-sometimes- uncaching the reference) if the parent object is inserted versus updated > I'd avoid the extra optimisation of accessing __dict__ directly though - if __set__ gains extra functionality, I'd prefer not to accidentally miss it here. Agreed! Not being sure just how widely spread accessing the underlying __dict__ is throughout the code, I wasn't sure which way to go. Although it may be a -measurable- improvement in performance, it's miniscule. baj -- You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/14d1a470-4026-4f06-bfde-265bc7d8a667n%40googlegroups.com.
Re: Migrations: A bug or a feature needing documentation?
On Wednesday, August 7, 2019 at 2:17:45 PM UTC-5, Adam Johnson wrote: > > Some questions: Have you looked into the migration framework internals? > Are there any comments around "replaces" that indicate this behaviour? And > is there maybe a way of hooking into the migration planner, or adding the > ability, to support your use case without relying on the current > bug-feature? > [Looks quick...] Interesting! There are two subtle points in the source that show the author intentionally planned for this case, both in a method named "remove_replaced_nodes()" that, ahem, removes the replaced migration nodes from the dependency graph if the conditions are right for replacement. The docstring for that method begins with the sentence "Remove each of the 'replaced' nodes (when they exist)." And within the main loop of that method, the code does a "pop" to remove the replaced node, deliberately not failing if that replaced node does not appear in the graph: replaced_node = self.node_map.pop(replaced_key, None) if replaced_node: The purpose of this particular method is two-fold: It removes the replaced node, yes, but just as importantly it rewires the dependencies to and of this replaced node. Any children of the replaced node now become children of the replacement, and any parents that point at the replaced now become parents of the replacement. So deliberate use of this approach does require careful understanding of the dependency tree (which really isn't a surprise). Interesting. -- You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/b6e576b9-ebbd-4263-8dd3-ae6ec79ced54%40googlegroups.com.
Re: Migrations: A bug or a feature needing documentation?
On Wednesday, August 7, 2019 at 9:22:27 PM UTC-5, Markus Holtermann wrote: > > Let's look at the last question first, regarding duplicate entries in the > django_migrations table: Yes, this is to be a bug. At least how it's > currently used. > Agreed. > Let's say you have migration foo.0001_initial and apply it. You then have > (foo, > 0001_initial) in the django_migrations table. You now create migration > foo.0002_something and also add a squashed migration > foo.0001_initial_squashed_0002_something. When you now run migrate, > Django will apply foo.0002_something and your database will have (foo, > 0001_initial), (foo, 0002_something) as well as (foo, > 0001_initial_squashed_0002_something). > So far so good. That's all as expected. If you now remove foo.0001_initial > and foo.0002_something from your filesystem and remove the replaces > section in foo.0001_initial_squashed_0002_something it is as if Django > never new about foo.0001_initial or foo.0002_something. You can add new > migrations, everything works the way it should. However, if you were to add > e.g. foo.0002_something again, Django would treat it as already applied, > despite it being somewhere later in your migration graph. > At this point, I don't think this is the intended behavior. That said, I'm > inclined to say that applying a squashed migration should "unrecord" all > migrations it replaces. I've not yet thought too much about the "fallout" > (backwards compatibility, > rollback of migrations, ...). But at least with regards to migrating > forwards, this seems to be the right behavior. > To date, the migration system has been remarkably tolerant of spurious entries in the django_migrations table, and the squashing process (to date) does indeed leave breadcrumbs of since-deleted migrations behind. Agree completely with your point about adding a subsequent migration with a name that exactly matches a previously created migration that has been removed. For all practical purposes, the migration names should be unique over time, or the developer should ensure that the second incarnation of that name is a functional replacement. We have, in the past, replaced the contents of the "0001_initial" migration in most of our apps, but it is incumbent on us to make sure that's correct. Regarding your second point around "replaces" and merging migrations: I > think this will lead to inconsistencies in your migration order, thus > potentially causing trouble down the line. [...] I suspect that two data > migrations could easily conflict or result in inconsistent data if applied > in the wrong order. For example, one data migration adding new records to a > table, and another one ensuring that all values in a column are in upper > case. If you apply both migrations in that order (insert and then ensure > uppercase) you can be certain that all values will be uppercase. If you, > however, first ensure uppercase and then insert additional values, you need > to make sure that the data in the second migration is properly formatted. > Oh, certainly agree, Markus, about the dangers of replacing migrations. That's true across the board, even in the documented use cases. The replacement steps MUST properly account for elidable data migrations. In the normal use case (the typical migration squashing process), the replacement step truly replaces the original step, and is executed at that original spot within the dependency graph. (It can seem as if that replacement is moved "back in time", as if history were rewritten. That means that any migrations subsequent to the original step will remain subsequent to the replacement, which works. In the use case where the original does not exist, the replacement step remains where it is in the dependency graph. It must, because there's nowhere earlier that it could be placed. Could this introduce an inconsistency, especially with data migrations? Yes, if the developer is not careful. Using your example of an insertion of lowercase data followed by a conversion to uppercase data: in one branch the insertion happens first; in a different branch (where the insertion is run as a replacement step) they could be run in a different order. That would indeed appear to be one of the dangers of moving a database from one branch of code to another, and that (as things are today) the developer must understand the migration paths. Hmmm. Keeping all this in mind, I'm beginning to believe there is a procedural method to solve the parallel-development problem. Just before creating a long-lived branch (such as as production release that may get patches), create an empty migration step in all apps. Later, if patch migrations had become necessary in that branch, the mainline trunk can "replace" the empty step with an equivalent migration that replaces it. At least, that will give us an order of execution that more closely resembles what the product
Summary Re: Migrations: A bug or a feature needing documentation?
On Monday, August 5, 2019 at 4:55:24 PM UTC-5, Barry Johnson wrote: > > [ TL;DR: A migration may use a “replaces” list pointing to migrations that > don’t actually exist. This undocumented technique cleanly solves a > recurring difficult migration problem. We seek consensus on whether this > should become a documented feature or that it is an unexpected side effect > subject to change in the future. ] > Earlier this week, I posted a question about an undocumented behavior in the migration system, wondering if it should be formally documented and supported. The use case relates to migration of a database from one branch of a project to another when there were different migrations created in both branches. I believe this is a use case that the migration system was not (and is not) intended to solve, yet it seems to be a real-world problem. Summary of responses: After discussion, I think that -our- future approach will be more procedural: We'll deliberately create an empty migration *just before* forking a parallel branch that may require migrations, and will eventually require migrating tenants from that branch back to our mainline code. We will then use that empty migration as a placeholder, and if necessary create one or more migrations that will replace the empty step to handle any parallel schema changes. It might be wise to document this practice as a suggestion for others faced with this use case. We will still have a case where multiple migrations would each indicate that they replace one of those empty migrations, perhaps caused by multiple schema changes being applied to the "patch" fork. We've seen that behavior create apparent duplicate entries in the django_migrations database table. But the migration system loads that table into a set containing (app, name) tuples, so the duplicate entries aren't hurting anything at present. Furthermore, there is nothing that deliberately ties the list of previously applied migrations (that set of (app, name) tuples) back to migrations that exist. Entries in the table for migrations that no longer exist are a side effect of the current squashing logic. They can eventually cause the django_migrations table to contain many more rows than are necessary; perhaps it may be useful to have a migration tool or operation that clears out those obsolete records. Or, as Markus has suggested, it may be wise to have the squashing process clean up after itself by removing "replaced" entries instead of leaving them behind. I'll work up a documentation change PR that people can review and accept, modify or throw away. Thank you! baj - Barry Johnson -- You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/b7d694aa-735f-429f-aae6-d0b948925d27%40googlegroups.com.
Re: Model-level validation
I agree with James in several ways. Our large Django application does rather extensive validation of data -- but I would argue strongly against embedding that validation in the base instance.save() logic. (I would not argue against Django including a "ValidatingModel", derived from Model, that automatically runs defined validations as part of save(). Then developers could choose which base they'd like to subclass when designing their objects. Of course, anyone could simply create their own "ValidatingModel" class and derive everything from that class.) Reason 1 is that business logic validation often requires access to multiple model instances -- for example, uniqueness across a set of objects about to be updated. (e.g., "Only one person can be marked as the primary contact"). Or internal consistency: "If this record is of this type, then it cannot have any children of that type". Or even referential integrity in some cases: "The incoming data has a code that serves as the primary key in some other table. Make sure that primary key exists." Yes, you can encode all of those cross-instance validations into an instance-level check, but then that brings us to the second point: Performance. There are a number of types of validations that are best served by operating on sets or lists of instances at a time. Again, consider a referential integrity validation: If I'm about to bulk_create 5000 instances, but need to confirm that the "xyz code" is valid for all of them, then I should run a query that selects the "xyz table" for all of the codes that are referenced within the 5000 items instead of doing 5000 individual lookups within that table. Yes, one can maintain and access caches of known-valid things, but those are awkward to manage from within the Model layer. It's particularly difficult to write performant validations within the model when you're using .only() or .defer(). Unless the validation logic is able to detect that certain properties haven't been loaded from the database, then they would trigger extra queries retrieving values from the database solely for the purpose of validating that they are still correct (even though you aren't changing them). Also on the performance front, there are times that removing the extra layer of validation is necessary and appropriate. With well-tested code, once the incoming data has been validated and the transformation/operational logic is considered fully tested and accurate, then avoiding a second validation on the outbound data can result in a significant performance improvement. If you're dealing with millions or billions of records at a time (as we do during data conversions), then those significant performance improvements are worthwhile. Finally, Django supports the queryset .update() method. Again, validations that run within the model instance won't even HAVE instances when using .update() -- the queryset manager would need to figure out how to do the necessary validation (and if it's a multi-field validation, good luck!) There are also cases where the use of raw SQL is appropriate, and one obviously cannot lean on instance-level validation in that case. Validation is indeed important -- but testing the validity of data belongs in the business logic layer, not in the database model layer. Agreed that some types of validations can easily be encoded into the database model, but then you find yourselves writing two layers of validation ("one simple, the other more sophisticated")... that that makes things even more complex. We do indeed use the model-level validations for single-field validations... but we invoke those validations from our business logic at the proper time, not during the time we're saving the data to the database. baj Barry Johnson Epicor On Thursday, October 6, 2022 at 2:47:19 AM UTC-5 James Bennett wrote: > I see a lot of people mentioning that other ORMs do validation, but not > picking up on a key difference: > > Many ORMs are designed as standalone packages. For example, in Python > SQLAlchemy is a standalone DB/ORM package, and other languages have similar > popular ORMs. > > But Django's ORM isn't standalone. It's tightly integrated into Django, > and Django is a web framework. And once you focus *specifically* on the web > framework use case, suddenly things start going differently. > > For example: data on the web is "stringly-typed" (effectively, since HTTP > doesn't really have data types) and comes in via HTML's form mechanism or > other string-y formats like JSON or XML payloads. So you need not just data > *validation*, but data *conversion* which works for the web use case. > > And since the web use c
Re: Inconsistent pagination when sorting by non-unique columns (ticket 34251)
I'd be very much -1 if the framework modified my sorting parameters (by adding 'pk' at the end of the ordering). As mentioned, it can certainly break queries that can be handled by a covering index... but mostly because 'pk' isn't the field we would prefer to use in this case anyway. For example, when sorting a list of products, the final parameter ought to be the product's stocking number; when looking at items by store, the final parameter needs to be both stocking number and store. There wouldn't seem to be -any- meaningful "always right" answer. We have something like 20,000 tests running today (and will probably hit 40,000 over time), many of which do pagination and sorting. I'd really, really hate to have to go fix all of those tests -- it would be thousands of them. If there was a way to disable this proposed warning, I could live with that for a while. baj - Barry Johnson On Wednesday, January 11, 2023 at 6:55:25 AM UTC-6 tatari...@gmail.com wrote: > As described in https://code.djangoproject.com/ticket/34251, some > database backends (incl. PostgreSQL) will produce non-deterministic > ordering when sorting by columns with non-unique values. > This leads to inconsistent pagination, which I argue falls under the same > category as paginating by unordered QuerySet. > > I suggest we add a warning similar to the UnorderedObjectListWarning when > total deterministic ordering is not specified (PK or unique/unique_together > not null fields). > > The consequence of this change is that *a lot* of projects will receive a > warning. This issue is easy to overlook since non-unique columns like > "name" can have mostly unique values and you need to have duplicates at the > page borders to get into trouble. While seemingly hard to reproduce, I > believe it impacts a lot of production systems out there. > > Correctly handling it requires some thought, especially when exposing > sorting options to the end user. Whether 3rd-party tools like > django-sortable-listview > <https://github.com/aptivate/django-sortable-listview> or DRF > <https://github.com/encode/django-rest-framework> (OrderingFilter) should > automatically enforce the correct ordering by adding "pk"? This brings > problems with indexing like in this ticket > <https://code.djangoproject.com/ticket/29943>. Another option with these > tools is adding *pk* or another unique field to the query string when > specifying desired sorting, which looks just wrong to me, pushing > responsibility to the front end. > > What do you folks think? I've been bitten by this issue several times, in > a production environment after smoothly running for a while. The first time > was a complete surprise to me, and I believe it's may surprise quite a lot > of developers as well, so adding a warning will at least bring the > attention. > -- You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/b0e89cc8-e9f2-48c5-8d7f-3a3093a8e6bfn%40googlegroups.com.
Async QuerySet with Prefetch
For those working on async code, I just opened a question on the Django Internals Async forum regarding query sets with prefetches, which aren't currently supported in the alpha build of 4.2. A solution seemed so surprisingly easy, I'm wondering what the real experts in the area would think. https://forum.djangoproject.com/t/release-4-2a-async-querysets-with-prefetch-related/18767 baj -- You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/d99adab6-a20a-42be-9ada-d90adf19d20an%40googlegroups.com.