Proposed change in ORM model save logic

2021-10-14 Thread Barry Johnson
Trac ticket #33191 was recently closed as a “wontfix”, but Carlton encouraged 
bringing the matter to this group.  The actual issue is a small one, but it 
seems that Django could do better than it is doing today with one small change, 
and avoid the need for a counter-intuitive "x = x" statement added to 
application code.
The claim is that Django is unnecessarily clearing the cached entry for an 
in-memory related object at a time that cache does not need to be cleared, and 
that this cache clearing can result in unwanted lazy reads down the line.  A 
one-line fix for the problem is suggested.
Many apologies in advance for the extreme length of this post.  The requested 
change is small and subtle (although quite beneficial in some cases); we must 
dive deep into code to understand WHY it is important.

Background
———
My team is developing a major retail point of sale application; it’s currently 
live in nearly 1,000 stores and we’re negotiating contracts that could bring it 
to another 10,000 stores across the US over the next three to five years before 
going world-wide.  The backend is Django and PostgreSQL.
One of the typical problems we face has to do with importing data from existing 
systems.  This data usually comes to us in large sets of flat ASCII files, 
perhaps a gigabyte or few at a time.  We have to parse this incoming data and 
fill our tables as quickly as possible.  It’s not usual for one such data 
import to create tens of millions of rows across a more than a hundred tables.  
Our processes to do this are working reasonably well.
In many cases, the incoming stream of data generates related records in several 
tables at a time.  For example, our product import populates related data 
across a dozen tables organized into parent/child hierarchies that are up to 
four levels deep.  The incoming data is grouped by product (not by functional 
data type); the rows and columns of the records for a single product will 
create ORM objects scattered across those dozen ORM models.  The top-level 
model, Product, has four child models with references to Product; and each of 
those child models may have other child tables referring back to THOSE models, 
etc.
If the database schema sounds surprisingly complex, it is.  Big box retail is 
more complex than most people realize.  Each store typically carries 30,000 to 
70,000 individual products; some are even larger.  Some of the retail chains to 
which we market our systems have more than $1 billion USD per year in sales.
To be efficient, we process this incoming data in chunks:  We may instantiate 
ORM objects representing, say, 5,000 products and all of their related child 
objects.  For example, we’ll create a new instance of a Product model, then 
instantiate the various children that reference that product.  So a very 
typical pattern is something like
 product = Product(values)
 pv = ProductVariant(product=product, **more_values)
 upc = UPC(product_variant=product_variant, **upc_values)
It’s not that simple, of course; we’re reading in sequential data that 
generates multiple instances of the children in loops, etc., but essentially 
building a list of products and the multi-level hierarchy of child objects that 
dependent upon each of those products.  We then use bulk_create() to create all 
of the top-level products in one operation, then bulk_create() the various 
lists of first-level children, then bulk_create the children of that second 
level, etc.  LOTS of bulk creates.

Prior to Django 3.2, we had to explicitly set the associated “_id” field for 
each child’s reference to a parent after the list of parents were bulk_created. 
 Using our examples above, we would see things like:
bulk_create(list_of_products)
for each variant in list_of_product_variants:
variant.product_id = variant.product.id
bulk_create(list_of_product_variants)
for each upc in list_of_upc_entries:
   upc.product_variant_id = upc.product_variant.id
bulk_create(list_of_upc_entries)
[…]
Again, this is somewhat simplifying the code, but the key takeaway is that 
older versions of Django required us to manually pick up the primary key value 
from recently created instances and set the associated “_id” value in each 
instance pointing at the recently created objects.
As expected, setting the “_id” value of a foreign key field clears the internal 
cache entry containing the reference to the parent object.  We would fully 
expect that if we said “variant.product_id = 10”, that “variant.product” would 
be considered “not yet loaded”, and a reference to variant.product would 
generate a lazy read.  We don’t disagree with that interpretation or behavior, 
and that is the existing behavior that Carlton implicitly referenced when he 
closed the ticket in question as “wontfix”.
But…

Relatively Recent Django Changes
——
In the Django 3.2 timeframe, Ticket #29497 (“Saving parent object after setting 
on child leads to an unexpected data l

Re: Proposed change in ORM model save logic

2021-10-16 Thread Barry Johnson
On Friday, October 15, 2021 at 12:39:52 PM UTC-5 pyt...@ian.feete.org wrote:

> You set out a good case, it's much easier to understand the "why" here 
than in the original ticket.

Yes.  It was hard to compress the matter down small enough to reasonably 
fit in a Trac entry, so skipped the vast majority of the rationale. Sorry 
that it took that much writing to make the case, but had to set the 
groundwork first.

On later consideration, the two most compelling points seem to be:
* Internal machinations of Django shouldn't change object relationships 
that the application-level code has set
* save() shouldn't have different effects (-sometimes- uncaching the 
reference) if the parent object is inserted versus updated

> I'd avoid the extra optimisation of accessing __dict__ directly though - 
if __set__ gains extra functionality, I'd prefer not to accidentally miss 
it here.

Agreed!  Not being sure just how widely spread accessing the underlying 
__dict__ is throughout the code, I wasn't sure which way to go.  Although 
it may be a -measurable- improvement in performance, it's miniscule.

baj

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/14d1a470-4026-4f06-bfde-265bc7d8a667n%40googlegroups.com.


Re: Migrations: A bug or a feature needing documentation?

2019-08-08 Thread Barry Johnson


On Wednesday, August 7, 2019 at 2:17:45 PM UTC-5, Adam Johnson wrote:
>
> Some questions: Have you looked into the migration framework internals? 
> Are there any comments around "replaces" that indicate this behaviour? And 
> is there maybe a way of hooking into the migration planner, or adding the 
> ability, to support your use case without relying on the current 
> bug-feature?
>

[Looks quick...]  Interesting!  
There are two subtle points in the source that show the author 
intentionally planned for this case, both in a method named 
"remove_replaced_nodes()" that, ahem, removes the replaced migration nodes 
from the dependency graph if the conditions are right for replacement.  The 
docstring for that method begins with the sentence "Remove each of the 
'replaced' nodes (when they exist)."  And within the main loop of that 
method, the code does a "pop" to remove the replaced node, deliberately not 
failing if that replaced node does not appear in the graph:
 replaced_node = self.node_map.pop(replaced_key, None)
 if replaced_node:
  

The purpose of this particular method is two-fold:  It removes the replaced 
node, yes, but just as importantly it rewires the dependencies to and of 
this replaced node.  Any children of the replaced node now become children 
of the replacement, and any parents that point at the replaced now become 
parents of the replacement.

So deliberate use of this approach does require careful understanding of 
the dependency tree (which really isn't a surprise).

Interesting.


 

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/b6e576b9-ebbd-4263-8dd3-ae6ec79ced54%40googlegroups.com.


Re: Migrations: A bug or a feature needing documentation?

2019-08-08 Thread Barry Johnson


On Wednesday, August 7, 2019 at 9:22:27 PM UTC-5, Markus Holtermann wrote:
>
> Let's look at the last question first, regarding duplicate entries in the 
> django_migrations table: Yes, this is to be a bug. At least how it's 
> currently used.
>

Agreed.
 

> Let's say you have migration foo.0001_initial and apply it. You then have 
> (foo, 
> 0001_initial) in the django_migrations table. You now create migration 
> foo.0002_something and also add a squashed migration 
> foo.0001_initial_squashed_0002_something. When you now run migrate, 
> Django will apply foo.0002_something and your database will have (foo, 
> 0001_initial), (foo, 0002_something) as well as (foo, 
> 0001_initial_squashed_0002_something).
> So far so good. That's all as expected. If you now remove foo.0001_initial 
> and foo.0002_something from your filesystem and remove the replaces 
> section in foo.0001_initial_squashed_0002_something it is as if Django 
> never new about foo.0001_initial or foo.0002_something. You can add new 
> migrations, everything works the way it should. However, if you were to add 
> e.g. foo.0002_something again, Django would treat it as already applied, 
> despite it being somewhere later in your migration graph.
> At this point, I don't think this is the intended behavior. That said, I'm 
> inclined to say that applying a squashed migration should "unrecord" all 
> migrations it replaces. I've not yet thought too much about the "fallout" 
> (backwards compatibility, 
>
rollback of migrations, ...). But at least with regards to migrating 
> forwards, this seems to be the right behavior.
>

To date, the migration system has been remarkably tolerant of spurious 
entries in the django_migrations table, and the squashing process (to date) 
does indeed leave breadcrumbs of since-deleted migrations behind.  Agree 
completely with your point about adding a subsequent migration with a name 
that exactly matches a previously created migration that has been removed. 
 For all practical purposes, the migration names should be unique over 
time, or the developer should ensure that the second incarnation of that 
name is a functional replacement.

We have, in the past, replaced the contents of the "0001_initial" migration 
in most of our apps, but it is incumbent on us to make sure that's correct.

Regarding your second point around "replaces" and merging migrations: I 
> think this will lead to inconsistencies in your migration order, thus 
> potentially causing trouble down the line. [...]  I suspect that two data 
> migrations could easily conflict or result in inconsistent data if applied 
> in the wrong order. For example, one data migration adding new records to a 
> table, and another one ensuring that all values in a column are in upper 
> case. If you apply both migrations in that order (insert and then ensure 
> uppercase) you can be certain that all values will be uppercase. If you, 
> however, first ensure uppercase and then insert additional values, you need 
> to make sure that the data in the second migration is properly formatted.
>

Oh, certainly agree, Markus, about the dangers of replacing migrations. 
 That's true across the board, even in the documented use cases.  The 
replacement steps MUST properly account for elidable data migrations.  

In the normal use case (the typical migration squashing process), the 
replacement step truly replaces the original step, and is executed at that 
original spot within the dependency graph.  (It can seem as if that 
replacement is moved "back in time", as if history were rewritten.  That 
means that any migrations subsequent to the original step will remain 
subsequent to the replacement, which works.

In the use case where the original does not exist, the replacement step 
remains where it is in the dependency graph.  It must, because there's 
nowhere earlier that it could be placed.  Could this introduce an 
inconsistency, especially with data migrations?  Yes, if the developer is 
not careful.  Using your example of an insertion of lowercase data followed 
by a conversion to uppercase data:  in one branch the insertion happens 
first; in a different branch (where the insertion is run as a replacement 
step) they could be run in a different order.  That would indeed appear to 
be one of the dangers of moving a database from one branch of code to 
another, and that (as things are today) the developer must understand the 
migration paths.

Hmmm.

Keeping all this in mind, I'm beginning to believe there is a procedural 
method to solve the parallel-development problem.  Just before creating a 
long-lived branch (such as as production release that may get patches), 
create an empty migration step in all apps.  Later, if patch migrations had 
become necessary in that branch, the mainline trunk can "replace" the empty 
step with an equivalent migration that replaces it.  At least, that will 
give us an order of execution that more closely resembles what the 
product

Summary Re: Migrations: A bug or a feature needing documentation?

2019-08-08 Thread Barry Johnson


On Monday, August 5, 2019 at 4:55:24 PM UTC-5, Barry Johnson wrote:
>
> [ TL;DR: A migration may use a “replaces” list pointing to migrations that 
> don’t actually exist.  This undocumented technique cleanly solves a 
> recurring difficult migration problem.  We seek consensus on whether this 
> should become a documented feature or that it is an unexpected side effect 
> subject to change in the future. ]
>


Earlier this week, I posted a question about an undocumented behavior in 
the migration system, wondering if it should be formally documented and 
supported.  The use case relates to migration of a database from one branch 
of a project to another when there were different migrations created in 
both branches.  I believe this is a use case that the migration system was 
not (and is not) intended to solve, yet it seems to be a real-world problem.

Summary of responses: 

After discussion, I think that -our- future approach will be more 
procedural:  We'll deliberately create an empty migration *just before* 
forking a parallel branch that may require migrations, and will eventually 
require migrating tenants from that branch back to our mainline code.  We 
will then use that empty migration as a placeholder, and if necessary 
create one or more migrations that will replace the empty step to handle 
any parallel schema changes.  It might be wise to document this practice as 
a suggestion for others faced with this use case.

We will still have a case where multiple migrations would each indicate 
that they replace one of those empty migrations, perhaps caused by multiple 
schema changes being applied to the "patch" fork.  We've seen that behavior 
create apparent duplicate entries in the django_migrations database table. 
 But the migration system loads that table into a set containing (app, 
name) tuples, so the duplicate entries aren't hurting anything at present.

Furthermore, there is nothing that deliberately ties the list of previously 
applied migrations (that set of (app, name) tuples) back to migrations that 
exist.  Entries in the table for migrations that no longer exist are a side 
effect of the current squashing logic.  They can eventually cause the 
django_migrations table to contain many more rows than are necessary; 
perhaps it may be useful to have a migration tool or operation that clears 
out those obsolete records.  Or, as Markus has suggested, it may be wise to 
have the squashing process clean up after itself by removing "replaced" 
entries instead of leaving them behind.

I'll work up a documentation change PR that people can review and accept, 
modify or throw away.

Thank you!

baj
-
Barry Johnson



-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/b7d694aa-735f-429f-aae6-d0b948925d27%40googlegroups.com.


Re: Model-level validation

2022-10-07 Thread 'Barry Johnson' via Django developers (Contributions to Django itself)
I agree with James in several ways.   Our large Django application does 
rather extensive validation of data -- but I would argue strongly against 
embedding that validation in the base instance.save() logic.

(I would not argue against Django including a "ValidatingModel", derived 
from Model, that automatically runs defined validations as part of save(). 
 Then developers could choose which base they'd like to subclass when 
designing their objects.  Of course, anyone could simply create their own 
"ValidatingModel" class and derive everything from that class.)

Reason 1 is that business logic validation often requires access to 
multiple model instances -- for example, uniqueness across a set of objects 
about to be updated.  (e.g., "Only one person can be marked as the primary 
contact").  Or internal consistency:   "If this record is of this type, 
then it cannot have any children of that type".  Or even referential 
integrity in some cases:  "The incoming data has a code that serves as the 
primary key in some other table.  Make sure that primary key exists."

Yes, you can encode all of those cross-instance validations into an 
instance-level check, but then that brings us to the second point: 
 Performance.  There are a number of types of validations that are best 
served by operating on sets or lists of instances at a time.   Again, 
consider a referential integrity validation:  If I'm about to bulk_create 
5000 instances, but need to confirm that the "xyz code" is valid for all of 
them, then I should run a query that selects the "xyz table" for all of the 
codes that are referenced within the 5000 items instead of doing 5000 
individual lookups within that table.   Yes, one can maintain and access 
caches of known-valid things, but those are awkward to manage from within 
the Model layer.  

It's particularly difficult to write performant validations within the 
model when you're using .only() or .defer().   Unless the validation logic 
is able to detect that certain properties haven't been loaded from the 
database, then they would trigger extra queries retrieving values from the 
database solely for the purpose of validating that they are still correct 
(even though you aren't changing them).

Also on the performance front, there are times that removing the extra 
layer of validation is necessary and appropriate.  With well-tested code, 
once the incoming data has been validated and the 
transformation/operational logic is considered fully tested and accurate, 
then avoiding a second validation on the outbound data can result in a 
significant performance improvement.  If you're dealing with millions or 
billions of records at a time (as we do during data conversions), then 
those significant performance improvements are worthwhile.

Finally, Django supports the queryset .update() method.  Again, validations 
that run within the model instance won't even HAVE instances when using 
.update() -- the queryset manager would need to figure out how to do the 
necessary validation (and if it's a multi-field validation, good luck!)   
There are also cases where the use of raw SQL is appropriate, and one 
obviously cannot lean on instance-level validation in that case.

Validation is indeed important -- but testing the validity of data belongs 
in the business logic layer, not in the database model layer.  Agreed that 
some types of validations can easily be encoded into the database model, 
but then you find yourselves writing two layers of validation ("one simple, 
the other more sophisticated")...  that that makes things even more 
complex.  We do indeed use the model-level validations for single-field 
validations... but we invoke those validations from our business logic at 
the proper time, not during the time we're saving the data to the database.

baj

Barry Johnson
Epicor


On Thursday, October 6, 2022 at 2:47:19 AM UTC-5 James Bennett wrote:

> I see a lot of people mentioning that other ORMs do validation, but not 
> picking up on a key difference:
>
> Many ORMs are designed as standalone packages. For example, in Python 
> SQLAlchemy is a standalone DB/ORM package, and other languages have similar 
> popular ORMs.
>
> But Django's ORM isn't standalone. It's tightly integrated into Django, 
> and Django is a web framework. And once you focus *specifically* on the web 
> framework use case, suddenly things start going differently.
>
> For example: data on the web is "stringly-typed" (effectively, since HTTP 
> doesn't really have data types) and comes in via HTML's form mechanism or 
> other string-y formats like JSON or XML payloads. So you need not just data 
> *validation*, but data *conversion* which works for the web use case.
>
> And since the web use c

Re: Inconsistent pagination when sorting by non-unique columns (ticket 34251)

2023-01-11 Thread 'Barry Johnson' via Django developers (Contributions to Django itself)

I'd be very much -1 if the framework modified my sorting parameters (by 
adding 'pk' at the end of the ordering).  As mentioned, it can certainly 
break queries that can be handled by a covering index...  but mostly 
because 'pk' isn't the field we would prefer to use in this case anyway. 
 For example, when sorting a list of products, the final parameter ought to 
be the product's stocking number; when looking at items by store, the final 
parameter needs to be both stocking number and store.  There wouldn't seem 
to be -any- meaningful "always right" answer.

We have something like 20,000 tests running today (and will probably hit 
40,000 over time), many of which do pagination and sorting.   I'd really, 
really hate to have to go fix all of those tests -- it would be thousands 
of them.  If there was a way to disable this proposed warning, I could live 
with that for a while.

baj
-
Barry Johnson

On Wednesday, January 11, 2023 at 6:55:25 AM UTC-6 tatari...@gmail.com 
wrote:

> As described in https://code.djangoproject.com/ticket/34251, some 
> database backends (incl. PostgreSQL) will produce non-deterministic 
> ordering when sorting by columns with non-unique values.
> This leads to inconsistent pagination, which I argue falls under the same 
> category as paginating by unordered QuerySet. 
>
> I suggest we add a warning similar to the UnorderedObjectListWarning when 
> total deterministic ordering is not specified (PK or unique/unique_together 
> not null fields).
>
> The consequence of this change is that *a lot* of projects will receive a 
> warning. This issue is easy to overlook since non-unique columns like 
> "name" can have mostly unique values and you need to have duplicates at the 
> page borders to get into trouble. While seemingly hard to reproduce, I 
> believe it impacts a lot of production systems out there.
>
> Correctly handling it requires some thought, especially when exposing 
> sorting options to the end user. Whether 3rd-party tools like 
> django-sortable-listview 
> <https://github.com/aptivate/django-sortable-listview> or DRF 
> <https://github.com/encode/django-rest-framework> (OrderingFilter) should 
> automatically enforce the correct ordering by adding "pk"? This brings 
> problems with indexing like in this ticket 
> <https://code.djangoproject.com/ticket/29943>. Another option with these 
> tools is adding *pk* or another unique field to the query string when 
> specifying desired sorting, which looks just wrong to me, pushing 
> responsibility to the front end.
>
> What do you folks think? I've been bitten by this issue several times, in 
> a production environment after smoothly running for a while. The first time 
> was a complete surprise to me, and I believe it's may surprise quite a lot 
> of developers as well, so adding a warning will at least bring the 
> attention. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/b0e89cc8-e9f2-48c5-8d7f-3a3093a8e6bfn%40googlegroups.com.


Async QuerySet with Prefetch

2023-02-10 Thread &#x27;Barry Johnson' via Django developers (Contributions to Django itself)
For those working on async code, I just opened a question on the Django 
Internals Async forum regarding query sets with prefetches, which aren't 
currently supported in the alpha build of 4.2.  A solution seemed so 
surprisingly easy, I'm wondering what the real experts in the area would 
think.

https://forum.djangoproject.com/t/release-4-2a-async-querysets-with-prefetch-related/18767

baj

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/d99adab6-a20a-42be-9ada-d90adf19d20an%40googlegroups.com.