#32940: Consider removing unused/unnecessary functionality in tree.Node.add
-------------------------------------+-------------------------------------
     Reporter:  Keryn Knight         |                    Owner:  Keryn
         Type:                       |  Knight
  Cleanup/optimization               |                   Status:  assigned
    Component:  Utilities            |                  Version:  dev
     Severity:  Normal               |               Resolution:
     Keywords:                       |             Triage Stage:  Accepted
    Has patch:  1                    |      Needs documentation:  0
  Needs tests:  0                    |  Patch needs improvement:  0
Easy pickings:  0                    |                    UI/UX:  0
-------------------------------------+-------------------------------------
Changes (by Nick Pope):

 * stage:  Unreviewed => Accepted


Old description:

> I'm proposing for discussion the removal of 2 things within the
> `Node.add` method, or at least further investigation on why it's ''not''
> possible to do so; those 2 bits are outlined immediately below.
>
> If we instrument the `Node.add` method, used by `query_utils.Q` and
> `where.WhereNode` extensively, like so:
> {{{
> ...
> if data in self.children:
>     print('v- data existed')
>     traceback.print_stack()
> if self.connector == conn_type and data in self.children:
>     print('^- data existed and connector matched')
>     return data
> if not squash:
>     print('<- squash is not Truthy')
>     self.children.append(data)
>     return data
> ...
> }}}
>
> and run the test suite (or as much as I'm able easily):
>
> {{{
> ...
> Ran 14834 tests in 453.388s
> OK (skipped=1188, expected failures=4)
> }}}
>
> We will find that the `data in self.children` barely registers as
> tested/used, and `squash` is entirely unused (unless those 1188 tests
> hide the truth).
>
> Having separately instrumented the add method previously to check the
> number of ''calls'' by class type, it appears as below:
>
> - `1493` calls to `Q.add` of which `0` have a `data in children` match.
> - `2` calls to `Node.add` of which `1` is a `data in children` match.
> - `210862` calls to `WhereNode.add` of which `2` are a `data in children
> match.
>
> But of all of those matches, the connector is never the same, so it never
> does that return early expected optimisation.
>
> The tests which exercise those 3 matches are:
> - `tests.queries.tests.DisjunctiveFilterTests.test_ticket8283`
> - `tests.queries.tests.Queries4Tests.test_combine_or_filter_reuse`
> -
> `tests.utils_tests.test_tree.NodeTests.test_add_eq_child_mixed_connector`
>
> The last of those is ''intentionally'' making sure that it ''does'' get
> added due to the connector difference.
> The first two, as they could affect where clause pushdown & merging, or
> cause different query plans due to repeats, we should check to make sure
> that the queries (additionally to the results) remain the same before and
> after by doing the following back of the napkin job for each of them:
> {{{
> self.assertEqual(
>     str(x.query),
>     "???"
> )
> }}}
> and confirm that they remain the same (they do), which are:
> {{{
> 'SELECT "queries_extrainfo"."id", "queries_extrainfo"."info",
> "queries_extrainfo"."note_id", "queries_extrainfo"."value",
> "queries_extrainfo"."date_id", "queries_extrainfo"."filterable" FROM
> "queries_extrainfo" WHERE (("queries_extrainfo"."note_id" = 1 OR
> "queries_extrainfo"."info" = e2) AND "queries_extrainfo"."note_id" = 1)
> ORDER BY "queries_extrainfo"."info" ASC';
>
> 'SELECT "queries_extrainfo"."id", "queries_extrainfo"."info",
> "queries_extrainfo"."note_id", "queries_extrainfo"."value",
> "queries_extrainfo"."date_id", "queries_extrainfo"."filterable" FROM
> "queries_extrainfo" WHERE (("queries_extrainfo"."info" = e2 OR
> "queries_extrainfo"."note_id" = 1) AND "queries_extrainfo"."note_id" = 1)
> ORDER BY "queries_extrainfo"."info" ASC';
>
> 'SELECT "queries_author"."id", "queries_author"."name",
> "queries_author"."num", "queries_author"."extra_id" FROM "queries_author"
> WHERE (("queries_author"."name" = a1 OR "queries_author"."name" = a3) AND
> "queries_author"."name" = a1)';
> }}}
>
> Removing the `data in self.children` check should be a small optimisation
> win because it's an `O(n)` scan across the entire list of direct
> children, and uses the `Node.__eq__` method which does an equality scan
> on it's own children (so grandchildren).
>
> {{{
> In [2]: %%timeit
>    ...: x = Q(('a', 1), ('b', 2), c=3, d=4, b=1)
>    ...: y = Q(aa=1, bb=2, cc=3)
>    ...: x.add(y, Q.OR)
> 4.81 µs ± 16.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops
> each)
> }}}
>
> after removing:
> {{{
> 4.66 µs ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops
> each)
> }}}
>
> It's worth noting that the concept of checking `data in self.children` is
> semi-flawed anyway, because `Q(('a', 1), a=2)` or `Q(('a', 1), ('a', 2))`
> very ''simply'' allow duplicate entries to occur immediately and up-
> front.
>
> I have a patch which I'll push up shortly once I've tidied it of all my
> instrumentation and `pdb` usage. The hope is that the entire CI passes,
> if it doesn't this is all moot :)

New description:

 I'm proposing for discussion the removal of 2 things within the `Node.add`
 method, or at least further investigation on why it's ''not'' possible to
 do so; those 2 bits are outlined immediately below.

 If we instrument the `Node.add` method, used by `query_utils.Q` and
 `where.WhereNode` extensively, like so:
 {{{
 ...
 if data in self.children:
     print('v- data existed')
     traceback.print_stack()
 if self.connector == conn_type and data in self.children:
     print('^- data existed and connector matched')
     return data
 if not squash:
     print('<- squash is not Truthy')
     self.children.append(data)
     return data
 ...
 }}}

 and run the test suite (or as much as I'm able easily):

 {{{
 ...
 Ran 14834 tests in 453.388s
 OK (skipped=1188, expected failures=4)
 }}}

 We will find that the `data in self.children` barely registers as
 tested/used, and `squash` is entirely unused (unless those 1188 tests hide
 the truth).

 Having separately instrumented the add method previously to check the
 number of ''calls'' by class type, it appears as below:

 - `1493` calls to `Q.add` of which `0` have a `data in children` match.
 - `2` calls to `Node.add` of which `1` is a `data in children` match.
 - `210862` calls to `WhereNode.add` of which `2` are a `data in children`
 match.

 But of all of those matches, the connector is never the same, so it never
 does that return early expected optimisation.

 The tests which exercise those 3 matches are:
 - `tests.queries.tests.DisjunctiveFilterTests.test_ticket8283`
 - `tests.queries.tests.Queries4Tests.test_combine_or_filter_reuse`
 -
 `tests.utils_tests.test_tree.NodeTests.test_add_eq_child_mixed_connector`

 The last of those is ''intentionally'' making sure that it ''does'' get
 added due to the connector difference.
 The first two, as they could affect where clause pushdown & merging, or
 cause different query plans due to repeats, we should check to make sure
 that the queries (additionally to the results) remain the same before and
 after by doing the following back of the napkin job for each of them:
 {{{
 self.assertEqual(
     str(x.query),
     "???"
 )
 }}}
 and confirm that they remain the same (they do), which are:
 {{{
 'SELECT "queries_extrainfo"."id", "queries_extrainfo"."info",
 "queries_extrainfo"."note_id", "queries_extrainfo"."value",
 "queries_extrainfo"."date_id", "queries_extrainfo"."filterable" FROM
 "queries_extrainfo" WHERE (("queries_extrainfo"."note_id" = 1 OR
 "queries_extrainfo"."info" = e2) AND "queries_extrainfo"."note_id" = 1)
 ORDER BY "queries_extrainfo"."info" ASC';

 'SELECT "queries_extrainfo"."id", "queries_extrainfo"."info",
 "queries_extrainfo"."note_id", "queries_extrainfo"."value",
 "queries_extrainfo"."date_id", "queries_extrainfo"."filterable" FROM
 "queries_extrainfo" WHERE (("queries_extrainfo"."info" = e2 OR
 "queries_extrainfo"."note_id" = 1) AND "queries_extrainfo"."note_id" = 1)
 ORDER BY "queries_extrainfo"."info" ASC';

 'SELECT "queries_author"."id", "queries_author"."name",
 "queries_author"."num", "queries_author"."extra_id" FROM "queries_author"
 WHERE (("queries_author"."name" = a1 OR "queries_author"."name" = a3) AND
 "queries_author"."name" = a1)';
 }}}

 Removing the `data in self.children` check should be a small optimisation
 win because it's an `O(n)` scan across the entire list of direct children,
 and uses the `Node.__eq__` method which does an equality scan on it's own
 children (so grandchildren).

 {{{
 In [2]: %%timeit
    ...: x = Q(('a', 1), ('b', 2), c=3, d=4, b=1)
    ...: y = Q(aa=1, bb=2, cc=3)
    ...: x.add(y, Q.OR)
 4.81 µs ± 16.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
 }}}

 after removing:
 {{{
 4.66 µs ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
 }}}

 It's worth noting that the concept of checking `data in self.children` is
 semi-flawed anyway, because `Q(('a', 1), a=2)` or `Q(('a', 1), ('a', 2))`
 very ''simply'' allow duplicate entries to occur immediately and up-front.

 I have a patch which I'll push up shortly once I've tidied it of all my
 instrumentation and `pdb` usage. The hope is that the entire CI passes, if
 it doesn't this is all moot :)

--

Comment:

 The `squash` argument was added in
 d3f00bd5706b35961390d3814dd7e322ead3a9a3. It seems to have been unused
 since it was introduced. (I didn't find `squash=False` anywhere.)

 The `if self.connector == conn_type and data in self.children:` line had
 the first half of the expression added recently in
 b81c7562fc33f50166d5120138d6398dc42b13c3. Interestingly this was actually
 fixing a regression in d3f00bd5706b35961390d3814dd7e322ead3a9a3 as prior
 to that commit the condition was the same!

 It'd probably be a good idea to see if Simon can cast an eye over this.

-- 
Ticket URL: <https://code.djangoproject.com/ticket/32940#comment:2>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-updates/067.f660e6406a074c38692598a567c2205c%40djangoproject.com.

Reply via email to