Accounting for between table correlation

2021-01-15 Thread Alexander Stoddard
I am having ongoing trouble with a pair of tables, the design of which is
beyond my control.

There is a 'primary' table with hundreds of millions of rows. There is then
a 'subclass' table ~ 10% of the primary which has additional fields. The
tables logically share a primary key field (although that is not annotated
in the schema).

Membership of the subclass table has high correlation with fields in the
primary table - it is very much not random. It seems query plans where the
two tables are joined are 'unstable'. Over time very different plans can
result  even for unchanged queries and some plans are exceedingly
inefficient.

I think what is going on is that the query planner assumes many fewer rows
are going to join to the subtable than actually do (because of the strong
correlation).

Can anyone offer any advice on dealing with this scenario (or better
diagnosing it)?

Thank you,
Alex


Re: Accounting for between table correlation

2021-01-15 Thread Alexander Stoddard
Unfortunately I'm not free to share the specific schema or the query plans.
They derive from an upstream vendor that is 'protective' of their data
model. To get to a proper example I'll need to recreate the behavior with
generic data in a generified schema.

I apologize for being frustratingly vague. I do feel like an idiot for not
at least saying this was with version PG 11.10.
It has been beneficial to me that the replies so far appear to validate my
understanding that no version of postgres has cross table correlation
statistics in the planner.

Analyze is done frequently, and I think at sufficient sample size. The
context is in a large data analysis setting and the data is changed via
bulk ETL not OLTP. The effect on analyzing is just instability - the plan
can flip in either direction (without underlying data changing) between
giving an answer in minutes and timing out after 10s of hours. That could
be indicative of too small a sample but I think in this case it is more the
statistics aren't useful because it is cross table correlation that can't
be accounted for.

The 'fast plans' use parallel seq scans. The 'slow plans' is using index
scans. It appears a good query plan correctly predicts it should be bulk
processing the tables but bad ones get fooled into trashing (hard disk, not
SSD) by mispredicting too few rows to join between the tables.

If anyone has similar experiences and is generous enough with their time to
share possible solutions/work arounds then I'm most grateful. If my
description is too vague to be worthwhile answering then I quite understand
and apologize for the time wasted in reading.

Thank you.





On Fri, Jan 15, 2021 at 10:26 AM David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Fri, Jan 15, 2021 at 9:10 AM Thomas Kellerer  wrote:
>
>> Atul Kumar schrieb am 15.01.2021 um 16:29:
>> > As per Ron, you are not supposed to ask your questions here.
>> >
>> > As According to him, we should keep on doing research on internet
>> > rather than asking for support directly even you have done enough
>> > research and until unless “Ron” won’t be satisfied you have to do
>> > keep on researching.
>>
>> Ron's question was perfectly valid.
>>
>> Missing and wrong statistics are one reason for the planner to choose a
>> bad execution plan.
>>
>
> Yeah, at first blush I didn't think analyze really mattered (and it mostly
> doesn't because while you can keep the statistics up-to-date the
> multi-table nature of the problem means they are only marginally helpful
> here), but that just points out the under-specified nature of the original
> posting.  Taken as a simple question of "is there a way to work around the
> lack of multi-table statistics" the analyze, and even the specific queries,
> don't matter all that much.  But it also would be much more useful if the
> OP would choose a single problematic query and show the schema, query, and
> explain results, hopefully both good and bad, and comment on how analyze
> seems to affect the plan choice.  But for the general question about
> overcoming our statistics limitations the analyze point is not relevant.
>
> David J.
>
>


Re: Accounting for between table correlation

2021-01-15 Thread Alexander Stoddard
On Fri, Jan 15, 2021 at 12:27 PM Michael Lewis  wrote:

> On Fri, Jan 15, 2021 at 10:22 AM Alexander Stoddard <
> alexander.stodd...@gmail.com> wrote:
>
>> The 'fast plans' use parallel seq scans. The 'slow plans' is using index
>> scans. It appears a good query plan correctly predicts it should be bulk
>> processing the tables but bad ones get fooled into trashing (hard disk, not
>> SSD) by mispredicting too few rows to join between the tables.
>>
>
> How many tables are involved?
>

The queries are complex, multiple joins to 10 plus tables, although most
are to tiny enum type lookup tables. I believe it is the join between the
two large tables that I have described that causes the issue, and that
seems to be reflected in the different strategies in the plans. For my own
learning and to clarify the problem I probably will have to try and
reproduce the behavior in a test case.


> Are you sure it is stats getting updated causing the change in behavior?
>
No I'm not sure, could something else flip a plan after an ANALYZE?
Differing performance of multiple runs of the same query could be due
caching etc. but that would be a timing difference without a change in
query plan. The output plans I see are radically different and correlate
with large magnitude performance changes.


> Are you hitting the genetic optimizer?
>

I am doing nothing to specify the optimizer. Do I have configurable options
in that regard? I was unaware of them.

Thank you,
Alex

On Fri, Jan 15, 2021 at 12:27 PM Michael Lewis  wrote:

> On Fri, Jan 15, 2021 at 10:22 AM Alexander Stoddard <
> alexander.stodd...@gmail.com> wrote:
>
>> The 'fast plans' use parallel seq scans. The 'slow plans' is using index
>> scans. It appears a good query plan correctly predicts it should be bulk
>> processing the tables but bad ones get fooled into trashing (hard disk, not
>> SSD) by mispredicting too few rows to join between the tables.
>>
>
> How many tables are involved? Are you sure it is stats getting updated
> causing the change in behavior? Are you hitting the genetic optimizer?
>


Re: Accounting for between table correlation

2021-01-19 Thread Alexander Stoddard
On Tue, Jan 19, 2021 at 2:47 PM Michael Lewis  wrote:

> On Fri, Jan 15, 2021 at 3:27 PM Alexander Stoddard <
> alexander.stodd...@gmail.com> wrote:
>
>> I am doing nothing to specify the optimizer. Do I have configurable
>> options in that regard? I was unaware of them.
>>
>
> Read up on join_collapse_limit and geqo.
>
> https://www.postgresql.org/docs/current/runtime-config-query.html
>
> Note that you can set these values within your session to check and even
> just run explain [query] to check if the plan that would be used varies.
>

Thank you for this. I can get different plans by changing the different
parameters, but if the output of EXPLAIN VERBOSE tells me geqo planning
specifically is being used I'm not seeing it.  Is there a way to see in the
explain plan if geqo was used?


Re: PostgreSQL suitable?

2017-12-19 Thread Alexander Stoddard
On Tue, Dec 19, 2017 at 10:39 AM, Stephen Frost  wrote:

> Greetings,
>
> * James Keener (j...@jimkeener.com) wrote:
> > Would a storage block level incremental like zfs work?
>
> This really depends on what you want out of your backups and just
> exactly how the ZFS filesystem is set up.  Remember that any backup of
> PG that doesn't use PG's start/stop backup must be atomic across all
> tablespaces and even then that really just allows you to bring PG back
> up as of that point of the snapshot.  I wouldn't recommend trying to
> play WAL forward from that kind of a backup.  If you use do use
> pg_start/stop_backup with ZFS snapshots, and make sure to track all of
> the WAL that's generated between the start/stop backup and ensure it's
> reliably stored, etc, then they can work, but it's not simple.
>
>
I believe that the thread started with a data warehouse use case. That
might be one application where data ingestion and processing can be stopped
and started in a controlled manner. As opposed to a continuously live
system where changes are going to continually accumulate in the WAL.

Best regards,
Alex


Please help me understand unlogged tables

2018-01-31 Thread Alexander Stoddard
If a table is set to unlogged is it inherently non-durable? That, is any
crash or unsafe shutdown _must_ result in truncation upon recovery?

I can imagine a table that is bulk loaded in a warehousing scenario and
then sitting statically could be safe, but maybe the question becomes how
could the system know it is unchanged if it isn't logged...

Would a correct solution be to bulk load into an unlogged table and then
(fairly cheaply?) set the table back to being logged?

If I my mental model is correct the table would be liable to being lost to
a crash during loading but once set back to logged it is durable. Is that
correct?

Is it the case that setting a large table to logged status after loading it
unlogged is a reasonably cheap operation?

Thank you,
Alex Stoddard