Re: Execution time from >1s -> 80m+ when extra columns added in SELECT for sub-query
Hi It looks so in slow plan is some strange relations between subselects - the slow plan looks like plan for correlated subquery, and it should be slow. Minimally you miss a index on column jtemp1c37l3b_baseline_windows_after_inclusion.uuid
Re: Execution time from >1s -> 80m+ when extra columns added in SELECT for sub-query
Thanks for responding! On May 18, 2020 at 12:18:37 PM, Pavel Stehule ([email protected](mailto:[email protected])) wrote: > Hi > > It looks so in slow plan is some strange relations between subselects - the > slow plan looks like plan for correlated subquery, and it should be slow. I'm not very saavy about query planning and have very little (if any) idea of what I'm talking about. Our dynamically generated SQL does generate a lot of "strange" subselects. They're hard to avoid for what we're doing. I admit it makes for some abnormal looking queries. That said, I think we've revealed an interesting issue here. Why would the planner switch plans so drastically given that all I'm doing is including a few extra columns in the subselect, particularly when those columns are discarded by the super? parent? subselect I'd expect the query planner to know that only the uuid column of the subselect is used and not bother to actually project any additional columns. Also, my impression about the EXISTS operator is that it would not really need to pull any column values from disk except where those values would be used to determine the conditions for the EXISTS. Maybe I'm misunderstanding how this query should be executed, but I wanted to bring it to your attention because it seems like, though the SQL is a bit crazy, it reveals some very inconsistent planning on PostgreSQL's part and might be something to look into. > Minimally you miss a index on column > jtemp1c37l3b_baseline_windows_after_inclusion.uuid Thanks for pointing this out! This is an intermediate table we generate as part of a much larger process and the table is only used once. I'm under the impression that there's a trade-off between taking the time to first build an index then run the query rather than just running the query that one time. It's an interesting idea to build the indexes just to avoid poor query plans and it's something I'll keep in mind if we run into other queries that trigger poor performance and we're unable to work around them some other way. If you'd like, I can slap on index on those tables and re-run a few queries, but again, even if performance improves in this one test case, I'm not sure it'd convince us to start adding indexes to some or all our tables as part of our process just to avoid this one bad plan in this one query. We've run 15-20 of our larger processes and only hit this situation in one query in one process. I imagine adding indexes across the board might be a heavy-handed solution to work around this issue.
Re: Execution time from >1s -> 80m+ when extra columns added in SELECT for sub-query
A Guy Named Ryan writes: > Why would the planner switch plans so drastically given that all I'm > doing is including a few extra columns in the subselect, particularly > when those columns are discarded by the super? parent? subselect The problem is that the columns you're adding *don't belong to that table*. Per your schema dump, jtemp1c37l3b_baseline_windows_with_collections only contains the columns person_id and uuid. So when you write SELECT "uuid" AS "uuid", CAST(NULL AS float) AS "drug_amount", CAST(NULL AS text) AS "drug_amount_units", CAST(NULL AS bigint) AS "drug_days_supply", CAST(NULL AS text) AS "drug_name", CAST(NULL AS float) AS "drug_quantity", CAST(NULL AS integer) AS "window_id", "person_id" AS "person_id", "criterion_id" AS "criterion_id", "criterion_table" AS "criterion_table", "criterion_domain" AS "criterion_domain", "start_date" AS "start_date", "end_date" AS "end_date" FROM "jigsaw_temp"."jtemp1c37l3b_baseline_windows_with_collections" those are the only two columns that are "legitimately" part of that bottom sub-select, and the others are outer references to jtemp1c37l3b_baseline_windows_after_inclusion. That's legal per SQL, but it makes the EXISTS into a correlated sub-select, which is something we can't turn into a semijoin. Indeed, the unreferenced columns do get thrown away later, but that doesn't happen until well past the point where the join restructuring decisions are made (and there are good reasons for that ordering of operations). Basically I'd write this off as "broken SQL code generator". If it doesn't understand the difference between a local reference and an outer reference, you shouldn't be letting it near your database. That sort of fundamental misunderstanding often leads to incorrect query results, never mind whether the query is fast or not. regards, tom lane
