Postgres using nested loops despite setting enable_nestloop to false
Hi list, We have an application that generates SQL statements that are then executed on a postgresql database. The statements are always "bulk" type statements: they always return a relatively large amount of data, and have only a few not very selective filter expressions. They do contain a terrible amount of joins, though. The database has a "datavault" structure consisting of satellite, hub and link tables. Tables can easily contain a large amount of rows (10..100 million). The individual tables have primary key constraints but no referential constraints, and the only indexes present are those for the primary key constraints. There are also no indices on any other column. The reason for this is that everything is done in this database to get the highest performance possible for both loading data and querying it for our specific purpose, and indices do not help with that at all (they are never used by the planner because the conditions are never selective enough). One problem we have with these queries is that Postgresql's planner often badly underestimates the number of rows returned by query steps. It then uses nested loops for merging parts because it estimated it needs to loop only a few times, but in reality it needs to loop 10 million times, and that tends to not finish in any reasonable time ;) Considering the type of query we do we can safely say that using a nested loop is always a bad choice, and so we always run these statements after setting enable_nestloop to false. This has greatly helped the stability of these queries. But lately while migrating to Postgres 13 (from 9.6) we found that Postgres does not (always) obey the enable_nestloop = false setting anymore: some queries make a plan that contains a nested loop, and consequently they do not finish anymore. Whether a nested loop is being generated still seems to depend on the database's actual statistics; on some databases it uses the nested loop while on others (that use the exact same schema but have different data in them) it uses only hash and merge joins- as it should. What can I do to prevent these nested loops from occurring? FYI: an example query in a datavault database: select coalesce(adres_pe.id_s, -1) as adres_id , coalesce(tijd.tijdkey, 'Unknown') as calender_id , coalesce(di01905cluster_pe.id_s, -1) as di01905cluster_id , coalesce(di02697relatie_pe.id_s, -1) as di02697relatie_id , coalesce(di04238cluster_pe.id_s, -1) as di04238cluster_id , coalesce(di04306natuurlijkpersoon_pe.id_s, -1) as di04306natuurlijkpersoon_id , coalesce(eenheid_pe.id_s, -1) as eenheid_id , coalesce(huurovereenkomst_pe.id_s, -1) as huurovereenkomst_id , cast(count(huurovereenkomst_pe.identificatie) as bigint) as kg00770 from datavault.tijd tijd cross join lateral (select * from datavault.s_h_huurovereenkomst_ssm where dv_start_dts <= tijd.einddatum and dv_end_dts > tijd.einddatum) huurovereenkomst_pe inner join datavault.l_huurovk_ovk_ssm l_huurovk_ovk_ssm_pe on huurovereenkomst_pe.id_h_huurovereenkomst = l_huurovk_ovk_ssm_pe.id_h_huurovereenkomst and l_huurovk_ovk_ssm_pe.dv_start_dts <= tijd.einddatum and l_huurovk_ovk_ssm_pe.dv_end_dts > tijd.einddatum inner join datavault.s_h_overeenkomst_ssm overeenkomst_pe on l_huurovk_ovk_ssm_pe.id_h_overeenkomst = overeenkomst_pe.id_h_overeenkomst and overeenkomst_pe.dv_start_dts <= tijd.einddatum and overeenkomst_pe.dv_end_dts > tijd.einddatum left join datavault.l_huurovk_eenheid_ssm l_huurovk_eenheid_ssm_pe on huurovereenkomst_pe.id_h_huurovereenkomst = l_huurovk_eenheid_ssm_pe.id_h_huurovereenkomst and l_huurovk_eenheid_ssm_pe.dv_start_dts <= tijd.einddatum and l_huurovk_eenheid_ssm_pe.dv_end_dts > tijd.einddatum left join datavault.s_h_eenheid_ssm eenheid_pe on l_huurovk_eenheid_ssm_pe.id_h_eenheid = eenheid_pe.id_h_eenheid and eenheid_pe.dv_start_dts <= tijd.einddatum and eenheid_pe.dv_end_dts > tijd.einddatum left join datavault.l_adres_eenheid_ssm l_adres_eenheid_ssm_pe on l_huurovk_eenheid_ssm_pe.id_h_eenheid = l_adres_eenheid_ssm_pe.id_h_eenheid and l_adres_eenheid_ssm_pe.dv_start_dts <= tijd.einddatum and l_adres_eenheid_ssm_pe.dv_end_dts > tijd.einddatum left join datavault.s_h_adres_ssm adres_pe on l_adres_eenheid_ssm_pe.id_h_adres = adres_pe.id_h_adres and adres_pe.dv_start_dts <= tijd.einddatum and adres_pe.dv_end_dts > tijd.einddatum left join lateral (select l_cluster_eenheid_ssm.id_h_eenheid , di01905cluster.id_s from datavault.l_cluster_eenheid_ssm inner join datavault.s_h_cluster_ssm di01905cluster on l_cluster_eenheid_ssm.id_h_cluster = di01905cluster.id_h_cluster and di01905cluster.dv_start_dts <= tijd.einddatum and di01905cluster.dv_end_dts > tijd.einddatum where di01905cluster.soort = 'FIN' ) di01905cluster_pe on l_huurovk_eenheid_ssm_pe.id_h_eenheid = di01905cluster_pe.id_h_eenheid left join lateral (select l_ovk_ovkrel_ssm.id_h_ove
Install clustered postgres
Hello to all, Can you help me understand if it is fessible to install clustered Postgres 12 on 2 balanced servers with VmWsare? Who has had this experience and can you share more information with me? Best Regards, Nancy
Re: Postgres using nested loops despite setting enable_nestloop to false
On Tue, Nov 17, 2020 at 02:47:55PM +0100, Frits Jalvingh wrote: > But lately while migrating to Postgres 13 (from 9.6) we found that Postgres > does not (always) obey the enable_nestloop = false setting anymore: some > > The execution plan on Postgres 13.1: Could you send the plans under pg13 and pg9.6 as attachments ? What is the setting of work_mem ? I see the cost is dominated by 2*disable_cost, but I wonder whether the I/O cost of hash joins now exceeds that. Maybe hash_mem_multiplier helps you? GroupAggregate (cost=20008853763.07..20008853776.02 rows=370 width=68) Group Key: (COALESCE(adres_pe.id_s, '-1'::integer)), -- Justin
Re: Postgres using nested loops despite setting enable_nestloop to false
Hi Justin, thanks for your help! I have attached both plans, both made with set enable_nestloop = false in the attachments. On the Postgresql 13 server work_mem is 64MB. It cannot really be higher there because Postgresql does not control its use of memory, setting it higher on this VM will cause the OOM killer to kill Postgresql for some queries. On the Postgres 9.6 server we have it way higher, at 5GB (this machine is a monster with about 800GB of RAM). I indeed saw too that the artificial cost for the nested join led to 2x that amount. But that seems to be because there are actually 2 nested joins in there: we use a cross join with a "time" table (which contains just some 28 rows) and that one always seems to need a nested loop (it is present always). So I'm not too certain that that 2x disable_cost is from joins; it seems to be from 2x the nested loop. And I actually wondered whether that would be a cause of the issue, because as far as costs are concerned that second nested loops only _increases_ the cost by 2 times... Regards, Frits On Tue, Nov 17, 2020 at 3:20 PM Justin Pryzby wrote: > On Tue, Nov 17, 2020 at 02:47:55PM +0100, Frits Jalvingh wrote: > > But lately while migrating to Postgres 13 (from 9.6) we found that > Postgres > > does not (always) obey the enable_nestloop = false setting anymore: some > > > > The execution plan on Postgres 13.1: > > Could you send the plans under pg13 and pg9.6 as attachments ? > > What is the setting of work_mem ? > > I see the cost is dominated by 2*disable_cost, but I wonder whether the I/O > cost of hash joins now exceeds that. Maybe hash_mem_multiplier helps you? > > GroupAggregate (cost=20008853763.07..20008853776.02 rows=370 width=68) > > > Group Key: (COALESCE(adres_pe.id_s, '-1'::integer)), > > > > > -- > Justin > > > GroupAggregate (cost=20008853763.12..20008853776.07 rows=370 width=68) Group Key: (COALESCE(adres_pe.id_s, '-1'::integer)), (COALESCE(tijd.tijdkey, 'Unknown'::character varying)), (COALESCE(di01905cluster.id_s, '-1'::integer)), (COALESCE(di02697relatie_pe.id_s, '-1'::integer)), (COALESCE(di04238cluster.id_s, '-1'::integer)), (COALESCE(di04306natuurlijkpersoon.id_s, '-1'::integer)), (COALESCE(eenheid_pe.id_s, '-1'::integer)), (COALESCE(s_h_huurovereenkomst_ssm.id_s, '-1'::integer)) -> Sort (cost=20008853763.12..20008853764.05 rows=370 width=81) Sort Key: (COALESCE(adres_pe.id_s, '-1'::integer)), (COALESCE(tijd.tijdkey, 'Unknown'::character varying)), (COALESCE(di01905cluster.id_s, '-1'::integer)), (COALESCE(di02697relatie_pe.id_s, '-1'::integer)), (COALESCE(di04238cluster.id_s, '-1'::integer)), (COALESCE(di04306natuurlijkpersoon.id_s, '-1'::integer)), (COALESCE(eenheid_pe.id_s, '-1'::integer)), (COALESCE(s_h_huurovereenkomst_ssm.id_s, '-1'::integer)) -> Nested Loop Left Join (cost=2106618.91..20008853747.34 rows=370 width=81) Join Filter: (l_huurovk_ovk_ssm_pe.id_h_overeenkomst = l_ovk_ovkrel_ssm_1.id_h_overeenkomst) -> Merge Left Join (cost=1096634.59..1097034.11 rows=370 width=60) Merge Cond: (l_huurovk_eenheid_ssm_pe.id_h_eenheid = l_cluster_eenheid_ssm_1.id_h_eenheid) Join Filter: ((di04238cluster.dv_start_dts <= tijd.einddatum) AND (di04238cluster.dv_end_dts > tijd.einddatum)) -> Merge Left Join (cost=1091816.96..1092215.31 rows=370 width=60) Merge Cond: (l_huurovk_eenheid_ssm_pe.id_h_eenheid = eenheid_pe.id_h_eenheid) Join Filter: ((eenheid_pe.dv_start_dts <= tijd.einddatum) AND (eenheid_pe.dv_end_dts > tijd.einddatum)) -> Merge Left Join (cost=1087694.33..1087954.25 rows=370 width=56) Merge Cond: (l_huurovk_eenheid_ssm_pe.id_h_eenheid = l_cluster_eenheid_ssm.id_h_eenheid) Join Filter: ((di01905cluster.dv_start_dts <= tijd.einddatum) AND (di01905cluster.dv_end_dts > tijd.einddatum)) -> Sort (cost=1078369.96..1078370.89 rows=370 width=52) Sort Key: l_huurovk_eenheid_ssm_pe.id_h_eenheid -> Merge Join (cost=1077852.39..1078354.18 rows=370 width=52) Merge Cond: (l_huurovk_ovk_ssm_pe.id_h_overeenkomst = overeenkomst_pe.id_h_overeenkomst) Join Filter: ((overeenkomst_pe.dv_start_dts <= tijd.einddatum) AND (overeenkomst_pe.dv_end_dts > tijd.einddatum) AND (overeenkomst_pe.begindatum <= tijd.einddatum) AND ((overeenkomst_pe.einddatum >= tijd.einddatum) OR (overeenkomst_pe.einddatum IS NULL))) -> Sort (cost=1073751.26..1073783.05 rows=12715 width=52) Sort Key: l_huurovk_ovk_ssm_pe.id_h_overeenkomst
Re: Postgres using nested loops despite setting enable_nestloop to false
Ah, sorry, I forgot. I set "hash_mem_multiplier = 2", and after that to 20. It did had no effects on the nested loops. On Tue, Nov 17, 2020 at 4:58 PM Frits Jalvingh wrote: > Hi Justin, thanks for your help! > I have attached both plans, both made with set enable_nestloop = false in > the attachments. > On the Postgresql 13 server work_mem is 64MB. It cannot really be higher > there because Postgresql does not control its use of memory, setting it > higher on this VM will cause the OOM killer to kill Postgresql for some > queries. > On the Postgres 9.6 server we have it way higher, at 5GB (this machine is > a monster with about 800GB of RAM). > > I indeed saw too that the artificial cost for the nested join led to 2x > that amount. But that seems to be because there are actually 2 nested joins > in there: we use a cross join with a "time" table (which contains just some > 28 rows) and that one always seems to need a nested loop (it is present > always). So I'm not too certain that that 2x disable_cost is from joins; it > seems to be from 2x the nested loop. And I actually wondered whether that > would be a cause of the issue, because as far as costs are concerned that > second nested loops only _increases_ the cost by 2 times... > > Regards, > > Frits > > > On Tue, Nov 17, 2020 at 3:20 PM Justin Pryzby > wrote: > >> On Tue, Nov 17, 2020 at 02:47:55PM +0100, Frits Jalvingh wrote: >> > But lately while migrating to Postgres 13 (from 9.6) we found that >> Postgres >> > does not (always) obey the enable_nestloop = false setting anymore: some >> > >> > The execution plan on Postgres 13.1: >> >> Could you send the plans under pg13 and pg9.6 as attachments ? >> >> What is the setting of work_mem ? >> >> I see the cost is dominated by 2*disable_cost, but I wonder whether the >> I/O >> cost of hash joins now exceeds that. Maybe hash_mem_multiplier helps you? >> >> GroupAggregate (cost=20008853763.07..20008853776.02 rows=370 width=68) >> >> >> >> Group Key: (COALESCE(adres_pe.id_s, '-1'::integer)), >> >> >> >> >> -- >> Justin >> >> >>
Re: Postgres using nested loops despite setting enable_nestloop to false
On Tue, Nov 17, 2020 at 04:58:45PM +0100, Frits Jalvingh wrote: > Hi Justin, thanks for your help! > I have attached both plans, both made with set enable_nestloop = false in > the attachments. > On the Postgresql 13 server work_mem is 64MB. It cannot really be higher > there because Postgresql does not control its use of memory, setting it > higher on this VM will cause the OOM killer to kill Postgresql for some > queries. Can you try to get an explain just for this query with either increased work_mem or hash_mem_multiplier ? Or possibly by messing with the cost parameters, including seq_page_cost. Making all cost_* params 1000x smaller might allow the disable cost to be effective. -- Justin
Re: Postgres using nested loops despite setting enable_nestloop to false
Ok, I set all those cost parameters: # - Planner Cost Constants - seq_page_cost = 0.0001 # measured on an arbitrary scale random_page_cost = 0.0002 cpu_tuple_cost = 0.1# same scale as above cpu_index_tuple_cost = 0.05 # same scale as above cpu_operator_cost = 0.025 # same scale as above parallel_tuple_cost = 0.0001# same scale as above parallel_setup_cost = 1.0 # same scale as above #min_parallel_table_scan_size = 8MB #min_parallel_index_scan_size = 512kB effective_cache_size = 2GB It still has the nested loop on top, but the total cost is now: GroupAggregate (cost=2005652.88..2005652.90 rows=370 width=68) On Tue, Nov 17, 2020 at 5:08 PM Frits Jalvingh wrote: > Ah, sorry, I forgot. I set "hash_mem_multiplier = 2", and after that to > 20. It did had no effects on the nested loops. > > On Tue, Nov 17, 2020 at 4:58 PM Frits Jalvingh wrote: > >> Hi Justin, thanks for your help! >> I have attached both plans, both made with set enable_nestloop = false in >> the attachments. >> On the Postgresql 13 server work_mem is 64MB. It cannot really be higher >> there because Postgresql does not control its use of memory, setting it >> higher on this VM will cause the OOM killer to kill Postgresql for some >> queries. >> On the Postgres 9.6 server we have it way higher, at 5GB (this machine is >> a monster with about 800GB of RAM). >> >> I indeed saw too that the artificial cost for the nested join led to 2x >> that amount. But that seems to be because there are actually 2 nested joins >> in there: we use a cross join with a "time" table (which contains just some >> 28 rows) and that one always seems to need a nested loop (it is present >> always). So I'm not too certain that that 2x disable_cost is from joins; it >> seems to be from 2x the nested loop. And I actually wondered whether that >> would be a cause of the issue, because as far as costs are concerned that >> second nested loops only _increases_ the cost by 2 times... >> >> Regards, >> >> Frits >> >> >> On Tue, Nov 17, 2020 at 3:20 PM Justin Pryzby >> wrote: >> >>> On Tue, Nov 17, 2020 at 02:47:55PM +0100, Frits Jalvingh wrote: >>> > But lately while migrating to Postgres 13 (from 9.6) we found that >>> Postgres >>> > does not (always) obey the enable_nestloop = false setting anymore: >>> some >>> > >>> > The execution plan on Postgres 13.1: >>> >>> Could you send the plans under pg13 and pg9.6 as attachments ? >>> >>> What is the setting of work_mem ? >>> >>> I see the cost is dominated by 2*disable_cost, but I wonder whether the >>> I/O >>> cost of hash joins now exceeds that. Maybe hash_mem_multiplier helps >>> you? >>> >>> GroupAggregate (cost=20008853763.07..20008853776.02 rows=370 width=68) >>> >>> >>> >>> Group Key: (COALESCE(adres_pe.id_s, '-1'::integer)), >>> >>> >>> >>> >>> -- >>> Justin >>> >>> >>>
Re: Postgres using nested loops despite setting enable_nestloop to false
Frits Jalvingh writes: > I have attached both plans, both made with set enable_nestloop = false in > the attachments. The reason why you're getting a nested loop is that the planner has no other choice. The "tijd" table has no join conditions that would be amenable to hash- or merge-joining it to something else, because both of those join methods require a plain equality join condition. AFAICS in a quick look, all of tijd's join conditions look more like Join Filter: ((di04238cluster.dv_start_dts <= tijd.einddatum) AND (di04238cluster.dv_end_dts > tijd.einddatum)) which is not amenable to anything except brute force cross-join-and- test-the-condition. Given that, it's likely that "enable_nestloop = false" is making things worse not better, by artificially distorting the plan shape. Seeing the large number of joins involved, I wonder what your geqo_threshold, join_collapse_limit, and from_collapse_limit settings are, and whether you can get a better plan by increasing them. The planner doesn't seem to think that any of these joins involve a very large number of rows, so I doubt that your work_mem setting is very relevant. However, are these rowcount estimates accurate? You claimed upthread that you were dealing with hundreds of millions of rows, but it's impossible to credit that cost estimates like -> Seq Scan on s_h_cluster_ssm di01905cluster (cost=0.00..155.05 rows=1072 width=24) Filter: (soort = 'FIN'::text) correspond to scanning large tables. In the end, I fear that finding a way to get rid of those inequality join conditions may be your only real answer. regards, tom lane
Re: Postgres using nested loops despite setting enable_nestloop to false
Hello Tom, thanks for your help! I understand that the "time" table cross join needs a nested loop. Indeed that nested loop is present in all plans generated. But it is the _second_ (topmost) nested loop that is the issue. Once the time table has been joined it should be possible to do something else for that second nested loop. This is proven by that query on 9.6 (which has only one nested loop for that exact same query, on almost the same database content as the Postgresql 13 one). Even on Postgresql 13 a correct plan is made in another database (exact same structure, different data); I have attached the plan that is made there too. All databases that make a plan without the second nested loops also finish the query within a reasonable time period (16 seconds on the .9.6 server). On the 13 server with the nested loops plan the process times out after 2 hours. As far as the row counts go: yes, this database is not by far the biggest one, so the row counts are less. It also depends on what query we actually run (we can have hundreds of them on different tables, and not all tables are that big). I disabled nested_loops not just for fun, I disabled it because without it many of the queries effectively hang because their plan estimate expects only a few rows while in reality there are millions. Disabling nested loops will let lots of the generated queries fail, even on smaller datasets. I have no idea of how to get rid of those inequality queries, except by not using SQL and doing them by hand in code.. That would prove to be disastrous for performance as I'd have to read all those datasets completely... Do you have an idea on how to do that better? Regards, Frits On Tue, Nov 17, 2020 at 5:21 PM Tom Lane wrote: > Frits Jalvingh writes: > > I have attached both plans, both made with set enable_nestloop = false in > > the attachments. > > The reason why you're getting a nested loop is that the planner has no > other choice. The "tijd" table has no join conditions that would be > amenable to hash- or merge-joining it to something else, because both > of those join methods require a plain equality join condition. AFAICS > in a quick look, all of tijd's join conditions look more like > > Join Filter: ((di04238cluster.dv_start_dts <= tijd.einddatum) AND > (di04238cluster.dv_end_dts > tijd.einddatum)) > > which is not amenable to anything except brute force cross-join-and- > test-the-condition. > > Given that, it's likely that "enable_nestloop = false" is making things > worse not better, by artificially distorting the plan shape. > > Seeing the large number of joins involved, I wonder what your > geqo_threshold, join_collapse_limit, and from_collapse_limit settings > are, and whether you can get a better plan by increasing them. > > The planner doesn't seem to think that any of these joins involve > a very large number of rows, so I doubt that your work_mem setting > is very relevant. However, are these rowcount estimates accurate? > You claimed upthread that you were dealing with hundreds of millions > of rows, but it's impossible to credit that cost estimates like > > -> Seq Scan on s_h_cluster_ssm di01905cluster (cost=0.00..155.05 > rows=1072 width=24) > Filter: (soort = 'FIN'::text) > > correspond to scanning large tables. > > In the end, I fear that finding a way to get rid of those > inequality join conditions may be your only real answer. > > regards, tom lane > GroupAggregate (cost=1164791.63..1164797.65 rows=172 width=68) Group Key: (COALESCE(adres_pe.id_s, '-1'::integer)), (COALESCE(tijd.tijdkey, 'Unknown'::character varying)), (COALESCE(di01905cluster.id_s, '-1'::integer)), (COALESCE(di02697relatie_pe.id_s, '-1'::integer)), (COALESCE(di04238cluster.id_s, '-1'::integer)), (COALESCE(di04306natuurlijkpersoon_pe.id_s, '-1'::integer)), (COALESCE(eenheid_pe.id_s, '-1'::integer)), (COALESCE(s_h_huurovereenkomst_ssm.id_s, '-1'::integer)) -> Sort (cost=1164791.63..1164792.06 rows=172 width=91) Sort Key: (COALESCE(adres_pe.id_s, '-1'::integer)), (COALESCE(tijd.tijdkey, 'Unknown'::character varying)), (COALESCE(di01905cluster.id_s, '-1'::integer)), (COALESCE(di02697relatie_pe.id_s, '-1'::integer)), (COALESCE(di04238cluster.id_s, '-1'::integer)), (COALESCE(di04306natuurlijkpersoon_pe.id_s, '-1'::integer)), (COALESCE(eenheid_pe.id_s, '-1'::integer)), (COALESCE(s_h_huurovereenkomst_ssm.id_s, '-1'::integer)) -> Hash Right Join (cost=1157476.98..1164785.24 rows=172 width=91) Hash Cond: (adres_pe.id_h_adres = l_adres_eenheid_ssm_pe.id_h_adres) Join Filter: ((adres_pe.dv_start_dts <= tijd.einddatum) AND (adres_pe.dv_end_dts > tijd.einddatum)) -> Seq Scan on s_h_adres_ssm adres_pe (cost=0.00..6628.04 rows=180704 width=24) -> Hash (cost=1157474.83..1157474.83 rows=172 width=70) -> Hash Right Join (cost=1139732.16..
