Re: Purging few months old data and vacuuming in production
On 2023-01-07 05:33:33 +, Ranjith Paliyath wrote: > > > This is because the deletion step is executed record > > by record in main table, with its connected record(s) > > delete executions in rest of tables? > > > I don't know if you have ON DELETE CASCADE. Even if you do, > > you'll have to manually delete the tables not linked by FK. > > I'd write a PL/pgSQL procedure: pass in a PK and then delete > > records from the 9 tables in the proper order so as to not > > throw FK constraint errors. > > Ok, in the case of our specific 9 tables it would finding and deleting > linked records in 8 tables based on the record chosen in the main > table. That is going and deleting records one by one. If I understood correctly, you have to delete about 3 million records (worst case) from the main table each day. Including the other 8 tables those are 27 million DELETE queries each of which deletes only a few records. That's about 300 queries per second. I'd be worried about impacting performance on other queries at this rate. I'd go for a middle ground: Instead of expiring once per day, use a shorter interval, maybe once per hour or once per minute. That will (probably) make each expire job really quick but still create much less load overall. hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature
Re: Purging few months old data and vacuuming in production
On 1/7/23 05:29, Peter J. Holzer wrote: [snip] If I understood correctly, you have to delete about 3 million records (worst case) from the main table each day. Including the other 8 tables those are 27 million DELETE queries each of which deletes only a few records. That's about 300 queries per second. I'd be worried about impacting performance on other queries at this rate. 300 *records*/second. Fewer DELETE statements if there are one-many relationships with the child tables. I'd go for a middle ground: Instead of expiring once per day, use a shorter interval, maybe once per hour or once per minute. That will (probably) make each expire job really quick but still create much less load overall. -- Born in Arizona, moved to Babylonia.
Re: Purging few months old data and vacuuming in production
On 2023-01-07 07:40:01 -0600, Ron wrote: > On 1/7/23 05:29, Peter J. Holzer wrote: > If I understood correctly, you have to delete about 3 million records > (worst case) from the main table each day. Including the other 8 tables > those are 27 million DELETE queries each of which deletes only a few > records. That's about 300 queries per second. I'd be worried about > impacting performance on other queries at this rate. > > > 300 records/second. Fewer DELETE statements if there are one-many > relationships with the child tables. Nope: | Each of these tables' daily record increment is on an average 2 to 3 | million I am assuming that the main table is typical, so there will be 2 to 3 million DELETEs from the main table and also from each of the other 8 tables (which may delete 0, 1, or more records). Also, it was mentioned that only some of these tables have a direct FK relationship, so the DELETE queries against the other tables may be (much) more expensive than a simple `delete from my_table where main_id = :1`. hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature
Re: Purging few months old data and vacuuming in production
Суббота, 7 января 2023, 20:23 +03:00 от Peter J. Holzer : >On 2023-01-07 07:40:01 -0600, Ron wrote: >> On 1/7/23 05:29, Peter J. Holzer wrote: >> If I understood correctly, you have to delete about 3 million records >> (worst case) from the main table each day. Including the other 8 tables >> those are 27 million DELETE queries each of which deletes only a few >> records. That's about 300 queries per second. I'd be worried about >> impacting performance on other queries at this rate. >> >> >> 300 records/second. Fewer DELETE statements if there are one-many >> relationships with the child tables. > >Nope: > >| Each of these tables' daily record increment is on an average 2 to 3 >| million > >I am assuming that the main table is typical, so there will be 2 to 3 >million DELETEs from the main table and also from each of the other 8 >tables (which may delete 0, 1, or more records). Also, it was mentioned >that only some of these tables have a direct FK relationship, so the >DELETE queries against the other tables may be (much) more expensive >than a simple `delete from my_table where main_id = :1`. > >hp > >-- > _ | Peter J. Holzer| Story must make more sense than reality. >|_|_) || >| | | h...@hjp.at |-- Charles Stross, "Creative writing >__/ | http://www.hjp.at/ | challenge!" So one may consider deleting from child tables, and only after that delete from main table, avoiding enforcing foreign key during delete. Also consider deletes by relatively small chunks, in loop.
impact join syntax ?? and gist index ??
Hi, postgres 12, postgis 3.0 I have a small table A, 11 rows with a varchar column x and a geometry column y. gist index on the geometry column. the geometry do contains multipolygons (regions on a map) I have a second table B , same structure, around 420 000 rows. no index, the geometry do contains points. all geometries are on 4326 srid. If i ask to count points in each multipolygons: select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x; it takes 11 seconds (everything in shared buffers). If I do the very same thing as: select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group by A.x; same result, but 85 seconds (every thing in shared buffers, again) if I redo asking with explain analyze, buffers, the plan is very different. if I do create a gist index on geometry column of the big table, both syntax takes 21 seconds. I get the feeling I am missing something.. (at least 2 things...) can someone shed some light ?? thanks Marc MILLAS Senior Architect +33607850334 www.mokadb.com
Re: impact join syntax ?? and gist index ??
Yes, I know the 2 syntax provide a different result: one provides the 6 meaningful lines, the left join do add 5 lines with a count of 0... ... Marc MILLAS Senior Architect +33607850334 www.mokadb.com On Sat, Jan 7, 2023 at 8:46 PM Marc Millas wrote: > Hi, > > postgres 12, postgis 3.0 > > I have a small table A, 11 rows with a varchar column x and a geometry > column y. > gist index on the geometry column. > the geometry do contains multipolygons (regions on a map) > I have a second table B , same structure, around 420 000 rows. > no index, > the geometry do contains points. > all geometries are on 4326 srid. > > If i ask to count points in each multipolygons: > > select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x; > it takes 11 seconds (everything in shared buffers). > If I do the very same thing as: > select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group by > A.x; > same result, but 85 seconds (every thing in shared buffers, again) > if I redo asking with explain analyze, buffers, the plan is very different. > > > if I do create a gist index on geometry column of the big table, > both syntax takes 21 seconds. > > I get the feeling I am missing something.. (at least 2 things...) > can someone shed some light ?? > > thanks > > > Marc MILLAS > Senior Architect > +33607850334 > www.mokadb.com > >
Re: impact join syntax ?? and gist index ??
on postgres 15 and postgis 3.3, with the very same dataset, without gist index on the 420k rows table, the syntax with the left join takes 25 seconds, and without 770 ms. so to get 5 empty lines its 30 times slower. if I add the gist index, both syntaxes takes 770 ms... at least, this close the discussion about the versions my project will use :-) Marc MILLAS Senior Architect +33607850334 www.mokadb.com On Sat, Jan 7, 2023 at 8:46 PM Marc Millas wrote: > Hi, > > postgres 12, postgis 3.0 > > I have a small table A, 11 rows with a varchar column x and a geometry > column y. > gist index on the geometry column. > the geometry do contains multipolygons (regions on a map) > I have a second table B , same structure, around 420 000 rows. > no index, > the geometry do contains points. > all geometries are on 4326 srid. > > If i ask to count points in each multipolygons: > > select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x; > it takes 11 seconds (everything in shared buffers). > If I do the very same thing as: > select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group by > A.x; > same result, but 85 seconds (every thing in shared buffers, again) > if I redo asking with explain analyze, buffers, the plan is very different. > > > if I do create a gist index on geometry column of the big table, > both syntax takes 21 seconds. > > I get the feeling I am missing something.. (at least 2 things...) > can someone shed some light ?? > > thanks > > > Marc MILLAS > Senior Architect > +33607850334 > www.mokadb.com > >
Re: impact join syntax ?? and gist index ??
> On 07/01/2023 20:46 CET Marc Millas wrote: > > Hi, > > postgres 12, postgis 3.0 > > I have a small table A, 11 rows with a varchar column x and a geometry column > y. > gist index on the geometry column. > the geometry do contains multipolygons (regions on a map) > I have a second table B , same structure, around 420 000 rows. > no index, > the geometry do contains points. > all geometries are on 4326 srid. > > If i ask to count points in each multipolygons: > > select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x; > it takes 11 seconds (everything in shared buffers). > If I do the very same thing as: > select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group by A.x; > same result, but 85 seconds (every thing in shared buffers, again) > if I redo asking with explain analyze, buffers, the plan is very different. > > > if I do create a gist index on geometry column of the big table, both syntax > takes 21 seconds. > > I get the feeling I am missing something.. (at least 2 things...) > can someone shed some light ?? Please provide the executions plans for both queries with and without the index on B.y. -- Erik
Re: impact join syntax ?? and gist index ??
here they are: (I replace the column and table names) also I post 2 more remarks, one on left join, and one on the test I did on postgres 15 postgis 3.3... 2023=# explain (analyze, buffers) select A.x, count(B.x) from A left join B on st_within(B.geom, A.geom) group by A.x; QUERY PLAN -- GroupAggregate (cost=212638398.98..212701792.16 rows=20 width=16) (actual time=86717.857..86757.820 rows=11 loops=1) Group Key: A.x Buffers: shared hit=4243867 -> Sort (cost=212638398.98..212659529.97 rows=8452398 width=16) (actual time=86717.851..86727.334 rows=421307 loops=1) Sort Key: A.x Sort Method: quicksort Memory: 37963kB Buffers: shared hit=4243867 -> Nested Loop Left Join (cost=0.00..211521459.31 rows=8452398 width=16) (actual time=17.473..86642.332 rows=421307 loops=1) Join Filter: st_within(B.geom, A.geom) Rows Removed by Join Filter: 4229377 Buffers: shared hit=4243867 -> Seq Scan on A (cost=0.00..9.20 rows=20 width=17752) (actual time=0.009..0.043 rows=11 loops=1) Buffers: shared hit=9 -> Materialize (cost=0.00..22309.83 rows=422789 width=40) (actual time=0.001..23.392 rows=422789 loops=11) Buffers: shared hit=15968 -> Seq Scan on B (cost=0.00..20195.89 rows=422789 width=40) (actual time=0.006..57.651 rows=422789 loops=1) Buffers: shared hit=15968 Planning Time: 0.693 ms Execution Time: 86763.087 ms (19 lignes) 2023=# explain (analyze, buffers) select A.x, count(B.x) from A, B where st_within(B.geom, A.geom) group by A.x; QUERY PLAN Finalize GroupAggregate (cost=6301606.00..6301608.60 rows=20 width=16) (actual time=11857.363..11863.212 rows=6 loops=1) Group Key: A.x Buffers: shared hit=2128836 -> Gather Merge (cost=6301606.00..6301608.30 rows=20 width=16) (actual time=11857.359..11863.207 rows=12 loops=1) Workers Planned: 1 Workers Launched: 1 Buffers: shared hit=2128836 -> Sort (cost=6300605.99..6300606.04 rows=20 width=16) (actual time=11840.355..11840.356 rows=6 loops=2) Sort Key: A.x Sort Method: quicksort Memory: 25kB Worker 0: Sort Method: quicksort Memory: 25kB Buffers: shared hit=2128836 -> Partial HashAggregate (cost=6300605.36..6300605.56 rows=20 width=16) (actual time=11840.331..11840.332 rows=6 loops=2) Group Key: A.x Buffers: shared hit=2128825 -> Nested Loop (cost=0.13..6275745.36 rows=4971999 width=16) (actual time=0.505..11781.817 rows=210651 loops=2) Buffers: shared hit=2128825 -> Parallel Seq Scan on B (cost=0.00..18454.99 rows=248699 width=40) (actual time=0.005..22.859 rows=211395 loops=2) Buffers: shared hit=15968 -> Index Scan using A_geom_idx on A (cost=0.13..25.15 rows=1 width=17752) (actual time=0.054..0.055 rows=1 loops=422789) Index Cond: (geom ~ B.geom) Filter: st_within(B.geom, geom) Rows Removed by Filter: 0 Buffers: shared hit=2112857 Planning Time: 0.252 ms Execution Time: 11863.357 ms (26 lignes) Marc MILLAS Senior Architect +33607850334 www.mokadb.com On Sat, Jan 7, 2023 at 9:40 PM Erik Wienhold wrote: > > On 07/01/2023 20:46 CET Marc Millas wrote: > > > > Hi, > > > > postgres 12, postgis 3.0 > > > > I have a small table A, 11 rows with a varchar column x and a geometry > column y. > > gist index on the geometry column. > > the geometry do contains multipolygons (regions on a map) > > I have a second table B , same structure, around 420 000 rows. > > no index, > > the geometry do contains points. > > all geometries are on 4326 srid. > > > > If i ask to count points in each multipolygons: > > > > select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x; > > it takes 11 seconds (everything in shared buffers). > > If I do the very same thing as: > > select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group > by A.x; > > same result, but 85 seconds (every thing in shared buffers, again) > > if I redo asking with explain analyze, buffers, the plan is very > different. > > > > > > if I do create a gist index on geometry column of the big table, > both syntax takes 21 seconds. > > > > I get the feeling I am missing somethin
Re: impact join syntax ?? and gist index ??
Marc Millas writes: > 2023=# explain (analyze, buffers) select A.x, count(B.x) from A left join B > on st_within(B.geom, A.geom) group by A.x; So the problem with this is that the only decently-performant way to do the join is like > -> Nested Loop (cost=0.13..6275745.36 rows=4971999 > width=16) (actual time=0.505..11781.817 rows=210651 loops=2) >-> Parallel Seq Scan on B (cost=0.00..18454.99 > rows=248699 width=40) (actual time=0.005..22.859 rows=211395 loops=2) >-> Index Scan using A_geom_idx on A > (cost=0.13..25.15 rows=1 width=17752) (actual time=0.054..0.055 rows=1 > loops=422789) > Index Cond: (geom ~ B.geom) > Filter: st_within(B.geom, geom) (Ignore the parallelism, it's not very relevant here.) There's no chance for merge or hash join because those require simple equality join conditions. The only way to avoid a stupid compare-every-row-of-A-to-every-row-of-B nestloop is to use a parameterized inner indexscan, as this plan does. But that only works if the join is inner or has the indexed table on the nullable side. We have no support for nestloop right join, which is what would be needed to make things run fast with no index on B. regards, tom lane
Re: impact join syntax ?? and gist index ??
I read your answer, Tom, but I cannot connect it to my measurements: why adding the index did slow the request twice ?? Marc MILLAS Senior Architect +33607850334 www.mokadb.com On Sat, Jan 7, 2023 at 10:33 PM Tom Lane wrote: > Marc Millas writes: > > 2023=# explain (analyze, buffers) select A.x, count(B.x) from A left > join B > > on st_within(B.geom, A.geom) group by A.x; > > So the problem with this is that the only decently-performant way to > do the join is like > > > -> Nested Loop (cost=0.13..6275745.36 > rows=4971999 width=16) (actual time=0.505..11781.817 rows=210651 loops=2) > >-> Parallel Seq Scan on B > (cost=0.00..18454.99 rows=248699 width=40) (actual time=0.005..22.859 > rows=211395 loops=2) > >-> Index Scan using A_geom_idx on A > (cost=0.13..25.15 rows=1 width=17752) (actual time=0.054..0.055 rows=1 > loops=422789) > > Index Cond: (geom ~ B.geom) > > Filter: st_within(B.geom, geom) > > (Ignore the parallelism, it's not very relevant here.) There's no > chance for merge or hash join because those require simple equality > join conditions. The only way to avoid a stupid > compare-every-row-of-A-to-every-row-of-B nestloop is to use a > parameterized inner indexscan, as this plan does. But that only works > if the join is inner or has the indexed table on the nullable side. > We have no support for nestloop right join, which is what would be > needed to make things run fast with no index on B. > > regards, tom lane >
Re: impact join syntax ?? and gist index ??
Marc Millas writes: > I read your answer, Tom, but I cannot connect it to my measurements: why > adding the index did slow the request twice ?? Are you referring to >>> if I do create a gist index on geometry column of the big table, >>> both syntax takes 21 seconds. ? That result is pretty much what I'd expect. If the planner has to form a nestloop-with-inner-indexscan between a small table and a big one, it's pretty much always going to prefer to put the small table on the outside of the loop if it can. The cost of such a loop is going to be more or less number of outer rows times the log of the number of inner rows (assuming at great risk of oversimplification that an index probe into a table of size N requires about O(log N) work), and it's not hard to see that S * log(B) is less than B * log(S). You seem to have lucked into a case where the other way comes out faster despite that, which perhaps can be explained by buffering effects, but it's not something to bet on across-the-board. (If you see this holding consistently, maybe it'd be advisable to reduce the planner's effective_cache_size parameter to something closer to shared_buffers, as it seems to indicate that fetches from kernel space are pretty expensive on your platform.) regards, tom lane