Re: Purging few months old data and vacuuming in production

2023-01-07 Thread Peter J. Holzer
On 2023-01-07 05:33:33 +, Ranjith Paliyath wrote:
> 
>   > This is because the deletion step is executed record
>   > by record in main table, with its connected record(s)
>   > delete executions in rest of tables? 
> 
>   > I don't know if you have ON DELETE CASCADE.  Even if you do,
>   > you'll have to manually delete the tables not linked by FK. 
>   > I'd write a PL/pgSQL procedure: pass in a PK and then delete
>   > records from the 9 tables in the proper order so as to not
>   > throw FK constraint errors. 
> 
> Ok, in the case of our specific 9 tables it would finding and deleting
> linked records in 8 tables based on the record chosen in the main
> table. That is going and deleting records one by one.

If I understood correctly, you have to delete about 3 million records
(worst case) from the main table each day. Including the other 8 tables
those are 27 million DELETE queries each of which deletes only a few
records. That's about 300 queries per second. I'd be worried about
impacting performance on other queries at this rate.

I'd go for a middle ground: Instead of expiring once per day, use a
shorter interval, maybe once per hour or once per minute. That will
(probably) make each expire job really quick but still create much less
load overall.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Re: Purging few months old data and vacuuming in production

2023-01-07 Thread Ron

On 1/7/23 05:29, Peter J. Holzer wrote:
[snip]

If I understood correctly, you have to delete about 3 million records
(worst case) from the main table each day. Including the other 8 tables
those are 27 million DELETE queries each of which deletes only a few
records. That's about 300 queries per second. I'd be worried about
impacting performance on other queries at this rate.


300 *records*/second.  Fewer DELETE statements if there are one-many 
relationships with the child tables.



I'd go for a middle ground: Instead of expiring once per day, use a
shorter interval, maybe once per hour or once per minute. That will
(probably) make each expire job really quick but still create much less
load overall.


--
Born in Arizona, moved to Babylonia.

Re: Purging few months old data and vacuuming in production

2023-01-07 Thread Peter J. Holzer
On 2023-01-07 07:40:01 -0600, Ron wrote:
> On 1/7/23 05:29, Peter J. Holzer wrote:
> If I understood correctly, you have to delete about 3 million records
> (worst case) from the main table each day. Including the other 8 tables
> those are 27 million DELETE queries each of which deletes only a few
> records. That's about 300 queries per second. I'd be worried about
> impacting performance on other queries at this rate.
> 
> 
> 300 records/second.  Fewer DELETE statements if there are one-many
> relationships with the child tables.

Nope:

| Each of these tables' daily record increment is on an average 2 to 3
| million

I am assuming that the main table is typical, so there will be 2 to 3
million DELETEs from the main table and also from each of the other 8
tables (which may delete 0, 1, or more records). Also, it was mentioned
that only some of these tables have a direct FK relationship, so the
DELETE queries against the other tables may be (much) more expensive
than a simple `delete from my_table where main_id = :1`.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Re: Purging few months old data and vacuuming in production

2023-01-07 Thread Николай Кобзарев



Суббота,  7 января 2023, 20:23 +03:00 от Peter J. Holzer :
>On 2023-01-07 07:40:01 -0600, Ron wrote:
>> On 1/7/23 05:29, Peter J. Holzer wrote:
>> If I understood correctly, you have to delete about 3 million records
>> (worst case) from the main table each day. Including the other 8 tables
>> those are 27 million DELETE queries each of which deletes only a few
>> records. That's about 300 queries per second. I'd be worried about
>> impacting performance on other queries at this rate.
>> 
>> 
>> 300 records/second.  Fewer DELETE statements if there are one-many
>> relationships with the child tables.
>
>Nope:
>
>| Each of these tables' daily record increment is on an average 2 to 3
>| million
>
>I am assuming that the main table is typical, so there will be 2 to 3
>million DELETEs from the main table and also from each of the other 8
>tables (which may delete 0, 1, or more records). Also, it was mentioned
>that only some of these tables have a direct FK relationship, so the
>DELETE queries against the other tables may be (much) more expensive
>than a simple `delete from my_table where main_id = :1`.
>
>hp
>
>-- 
>   _  | Peter J. Holzer| Story must make more sense than reality.
>|_|_) ||
>| |   |  h...@hjp.at |-- Charles Stross, "Creative writing
>__/   |  http://www.hjp.at/ |   challenge!"


So one may consider deleting from child tables, and only after that delete from 
main table, avoiding enforcing foreign key during delete. Also consider deletes 
by relatively small chunks, in loop.

impact join syntax ?? and gist index ??

2023-01-07 Thread Marc Millas
Hi,

postgres 12, postgis 3.0

I have a small table A, 11 rows with a varchar column x and a geometry
column y.
gist index on the geometry column.
the geometry do contains multipolygons (regions on a map)
I have a second table B , same structure, around 420 000 rows.
no index,
the geometry do contains points.
all geometries are on 4326 srid.

 If i ask to count points in each multipolygons:

select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x;
it takes 11 seconds  (everything in shared buffers).
If I do the very same thing as:
select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group by
A.x;
same result, but 85 seconds (every thing in shared buffers, again)
if I redo asking with explain analyze, buffers, the plan is very different.


if I do create a gist index on geometry column of the big table,
both syntax takes 21 seconds.

I get the feeling I am missing something.. (at least 2 things...)
can someone shed some light ??

thanks


Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com


Re: impact join syntax ?? and gist index ??

2023-01-07 Thread Marc Millas
Yes, I know the 2 syntax provide a different result: one provides the 6
meaningful lines, the left join do add 5 lines with a count of 0...
...

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com



On Sat, Jan 7, 2023 at 8:46 PM Marc Millas  wrote:

> Hi,
>
> postgres 12, postgis 3.0
>
> I have a small table A, 11 rows with a varchar column x and a geometry
> column y.
> gist index on the geometry column.
> the geometry do contains multipolygons (regions on a map)
> I have a second table B , same structure, around 420 000 rows.
> no index,
> the geometry do contains points.
> all geometries are on 4326 srid.
>
>  If i ask to count points in each multipolygons:
>
> select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x;
> it takes 11 seconds  (everything in shared buffers).
> If I do the very same thing as:
> select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group by
> A.x;
> same result, but 85 seconds (every thing in shared buffers, again)
> if I redo asking with explain analyze, buffers, the plan is very different.
>
>
> if I do create a gist index on geometry column of the big table,
> both syntax takes 21 seconds.
>
> I get the feeling I am missing something.. (at least 2 things...)
> can someone shed some light ??
>
> thanks
>
>
> Marc MILLAS
> Senior Architect
> +33607850334
> www.mokadb.com
>
>


Re: impact join syntax ?? and gist index ??

2023-01-07 Thread Marc Millas
on postgres 15 and postgis 3.3, with the very same dataset,
without gist index on the 420k rows table, the syntax with the left join
takes 25 seconds, and without 770 ms.
so to get 5 empty lines its 30 times slower.
if I add the gist index, both syntaxes takes 770 ms...

at least, this close the discussion about the versions my project will use
:-)


Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com



On Sat, Jan 7, 2023 at 8:46 PM Marc Millas  wrote:

> Hi,
>
> postgres 12, postgis 3.0
>
> I have a small table A, 11 rows with a varchar column x and a geometry
> column y.
> gist index on the geometry column.
> the geometry do contains multipolygons (regions on a map)
> I have a second table B , same structure, around 420 000 rows.
> no index,
> the geometry do contains points.
> all geometries are on 4326 srid.
>
>  If i ask to count points in each multipolygons:
>
> select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x;
> it takes 11 seconds  (everything in shared buffers).
> If I do the very same thing as:
> select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group by
> A.x;
> same result, but 85 seconds (every thing in shared buffers, again)
> if I redo asking with explain analyze, buffers, the plan is very different.
>
>
> if I do create a gist index on geometry column of the big table,
> both syntax takes 21 seconds.
>
> I get the feeling I am missing something.. (at least 2 things...)
> can someone shed some light ??
>
> thanks
>
>
> Marc MILLAS
> Senior Architect
> +33607850334
> www.mokadb.com
>
>


Re: impact join syntax ?? and gist index ??

2023-01-07 Thread Erik Wienhold
> On 07/01/2023 20:46 CET Marc Millas  wrote:
>
> Hi,
>
> postgres 12, postgis 3.0
>
> I have a small table A, 11 rows with a varchar column x and a geometry column 
> y.
> gist index on the geometry column.
> the geometry do contains multipolygons (regions on a map)
> I have a second table B , same structure, around 420 000 rows.
> no index,
> the geometry do contains points.
> all geometries are on 4326 srid.
>
> If i ask to count points in each multipolygons:
>
> select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x;
> it takes 11 seconds (everything in shared buffers).
> If I do the very same thing as:
> select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group by A.x;
> same result, but 85 seconds (every thing in shared buffers, again)
> if I redo asking with explain analyze, buffers, the plan is very different.
>
>
> if I do create a gist index on geometry column of the big table, both syntax 
> takes 21 seconds.
>
> I get the feeling I am missing something.. (at least 2 things...)
> can someone shed some light ??

Please provide the executions plans for both queries with and without the index 
on B.y.

--
Erik




Re: impact join syntax ?? and gist index ??

2023-01-07 Thread Marc Millas
here they are: (I replace the column and table names) also I post 2 more
remarks, one on left join, and one on the test I did on postgres 15 postgis
3.3...

2023=# explain (analyze, buffers) select A.x, count(B.x) from A left join B
on st_within(B.geom, A.geom) group by A.x;
QUERY PLAN
--
 GroupAggregate  (cost=212638398.98..212701792.16 rows=20 width=16) (actual
time=86717.857..86757.820 rows=11 loops=1)
   Group Key: A.x
   Buffers: shared hit=4243867
   ->  Sort  (cost=212638398.98..212659529.97 rows=8452398 width=16)
(actual time=86717.851..86727.334 rows=421307 loops=1)
 Sort Key: A.x
 Sort Method: quicksort  Memory: 37963kB
 Buffers: shared hit=4243867
 ->  Nested Loop Left Join  (cost=0.00..211521459.31 rows=8452398
width=16) (actual time=17.473..86642.332 rows=421307 loops=1)
   Join Filter: st_within(B.geom, A.geom)
   Rows Removed by Join Filter: 4229377
   Buffers: shared hit=4243867
   ->  Seq Scan on A  (cost=0.00..9.20 rows=20 width=17752)
(actual time=0.009..0.043 rows=11 loops=1)
 Buffers: shared hit=9
   ->  Materialize  (cost=0.00..22309.83 rows=422789 width=40)
(actual time=0.001..23.392 rows=422789 loops=11)
 Buffers: shared hit=15968
 ->  Seq Scan on B  (cost=0.00..20195.89 rows=422789
width=40) (actual time=0.006..57.651 rows=422789 loops=1)
   Buffers: shared hit=15968
 Planning Time: 0.693 ms
 Execution Time: 86763.087 ms
(19 lignes)


2023=# explain (analyze, buffers) select A.x, count(B.x) from A, B where
st_within(B.geom, A.geom) group by A.x;

 QUERY PLAN

 Finalize GroupAggregate  (cost=6301606.00..6301608.60 rows=20 width=16)
(actual time=11857.363..11863.212 rows=6 loops=1)
   Group Key: A.x
   Buffers: shared hit=2128836
   ->  Gather Merge  (cost=6301606.00..6301608.30 rows=20 width=16) (actual
time=11857.359..11863.207 rows=12 loops=1)
 Workers Planned: 1
 Workers Launched: 1
 Buffers: shared hit=2128836
 ->  Sort  (cost=6300605.99..6300606.04 rows=20 width=16) (actual
time=11840.355..11840.356 rows=6 loops=2)
   Sort Key: A.x
   Sort Method: quicksort  Memory: 25kB
   Worker 0:  Sort Method: quicksort  Memory: 25kB
   Buffers: shared hit=2128836
   ->  Partial HashAggregate  (cost=6300605.36..6300605.56
rows=20 width=16) (actual time=11840.331..11840.332 rows=6 loops=2)
 Group Key: A.x
 Buffers: shared hit=2128825
 ->  Nested Loop  (cost=0.13..6275745.36 rows=4971999
width=16) (actual time=0.505..11781.817 rows=210651 loops=2)
   Buffers: shared hit=2128825
   ->  Parallel Seq Scan on B  (cost=0.00..18454.99
rows=248699 width=40) (actual time=0.005..22.859 rows=211395 loops=2)
 Buffers: shared hit=15968
   ->  Index Scan using A_geom_idx on A
(cost=0.13..25.15 rows=1 width=17752) (actual time=0.054..0.055 rows=1
loops=422789)
 Index Cond: (geom ~ B.geom)
 Filter: st_within(B.geom, geom)
 Rows Removed by Filter: 0
 Buffers: shared hit=2112857
 Planning Time: 0.252 ms
 Execution Time: 11863.357 ms
(26 lignes)




Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com



On Sat, Jan 7, 2023 at 9:40 PM Erik Wienhold  wrote:

> > On 07/01/2023 20:46 CET Marc Millas  wrote:
> >
> > Hi,
> >
> > postgres 12, postgis 3.0
> >
> > I have a small table A, 11 rows with a varchar column x and a geometry
> column y.
> > gist index on the geometry column.
> > the geometry do contains multipolygons (regions on a map)
> > I have a second table B , same structure, around 420 000 rows.
> > no index,
> > the geometry do contains points.
> > all geometries are on 4326 srid.
> >
> > If i ask to count points in each multipolygons:
> >
> > select A.x, count(B.x) from A, B where st_within(B.y, A.y) group by A.x;
> > it takes 11 seconds (everything in shared buffers).
> > If I do the very same thing as:
> > select A.x, count(B.x) from A left join B on st_within(B.y, A.y) group
> by A.x;
> > same result, but 85 seconds (every thing in shared buffers, again)
> > if I redo asking with explain analyze, buffers, the plan is very
> different.
> >
> >
> > if I do create a gist index on geometry column of the big table,
> both syntax takes 21 seconds.
> >
> > I get the feeling I am missing somethin

Re: impact join syntax ?? and gist index ??

2023-01-07 Thread Tom Lane
Marc Millas  writes:
> 2023=# explain (analyze, buffers) select A.x, count(B.x) from A left join B
> on st_within(B.geom, A.geom) group by A.x;

So the problem with this is that the only decently-performant way to
do the join is like

>  ->  Nested Loop  (cost=0.13..6275745.36 rows=4971999 
> width=16) (actual time=0.505..11781.817 rows=210651 loops=2)
>->  Parallel Seq Scan on B  (cost=0.00..18454.99 
> rows=248699 width=40) (actual time=0.005..22.859 rows=211395 loops=2)
>->  Index Scan using A_geom_idx on A 
> (cost=0.13..25.15 rows=1 width=17752) (actual time=0.054..0.055 rows=1 
> loops=422789)
>  Index Cond: (geom ~ B.geom)
>  Filter: st_within(B.geom, geom)

(Ignore the parallelism, it's not very relevant here.)  There's no
chance for merge or hash join because those require simple equality
join conditions.  The only way to avoid a stupid
compare-every-row-of-A-to-every-row-of-B nestloop is to use a
parameterized inner indexscan, as this plan does.  But that only works
if the join is inner or has the indexed table on the nullable side.
We have no support for nestloop right join, which is what would be
needed to make things run fast with no index on B.

regards, tom lane




Re: impact join syntax ?? and gist index ??

2023-01-07 Thread Marc Millas
I read your answer, Tom, but I cannot connect it to my measurements: why
adding the index did slow the request twice ??

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com



On Sat, Jan 7, 2023 at 10:33 PM Tom Lane  wrote:

> Marc Millas  writes:
> > 2023=# explain (analyze, buffers) select A.x, count(B.x) from A left
> join B
> > on st_within(B.geom, A.geom) group by A.x;
>
> So the problem with this is that the only decently-performant way to
> do the join is like
>
> >  ->  Nested Loop  (cost=0.13..6275745.36
> rows=4971999 width=16) (actual time=0.505..11781.817 rows=210651 loops=2)
> >->  Parallel Seq Scan on B
> (cost=0.00..18454.99 rows=248699 width=40) (actual time=0.005..22.859
> rows=211395 loops=2)
> >->  Index Scan using A_geom_idx on A
> (cost=0.13..25.15 rows=1 width=17752) (actual time=0.054..0.055 rows=1
> loops=422789)
> >  Index Cond: (geom ~ B.geom)
> >  Filter: st_within(B.geom, geom)
>
> (Ignore the parallelism, it's not very relevant here.)  There's no
> chance for merge or hash join because those require simple equality
> join conditions.  The only way to avoid a stupid
> compare-every-row-of-A-to-every-row-of-B nestloop is to use a
> parameterized inner indexscan, as this plan does.  But that only works
> if the join is inner or has the indexed table on the nullable side.
> We have no support for nestloop right join, which is what would be
> needed to make things run fast with no index on B.
>
> regards, tom lane
>


Re: impact join syntax ?? and gist index ??

2023-01-07 Thread Tom Lane
Marc Millas  writes:
> I read your answer, Tom, but I cannot connect it to my measurements: why
> adding the index did slow the request twice ??

Are you referring to

>>> if I do create a gist index on geometry column of the big table,
>>> both syntax takes 21 seconds.

?  That result is pretty much what I'd expect.  If the planner has
to form a nestloop-with-inner-indexscan between a small table and
a big one, it's pretty much always going to prefer to put the small
table on the outside of the loop if it can.  The cost of such a
loop is going to be more or less number of outer rows times the
log of the number of inner rows (assuming at great risk of
oversimplification that an index probe into a table of size N
requires about O(log N) work), and it's not hard to see that
S * log(B) is less than B * log(S).  You seem to have lucked into a
case where the other way comes out faster despite that, which perhaps
can be explained by buffering effects, but it's not something to bet
on across-the-board.

(If you see this holding consistently, maybe it'd be advisable to
reduce the planner's effective_cache_size parameter to something
closer to shared_buffers, as it seems to indicate that fetches
from kernel space are pretty expensive on your platform.)

regards, tom lane