Scenarios that Replication Slot will be Marked as Active = false

2024-05-10 Thread Avi Weinberg
Hi Experts,

I would like to write a monitoring script that will check logical replication 
running on our setup.  For that I would like to know when to alert when I see 
that replication slot is marked as active= false.  I will probably need to know 
when it is a temporary state that I need to wait before I alert and when to 
alert immediately when I see the slot is inactive.

If the replication slot falls behind the primary server will it be marked as 
Active = false until the replica catches up?  If so, can the lag size be 
configured that only if it crosses the threshold it  will be marked as inactive.

Will failure in the replica affect the active status of its replication slot?

What are other scenarios that can cause it to be inactive.

Thanks!


IMPORTANT - This email and any attachments is intended for the above named 
addressee(s), and may contain information which is confidential or privileged. 
If you are not the intended recipient, please inform the sender immediately and 
delete this email: you should not copy or use this e-mail for any purpose nor 
disclose its contents to any person.


Re: Unexpected data when subscribing to logical replication slot

2024-05-10 Thread Daniel McKenzie
Thank you all for your input.

We have solved the problem by -

   1. Configuring wal2json to include xids
   .
   2. Updating our enrichment queries to return the xmin
   .
   3. Retrying our enrichment queries after a brief sleep if xmin < xid.

This seems to be working very well so far.

Daniel McKenzie
Software Developer

Office: +1 403.910.5927 x 251
Mobile: +44 7712 159045
Website: www.curvedental.com

*Curve Dental Confidentiality Notice*
This message is intended exclusively for the individual or entity to which
it is addressed. This communication may contain information that is
proprietary, privileged, confidential, or otherwise legally exempt from
disclosure. If you are not the named addressee, you are not authorized to
read, print, retain, copy, or disseminate this message or any part of it.
If you have received this message in error, please notify the sender
immediately by replying to this e-mail and delete all copies of this
message.


On Thu, May 9, 2024 at 5:45 PM Adrian Klaver 
wrote:

> On 5/9/24 00:32, Daniel McKenzie wrote:
>
> > We've had this running in live now for years without a hiccup so we are
> > surprised to learn that we have this massive race condition and it just
> > so happens that the hardware is fast enough to process the transaction
> > before the .NET application can react to replication slot changes.
>
> On broad scale I'm going to say that over 'for years' there has been an
> increase in load on the Postgres server as well as the I/0 system of the
> machine it is running on. What you are seeing now is the canary in the
> mine giving you the heads up that more trouble lies ahead as the
> hardware and software is reaching load limits.
>
> On finer scale my guess is that the following is happening when
> synchronous_commit = on:
>
> 1) Postgres session #1 does data change.
>
> 2) This is replicated out and picked up by wal2json, which sees the new
> data.
>
> 3) The Postgres server waits for the confirmation that the WAL record
> has been written out to disk. Upon confirmation it commits on the
> server. This is the part that I am not sure of in relation to wal2json.
>
> 4) Postgres session #2 queries the database for the record. In the case
> where 3) has not completed it sees the old values as the data change in
> session #1 has not committed and therefore the new values are not seen
> by other sessions.
>
> >
> > Daniel McKenzie
> > Software Developer
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>
>


Re: Postgresql active-active nodes in cluster

2024-05-10 Thread Vidyashree H S
Thanks all for your inputs.


  *
Currently I'm working on active-passive(Primary and standby scenario).
  *
I'm preparing one study report on active-active node implementation which talks 
about active-active nodes background, theoretical explanation, implementation 
part which includes some trails on this. For that, I'm checking for the 
resources.



Thanks
Vidyashree H S

From: Bruce Momjian 
Sent: Friday, May 10, 2024 4:53 AM
To: Greg Sabino Mullane 
Cc: Kashif Zeeshan ; Vidyashree H S 
; pgsql-general@lists.postgresql.org 

Subject: Re: Postgresql active-active nodes in cluster

On Thu, May  9, 2024 at 06:33:39PM -0400, Greg Sabino Mullane wrote:
>
> https://wiki.postgresql.org/wiki/Multimaster
>
>
> That is a pretty old page. There are other solutions, such as pgedge, but the
> three most important questions when it comes to active-active replication are:
>
> * Do you really, really need it? (hint: the answer is no)
> * What problem are you trying to solve?
> * Are you really willing to suffer the tradeoffs?

Agreed, see this:

https://momjian.us/main/blogs/pgblog/2018.html#December_24_2018

--
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
[EXT]


SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou

Hello list,

INTRO

I have a huge (multi-billion rows) table partitioned into 1000 partitions.
Around half of the partitions are full and the rest are empty, created in
advance ready to receive future incoming data. Postgres is 16.2.
Here are the relevant parts of the schema:


\d test_runs_raw

 Partitioned table "public.test_runs_raw"
  Column   |Type | Collation | Nullable |
Default
---+-+---+--+--
 run_n | bigint  |   | not null | 
generated by default as identity
 test_case_n   | smallint|   | not null |
 workitem_n| integer |   | not null |
 test_resulttype_n | smallint|   |  |
Partition key: RANGE (workitem_n)
Indexes:
"test_runs_raw_partitioned_pkey" PRIMARY KEY, btree (workitem_n, run_n)

Each partition is made to keep entries with workitem_n in ranges
(0,20k), (20k,40k) and so on (k = kilo) up to 2k.


PROBLEM

I noticed that the following query is very very slow (too long to wait for
it to finish):

SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10;

What is remarkable, is that in 998 out of 1000 table scans it involves,
the planner does not use the index. Instead it chooses a sequential scan.
Here is the output from EXPLAIN:

 Limit  (cost=853891608.79..853891608.99 rows=10 width=4)
   ->  Unique  (cost=853891608.79..853891612.79 rows=200 width=4)
 ->  Sort  (cost=853891608.79..853891610.79 rows=800 width=4)   
Sort Key: test_runs_raw.workitem_n DESC
   ->  Gather  (cost=853891488.22..853891570.22 rows=800 width=4)
 Workers Planned: 4
 ->  HashAggregate  (cost=853890488.22..853890490.22 
rows=200 width=4)
   Group Key: test_runs_raw.workitem_n
   ->  Parallel Append  (cost=0.00..813118117.30 
rows=16308948365 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k 
test_runs_raw_480  (cost=0.57..1597355.10 rows=33623320 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k 
test_runs_raw_507  (cost=0.57..1210795.63 rows=25793672 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max9500k test_runs_raw_475  (cost=0.00..3037793.12 
rows=64121612 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max11180k test_runs_raw_559  (cost=0.00..2918875.90 
rows=61612190 width=4)
[ ... 996 more sequential scans ... ]

If I remove DISTINCT then the plan changes dramatically and it runs
instantaneously:

 Limit  (cost=363.84..367.30 rows=10 width=4)
   ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
 ->  Index Only Scan Backward using test_runs_raw__part_max2k_pkey 
on test_runs_raw__part_max2k test_runs_raw_1000  (cost=0.12..2.34 rows=1 
width=4)
 ->  Index Only Scan Backward using test_runs_raw__part_max19980k_pkey 
on test_runs_raw__part_max19980k test_runs_raw_999  (cost=0.12..2.34 rows=1 
width=4)
 ->  Index Only Scan Backward using test_runs_raw__part_max19960k_pkey 
on test_runs_raw__part_max19960k test_runs_raw_998  (cost=0.12..2.34 rows=1 
width=4)
 ->  Index Only Scan Backward using test_runs_raw__part_max19940k_pkey 
on test_runs_raw__part_max19940k test_runs_raw_997  (cost=0.12..2.34 rows=1 
width=4)
[ ... 996 more index scans ... ]

Notice how in the last plan there is no parallel scanning. Instead the
partitions are scanned sequentially, *in proper order*,
so that the plan execution stops after reading the first
10 rows in the first non-empty partition.

Why can't the same be done with DISTINCT?
Please note that the workitem_n value range is well spread into in range
(0,13M) and the table has been gradually filled within one year, so I'm
assuming the vacuum worker has worked long enough to build sane statistics
(not sure how to verify that).


REMARKS

1. I tried reproducing the problem on an artificial table with few
   partitions and few values, but I couldn't. Both queries execute fast,
   and the planner is always choosing a non-parallel index-only scan.

2. Among testing changes to various settings, I just noticed that setting
   max_parallel_workers_per_gather to 0 (from the original value of 4)
   fixes the issue! On the original huge table, disabling parallelism
   actually makes the query infinitely faster and it returns within 1s! Is
   this a bug in the planner?


Thank you,
Dimitris





Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou

On Fri, 10 May 2024, Dimitrios Apostolou wrote:


I noticed that the following query is very very slow (too long to wait for it
to finish):

SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10;


Update: even the simplest SELECT DISTINCT query shows similar behaviour:

EXPLAIN
SELECT DISTINCT workitem_n FROM test_runs_raw LIMIT 10;

 Limit  (cost=724518979.52..724518979.92 rows=10 width=4)
   ->  Unique  (cost=724518979.52..724518987.52 rows=200 width=4)
 ->  Sort  (cost=724518979.52..724518983.52 rows=1600 width=4)  
 Sort Key: test_runs_raw.workitem_n
   ->  Gather  (cost=724518732.37..724518894.37 rows=1600 width=4)
 Workers Planned: 4
 ->  HashAggregate  (cost=724517732.37..724517734.37 
rows=200 width=4)
   Group Key: test_runs_raw.workitem_n
   ->  Parallel Append  (cost=0.00..704131546.90 
rows=8154474186 width=4)
 ->  Parallel Index Only Scan using 
test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k 
test_runs_raw_480  (cost=0.57..1429238.50 rows=16811660 width=4)
 ->  Parallel Index Only Scan using 
test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k 
test_runs_raw_507  (cost=0.57..1081827.27 rows=12896836 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max9500k test_runs_raw_475  (cost=0.00..2717185.06 
rows=32060806 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max11180k test_runs_raw_559  (cost=0.00..2610814.95 
rows=30806095 width=4)


It also takes ages to return, so I have to interrupt it.

I believe it should exit early, as soon as it finds 10 distinct values
(which should be rather easy even with parallel seqscans, given the
pattern followed when inserting the data).


Thanks,
Dimitris




Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou




On Fri, 10 May 2024, Dimitrios Apostolou wrote:


On Fri, 10 May 2024, Dimitrios Apostolou wrote:

Update: even the simplest SELECT DISTINCT query shows similar behaviour:


Further digging into this simple query, if I force the non-parallel plan
by setting max_parallel_workers_per_gather TO 0, I see that the query
planner comes up with a cost much higher:

 Limit  (cost=363.84..1134528847.47 rows=10 width=4)
   ->  Unique  (cost=363.84..22690570036.41 rows=200 width=4)
 ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
   ->  Index Only Scan using test_runs_raw__part_max20k_pkey on 
test_runs_raw__part_max20k test_runs_raw_1  (cost=0.12..2.34 rows=1 width=4)
   ->  Index Only Scan using test_runs_raw__part_max40k_pkey on 
test_runs_raw__part_max40k test_runs_raw_2  (cost=0.12..2.34 rows=1 width=4)
[...]
   ->  Index Only Scan using test_runs_raw__part_max1780k_pkey on 
test_runs_raw__part_max1780k test_runs_raw_89  (cost=0.57..53587294.65 
rows=106088160 width=4)
   ->  Index Only Scan using test_runs_raw__part_max1800k_pkey on 
test_runs_raw__part_max1800k test_runs_raw_90  (cost=0.57..98943539.74 
rows=96214080 width=4)
   ->  Index Only Scan using test_runs_raw__part_max1820k_pkey on 
test_runs_raw__part_max1820k test_runs_raw_91  (cost=0.57..97495653.34 
rows=193248960 width=4)
   ->  Index Only Scan using test_runs_raw__part_max1840k_pkey on 
test_runs_raw__part_max1840k test_runs_raw_92  (cost=0.57..110205205.07 
rows=218440928 width=4)
   ->  Index Only Scan using test_runs_raw__part_max1860k_pkey on 
test_runs_raw__part_max1860k test_runs_raw_93  (cost=0.57..50164056.28 
rows=99431760 width=4)
[...]


The total cost on the 1st line (cost=363.84..1134528847.47) has a much
higher upper limit than the total cost when
max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99).

This explains the planner's choice. But I wonder why the cost estimation
is so far away from reality.


Dimitris





Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Tom Lane
Dimitrios Apostolou  writes:
> Further digging into this simple query, if I force the non-parallel plan
> by setting max_parallel_workers_per_gather TO 0, I see that the query
> planner comes up with a cost much higher:

>   Limit  (cost=363.84..1134528847.47 rows=10 width=4)
> ->  Unique  (cost=363.84..22690570036.41 rows=200 width=4)
>   ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
> ...

> The total cost on the 1st line (cost=363.84..1134528847.47) has a much
> higher upper limit than the total cost when
> max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99).
> This explains the planner's choice. But I wonder why the cost estimation
> is so far away from reality.

I'd say the blame lies with that (probably-default) estimate of
just 200 distinct rows.  That means the planner expects to have
to read about 5% (10/200) of the tables to get the result, and
that's making fast-start plans look bad.

Possibly an explicit ANALYZE on the partitioned table would help.

regards, tom lane




Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou

On Fri, 10 May 2024, Tom Lane wrote:


Dimitrios Apostolou  writes:

Further digging into this simple query, if I force the non-parallel plan
by setting max_parallel_workers_per_gather TO 0, I see that the query
planner comes up with a cost much higher:



  Limit  (cost=363.84..1134528847.47 rows=10 width=4)
->  Unique  (cost=363.84..22690570036.41 rows=200 width=4)
  ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
...



The total cost on the 1st line (cost=363.84..1134528847.47) has a much
higher upper limit than the total cost when
max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99).
This explains the planner's choice. But I wonder why the cost estimation
is so far away from reality.


I'd say the blame lies with that (probably-default) estimate of
just 200 distinct rows.  That means the planner expects to have
to read about 5% (10/200) of the tables to get the result, and
that's making fast-start plans look bad.


Indeed that's an awful estimate, the table has more than 1M of unique
values in that column. Looking into pg_stat_user_tables, I can't see the
partitions having been vacuum'd or analyzed at all. I think they should
have been auto-analyzed, since they get a ton of INSERTs
(no deletes/updates though) and I have the default autovacuum settings.
Could it be that autovacuum starts, but never
finishes? I can't find something in the logs.

In any case, even after the planner decides to execute the terrible plan
with the parallel seqscans, why doesn't it finish right when it finds 10
distinct values?



Possibly an explicit ANALYZE on the partitioned table would help.


Thanks, I'll save the ANALYZE as the last step; I feel it's a good
opportunity to figure out more details about how postgres works. Plus I
expect ANALYZE to last a couple of days, so I should first find quiet time
for that. :-)

Dimitris




Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Tom Lane
Dimitrios Apostolou  writes:
> On Fri, 10 May 2024, Tom Lane wrote:
>> I'd say the blame lies with that (probably-default) estimate of
>> just 200 distinct rows.  That means the planner expects to have
>> to read about 5% (10/200) of the tables to get the result, and
>> that's making fast-start plans look bad.

> In any case, even after the planner decides to execute the terrible plan
> with the parallel seqscans, why doesn't it finish right when it finds 10
> distinct values?

That plan can't emit anything at all till it finishes the Sort.

I do kind of wonder why it's producing both a hashagg and a Unique
step --- seems like it should do one or the other.

> Thanks, I'll save the ANALYZE as the last step; I feel it's a good
> opportunity to figure out more details about how postgres works. Plus I
> expect ANALYZE to last a couple of days, so I should first find quiet time
> for that. :-)

It really should not take too long --- it reads a sample, not the
whole table.

regards, tom lane




Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread David Rowley
On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou  wrote:
> Indeed that's an awful estimate, the table has more than 1M of unique
> values in that column. Looking into pg_stat_user_tables, I can't see the
> partitions having been vacuum'd or analyzed at all. I think they should
> have been auto-analyzed, since they get a ton of INSERTs
> (no deletes/updates though) and I have the default autovacuum settings.
> Could it be that autovacuum starts, but never
> finishes? I can't find something in the logs.

It's not the partitions getting analyzed you need to worry about for
an ndistinct estimate on the partitioned table. It's auto-analyze or
ANALYZE on the partitioned table itself that you should care about.

If you look at [1], it says "Tuples changed in partitions and
inheritance children do not trigger analyze on the parent table."

> In any case, even after the planner decides to execute the terrible plan
> with the parallel seqscans, why doesn't it finish right when it finds 10
> distinct values?

It will. It's just that Sorting requires fetching everything from its subnode.

David

[1] 
https://www.postgresql.org/docs/16/routine-vacuuming.html#VACUUM-FOR-STATISTICS




Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread David Rowley
On Sat, 11 May 2024 at 13:33, Tom Lane  wrote:
> I do kind of wonder why it's producing both a hashagg and a Unique
> step --- seems like it should do one or the other.

It still needs to make the duplicate groups from parallel workers unique.

David




Unnecessary buffer usage with multicolumn index, row comparison, and equility constraint

2024-05-10 Thread WU Yan
Hi everyone, first time here. Please kindly let me know if this is not the
right place to ask.

I notice a simple query can read a lot of buffer blocks in a meaningless
way, when
1. there is an index scan on a multicolumn index
2. there is row constructor comparison in the Index Cond
3. there is also an equality constraint on the leftmost column of the
multicolumn index


## How to reproduce

I initially noticed it on AWS Aurora RDS, but it can be reproduced in
docker container as well.
```bash
docker run --name test-postgres -e POSTGRES_PASSWORD=mysecretpassword -d -p
5432:5432 postgres:16.3
```

Create a table with a multicolumn index. Populate 12 million rows with
random integers.
```sql
CREATE TABLE t(a int, b int);
CREATE INDEX my_idx ON t USING BTREE (a, b);

INSERT INTO t(a, b)
SELECT
(random() * 123456)::int AS a,
(random() * 123456)::int AS b
FROM
generate_series(1, 12345678);

ANALYZE t;
```

Simple query that uses the multicolumn index.
```
postgres=# explain (analyze, buffers) select * from t where row(a, b) >
row(123450, 123450) and a = 0 order by a, b;
QUERY PLAN
---
 Index Only Scan using my_idx on t  (cost=0.43..8.46 rows=1 width=8)
(actual time=284.312..284.314 rows=0 loops=1)
   Index Cond: ((ROW(a, b) > ROW(123450, 123450)) AND (a = 0))
   Heap Fetches: 0
   Buffers: shared hit=3777 read=37304 written=11713
 Planning:
   Buffers: shared hit=22 read=4
 Planning Time: 0.270 ms
 Execution Time: 284.341 ms
(8 rows)
```

## Expected output

The number of buffer blocks used is high. I expect it to be no more than
when there’s only one constraint.

```
postgres=# explain (analyze, buffers) select * from t where row(a, b) >
row(123450, 123450) order by a, b;
 QUERY PLAN

 Index Only Scan using my_idx on t  (cost=0.43..23.67 rows=642 width=8)
(actual time=0.030..0.158 rows=542 loops=1)
   Index Cond: (ROW(a, b) > ROW(123450, 123450))
   Heap Fetches: 0
   Buffers: shared hit=254 read=3
 Planning:
   Buffers: shared read=4
 Planning Time: 0.232 ms
 Execution Time: 0.206 ms
(8 rows)

postgres=# explain (analyze, buffers) select * from t where a = 0 order by
a, b;
QUERY PLAN
--
 Index Only Scan using my_idx on t  (cost=0.43..6.20 rows=101 width=8)
(actual time=0.099..0.113 rows=57 loops=1)
   Index Cond: (a = 0)
   Heap Fetches: 0
   Buffers: shared hit=27 read=2
 Planning Time: 0.081 ms
 Execution Time: 0.131 ms
(6 rows)
```

## Postgres version

16.3

## Platform information

I can reproduce it on the latest postgres docker image, which is based on
Debian Linux. Originally found the issue on AWS Aurora.



The following are my own observation and thoughts. Please disregard if it’s
distraction.

For a general form of
```sql
select * from t where (a, b) > (x, y) and a = z order by a, b;
```

1. The number of buffer blocks is proportional to the gap between x and z.
Strictly, it’s max(0, min(x, max(a)) – max(z, min(a))).

```
postgres=# explain (analyze, buffers) select * from t where row(a, b) >
row(123450, 123450) and a = -3 order by a, b;
QUERY PLAN
---
 Index Only Scan using my_idx on t  (cost=0.43..4.45 rows=1 width=8)
(actual time=243.173..243.175 rows=0 loops=1)
   Index Cond: ((ROW(a, b) > ROW(123450, 123450)) AND (a =
'-3'::integer))
   Heap Fetches: 0
   Buffers: shared hit=1 read=41080
 Planning:
   Buffers: shared hit=2 read=2
 Planning Time: 0.174 ms
 Execution Time: 243.199 ms
(8 rows)

postgres=# explain (analyze, buffers) select * from t where row(a, b) >
row(123450, 123450) and a = 0 order by a, b;
QUERY PLAN
---
 Index Only Scan using my_idx on t  (cost=0.43..4.45 rows=1 width=8)
(actual time=230.425..230.426 rows=0 loops=1)
   Index Cond: ((ROW(a, b) > ROW(123450, 123450)) AND (a = 0))
   Heap Fetches: 0
   Buffers: shared hit=1 read=41080
 Planning:
   Buffers: shared read=4
 Planning Time: 0.296 ms
 Execution Time: 230.460 ms
(8 rows)

postgres=# explain (analyze, buffers) select * from t where row(a, b) >
row(123450, 123450) and a = 3 order by a, b;
QUERY PLAN
---
 Index Only Scan using

Re: Unnecessary buffer usage with multicolumn index, row comparison, and equility constraint

2024-05-10 Thread Ron Johnson
On Fri, May 10, 2024 at 11:28 PM WU Yan <4wu...@gmail.com> wrote:

> Hi everyone, first time here. Please kindly let me know if this is not the
> right place to ask.
>
> I notice a simple query can read a lot of buffer blocks in a meaningless
> way, when
> 1. there is an index scan on a multicolumn index
> 2. there is row constructor comparison in the Index Cond
> 3. there is also an equality constraint on the leftmost column of the
> multicolumn index
>
>
> ## How to reproduce
>
> I initially noticed it on AWS Aurora RDS, but it can be reproduced in
> docker container as well.
> ```bash
> docker run --name test-postgres -e POSTGRES_PASSWORD=mysecretpassword -d
> -p 5432:5432 postgres:16.3
> ```
>
> Create a table with a multicolumn index. Populate 12 million rows with
> random integers.
> ```sql
> CREATE TABLE t(a int, b int);
> CREATE INDEX my_idx ON t USING BTREE (a, b);
>
> INSERT INTO t(a, b)
> SELECT
> (random() * 123456)::int AS a,
> (random() * 123456)::int AS b
> FROM
> generate_series(1, 12345678);
>
> ANALYZE t;
> ```
>
> Simple query that uses the multicolumn index.
> ```
> postgres=# explain (analyze, buffers) select * from t where row(a, b) >
> row(123450, 123450) and a = 0 order by a, b;
>

Out of curiosity, why "where row(a, b) > row(123450, 123450)" instead of "where
a > 123450 and b > 123450"?


Re: Unnecessary buffer usage with multicolumn index, row comparison, and equility constraint

2024-05-10 Thread Tom Lane
Ron Johnson  writes:
> On Fri, May 10, 2024 at 11:28 PM WU Yan <4wu...@gmail.com> wrote:
>> Simple query that uses the multicolumn index.
>> postgres=# explain (analyze, buffers) select * from t where row(a, b) >
>> row(123450, 123450) and a = 0 order by a, b;

> Out of curiosity, why "where row(a, b) > row(123450, 123450)" instead of 
> "where
> a > 123450 and b > 123450"?

That row() condition actually means "a > 123450 OR
(a = 123450 AND b > 123450)", which is not the same.

(It'd be a little clearer with two different values in
the row constant, perhaps.)

It does seem like there's an optimization failure here.
I don't expect btree to analyze row comparisons exactly,
but it's sad that it seems to be stupider than for the
simplified case

explain (analyze, buffers) select * from t
where a >= 123450 and a = 0
order by a, b;
  QUERY PLAN
   
---
 Index Only Scan using my_idx on t  (cost=0.43..4.45 rows=1 width=8) (actual 
time=0.001..0.002 rows=0 loops=1)
   Index Cond: ((a >= 123450) AND (a = 0))
   Heap Fetches: 0
 Planning:
   Buffers: shared hit=4
 Planning Time: 0.081 ms
 Execution Time: 0.013 ms
(7 rows)

For that, it's able to see that the index conditions are
contradictory, so it fetches no index pages whatever.

regards, tom lane