Extremely slow to establish connection when user has a high number of roles

2024-04-20 Thread Michal Charemza
Hi,

We're running PostgreSQL as essentially a data warehouse, and we have a few
thousand roles, which are used to grant permissions on a table-by-table
basis to a few thousand users, so a user would typically have say between 1
and 2 thousand roles. There is also quite a lot of "churn" in terms of
tables being created/removed, and permissions changed.

The issue is that we're hitting a strange performance problem on
connection. Sometimes it can take ~25 to 40 seconds just to connect,
although it's often way quicker. There seems to be no middle ground - never
have I seen a connection take between 0.5 and 25 seconds for example. We
suspect it's related to the number of roles the connecting user has
(including via other roles), because if we remove all roles but one from
the connecting user (the one that grants connection permissions),
connecting is always virtually instantaneous.

The closest issue that I can find that's similar is
https://www.postgresql.org/message-id/flat/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com,
which reports that GRANT role is slow with a high number of roles - but in
our case, it's connecting that's the problem, before (as far as we can
tell) even one query is run. The database is busy, say up to 60-80% on a 16
VCPU machine - even if it's a "good amount" below 100%, the issue occurs.

Is there anything we can do to investigate (or hopefully fix!) the issue?

Thanks,

Michal

--

A description of what you are trying to achieve and what results you
expect.:
We would like to connect to the database - expect it to connect in less
than 1 second, but sometimes 25 - 40s.

PostgreSQL version number you are running:
PostgreSQL 14.10 on aarch64-unknown-linux-gnu, compiled by
aarch64-unknown-linux-gnu-gcc (GCC) 9.5.0, 64-bit

How you installed PostgreSQL:
Via AWS/Amazon Aurora

Changes made to the settings in the postgresql.conf file
In attached CSV file

Operating system and version:
Unknown

What program you're using to connect to PostgreSQL:
Python + SQLAlchemy, psql, or also via Amazon Quicksight (Unsure which
client they use under the hood, but it surfaces connection timeout errors,
which we suspect is due to the issue described above)

Is there anything relevant or unusual in the PostgreSQL server logs?:
No

For questions about any kind of error:
N/A
name,current_setting,source
apg_write_forward.connect_timeout,30,configuration file
apg_write_forward.consistency_mode,session,configuration file
apg_write_forward.idle_in_transaction_session_timeout,8640,configuration file
apg_write_forward.idle_session_timeout,30,configuration file
apg_write_forward.max_forwarding_connections_percent,25,configuration file
archive_command,(disabled),configuration file
archive_mode,off,configuration file
archive_timeout,5min,configuration file
aurora_compute_plan_id,on,configuration file
aurora_stat_plans.calls_until_recapture,0,configuration file
aurora_stat_plans.minutes_until_recapture,0,configuration file
aurora_stat_plans.with_analyze,off,configuration file
aurora_stat_plans.with_buffers,off,configuration file
aurora_stat_plans.with_costs,on,configuration file
aurora_stat_plans.with_timing,on,configuration file
aurora_stat_plans.with_triggers,off,configuration file
aurora_stat_plans.with_wal,off,configuration file
autovacuum_analyze_scale_factor,0.05,configuration file
autovacuum_max_workers,3,configuration file
autovacuum_naptime,5s,configuration file
autovacuum_vacuum_cost_delay,5ms,configuration file
autovacuum_vacuum_cost_limit,1800,configuration file
autovacuum_vacuum_insert_scale_factor,0.2,configuration file
autovacuum_vacuum_scale_factor,0.1,configuration file
autovacuum_work_mem,4074385kB,configuration file
buffer_cache_mode,dynamic,configuration file
buffer_table_mode,fixed,configuration file
checkpoint_timeout,1min,configuration file
client_encoding,UTF8,configuration file
compute_query_id,auto,configuration file
database_instance_type,provisioned,configuration file
effective_cache_size,88325320kB,configuration file
hot_standby,off,configuration file
hot_standby_feedback,on,configuration file
huge_pages,on,configuration file
idle_in_transaction_session_timeout,1h,user
krb_caseins_users,off,configuration file
listen_addresses,*,command line
lo_compat_privileges,off,configuration file
log_autovacuum_min_duration,10s,configuration file
log_destination,"stderr,csvlog",configuration file
log_file_mode,0644,configuration file
log_hostname,off,configuration file
log_line_prefix,%t:%r:%u@%d:[%p]:,configuration file
log_rotation_age,1h,configuration file
log_rotation_size,10kB,configuration file
log_timezone,UTC,configuration file
log_truncate_on_rotation,off,configuration file
logging_collector,on,configuration file
maintenance_io_concurrency,1,configuration file
maintenance_work_mem,2087MB,configuration file
max_connections,5000,configuration file
max_locks_per_transaction,64,configuration file
max_parallel_workers,8,configuration file
max_prepared_transact

Re: Extremely slow to establish connection when user has a high number of roles

2024-04-20 Thread Tomas Vondra
On 4/20/24 13:55, Michal Charemza wrote:
> Hi,
> 
> We're running PostgreSQL as essentially a data warehouse, and we have a few
> thousand roles, which are used to grant permissions on a table-by-table
> basis to a few thousand users, so a user would typically have say between 1
> and 2 thousand roles. There is also quite a lot of "churn" in terms of
> tables being created/removed, and permissions changed.
> 
> The issue is that we're hitting a strange performance problem on
> connection. Sometimes it can take ~25 to 40 seconds just to connect,
> although it's often way quicker. There seems to be no middle ground - never
> have I seen a connection take between 0.5 and 25 seconds for example. We
> suspect it's related to the number of roles the connecting user has
> (including via other roles), because if we remove all roles but one from
> the connecting user (the one that grants connection permissions),
> connecting is always virtually instantaneous.
> 

I tried a couple simple setups with many roles (user with many roles
granted directly and with many roles granted through other roles), but
I've been unable to reproduce this.

> The closest issue that I can find that's similar is
> https://www.postgresql.org/message-id/flat/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com,
> which reports that GRANT role is slow with a high number of roles - but in
> our case, it's connecting that's the problem, before (as far as we can
> tell) even one query is run. The database is busy, say up to 60-80% on a 16
> VCPU machine - even if it's a "good amount" below 100%, the issue occurs.
> 
> Is there anything we can do to investigate (or hopefully fix!) the issue?
> 

A reproducer would be great - a script that creates user/roles, and
triggers the long login time would allow us to investigate that.

Another option would be to get a perf profile from the process busy with
logging the user in - assuming it's CPU-intensive, and not (e.g.) some
sort of locking issue.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Extremely slow to establish connection when user has a high number of roles

2024-04-20 Thread Tom Lane
Michal Charemza  writes:
> The issue is that we're hitting a strange performance problem on
> connection. Sometimes it can take ~25 to 40 seconds just to connect,
> although it's often way quicker. There seems to be no middle ground - never
> have I seen a connection take between 0.5 and 25 seconds for example. We
> suspect it's related to the number of roles the connecting user has
> (including via other roles), because if we remove all roles but one from
> the connecting user (the one that grants connection permissions),
> connecting is always virtually instantaneous.

It's not very clear what you mean by "sometimes".  Is the slowness
reproducible for a particular user and role configuration, or does
it seem to come and go by itself?

As Tomas said, a self-contained reproduction script would be very
helpful for looking into this.

> The closest issue that I can find that's similar is
> https://www.postgresql.org/message-id/flat/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com,
> which reports that GRANT role is slow with a high number of roles - but in
> our case, it's connecting that's the problem, before (as far as we can
> tell) even one query is run.

That specific problem is (we think) new in v16, but the root cause
is an inefficient lookup mechanism that has been there a long time.
Maybe you have found a usage pattern that exposes its weakness in
older branches.  If so, we could consider back-patching 14e991db8
further than v16 ... but I don't plan to take any risk there without
concrete evidence that it'd improve things.

regards, tom lane




Re: Extremely slow to establish connection when user has a high number of roles

2024-04-20 Thread Vijaykumar Jain
On Sat, Apr 20, 2024, 5:25 PM Michal Charemza  wrote:

> Hi,
>
> We're running PostgreSQL as essentially a data warehouse, and we have a
> few thousand roles, which are used to grant permissions on a table-by-table
> basis to a few thousand users, so a user would typically have say between 1
> and 2 thousand roles. There is also quite a lot of "churn" in terms of
> tables being created/removed, and permissions changed.
>
> The issue is that we're hitting a strange performance problem on
> connection. Sometimes it can take ~25 to 40 seconds just to connect,
> although it's often way quicker
>

can you rule out system catalog bloat ?


Re: Extremely slow to establish connection when user has a high number of roles

2024-04-20 Thread Michal Charemza
Tom Lane  writes:
> It's not very clear what you mean by "sometimes".  Is the slowness
reproducible for a particular user and role configuration, or does
it seem to come and go by itself?

Ah it's more come and go by itself - as in one connection takes 30 seconds,
then the next say 0.06s. It's happened for every user we've tried. Even
more anecdotally, I would say it happens more when the database is busy in
terms of tables being dropped/created and permissions changing.

Also: realise we did have one user that had directly was a member of
several thousand roles, but indirectly several million. It would sometimes
take 10 minutes for that user to connect. We've since changed that to one
role, and that user connects fine now.

> As Tomas said, a self-contained reproduction script would be very
helpful for looking into this.

Have tried... but alas it seems fine in anything but the production
environment. My closest attempt is attached to at least it show in more
detail how our system is setup, but it always works fine for me locally.

I am wondering - what happens on connection? What catalogue tables does
PostgreSQL check and how? What's allowed to happen concurrently and what
isn't? If I knew, maybe I could come up with a reproduction script that
does reproduce the issue?
#!/bin/bash

set -e

export PGPASSWORD=password
export PGUSER=postgres
export PGDATABASE=postgres
export PGHOST=127.0.0.1

echo "Starting PostgreSQL..."
trap "exit" INT TERM
trap "echo 'Stopping PostgreSQL'; docker stop postgres_perf" EXIT
docker run --rm -d -p 5432:5432 -e POSTGRES_PASSWORD=$PGPASSWORD --name postgres_perf postgres:14.10

while ! pg_isready --host 127.0.0.1 -U postgres
do
echo "Waiting for PostgreSQL to start..."
sleep 1
done

echo "PostgreSQL started"

rm -f setup.sql

echo "Creating user"
echo "CREATE USER connecting_user WITH password 'password';" >> setup.sql

echo "Creating database connect role..."
echo "CREATE ROLE database_connect;" >> setup.sql
echo "GRANT CONNECT ON DATABASE postgres TO database_connect;" >> setup.sql
echo "GRANT database_connect TO connecting_user;" >> setup.sql

echo "Creating intermediate role"
echo "CREATE ROLE intermediate;" >> setup.sql

echo "Creating script to create and roles with SELECT perms"

for idx in $(seq -w 1 1); do
  echo "CREATE TABLE table_${idx}(id int);"
  echo "CREATE ROLE table_select_${idx};"
  echo "GRANT SELECT ON table_${idx} TO table_select_${idx};"
  echo "GRANT table_select_${idx} TO intermediate;"
done >> setup.sql

echo "Running script..."
psql -q -f setup.sql

export PGUSER=connecting_user
time psql -q -c 'SELECT 1;'

echo "Done"


Re: Extremely slow to establish connection when user has a high number of roles

2024-04-20 Thread Michal Charemza
Vijaykumar Jain  writes:
> can you rule out system catalog bloat ?

I don't know! I've now run the query from
https://wiki.postgresql.org/wiki/Show_database_bloat just just on
pg_catalog, results attached

On Sat, Apr 20, 2024 at 3:52 PM Vijaykumar Jain <
[email protected]> wrote:

>
>
> On Sat, Apr 20, 2024, 5:25 PM Michal Charemza 
> wrote:
>
>> Hi,
>>
>> We're running PostgreSQL as essentially a data warehouse, and we have a
>> few thousand roles, which are used to grant permissions on a table-by-table
>> basis to a few thousand users, so a user would typically have say between 1
>> and 2 thousand roles. There is also quite a lot of "churn" in terms of
>> tables being created/removed, and permissions changed.
>>
>> The issue is that we're hitting a strange performance problem on
>> connection. Sometimes it can take ~25 to 40 seconds just to connect,
>> although it's often way quicker
>>
>
> can you rule out system catalog bloat ?
>
>
 current_database  | schemaname | tablename  | tbloat | wastedbytes |   
  iname | ibloat | wastedibytes
---++++-+---++--
 public_datasets_1 | pg_catalog | pg_attribute   |1.6 |   381648896 | 
pg_attribute_relid_attnum_index   |0.7 |0
 public_datasets_1 | pg_catalog | pg_attribute   |1.6 |   381648896 | 
pg_attribute_relid_attnam_index   |1.3 |143884288
 public_datasets_1 | pg_catalog | pg_shdepend|5.5 |   183640064 | 
pg_shdepend_depender_index|   56.5 |258285568
 public_datasets_1 | pg_catalog | pg_shdepend|5.5 |   183640064 | 
pg_shdepend_reference_index   |   27.3 |122470400
 public_datasets_1 | pg_catalog | pg_depend  |1.8 |45187072 | 
pg_depend_reference_index |2.8 | 69427200
 public_datasets_1 | pg_catalog | pg_depend  |1.8 |45187072 | 
pg_depend_depender_index  |3.1 | 79560704
 public_datasets_1 | pg_catalog | pg_namespace   |   52.5 |42598400 | 
pg_namespace_nspname_index|   30.7 |  8511488
 public_datasets_1 | pg_catalog | pg_namespace   |   52.5 |42598400 | 
pg_namespace_oid_index|   19.5 |  5316608
 public_datasets_1 | pg_catalog | pg_index   |1.5 |29949952 | 
pg_index_indexrelid_index |0.6 |0
 public_datasets_1 | pg_catalog | pg_index   |1.5 |29949952 | 
pg_index_indrelid_index   |0.5 |0
 public_datasets_1 | pg_catalog | pg_constraint  |1.6 |25714688 | 
pg_constraint_contypid_index  |0.1 |0
 public_datasets_1 | pg_catalog | pg_constraint  |1.6 |25714688 | 
pg_constraint_oid_index   |0.1 |0
 public_datasets_1 | pg_catalog | pg_constraint  |1.6 |25714688 | 
pg_constraint_conname_nsp_index   |0.2 |0
 public_datasets_1 | pg_catalog | pg_constraint  |1.6 |25714688 | 
pg_constraint_conrelid_contypid_conname_index |0.2 |0
 public_datasets_1 | pg_catalog | pg_constraint  |1.6 |25714688 | 
pg_constraint_conparentid_index   |0.1 |0
 public_datasets_1 | pg_catalog | pg_class   |1.1 |13434880 | 
pg_class_tblspc_relfilenode_index |0.1 |0
 public_datasets_1 | pg_catalog | pg_class   |1.1 |13434880 | 
pg_class_oid_index|0.1 |0
 public_datasets_1 | pg_catalog | pg_class   |1.1 |13434880 | 
pg_class_relname_nsp_index|0.4 |0
 public_datasets_1 | pg_catalog | pg_auth_members|1.4 |11870208 | 
pg_auth_members_role_member_index |3.0 | 35561472
 public_datasets_1 | pg_catalog | pg_auth_members|1.4 |11870208 | 
pg_auth_members_member_role_index |3.2 | 39198720
 public_datasets_1 | pg_catalog | pg_db_role_setting |3.6 | 9125888 | 
pg_db_role_setting_databaseid_rol_index   |0.8 |0
 public_datasets_1 | pg_catalog | pg_type|1.2 | 4972544 | 
pg_type_oid_index |0.2 |0
 public_datasets_1 | pg_catalog | pg_type|1.2 | 4972544 | 
pg_type_typname_nsp_index |0.6 |0
 public_datasets_1 | pg_catalog | pg_description |1.7 | 4136960 | 
pg_description_o_c_o_index|0.7 |0
 public_datasets_1 | pg_catalog | pg_inherits|1.6 | 1212416 | 
pg_inherits_parent_index  |1.8 |   835584
 public_dat

Re: Extremely slow to establish connection when user has a high number of roles

2024-04-20 Thread Frits Hoogland
Michael, can you validate if this is consistently happening for the first 
connection after database cluster startup?

Frits

> Op 20 apr 2024 om 04:55 heeft Michal Charemza  het 
> volgende geschreven:
> 
> 
> Hi,
> 
> We're running PostgreSQL as essentially a data warehouse, and we have a few 
> thousand roles, which are used to grant permissions on a table-by-table basis 
> to a few thousand users, so a user would typically have say between 1 and 2 
> thousand roles. There is also quite a lot of "churn" in terms of tables being 
> created/removed, and permissions changed.
> 
> The issue is that we're hitting a strange performance problem on connection. 
> Sometimes it can take ~25 to 40 seconds just to connect, although it's often 
> way quicker. There seems to be no middle ground - never have I seen a 
> connection take between 0.5 and 25 seconds for example. We suspect it's 
> related to the number of roles the connecting user has (including via other 
> roles), because if we remove all roles but one from the connecting user (the 
> one that grants connection permissions), connecting is always virtually 
> instantaneous.
> 
> The closest issue that I can find that's similar is 
> https://www.postgresql.org/message-id/flat/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com,
>  which reports that GRANT role is slow with a high number of roles - but in 
> our case, it's connecting that's the problem, before (as far as we can tell) 
> even one query is run. The database is busy, say up to 60-80% on a 16 VCPU 
> machine - even if it's a "good amount" below 100%, the issue occurs.
> 
> Is there anything we can do to investigate (or hopefully fix!) the issue?
> 
> Thanks,
> 
> Michal
> 
> --
> 
> A description of what you are trying to achieve and what results you expect.:
> We would like to connect to the database - expect it to connect in less than 
> 1 second, but sometimes 25 - 40s.
> 
> PostgreSQL version number you are running:
> PostgreSQL 14.10 on aarch64-unknown-linux-gnu, compiled by 
> aarch64-unknown-linux-gnu-gcc (GCC) 9.5.0, 64-bit
> 
> How you installed PostgreSQL:
> Via AWS/Amazon Aurora
> 
> Changes made to the settings in the postgresql.conf file
> In attached CSV file
> 
> Operating system and version:
> Unknown
> 
> What program you're using to connect to PostgreSQL:
> Python + SQLAlchemy, psql, or also via Amazon Quicksight (Unsure which client 
> they use under the hood, but it surfaces connection timeout errors, which we 
> suspect is due to the issue described above)
>  
> Is there anything relevant or unusual in the PostgreSQL server logs?:
> No
>  
> For questions about any kind of error:
> N/A
> 


Re: Extremely slow to establish connection when user has a high number of roles

2024-04-20 Thread Michal Charemza
Frits Hoogland  writes:
> Michael, can you validate if this is consistently happening for the first
connection after database cluster startup?

Hmmm... it'll be tricky and need some planning. It might even be impossible
since this is on AWS Aurora, and I think AWS connects in regularly as part
of a heartbeat check(?)

But: do you mean any connection, or just a user that hasn't connected yet
after the restart? A user that hasn't connected yet would be much easier.


Re: Extremely slow to establish connection when user has a high number of roles

2024-04-20 Thread Tom Lane
Michal Charemza  writes:
> Tom Lane  writes:
>> It's not very clear what you mean by "sometimes".  Is the slowness
> reproducible for a particular user and role configuration, or does
> it seem to come and go by itself?

> Ah it's more come and go by itself - as in one connection takes 30 seconds,
> then the next say 0.06s. It's happened for every user we've tried. Even
> more anecdotally, I would say it happens more when the database is busy in
> terms of tables being dropped/created and permissions changing.

OK, that pretty much eliminates the idea that it's a new manifestation
of the catcache-inefficiency problem.  Vijaykumar may well have the
right idea, that it's a form of catalog bloat.  Do you do bulk
permissions updates that might affect thousands of role memberships at
once?

> Also: realise we did have one user that had directly was a member of
> several thousand roles, but indirectly several million. It would sometimes
> take 10 minutes for that user to connect. We've since changed that to one
> role, and that user connects fine now.

Interesting --- but even for that user, it was sometimes fast to
connect?

> I am wondering - what happens on connection? What catalogue tables does
> PostgreSQL check and how? What's allowed to happen concurrently and what
> isn't? If I knew, maybe I could come up with a reproduction script that
> does reproduce the issue?

Well, it's going to be looking to see that the user has CONNECT
privileges on the target database.  If that db doesn't have public
connect privileges, but only grants CONNECT to certain roles, then
we'll have to test whether the connecting user is a member of those
roles --- which involves looking into pg_auth_members and possibly
even doing recursive searches there.  For the sort of setup you're
describing with thousands of role grants (pg_auth_members entries)
it's not hard to imagine that search being rather expensive.  What
remains to be explained is how come it's only expensive sometimes.

The catalog-bloat idea comes from thinking about how Postgres handles
row updates.  There will be multiple physical copies (row versions)
of any recently-updated row, and this is much more expensive to scan
than a static situation with only one live row version.  First just
because we have to look at more than one copy, and second because
testing whether that copy is the live version is noticeably more
expensive if it's recent than once it's older than the xmin horizon,
and third because if we are the first process to scan it since it
became dead-to-everybody then it's our responsibility to mark it as
dead-to-everybody, so that we have to incur additional I/O to do that.
A plausible idea for particular connection attempts being slow is that
they came in just as a whole lot of pg_auth_members entries became
dead-to-everybody, and hence they were unlucky enough to get saddled
with a whole lot of that hint-bit-updating work.  (This also nicely
explains why the next attempt isn't slow: the work's been done.)

But this is only plausible if you regularly do actions that cause a
lot of pg_auth_members entries to be updated at the same time.
So we still don't have good insight into that, and your test script
isn't shedding any light.

A couple of other thoughts:

* I don't think your test script would show a connection-slowness
problem even if there was one to be shown, because you forgot to
revoke the PUBLIC connect privilege on the postgres database.
I'm fairly sure that if that exists it's always noticed first,
bypassing the need for any role membership tests.  So please
confirm whether your production database has revoked PUBLIC
connect privilege.

* It could be that the problem is not associated with the
database's connect privilege, but with role membership lookups
triggered by pg_hba.conf entries.  Do you have any entries there
that require testing membership (i.e. the role column is not
"all")?

regards, tom lane