Extremely slow to establish connection when user has a high number of roles
Hi, We're running PostgreSQL as essentially a data warehouse, and we have a few thousand roles, which are used to grant permissions on a table-by-table basis to a few thousand users, so a user would typically have say between 1 and 2 thousand roles. There is also quite a lot of "churn" in terms of tables being created/removed, and permissions changed. The issue is that we're hitting a strange performance problem on connection. Sometimes it can take ~25 to 40 seconds just to connect, although it's often way quicker. There seems to be no middle ground - never have I seen a connection take between 0.5 and 25 seconds for example. We suspect it's related to the number of roles the connecting user has (including via other roles), because if we remove all roles but one from the connecting user (the one that grants connection permissions), connecting is always virtually instantaneous. The closest issue that I can find that's similar is https://www.postgresql.org/message-id/flat/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com, which reports that GRANT role is slow with a high number of roles - but in our case, it's connecting that's the problem, before (as far as we can tell) even one query is run. The database is busy, say up to 60-80% on a 16 VCPU machine - even if it's a "good amount" below 100%, the issue occurs. Is there anything we can do to investigate (or hopefully fix!) the issue? Thanks, Michal -- A description of what you are trying to achieve and what results you expect.: We would like to connect to the database - expect it to connect in less than 1 second, but sometimes 25 - 40s. PostgreSQL version number you are running: PostgreSQL 14.10 on aarch64-unknown-linux-gnu, compiled by aarch64-unknown-linux-gnu-gcc (GCC) 9.5.0, 64-bit How you installed PostgreSQL: Via AWS/Amazon Aurora Changes made to the settings in the postgresql.conf file In attached CSV file Operating system and version: Unknown What program you're using to connect to PostgreSQL: Python + SQLAlchemy, psql, or also via Amazon Quicksight (Unsure which client they use under the hood, but it surfaces connection timeout errors, which we suspect is due to the issue described above) Is there anything relevant or unusual in the PostgreSQL server logs?: No For questions about any kind of error: N/A name,current_setting,source apg_write_forward.connect_timeout,30,configuration file apg_write_forward.consistency_mode,session,configuration file apg_write_forward.idle_in_transaction_session_timeout,8640,configuration file apg_write_forward.idle_session_timeout,30,configuration file apg_write_forward.max_forwarding_connections_percent,25,configuration file archive_command,(disabled),configuration file archive_mode,off,configuration file archive_timeout,5min,configuration file aurora_compute_plan_id,on,configuration file aurora_stat_plans.calls_until_recapture,0,configuration file aurora_stat_plans.minutes_until_recapture,0,configuration file aurora_stat_plans.with_analyze,off,configuration file aurora_stat_plans.with_buffers,off,configuration file aurora_stat_plans.with_costs,on,configuration file aurora_stat_plans.with_timing,on,configuration file aurora_stat_plans.with_triggers,off,configuration file aurora_stat_plans.with_wal,off,configuration file autovacuum_analyze_scale_factor,0.05,configuration file autovacuum_max_workers,3,configuration file autovacuum_naptime,5s,configuration file autovacuum_vacuum_cost_delay,5ms,configuration file autovacuum_vacuum_cost_limit,1800,configuration file autovacuum_vacuum_insert_scale_factor,0.2,configuration file autovacuum_vacuum_scale_factor,0.1,configuration file autovacuum_work_mem,4074385kB,configuration file buffer_cache_mode,dynamic,configuration file buffer_table_mode,fixed,configuration file checkpoint_timeout,1min,configuration file client_encoding,UTF8,configuration file compute_query_id,auto,configuration file database_instance_type,provisioned,configuration file effective_cache_size,88325320kB,configuration file hot_standby,off,configuration file hot_standby_feedback,on,configuration file huge_pages,on,configuration file idle_in_transaction_session_timeout,1h,user krb_caseins_users,off,configuration file listen_addresses,*,command line lo_compat_privileges,off,configuration file log_autovacuum_min_duration,10s,configuration file log_destination,"stderr,csvlog",configuration file log_file_mode,0644,configuration file log_hostname,off,configuration file log_line_prefix,%t:%r:%u@%d:[%p]:,configuration file log_rotation_age,1h,configuration file log_rotation_size,10kB,configuration file log_timezone,UTC,configuration file log_truncate_on_rotation,off,configuration file logging_collector,on,configuration file maintenance_io_concurrency,1,configuration file maintenance_work_mem,2087MB,configuration file max_connections,5000,configuration file max_locks_per_transaction,64,configuration file max_parallel_workers,8,configuration file max_prepared_transact
Re: Extremely slow to establish connection when user has a high number of roles
On 4/20/24 13:55, Michal Charemza wrote: > Hi, > > We're running PostgreSQL as essentially a data warehouse, and we have a few > thousand roles, which are used to grant permissions on a table-by-table > basis to a few thousand users, so a user would typically have say between 1 > and 2 thousand roles. There is also quite a lot of "churn" in terms of > tables being created/removed, and permissions changed. > > The issue is that we're hitting a strange performance problem on > connection. Sometimes it can take ~25 to 40 seconds just to connect, > although it's often way quicker. There seems to be no middle ground - never > have I seen a connection take between 0.5 and 25 seconds for example. We > suspect it's related to the number of roles the connecting user has > (including via other roles), because if we remove all roles but one from > the connecting user (the one that grants connection permissions), > connecting is always virtually instantaneous. > I tried a couple simple setups with many roles (user with many roles granted directly and with many roles granted through other roles), but I've been unable to reproduce this. > The closest issue that I can find that's similar is > https://www.postgresql.org/message-id/flat/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com, > which reports that GRANT role is slow with a high number of roles - but in > our case, it's connecting that's the problem, before (as far as we can > tell) even one query is run. The database is busy, say up to 60-80% on a 16 > VCPU machine - even if it's a "good amount" below 100%, the issue occurs. > > Is there anything we can do to investigate (or hopefully fix!) the issue? > A reproducer would be great - a script that creates user/roles, and triggers the long login time would allow us to investigate that. Another option would be to get a perf profile from the process busy with logging the user in - assuming it's CPU-intensive, and not (e.g.) some sort of locking issue. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Extremely slow to establish connection when user has a high number of roles
Michal Charemza writes: > The issue is that we're hitting a strange performance problem on > connection. Sometimes it can take ~25 to 40 seconds just to connect, > although it's often way quicker. There seems to be no middle ground - never > have I seen a connection take between 0.5 and 25 seconds for example. We > suspect it's related to the number of roles the connecting user has > (including via other roles), because if we remove all roles but one from > the connecting user (the one that grants connection permissions), > connecting is always virtually instantaneous. It's not very clear what you mean by "sometimes". Is the slowness reproducible for a particular user and role configuration, or does it seem to come and go by itself? As Tomas said, a self-contained reproduction script would be very helpful for looking into this. > The closest issue that I can find that's similar is > https://www.postgresql.org/message-id/flat/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com, > which reports that GRANT role is slow with a high number of roles - but in > our case, it's connecting that's the problem, before (as far as we can > tell) even one query is run. That specific problem is (we think) new in v16, but the root cause is an inefficient lookup mechanism that has been there a long time. Maybe you have found a usage pattern that exposes its weakness in older branches. If so, we could consider back-patching 14e991db8 further than v16 ... but I don't plan to take any risk there without concrete evidence that it'd improve things. regards, tom lane
Re: Extremely slow to establish connection when user has a high number of roles
On Sat, Apr 20, 2024, 5:25 PM Michal Charemza wrote: > Hi, > > We're running PostgreSQL as essentially a data warehouse, and we have a > few thousand roles, which are used to grant permissions on a table-by-table > basis to a few thousand users, so a user would typically have say between 1 > and 2 thousand roles. There is also quite a lot of "churn" in terms of > tables being created/removed, and permissions changed. > > The issue is that we're hitting a strange performance problem on > connection. Sometimes it can take ~25 to 40 seconds just to connect, > although it's often way quicker > can you rule out system catalog bloat ?
Re: Extremely slow to establish connection when user has a high number of roles
Tom Lane writes:
> It's not very clear what you mean by "sometimes". Is the slowness
reproducible for a particular user and role configuration, or does
it seem to come and go by itself?
Ah it's more come and go by itself - as in one connection takes 30 seconds,
then the next say 0.06s. It's happened for every user we've tried. Even
more anecdotally, I would say it happens more when the database is busy in
terms of tables being dropped/created and permissions changing.
Also: realise we did have one user that had directly was a member of
several thousand roles, but indirectly several million. It would sometimes
take 10 minutes for that user to connect. We've since changed that to one
role, and that user connects fine now.
> As Tomas said, a self-contained reproduction script would be very
helpful for looking into this.
Have tried... but alas it seems fine in anything but the production
environment. My closest attempt is attached to at least it show in more
detail how our system is setup, but it always works fine for me locally.
I am wondering - what happens on connection? What catalogue tables does
PostgreSQL check and how? What's allowed to happen concurrently and what
isn't? If I knew, maybe I could come up with a reproduction script that
does reproduce the issue?
#!/bin/bash
set -e
export PGPASSWORD=password
export PGUSER=postgres
export PGDATABASE=postgres
export PGHOST=127.0.0.1
echo "Starting PostgreSQL..."
trap "exit" INT TERM
trap "echo 'Stopping PostgreSQL'; docker stop postgres_perf" EXIT
docker run --rm -d -p 5432:5432 -e POSTGRES_PASSWORD=$PGPASSWORD --name postgres_perf postgres:14.10
while ! pg_isready --host 127.0.0.1 -U postgres
do
echo "Waiting for PostgreSQL to start..."
sleep 1
done
echo "PostgreSQL started"
rm -f setup.sql
echo "Creating user"
echo "CREATE USER connecting_user WITH password 'password';" >> setup.sql
echo "Creating database connect role..."
echo "CREATE ROLE database_connect;" >> setup.sql
echo "GRANT CONNECT ON DATABASE postgres TO database_connect;" >> setup.sql
echo "GRANT database_connect TO connecting_user;" >> setup.sql
echo "Creating intermediate role"
echo "CREATE ROLE intermediate;" >> setup.sql
echo "Creating script to create and roles with SELECT perms"
for idx in $(seq -w 1 1); do
echo "CREATE TABLE table_${idx}(id int);"
echo "CREATE ROLE table_select_${idx};"
echo "GRANT SELECT ON table_${idx} TO table_select_${idx};"
echo "GRANT table_select_${idx} TO intermediate;"
done >> setup.sql
echo "Running script..."
psql -q -f setup.sql
export PGUSER=connecting_user
time psql -q -c 'SELECT 1;'
echo "Done"
Re: Extremely slow to establish connection when user has a high number of roles
Vijaykumar Jain writes: > can you rule out system catalog bloat ? I don't know! I've now run the query from https://wiki.postgresql.org/wiki/Show_database_bloat just just on pg_catalog, results attached On Sat, Apr 20, 2024 at 3:52 PM Vijaykumar Jain < [email protected]> wrote: > > > On Sat, Apr 20, 2024, 5:25 PM Michal Charemza > wrote: > >> Hi, >> >> We're running PostgreSQL as essentially a data warehouse, and we have a >> few thousand roles, which are used to grant permissions on a table-by-table >> basis to a few thousand users, so a user would typically have say between 1 >> and 2 thousand roles. There is also quite a lot of "churn" in terms of >> tables being created/removed, and permissions changed. >> >> The issue is that we're hitting a strange performance problem on >> connection. Sometimes it can take ~25 to 40 seconds just to connect, >> although it's often way quicker >> > > can you rule out system catalog bloat ? > > current_database | schemaname | tablename | tbloat | wastedbytes | iname | ibloat | wastedibytes ---++++-+---++-- public_datasets_1 | pg_catalog | pg_attribute |1.6 | 381648896 | pg_attribute_relid_attnum_index |0.7 |0 public_datasets_1 | pg_catalog | pg_attribute |1.6 | 381648896 | pg_attribute_relid_attnam_index |1.3 |143884288 public_datasets_1 | pg_catalog | pg_shdepend|5.5 | 183640064 | pg_shdepend_depender_index| 56.5 |258285568 public_datasets_1 | pg_catalog | pg_shdepend|5.5 | 183640064 | pg_shdepend_reference_index | 27.3 |122470400 public_datasets_1 | pg_catalog | pg_depend |1.8 |45187072 | pg_depend_reference_index |2.8 | 69427200 public_datasets_1 | pg_catalog | pg_depend |1.8 |45187072 | pg_depend_depender_index |3.1 | 79560704 public_datasets_1 | pg_catalog | pg_namespace | 52.5 |42598400 | pg_namespace_nspname_index| 30.7 | 8511488 public_datasets_1 | pg_catalog | pg_namespace | 52.5 |42598400 | pg_namespace_oid_index| 19.5 | 5316608 public_datasets_1 | pg_catalog | pg_index |1.5 |29949952 | pg_index_indexrelid_index |0.6 |0 public_datasets_1 | pg_catalog | pg_index |1.5 |29949952 | pg_index_indrelid_index |0.5 |0 public_datasets_1 | pg_catalog | pg_constraint |1.6 |25714688 | pg_constraint_contypid_index |0.1 |0 public_datasets_1 | pg_catalog | pg_constraint |1.6 |25714688 | pg_constraint_oid_index |0.1 |0 public_datasets_1 | pg_catalog | pg_constraint |1.6 |25714688 | pg_constraint_conname_nsp_index |0.2 |0 public_datasets_1 | pg_catalog | pg_constraint |1.6 |25714688 | pg_constraint_conrelid_contypid_conname_index |0.2 |0 public_datasets_1 | pg_catalog | pg_constraint |1.6 |25714688 | pg_constraint_conparentid_index |0.1 |0 public_datasets_1 | pg_catalog | pg_class |1.1 |13434880 | pg_class_tblspc_relfilenode_index |0.1 |0 public_datasets_1 | pg_catalog | pg_class |1.1 |13434880 | pg_class_oid_index|0.1 |0 public_datasets_1 | pg_catalog | pg_class |1.1 |13434880 | pg_class_relname_nsp_index|0.4 |0 public_datasets_1 | pg_catalog | pg_auth_members|1.4 |11870208 | pg_auth_members_role_member_index |3.0 | 35561472 public_datasets_1 | pg_catalog | pg_auth_members|1.4 |11870208 | pg_auth_members_member_role_index |3.2 | 39198720 public_datasets_1 | pg_catalog | pg_db_role_setting |3.6 | 9125888 | pg_db_role_setting_databaseid_rol_index |0.8 |0 public_datasets_1 | pg_catalog | pg_type|1.2 | 4972544 | pg_type_oid_index |0.2 |0 public_datasets_1 | pg_catalog | pg_type|1.2 | 4972544 | pg_type_typname_nsp_index |0.6 |0 public_datasets_1 | pg_catalog | pg_description |1.7 | 4136960 | pg_description_o_c_o_index|0.7 |0 public_datasets_1 | pg_catalog | pg_inherits|1.6 | 1212416 | pg_inherits_parent_index |1.8 | 835584 public_dat
Re: Extremely slow to establish connection when user has a high number of roles
Michael, can you validate if this is consistently happening for the first connection after database cluster startup? Frits > Op 20 apr 2024 om 04:55 heeft Michal Charemza het > volgende geschreven: > > > Hi, > > We're running PostgreSQL as essentially a data warehouse, and we have a few > thousand roles, which are used to grant permissions on a table-by-table basis > to a few thousand users, so a user would typically have say between 1 and 2 > thousand roles. There is also quite a lot of "churn" in terms of tables being > created/removed, and permissions changed. > > The issue is that we're hitting a strange performance problem on connection. > Sometimes it can take ~25 to 40 seconds just to connect, although it's often > way quicker. There seems to be no middle ground - never have I seen a > connection take between 0.5 and 25 seconds for example. We suspect it's > related to the number of roles the connecting user has (including via other > roles), because if we remove all roles but one from the connecting user (the > one that grants connection permissions), connecting is always virtually > instantaneous. > > The closest issue that I can find that's similar is > https://www.postgresql.org/message-id/flat/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com, > which reports that GRANT role is slow with a high number of roles - but in > our case, it's connecting that's the problem, before (as far as we can tell) > even one query is run. The database is busy, say up to 60-80% on a 16 VCPU > machine - even if it's a "good amount" below 100%, the issue occurs. > > Is there anything we can do to investigate (or hopefully fix!) the issue? > > Thanks, > > Michal > > -- > > A description of what you are trying to achieve and what results you expect.: > We would like to connect to the database - expect it to connect in less than > 1 second, but sometimes 25 - 40s. > > PostgreSQL version number you are running: > PostgreSQL 14.10 on aarch64-unknown-linux-gnu, compiled by > aarch64-unknown-linux-gnu-gcc (GCC) 9.5.0, 64-bit > > How you installed PostgreSQL: > Via AWS/Amazon Aurora > > Changes made to the settings in the postgresql.conf file > In attached CSV file > > Operating system and version: > Unknown > > What program you're using to connect to PostgreSQL: > Python + SQLAlchemy, psql, or also via Amazon Quicksight (Unsure which client > they use under the hood, but it surfaces connection timeout errors, which we > suspect is due to the issue described above) > > Is there anything relevant or unusual in the PostgreSQL server logs?: > No > > For questions about any kind of error: > N/A >
Re: Extremely slow to establish connection when user has a high number of roles
Frits Hoogland writes: > Michael, can you validate if this is consistently happening for the first connection after database cluster startup? Hmmm... it'll be tricky and need some planning. It might even be impossible since this is on AWS Aurora, and I think AWS connects in regularly as part of a heartbeat check(?) But: do you mean any connection, or just a user that hasn't connected yet after the restart? A user that hasn't connected yet would be much easier.
Re: Extremely slow to establish connection when user has a high number of roles
Michal Charemza writes: > Tom Lane writes: >> It's not very clear what you mean by "sometimes". Is the slowness > reproducible for a particular user and role configuration, or does > it seem to come and go by itself? > Ah it's more come and go by itself - as in one connection takes 30 seconds, > then the next say 0.06s. It's happened for every user we've tried. Even > more anecdotally, I would say it happens more when the database is busy in > terms of tables being dropped/created and permissions changing. OK, that pretty much eliminates the idea that it's a new manifestation of the catcache-inefficiency problem. Vijaykumar may well have the right idea, that it's a form of catalog bloat. Do you do bulk permissions updates that might affect thousands of role memberships at once? > Also: realise we did have one user that had directly was a member of > several thousand roles, but indirectly several million. It would sometimes > take 10 minutes for that user to connect. We've since changed that to one > role, and that user connects fine now. Interesting --- but even for that user, it was sometimes fast to connect? > I am wondering - what happens on connection? What catalogue tables does > PostgreSQL check and how? What's allowed to happen concurrently and what > isn't? If I knew, maybe I could come up with a reproduction script that > does reproduce the issue? Well, it's going to be looking to see that the user has CONNECT privileges on the target database. If that db doesn't have public connect privileges, but only grants CONNECT to certain roles, then we'll have to test whether the connecting user is a member of those roles --- which involves looking into pg_auth_members and possibly even doing recursive searches there. For the sort of setup you're describing with thousands of role grants (pg_auth_members entries) it's not hard to imagine that search being rather expensive. What remains to be explained is how come it's only expensive sometimes. The catalog-bloat idea comes from thinking about how Postgres handles row updates. There will be multiple physical copies (row versions) of any recently-updated row, and this is much more expensive to scan than a static situation with only one live row version. First just because we have to look at more than one copy, and second because testing whether that copy is the live version is noticeably more expensive if it's recent than once it's older than the xmin horizon, and third because if we are the first process to scan it since it became dead-to-everybody then it's our responsibility to mark it as dead-to-everybody, so that we have to incur additional I/O to do that. A plausible idea for particular connection attempts being slow is that they came in just as a whole lot of pg_auth_members entries became dead-to-everybody, and hence they were unlucky enough to get saddled with a whole lot of that hint-bit-updating work. (This also nicely explains why the next attempt isn't slow: the work's been done.) But this is only plausible if you regularly do actions that cause a lot of pg_auth_members entries to be updated at the same time. So we still don't have good insight into that, and your test script isn't shedding any light. A couple of other thoughts: * I don't think your test script would show a connection-slowness problem even if there was one to be shown, because you forgot to revoke the PUBLIC connect privilege on the postgres database. I'm fairly sure that if that exists it's always noticed first, bypassing the need for any role membership tests. So please confirm whether your production database has revoked PUBLIC connect privilege. * It could be that the problem is not associated with the database's connect privilege, but with role membership lookups triggered by pg_hba.conf entries. Do you have any entries there that require testing membership (i.e. the role column is not "all")? regards, tom lane
