Slow query, possibly not using index
I have this table: CREATE TABLE media.block ( id uuid NOT NULL, "size" int8 NOT NULL, nrefs int8 NOT NULL DEFAULT 0, block bytea NOT NULL, hs256 bytea NOT NULL, CONSTRAINT block_pkey PRIMARY KEY (id), CONSTRAINT chk_nrefs CHECK ((nrefs >= 0)) ) WITH ( toast_tuple_target=8160 ) TABLESPACE data_slow ; alter table media.block alter column block set storage main; alter table media.block alter column hs256 set storage main; CREATE INDEX idx_block_unused ON media.block USING btree (id) WHERE (nrefs = 0); CREATE UNIQUE INDEX uidx_block_hs256 ON media.block USING btree (hs256); Number of rows in this table is about 40M, and most of the rows occupy a full 8K block (in most cases, the "block" field contains 7500 bytes). The idx_block_unused index should be used to find blocks that are unused, so they can be deleted at some point. The idx_block_unused index is less than 400MB: SELECT i.relname "Table Name",indexrelname "Index Name", pg_size_pretty(pg_total_relation_size(relid)) As "Total Size", pg_size_pretty(pg_indexes_size(relid)) as "Total Size of all Indexes", pg_size_pretty(pg_relation_size(relid)) as "Table Size", pg_size_pretty(pg_relation_size(indexrelid)) "Index Size", reltuples::bigint "Estimated table row count" FROM pg_stat_all_indexes i JOIN pg_class c ON i.relid=c.oid where i.relid ='media.block'::regclass Table Name|Index Name |Total Size|Total Size of all Indexes|Table Size|Index Size|Estimated table row count| --++--+-+--+--+-+ block |block_pkey |352 GB|5584 MB |347 GB |1986 MB | 38958848| block |uidx_block_hs256|352 GB|5584 MB |347 GB |3226 MB | 38958848| block |idx_block_unused|352 GB|5584 MB |347 GB |372 MB| 38958848| If I try to select a single unused block this way: explain analyze select id from media.block b where nrefs =0 limit 1 then it runs for more than 10 minutes (I'm not sure how long, I cancelled the query after 10 minutes). If I run this without analyze: explain select id from media.block b where nrefs =0 limit 1 QUERY PLAN | ---+ Limit (cost=0.38..0.76 rows=1 width=16) | -> Index Only Scan using idx_block_unused on block b (cost=0.38..869.83 rows=2274 width=16)| I believe it is not actually using the index, because reading a single (random?) entry from an index should not run for >10 minutes. What am I doing wrong? Thank you, Laszlo
Re: Slow query, possibly not using index
Les writes: > If I try to select a single unused block this way: > explain analyze select id from media.block b where nrefs =0 limit 1 > then it runs for more than 10 minutes (I'm not sure how long, I cancelled > the query after 10 minutes). Are you sure it isn't blocked on a lock? Another theory is that the index contains many thousands of references to now-dead rows, and the query is vainly searching for a live entry. Given that EXPLAIN thinks there are only about 2300 live entries, and yet you say the index is 400MB, this seems pretty plausible. Have you disabled autovacuum, or something like that? (REINDEX could help here, at least till the index gets bloated again.) You might think that even so, it shouldn't take that long ... but indexes on UUID columns are a well known performance antipattern. The index entry order is likely to have precisely zip to do with the table's physical order, resulting in exceedingly-random access to the table, which'll be horribly expensive when the table is so much bigger than RAM. Can you replace the UUID column with a simple serial (identity) column? > I believe it is not actually using the index, because reading a single > (random?) entry from an index should not run for >10 minutes. You should believe what EXPLAIN tells you about the plan shape. (Its rowcount estimates are only estimates, though.) regards, tom lane
Re: Slow query, possibly not using index
> > If I try to select a single unused block this way: > > explain analyze select id from media.block b where nrefs =0 limit 1 > > then it runs for more than 10 minutes (I'm not sure how long, I cancelled > > the query after 10 minutes). > > Are you sure it isn't blocked on a lock? > Yes, I'm sure. I have created a single database instance from a zfs snapshot and tried the query on that database. It was the only client. > Another theory is that the index contains many thousands of references > to now-dead rows, and the query is vainly searching for a live entry. > Given that EXPLAIN thinks there are only about 2300 live entries, > and yet you say the index is 400MB, this seems pretty plausible. > Nobody ever deleted anything from this table. Since it was created, this has been a write-only table. > Have you disabled autovacuum, or something like that? (REINDEX > could help here, at least till the index gets bloated again.) > I did not disable autovacuum. > > You might think that even so, it shouldn't take that long ... but > indexes on UUID columns are a well known performance antipattern. > The index entry order is likely to have precisely zip to do with > the table's physical order, resulting in exceedingly-random access > to the table, which'll be horribly expensive when the table is so > much bigger than RAM. Can you replace the UUID column with a simple > serial (identity) column? > I'm aware of the problems with random UUID values. I was using this function to create ulids from the beginning: CREATE OR REPLACE FUNCTION public.gen_ulid() RETURNS uuid LANGUAGE sql AS $function$ SELECT (lpad(to_hex(floor(extract(epoch FROM clock_timestamp()) * 1000):: bigint), 12, '0') || encode(gen_random_bytes(10), 'hex'))::uuid; $function$ ; If I order some rows by id values, I can see that their creation times are strictly ascending. I did not write this function, it was taken from this website: https://blog.daveallie.com/ulid-primary-keys They have a benchmark section where they show that these ULID values are slower to generate (at least with this implementation) but much faster to insert. I might be able to replace these with int8 values, I need to check. > > > I believe it is not actually using the index, because reading a single > > (random?) entry from an index should not run for >10 minutes. > > You should believe what EXPLAIN tells you about the plan shape. > (Its rowcount estimates are only estimates, though.) > All of the 40M rows in this table are live. I'm 100% sure about this, because nobody ever deleted rows from this table. I can try to do VACUUM on this table, but I'm limited on resources. I think it will take days to do this. Maybe I can try to dump the whole database and restore it on another machine. Would that eliminate dead rows? (Is there a way to check the number of dead rows?) Regards, Laszlo
Re: Slow query, possibly not using index
> Nobody ever deleted anything from this table. Since it was created, this
has been a write-only table.
does write-only include updates? that would create the dead rows tom is
referring to.
> I believe it is not actually using the index, because reading a single
(random?) entry from an index should not run for >10 minutes.
it is using the index. You can disable it, and see how long your query
takes ("set enable_indexonlyscan = off", optionally enable_indexscan too),
chances are it will take even longer. If somehow it does not, by luck, then
you no vacuum has run for a while, for some reason (perhaps canceled due to
frequent table activity?). That too you can check.
>
Re: Slow query, possibly not using index
Les writes: >>> If I try to select a single unused block this way: >>> explain analyze select id from media.block b where nrefs =0 limit 1 >>> then it runs for more than 10 minutes (I'm not sure how long, I cancelled >>> the query after 10 minutes). >> You might think that even so, it shouldn't take that long ... but >> indexes on UUID columns are a well known performance antipattern. > I'm aware of the problems with random UUID values. I was using this > function to create ulids from the beginning: Oh, well that would have been useful information to provide at the outset. Now that we know the index order is correlated with creation time, I wonder if it is also correlated with nrefs, in such a way that scanning in index order is disastrous because all the desired rows are at the end of the index. Also, you deny deleting any rows, but that wasn't the point. Do you ever update nrefs from zero to nonzero? That'd also leave dead entries behind in this index. If that is a routine event that is correlated with creation time, it gets easier to believe that your index could have lots of dead entries at the front. We'd still have to figure out why autovacuum is failing to clean out those entries in a timely fashion, but that seems like a plausible way for the performance problem to exist. > I can try to do VACUUM on this table, but I'm limited on resources. I think > it will take days to do this. Maybe I can try to dump the whole database > and restore it on another machine. Pretty hard to believe that dump-and-restore would be faster than VACUUM. > (Is there a way to check the number of dead rows?) I think contrib/pgstattuple might help. regards, tom lane
Re: Slow query, possibly not using index
> > > > > I'm aware of the problems with random UUID values. I was using this > > function to create ulids from the beginning: > > Oh, well that would have been useful information to provide at the > outset. I'm sorry, I left this out. > Now that we know the index order is correlated with creation > time, I wonder if it is also correlated with nrefs, in such a way that > scanning in index order is disastrous because all the desired rows are > at the end of the index. > Possibly, I have no idea. > > Also, you deny deleting any rows, but that wasn't the point. Do you > ever update nrefs from zero to nonzero? That'd also leave dead > entries behind in this index. If that is a routine event that is > correlated with creation time, it gets easier to believe that your > index could have lots of dead entries at the front. > I have checked the trigger that is maintaining the nrefs field. Blocks are referenced from a "file_block" table. Every time a block is created, it first has an initial value of nrefs=0, then a file_block row (reference) is inserted, and nrefs is incremented to one. It means that every block has shown up once in the index, and then disappeared. If the index was never vacuumed, then it is very plausible that it is full of dead rows. CREATE OR REPLACE FUNCTION media.trg_aiud_file_block() RETURNS trigger LANGUAGE plpgsql AS $function$ begin if TG_OP='INSERT' then update media.block set nrefs = nrefs + 1 where id = new.block_id; return new; end if; if TG_OP='UPDATE' then if old.block_id is distinct from new.block_id then update media.block set nrefs = nrefs + 1 where id = new.block_id; update media.block set nrefs = nrefs - 1 where id = old.block_id; end if; return new; end if; if TG_OP='DELETE' then update media.block set nrefs = nrefs - 1 where id = old.block_id; return old; end if; end; $function$ ; The idea was to create an index that can help in quickly removing unused blocks, to free up disk space. It would be much better to keep out the initially inserted (not yet references) from the index, but I don't know how to do this. > We'd still have to figure out why autovacuum is failing to clean out > those entries in a timely fashion, but that seems like a plausible > way for the performance problem to exist. > Yes, that would be very good to know. I cloud drop and recreate the index now, but after some time I would be facing the same situation again. I double checked, and the "autovacuum launcher" process is running. Here are the current settings: =# select name, setting, unit, min_val, max_val, boot_val, reset_val, pending_restart from pg_settings where name like '%vacuum%'; name | setting | unit | min_val | max_val | boot_val | reset_val | pending_restart ---++--+-++++- autovacuum| on | | | | on | on | f autovacuum_analyze_scale_factor | 0.1| | 0 | 100 | 0.1| 0.1| f autovacuum_analyze_threshold | 50 | | 0 | 2147483647 | 50 | 50 | f autovacuum_freeze_max_age | 2 | | 10 | 20 | 2 | 2 | f autovacuum_max_workers| 3 | | 1 | 262143 | 3 | 3 | f autovacuum_multixact_freeze_max_age | 4 | | 1 | 20 | 4 | 4 | f autovacuum_naptime| 60 | s| 1 | 2147483| 60 | 60 | f autovacuum_vacuum_cost_delay | 2 | ms | -1 | 100 | 2 | 2 | f autovacuum_vacuum_cost_limit | -1 | | -1 | 1 | -1 | -1 | f autovacuum_vacuum_insert_scale_factor | 0.2| | 0 | 100 | 0.2| 0.2| f autovacuum_vacuum_insert_threshold| 1000 | | -1 | 2147483647 | 1000 | 1000 | f autovacuum_vacuum_scale_factor| 0.2| | 0 | 100 | 0.2| 0.2| f autovacuum_vacuum_threshold | 50 | | 0 | 2147483647 | 50 | 50 | f autovacuum_work_mem | -1 | kB | -1 | 2147483647 | -1 | -1 | f log_autovacuum_min_duration | 60 | ms | -1 | 2147483647 | 60 | 60 | f vacuum_cost_delay | 0 | ms | 0 | 100 | 0 | 0 | f vacuum_cost_limit | 200| | 1 | 1 | 200| 200| f vacuum_cost_page_dirty| 20 | | 0 | 1 | 20 | 20 | f vacuum_cost_page_hit | 1 | | 0
