Slow query, possibly not using index

2023-08-27 Thread Les
I have this table:

CREATE TABLE media.block (

id uuid NOT NULL,

"size" int8 NOT NULL,

nrefs int8 NOT NULL DEFAULT 0,

block bytea NOT NULL,

hs256 bytea NOT NULL,

CONSTRAINT block_pkey PRIMARY KEY (id),

CONSTRAINT chk_nrefs CHECK ((nrefs >= 0))

)

WITH (

toast_tuple_target=8160

)

TABLESPACE data_slow

;

alter table media.block alter column block set storage main;

alter table media.block alter column hs256 set storage main;

CREATE INDEX idx_block_unused ON media.block USING btree (id) WHERE (nrefs
= 0);

CREATE UNIQUE INDEX uidx_block_hs256 ON media.block USING btree (hs256);


Number of rows in this table is about 40M, and most of the rows occupy a
full 8K block (in most cases, the "block" field contains 7500 bytes).

The idx_block_unused index should be used to find blocks that are unused,
so they can be deleted at some point.

The idx_block_unused index is less than 400MB:


SELECT i.relname "Table Name",indexrelname "Index Name",

pg_size_pretty(pg_total_relation_size(relid)) As "Total Size",

pg_size_pretty(pg_indexes_size(relid)) as "Total Size of all Indexes",

pg_size_pretty(pg_relation_size(relid)) as "Table Size",

pg_size_pretty(pg_relation_size(indexrelid)) "Index Size",

reltuples::bigint "Estimated table row count"

FROM pg_stat_all_indexes i JOIN pg_class c ON i.relid=c.oid

where i.relid ='media.block'::regclass


Table Name|Index Name  |Total Size|Total Size of all Indexes|Table
Size|Index Size|Estimated table row count|
--++--+-+--+--+-+
block |block_pkey  |352 GB|5584 MB  |347 GB
 |1986 MB   | 38958848|
block |uidx_block_hs256|352 GB|5584 MB  |347 GB
 |3226 MB   | 38958848|
block |idx_block_unused|352 GB|5584 MB  |347 GB
 |372 MB| 38958848|

If I try to select a single unused block this way:

explain analyze select id from media.block b where nrefs =0 limit 1


then it runs for more than 10 minutes (I'm not sure how long, I cancelled
the query after 10 minutes).

If I run this without analyze:

explain select id from media.block b where nrefs =0 limit 1


QUERY PLAN
|
---+
Limit  (cost=0.38..0.76 rows=1 width=16)
|
  ->  Index Only Scan using idx_block_unused on block b  (cost=0.38..869.83
rows=2274 width=16)|

I believe it is not actually using the index, because reading a single
(random?) entry from an index should not run for >10 minutes.

What am I doing wrong?

Thank you,

 Laszlo


Re: Slow query, possibly not using index

2023-08-27 Thread Tom Lane
Les  writes:
> If I try to select a single unused block this way:
> explain analyze select id from media.block b where nrefs =0 limit 1
> then it runs for more than 10 minutes (I'm not sure how long, I cancelled
> the query after 10 minutes).

Are you sure it isn't blocked on a lock?

Another theory is that the index contains many thousands of references
to now-dead rows, and the query is vainly searching for a live entry.
Given that EXPLAIN thinks there are only about 2300 live entries,
and yet you say the index is 400MB, this seems pretty plausible.
Have you disabled autovacuum, or something like that?  (REINDEX
could help here, at least till the index gets bloated again.)

You might think that even so, it shouldn't take that long ... but
indexes on UUID columns are a well known performance antipattern.
The index entry order is likely to have precisely zip to do with
the table's physical order, resulting in exceedingly-random access
to the table, which'll be horribly expensive when the table is so
much bigger than RAM.  Can you replace the UUID column with a simple
serial (identity) column?

> I believe it is not actually using the index, because reading a single
> (random?) entry from an index should not run for >10 minutes.

You should believe what EXPLAIN tells you about the plan shape.
(Its rowcount estimates are only estimates, though.)

regards, tom lane




Re: Slow query, possibly not using index

2023-08-27 Thread Les
> > If I try to select a single unused block this way:
> > explain analyze select id from media.block b where nrefs =0 limit 1
> > then it runs for more than 10 minutes (I'm not sure how long, I cancelled
> > the query after 10 minutes).
>
> Are you sure it isn't blocked on a lock?
>

Yes, I'm sure. I have created a single database instance from a zfs
snapshot and tried the query on that database. It was the only client.


> Another theory is that the index contains many thousands of references
> to now-dead rows, and the query is vainly searching for a live entry.
> Given that EXPLAIN thinks there are only about 2300 live entries,
> and yet you say the index is 400MB, this seems pretty plausible.
>

Nobody ever deleted anything from this table. Since it was created, this
has been a write-only table.


> Have you disabled autovacuum, or something like that?  (REINDEX
> could help here, at least till the index gets bloated again.)
>
I did not disable autovacuum.

>
> You might think that even so, it shouldn't take that long ... but
> indexes on UUID columns are a well known performance antipattern.
> The index entry order is likely to have precisely zip to do with
> the table's physical order, resulting in exceedingly-random access
> to the table, which'll be horribly expensive when the table is so
> much bigger than RAM.  Can you replace the UUID column with a simple
> serial (identity) column?
>

I'm aware of the problems with random UUID values. I was using this
function to create ulids from the beginning:

CREATE OR REPLACE FUNCTION public.gen_ulid()

RETURNS uuid

LANGUAGE sql

AS $function$

SELECT (lpad(to_hex(floor(extract(epoch FROM clock_timestamp()) * 1000)::
bigint), 12, '0') || encode(gen_random_bytes(10), 'hex'))::uuid;

$function$

;


If I order some rows by id values, I can see that their creation times are
strictly ascending. I did not write this function, it was taken from this
website:

https://blog.daveallie.com/ulid-primary-keys

They have a benchmark section where they show that these ULID values are
slower to generate (at least with this implementation) but much faster to
insert.

I might be able to replace these with int8 values, I need to check.


>
> > I believe it is not actually using the index, because reading a single
> > (random?) entry from an index should not run for >10 minutes.
>
> You should believe what EXPLAIN tells you about the plan shape.
> (Its rowcount estimates are only estimates, though.)
>

All of the 40M rows in this table are live. I'm 100% sure about this,
because nobody ever deleted rows from this table.

I can try to do VACUUM on this table, but I'm limited on resources. I think
it will take days to do this. Maybe I can try to dump the whole database
and restore it on another machine. Would that eliminate dead rows? (Is
there a way to check the number of dead rows?)

Regards,

Laszlo


Re: Slow query, possibly not using index

2023-08-27 Thread Wael Khobalatte
> Nobody ever deleted anything from this table. Since it was created, this
has been a write-only table.

does write-only include updates? that would create the dead rows tom is
referring to.

> I believe it is not actually using the index, because reading a single
(random?) entry from an index should not run for >10 minutes.

it is using the index. You can disable it, and see how long your query
takes ("set enable_indexonlyscan = off", optionally enable_indexscan too),
chances are it will take even longer. If somehow it does not, by luck, then
you no vacuum has run for a while, for some reason (perhaps canceled due to
frequent table activity?). That too you can check.

>


Re: Slow query, possibly not using index

2023-08-27 Thread Tom Lane
Les  writes:
>>> If I try to select a single unused block this way:
>>> explain analyze select id from media.block b where nrefs =0 limit 1
>>> then it runs for more than 10 minutes (I'm not sure how long, I cancelled
>>> the query after 10 minutes).

>> You might think that even so, it shouldn't take that long ... but
>> indexes on UUID columns are a well known performance antipattern.

> I'm aware of the problems with random UUID values. I was using this
> function to create ulids from the beginning:

Oh, well that would have been useful information to provide at the
outset.  Now that we know the index order is correlated with creation
time, I wonder if it is also correlated with nrefs, in such a way that
scanning in index order is disastrous because all the desired rows are
at the end of the index.

Also, you deny deleting any rows, but that wasn't the point.  Do you
ever update nrefs from zero to nonzero?  That'd also leave dead
entries behind in this index.  If that is a routine event that is
correlated with creation time, it gets easier to believe that your
index could have lots of dead entries at the front.

We'd still have to figure out why autovacuum is failing to clean out
those entries in a timely fashion, but that seems like a plausible
way for the performance problem to exist.

> I can try to do VACUUM on this table, but I'm limited on resources. I think
> it will take days to do this. Maybe I can try to dump the whole database
> and restore it on another machine.

Pretty hard to believe that dump-and-restore would be faster than
VACUUM.

> (Is there a way to check the number of dead rows?)

I think contrib/pgstattuple might help.

regards, tom lane




Re: Slow query, possibly not using index

2023-08-27 Thread Les
>
>
>
> > I'm aware of the problems with random UUID values. I was using this
> > function to create ulids from the beginning:
>
> Oh, well that would have been useful information to provide at the
> outset.

I'm sorry, I left this out.

> Now that we know the index order is correlated with creation
> time, I wonder if it is also correlated with nrefs, in such a way that
> scanning in index order is disastrous because all the desired rows are
> at the end of the index.
>
Possibly, I have no idea.

>
> Also, you deny deleting any rows, but that wasn't the point.  Do you
> ever update nrefs from zero to nonzero?  That'd also leave dead
> entries behind in this index.  If that is a routine event that is
> correlated with creation time, it gets easier to believe that your
> index could have lots of dead entries at the front.
>

I have checked the trigger that is maintaining the nrefs field. Blocks are
referenced from a "file_block" table. Every time a block is created, it
first has an initial value of nrefs=0, then a file_block row (reference) is
inserted, and nrefs is incremented to one. It means that every block has
shown up once in the index, and then disappeared. If the index was never
vacuumed, then it is very plausible that it is full of dead rows.

CREATE OR REPLACE FUNCTION media.trg_aiud_file_block()

RETURNS trigger

LANGUAGE plpgsql

AS $function$

begin

if TG_OP='INSERT' then

update media.block set nrefs = nrefs + 1 where id = new.block_id;

return new;

end if;

if TG_OP='UPDATE' then

if old.block_id is distinct from new.block_id then

update media.block set nrefs = nrefs + 1 where id = new.block_id;

update media.block set nrefs = nrefs - 1 where id = old.block_id;

end if;

return new;

end if;

if TG_OP='DELETE' then

update media.block set nrefs = nrefs - 1 where id = old.block_id;

return old;

end if;

end;

$function$

;


The idea was to create an index that can help in quickly removing unused
blocks, to free up disk space. It would be much better to keep out the
initially inserted (not yet references) from the index, but I don't know
how to do this.


> We'd still have to figure out why autovacuum is failing to clean out
> those entries in a timely fashion, but that seems like a plausible
> way for the performance problem to exist.
>
Yes, that would be very good to know. I cloud drop and recreate the index
now, but after some time I would be facing the same situation again.

I double checked, and the "autovacuum launcher" process is running.

Here are the current settings:

=# select name, setting, unit, min_val, max_val, boot_val, reset_val,
pending_restart from pg_settings where name like '%vacuum%';
 name  |  setting   | unit | min_val |
 max_val   |  boot_val  | reset_val  | pending_restart
---++--+-++++-
 autovacuum| on |  | |
   | on | on | f
 autovacuum_analyze_scale_factor   | 0.1|  | 0   | 100
   | 0.1| 0.1| f
 autovacuum_analyze_threshold  | 50 |  | 0   |
2147483647 | 50 | 50 | f
 autovacuum_freeze_max_age | 2  |  | 10  |
20 | 2  | 2  | f
 autovacuum_max_workers| 3  |  | 1   |
262143 | 3  | 3  | f
 autovacuum_multixact_freeze_max_age   | 4  |  | 1   |
20 | 4  | 4  | f
 autovacuum_naptime| 60 | s| 1   |
2147483| 60 | 60 | f
 autovacuum_vacuum_cost_delay  | 2  | ms   | -1  | 100
   | 2  | 2  | f
 autovacuum_vacuum_cost_limit  | -1 |  | -1  |
1  | -1 | -1 | f
 autovacuum_vacuum_insert_scale_factor | 0.2|  | 0   | 100
   | 0.2| 0.2| f
 autovacuum_vacuum_insert_threshold| 1000   |  | -1  |
2147483647 | 1000   | 1000   | f
 autovacuum_vacuum_scale_factor| 0.2|  | 0   | 100
   | 0.2| 0.2| f
 autovacuum_vacuum_threshold   | 50 |  | 0   |
2147483647 | 50 | 50 | f
 autovacuum_work_mem   | -1 | kB   | -1  |
2147483647 | -1 | -1 | f
 log_autovacuum_min_duration   | 60 | ms   | -1  |
2147483647 | 60 | 60 | f
 vacuum_cost_delay | 0  | ms   | 0   | 100
   | 0  | 0  | f
 vacuum_cost_limit | 200|  | 1   |
1  | 200| 200| f
 vacuum_cost_page_dirty| 20 |  | 0   |
1  | 20 | 20 | f
 vacuum_cost_page_hit  | 1  |  | 0