Index on (fixed size) bytea value
Dear fellow list members, I'm in the process of implementing a file storage system that is based on PostgreSQL and streaming replication. There will possibly be many similar files stored. I would like to implement block-level deduplication: each file consists of a series of blocks, and each unique block is stored only once (e.g. one block may be reused by multiple files). It will be part of a bigger software, e.g. the same database will be used for other purposes too. Here is the basic idea for storing individual blocks: create table block( id uuid not null primary key, block bytea not null, hs256 bytea not null ) ; create unique index uidx_block_hs256 on block(hs256); create or replace function trg_biu_block() returns trigger language plpgsql as $function$ begin new.hs256 = digest(new.block, 'sha256'); end; $function$; create trigger trg_biu_block before insert or update on block for each row execute procedure trg_biu_block(); This is just for storing the blocks. I'm going to put this "block" table into a separate tablespace. File operations will be at least 95% read and at most 5% write. (Streaming replication will hopefully allow almost horizontal scaling for read operations.) Most of the files will be several megabytes in size (images), and some of them will be 100MB or more (videos). Total capacity is in the 10TB range. Storage will be SSD (possibly fiber connected, or local RAID, we are not sure yet). I do not want to use PostgreSQL large objects, because it does not have block level deduplication. Here are some things that I need help with: 1. What should be the maximum size of a block? I was trying to find out the optimal value. Default BLCKSZ is 8192 bytes. AFAIK PostgreSQL does not allow a row to occupy multiple blocks. I don't know enough to calculate the optimal "block size" (the max. number of bytes stored in a single row in the block.block field), but I suspect that it should be 4K or something similar. I think that it should be as large as possible, without hitting the toast. My thinking is this: most files will be at least 1MB in size, so most "block" rows will reach the maximum tuple size. It would be practical to make one row in the "block" table occupy almost one PostgreSQL block. 2. I'm not sure if it would be beneficial to increase BLCKSZ. I will be able to test the speed of my (not yet finished) implementation with different BLCKSZ values, but that alone won't help me make the decision, because AFAIK BLCKSZ must have a fixed value for the PostgreSQL instance, so it will affect all other tables in the database. It would be hard to tell how changing BLCKSZ would affect the system as a whole. 3. In the above example, I used SHA-256 (pgcrypto), because apparently it is very well optimized for 64 bit machines, and it has practically zero chance of a collision. I think that sha512 would be an overkill. But I'm not sure that this is the best choice. Maybe somebody with more experience can suggest a better hash function. 4. The hs256 value will always be non-null, fixed 32 byte binary value, but probably the query planner will not know anything about that. I was also thinking about bit(256), but I don't see an easy way to convert the bytea digest into bit(256). A simple type cast won't work here. Maybe using bytea here is perfectly fine, and creating an index on the hs256 bytea fields is as effective as possible. I'm not looking for a definitive answer, just trying to get some hints from more experienced users before I fill up the drives with terabytes of data. Thank you, Laszlo
Re: Index on (fixed size) bytea value
David G. Johnston ezt írta (időpont: 2023.
jún. 19., H, 22:30):
> On Mon, Jun 19, 2023 at 1:05 PM Les wrote:
>
>> AFAIK PostgreSQL does not allow a row to occupy multiple blocks.
>>
>
> Your plan is going to heavily involve out-of-band storage. Please read up
> on it here:
>
> https://www.postgresql.org/docs/current/storage-toast.html
>
I'm aware of the TOAST, and how it works. I was referring to it ("I think
that it should be as large as possible, without hitting the toast. ") I
have designed a separate "block" table specifically to avoid storing binary
data in the TOAST. So my plan is not going to involve out-of-band storage.
Just to make this very clear: a record in the block table would store a
block, not the whole file. My question is to finding the optimal block size
(without hitting the toast), and finding the optimal hash algorithm for
block de-duplication.
Unless I totally misunderstood how the TOAST works. (?)
Laszlo
Re: Index on (fixed size) bytea value
> > > Then you would ALTER the column and SET STORAGE MAIN, so that it does not > ever use TOAST. > > The size limit for a row would then be 8kB minus page header minus row > header, which > should be somewhere in the vicinity of 8140 bytes. > > If you want your block size to be a power of two, the limit would be 4kB, > which would waste > almost half your storage space. > Oh I see. So if I want to save some space for future columns, then storing about 7500 bytes in the "block bytea" column would be close to optimal, utilizing more than 90% of the block space. I guess that the fillfactor setting will have no effect on this table, and it does not matter if I set it or not.
slow delete
I have created a table called _td with about 43 000 rows. I have tried to
use this as a primary key id list to delete records from my
product.product_file table, but I could not do it. It uses 100% of one CPU
and it takes forever. Then I changed the query to delete 100 records only,
and measure the speed:
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS, FORMAT JSON)
delete from product.product_file where id in (
select pf2_id from _td limit 100
)
It still takes 11 seconds. It means 110 msec / record, and that is
unacceptable.
I'm going to post the whole query plan at the end of this email, but I
would like to highlight the "Triggers" part:
"Triggers": [
{
"Trigger Name": "RI_ConstraintTrigger_a_26535",
"Constraint Name": "fk_pfft_product",
"Relation": "product_file",
"Time": 4.600,
"Calls": 90
},
{
"Trigger Name": "RI_ConstraintTrigger_a_26837",
"Constraint Name": "fk_product_file_src",
"Relation": "product_file",
"Time": 5.795,
"Calls": 90
},
{
"Trigger Name": "RI_ConstraintTrigger_a_75463",
"Constraint Name": "fk_pfq_src_product_file",
"Relation": "product_file",
"Time": 11179.429,
"Calls": 90
},
{
"Trigger Name": "_trg_002_aiu_audit_row",
"Relation": "product_file",
"Time": 49.410,
"Calls": 90
}
]
It seems that two foreign key constraints use 10.395 seconds out of the
total 11.24 seconds. But I don't see why it takes that much?
The product.product_file table has 477 000 rows:
CREATE TABLE product.product_file (
id uuid NOT NULL,
c_tim timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
c_uid uuid NULL,
c_sid uuid NULL,
m_tim timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
m_uid uuid NULL,
m_sid uuid NULL,
product_id uuid NOT NULL,
product_file_type_id uuid NOT NULL,
file_id uuid NOT NULL,
product_file_status_id uuid NOT NULL,
dl_url text NULL,
src_product_file_id uuid NULL,
CONSTRAINT product_file_pkey PRIMARY KEY (id),
CONSTRAINT fk_pf_file FOREIGN KEY (file_id) REFERENCES media.file(id),
CONSTRAINT fk_pf_file_type FOREIGN KEY (product_file_type_id) REFERENCES
product.product_file_type(id),
CONSTRAINT fk_pf_product FOREIGN KEY (product_id) REFERENCES
product.product(id) ON DELETE CASCADE,
CONSTRAINT fk_product_file_src FOREIGN KEY (src_product_file_id) REFERENCES
product.product_file(id),
CONSTRAINT fk_product_file_status FOREIGN KEY (product_file_status_id)
REFERENCES product.product_file_status(id)
);
CREATE INDEX idx_product_file_dl_url ON product.product_file USING btree
(dl_url) INCLUDE (product_id) WHERE (dl_url IS NOT NULL);
CREATE INDEX idx_product_file_file_product ON product.product_file USING
btree (file_id, product_id);
CREATE INDEX idx_product_file_product_file ON product.product_file USING
btree (product_id, file_id);
CREATE INDEX idx_product_file_src ON product.product_file USING btree
(src_product_file_id) WHERE (src_product_file_id IS NOT NULL);
The one with fk_pfft_product looks like this, it has about 5000 records in
it:
CREATE TABLE product.product_file_tag (
id uuid NOT NULL,
c_tim timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
c_uid uuid NULL,
c_sid uuid NULL,
m_tim timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
m_uid uuid NULL,
m_sid uuid NULL,
product_file_id uuid NOT NULL,
file_tag_id uuid NOT NULL,
CONSTRAINT product_file_tag_pkey PRIMARY KEY (id),
CONSTRAINT fk_pfft_file_tag FOREIGN KEY (file_tag_id) REFERENCES
product.file_tag(id) ON DELETE CASCADE DEFERRABLE,
CONSTRAINT fk_pfft_product FOREIGN KEY (product_file_id) REFERENCES
product.product_file(id) ON DELETE CASCADE DEFERRABLE
);
CREATE UNIQUE INDEX uidx_product_file_file_tag ON product.product_file_tag
USING btree (product_file_id, file_tag_id);
The other constraint has zero actual references, this returns zero:
select count(*) from product.product_file where src_product_file_id in (
select pf2_id from _td
); -- 0
I was trying to figure out how a foreign key constraint with zero actual
references can cost 100 msec / record, but I failed.
Can somebody please explain what is wrong here?
The plan is also visualized here:
http://tatiyants.com/pev/#/plans/plan_1692129126258
[
{
"Plan": {
"Node Type": "ModifyTable",
"Operation": "Delete",
"Parallel Aware": false,
"Async Capable": false,
"Relation Name": "product_file",
"Schema": "product",
"Alias": "product_file",
"Startup Cost": 4.21,
"Total Cost": 840.79,
"Plan Rows": 0,
"Plan Width": 0,
"Actual Startup Time": 0.567,
"Actual Total Time": 0.568,
"Actual Rows": 0,
"Actual Loops": 1,
"Shared Hit Blocks": 582,
"Shared Read Blocks": 0,
"Shared Dirtied Blocks": 10,
"Shared Written Blocks": 0,
"Local Hit Blocks": 0,
"Local Read Blocks": 0,
"Local Dirtied Blocks": 0,
"Local Written Blocks": 0,
"Temp Read Blocks": 0,
"Temp Written Blocks": 0,
"Plans": [
{
"Node Type": "Nested Loop",
"Parent Relationship": "Outer",
"Parallel Aware": false,
"Async Capable": false,
"Join Type": "Inner",
"Startup Cost": 4.21,
"Total Cost": 840.79,
"Plan Rows": 100,
"Plan Width": 46,
"A
Re: slow delete
Tom Lane ezt írta (időpont: 2023. aug. 15., K, 22:37): > Les writes: > > It seems that two foreign key constraints use 10.395 seconds out of the > > total 11.24 seconds. But I don't see why it takes that much? > > Probably because you don't have an index on the referencing column. > You can get away with that, if you don't care about the speed of > deletes from the PK table ... > For fk_pfft_product constraint this is true, but I always thought that PostgreSQL can use an index "partially". There is already an index: CREATE UNIQUE INDEX uidx_product_file_file_tag ON product.product_file_tag USING btree (product_file_id, file_tag_id); It has the same order, only it has one column more. Wouldn't it be possible to use it for the plan? After I created these two missing indices: CREATE INDEX idx_pft_pf ON product.product_file_tag USING btree (product_file_id); CREATE INDEX idx_pfq_src_pf ON product.product_file_queue USING btree (src_product_file_id); I could delete all 40 000 records in 10 seconds. Thank you! Laszlo > >
Slow query, possibly not using index
I have this table: CREATE TABLE media.block ( id uuid NOT NULL, "size" int8 NOT NULL, nrefs int8 NOT NULL DEFAULT 0, block bytea NOT NULL, hs256 bytea NOT NULL, CONSTRAINT block_pkey PRIMARY KEY (id), CONSTRAINT chk_nrefs CHECK ((nrefs >= 0)) ) WITH ( toast_tuple_target=8160 ) TABLESPACE data_slow ; alter table media.block alter column block set storage main; alter table media.block alter column hs256 set storage main; CREATE INDEX idx_block_unused ON media.block USING btree (id) WHERE (nrefs = 0); CREATE UNIQUE INDEX uidx_block_hs256 ON media.block USING btree (hs256); Number of rows in this table is about 40M, and most of the rows occupy a full 8K block (in most cases, the "block" field contains 7500 bytes). The idx_block_unused index should be used to find blocks that are unused, so they can be deleted at some point. The idx_block_unused index is less than 400MB: SELECT i.relname "Table Name",indexrelname "Index Name", pg_size_pretty(pg_total_relation_size(relid)) As "Total Size", pg_size_pretty(pg_indexes_size(relid)) as "Total Size of all Indexes", pg_size_pretty(pg_relation_size(relid)) as "Table Size", pg_size_pretty(pg_relation_size(indexrelid)) "Index Size", reltuples::bigint "Estimated table row count" FROM pg_stat_all_indexes i JOIN pg_class c ON i.relid=c.oid where i.relid ='media.block'::regclass Table Name|Index Name |Total Size|Total Size of all Indexes|Table Size|Index Size|Estimated table row count| --++--+-+--+--+-+ block |block_pkey |352 GB|5584 MB |347 GB |1986 MB | 38958848| block |uidx_block_hs256|352 GB|5584 MB |347 GB |3226 MB | 38958848| block |idx_block_unused|352 GB|5584 MB |347 GB |372 MB| 38958848| If I try to select a single unused block this way: explain analyze select id from media.block b where nrefs =0 limit 1 then it runs for more than 10 minutes (I'm not sure how long, I cancelled the query after 10 minutes). If I run this without analyze: explain select id from media.block b where nrefs =0 limit 1 QUERY PLAN | ---+ Limit (cost=0.38..0.76 rows=1 width=16) | -> Index Only Scan using idx_block_unused on block b (cost=0.38..869.83 rows=2274 width=16)| I believe it is not actually using the index, because reading a single (random?) entry from an index should not run for >10 minutes. What am I doing wrong? Thank you, Laszlo
Re: Slow query, possibly not using index
> > If I try to select a single unused block this way: > > explain analyze select id from media.block b where nrefs =0 limit 1 > > then it runs for more than 10 minutes (I'm not sure how long, I cancelled > > the query after 10 minutes). > > Are you sure it isn't blocked on a lock? > Yes, I'm sure. I have created a single database instance from a zfs snapshot and tried the query on that database. It was the only client. > Another theory is that the index contains many thousands of references > to now-dead rows, and the query is vainly searching for a live entry. > Given that EXPLAIN thinks there are only about 2300 live entries, > and yet you say the index is 400MB, this seems pretty plausible. > Nobody ever deleted anything from this table. Since it was created, this has been a write-only table. > Have you disabled autovacuum, or something like that? (REINDEX > could help here, at least till the index gets bloated again.) > I did not disable autovacuum. > > You might think that even so, it shouldn't take that long ... but > indexes on UUID columns are a well known performance antipattern. > The index entry order is likely to have precisely zip to do with > the table's physical order, resulting in exceedingly-random access > to the table, which'll be horribly expensive when the table is so > much bigger than RAM. Can you replace the UUID column with a simple > serial (identity) column? > I'm aware of the problems with random UUID values. I was using this function to create ulids from the beginning: CREATE OR REPLACE FUNCTION public.gen_ulid() RETURNS uuid LANGUAGE sql AS $function$ SELECT (lpad(to_hex(floor(extract(epoch FROM clock_timestamp()) * 1000):: bigint), 12, '0') || encode(gen_random_bytes(10), 'hex'))::uuid; $function$ ; If I order some rows by id values, I can see that their creation times are strictly ascending. I did not write this function, it was taken from this website: https://blog.daveallie.com/ulid-primary-keys They have a benchmark section where they show that these ULID values are slower to generate (at least with this implementation) but much faster to insert. I might be able to replace these with int8 values, I need to check. > > > I believe it is not actually using the index, because reading a single > > (random?) entry from an index should not run for >10 minutes. > > You should believe what EXPLAIN tells you about the plan shape. > (Its rowcount estimates are only estimates, though.) > All of the 40M rows in this table are live. I'm 100% sure about this, because nobody ever deleted rows from this table. I can try to do VACUUM on this table, but I'm limited on resources. I think it will take days to do this. Maybe I can try to dump the whole database and restore it on another machine. Would that eliminate dead rows? (Is there a way to check the number of dead rows?) Regards, Laszlo
Re: Slow query, possibly not using index
> > > > > I'm aware of the problems with random UUID values. I was using this > > function to create ulids from the beginning: > > Oh, well that would have been useful information to provide at the > outset. I'm sorry, I left this out. > Now that we know the index order is correlated with creation > time, I wonder if it is also correlated with nrefs, in such a way that > scanning in index order is disastrous because all the desired rows are > at the end of the index. > Possibly, I have no idea. > > Also, you deny deleting any rows, but that wasn't the point. Do you > ever update nrefs from zero to nonzero? That'd also leave dead > entries behind in this index. If that is a routine event that is > correlated with creation time, it gets easier to believe that your > index could have lots of dead entries at the front. > I have checked the trigger that is maintaining the nrefs field. Blocks are referenced from a "file_block" table. Every time a block is created, it first has an initial value of nrefs=0, then a file_block row (reference) is inserted, and nrefs is incremented to one. It means that every block has shown up once in the index, and then disappeared. If the index was never vacuumed, then it is very plausible that it is full of dead rows. CREATE OR REPLACE FUNCTION media.trg_aiud_file_block() RETURNS trigger LANGUAGE plpgsql AS $function$ begin if TG_OP='INSERT' then update media.block set nrefs = nrefs + 1 where id = new.block_id; return new; end if; if TG_OP='UPDATE' then if old.block_id is distinct from new.block_id then update media.block set nrefs = nrefs + 1 where id = new.block_id; update media.block set nrefs = nrefs - 1 where id = old.block_id; end if; return new; end if; if TG_OP='DELETE' then update media.block set nrefs = nrefs - 1 where id = old.block_id; return old; end if; end; $function$ ; The idea was to create an index that can help in quickly removing unused blocks, to free up disk space. It would be much better to keep out the initially inserted (not yet references) from the index, but I don't know how to do this. > We'd still have to figure out why autovacuum is failing to clean out > those entries in a timely fashion, but that seems like a plausible > way for the performance problem to exist. > Yes, that would be very good to know. I cloud drop and recreate the index now, but after some time I would be facing the same situation again. I double checked, and the "autovacuum launcher" process is running. Here are the current settings: =# select name, setting, unit, min_val, max_val, boot_val, reset_val, pending_restart from pg_settings where name like '%vacuum%'; name | setting | unit | min_val | max_val | boot_val | reset_val | pending_restart ---++--+-++++- autovacuum| on | | | | on | on | f autovacuum_analyze_scale_factor | 0.1| | 0 | 100 | 0.1| 0.1| f autovacuum_analyze_threshold | 50 | | 0 | 2147483647 | 50 | 50 | f autovacuum_freeze_max_age | 2 | | 10 | 20 | 2 | 2 | f autovacuum_max_workers| 3 | | 1 | 262143 | 3 | 3 | f autovacuum_multixact_freeze_max_age | 4 | | 1 | 20 | 4 | 4 | f autovacuum_naptime| 60 | s| 1 | 2147483| 60 | 60 | f autovacuum_vacuum_cost_delay | 2 | ms | -1 | 100 | 2 | 2 | f autovacuum_vacuum_cost_limit | -1 | | -1 | 1 | -1 | -1 | f autovacuum_vacuum_insert_scale_factor | 0.2| | 0 | 100 | 0.2| 0.2| f autovacuum_vacuum_insert_threshold| 1000 | | -1 | 2147483647 | 1000 | 1000 | f autovacuum_vacuum_scale_factor| 0.2| | 0 | 100 | 0.2| 0.2| f autovacuum_vacuum_threshold | 50 | | 0 | 2147483647 | 50 | 50 | f autovacuum_work_mem | -1 | kB | -1 | 2147483647 | -1 | -1 | f log_autovacuum_min_duration | 60 | ms | -1 | 2147483647 | 60 | 60 | f vacuum_cost_delay | 0 | ms | 0 | 100 | 0 | 0 | f vacuum_cost_limit | 200| | 1 | 1 | 200| 200| f vacuum_cost_page_dirty| 20 | | 0 | 1 | 20 | 20 | f vacuum_cost_page_hit | 1 | | 0
Re: Slow query, possibly not using index
>
>>
> All right, I started pgstattuple() and I'll also do pgstatindex(), but it
> takes a while. I'll get back with the results.
>
=# select * from pgstattuple('media.block');
table_len | tuple_count | tuple_len | tuple_percent |
dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space |
free_percent
--+-+--+---+--+++-+--
372521984000 |39652836 | 299148572428 | 80.3 |
3578319 |26977942540 | 7.24 | 44638265312 |11.98
(1 row)
=# select * from pgstatindex('media.idx_block_unused');
version | tree_level | index_size | root_block_no | internal_pages |
leaf_pages | empty_pages | deleted_pages | avg_leaf_density |
leaf_fragmentation
-+++---+++-+---+--+
4 | 2 | 389677056 | 546 |114 |
23069 | 0 | 24384 |90.03 | 0
(1 row)
As far as I understand these numbers, the media.block table itself is in
good shape, but the index is not. Should I vacuum the whole table? Or would
it be better to REINDEX INDEX media.idx_block_unused CONCURRENTLY ?
More important question is, how can I find out why the index was not auto
vacuumed.
Thank you,
Laszlo
Re: Slow query, possibly not using index
>
>
> =# select * from pgstatindex('media.idx_block_unused');
> version | tree_level | index_size | root_block_no | internal_pages |
> leaf_pages | empty_pages | deleted_pages | avg_leaf_density |
> leaf_fragmentation
>
> -+++---+++-+---+--+
>4 | 2 | 389677056 | 546 |114 |
> 23069 | 0 | 24384 |90.03 | 0
> (1 row)
>
> After reindex:
=# select * from pgstatindex('media.idx_block_unused');
version | tree_level | index_size | root_block_no | internal_pages |
leaf_pages | empty_pages | deleted_pages | avg_leaf_density |
leaf_fragmentation
-+++---+++-+---+--+
4 | 0 | 8192 | 0 | 0 |
0 | 0 | 0 | NaN |NaN
(1 row)
explain analyze select id from media.block b where nrefs =0 limit 1
QUERY PLAN
|
-+
Limit (cost=0.14..0.46 rows=1 width=16) (actual time=0.010..0.011 rows=0
loops=1) |
-> Index Only Scan using idx_block_unused on block b (cost=0.14..698.91
rows=2231 width=16) (actual time=0.008..0.009 rows=0 loops=1)|
Heap Fetches: 0
|
Planning Time: 0.174 ms
|
Execution Time: 0.030 ms
|
It is actually empty.
Now I only need to figure out why autovacuum did not work on the index.
Thank you
Laszlo
Re: Slow query, possibly not using index
> > > > > More important question is, how can I find out why the index was not > auto vacuumed. > > You should have a look at pg_stat_user_tables. It'll let you know if > the table is being autovacuumed and how often. If you're concerned > about autovacuum not running properly, then you might want to lower > log_autovacuum_min_duration. Generally, anything that takes a > conflicting lock will cause autovacuum to cancel so that the > conflicting locker can get through. Things like ALTER TABLE or even > an ANALYZE running will cancel most autovacuum runs on tables. > > Also, this is a fairly large table and you do have the standard > autovacuum settings. Going by pgstattuple, the table has 39652836 > tuples. Autovacuum will trigger when the statistics indicate that 20% > of tuples are dead, which is about 8 million tuples. Perhaps that's > enough for the index scan to have to skip over a large enough number > of dead tuples to make it slow. You might want to consider lowering > the autovacuum scale factor for this table. > > Also, ensure you're not doing anything like calling pg_stat_reset(); > > It might be worth showing us the output of: > > select * from pg_stat_user_tables where relid = 'media.block'::regclass; > Thank you for your suggestion, this is really very helpful. select * from pg_stat_user_tables where relid = 'media.block'::regclass; Name |Value| ---+-+ relid |25872| schemaname |media| relname|block| seq_scan |8| seq_tup_read |139018370| idx_scan |45023556 | idx_tup_fetch |37461539 | n_tup_ins |7556051 | n_tup_upd |7577720 | n_tup_del |0| n_tup_hot_upd |0| n_live_tup |39782042 | n_dead_tup |5938057 | n_mod_since_analyze|1653427 | n_ins_since_vacuum |5736676 | last_vacuum| | last_autovacuum|2023-08-17 22:39:29.383 +0200| last_analyze | | last_autoanalyze |2023-08-22 16:02:56.093 +0200| vacuum_count |0| autovacuum_count |1| analyze_count |0| autoanalyze_count |4| Regards, Laszlo
Terribly slow query with very good plan?
Hello,
I have a table that contains folders, and another one that contains files.
Here are the table definitions. I have removed most of the columns because
they are not important for this question. (There are lots of columns.)
CREATE TABLE media.oo_folder (
id int8 NOT NULL,
is_active bool NOT NULL DEFAULT true,
title text NOT NULL,
relpath text NOT NULL,
CONSTRAINT chk_no_slash CHECK (("position"(title, '/'::text) = 0)),
CONSTRAINT oo_folder_chk_no_slash CHECK (("position"(title, '/'::text) =
0)),
CONSTRAINT pk_oo_folder PRIMARY KEY (id),
CONSTRAINT fk_oo_folder_parent_id FOREIGN KEY (parent_id) REFERENCES
media.oo_folder(id) ON DELETE CASCADE DEFERRABLE
);
CREATE INDEX oo_folder_idx_parent ON media.oo_folder USING btree
(parent_id);
CREATE INDEX oo_folder_idx_relpath ON media.oo_folder USING btree (relpath);
CREATE UNIQUE INDEX uidx_oo_folder_active_title ON media.oo_folder USING
btree (parent_id, title) WHERE is_active;
CREATE TABLE media.oo_file (
id int8 NOT NULL,
is_active bool NOT NULL DEFAULT true,
title text NOT NULL,
ext text NULL,
relpath text NOT NULL,
sha1 text NOT NULL,
CONSTRAINT chk_no_slash CHECK (("position"(title, '/'::text) = 0)),
CONSTRAINT oo_file_chk_no_slash CHECK (("position"(title, '/'::text) = 0)),
CONSTRAINT pk_oo_file PRIMARY KEY (id),
CONSTRAINT fk_oo_file_oo_folder_id FOREIGN KEY (oo_folder_id) REFERENCES
media.oo_folder(id) ON DELETE CASCADE DEFERRABLE,
);
CREATE INDEX oo_file_idx_oo_folder_id ON media.oo_file USING btree
(oo_folder_id);
CREATE INDEX oo_file_idx_relpath ON media.oo_file USING btree (relpath);
CREATE INDEX oo_file_idx_sha1 ON media.oo_file USING btree (sha1);
CREATE UNIQUE INDEX uidx_oo_file_active_title ON media.oo_file USING btree
(oo_folder_id, title) WHERE is_active;
The "replath" field contains the path of the file/folder. For example:
"/folder1/folder2/folder3/filename4.ext5". The replath field is managed by
triggers. There are about 1M rows for files and 600K folder rows in the
database. The files are well distributed between folders, and there are
only 45 root folders ( parent_id is null)
This query runs very fast:
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) select id, title from
media.oo_folder f where f.parent_id is null
QUERY PLAN
|
--+
Index Scan using oo_folder_idx_parent on media.oo_folder f
(cost=0.42..73.70 rows=20 width=25) (actual time=0.030..0.159 rows=45
loops=1)|
Output: id, title
|
Index Cond: (f.parent_id IS NULL)
|
Buffers: shared hit=40
|
Planning Time: 0.123 ms
|
Execution Time: 0.187 ms
|
My task is to write a query that tells if a folder has any active file
inside it - directly or in subfolders. Here is the query for that:
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
select id, title,
(exists (select f2.id from media.oo_file f2 where f2.relpath like f.relpath
|| '%')) as has_file
from media.oo_folder f where f.parent_id is null
QUERY PLAN
|
--+
Index Scan using oo_folder_idx_parent on media.oo_folder f
(cost=0.42..488.02 rows=20 width=26) (actual time=713.419..25414.969
rows=45 loops=1) |
Output: f.id, f.title, (SubPlan 1)
|
Index Cond: (f.parent_id IS NULL)
|
Buffers: shared hit=7014170
|
SubPlan 1
|
-> Index Only Scan using oo_file_idx_relpath on media.oo_file f2
(cost=0.55..108499.27 rows=5381 width=0) (actual time=564.756..564.756
rows=0 loops=45)|
Filter: (f2.relpath ~~ (f.relpath || '%'::text))
|
Rows Removed by Filter: 792025
|
Heap Fetches: 768960
|
Buffers: shared hit=7014130
|
Planning Time: 0.361 ms
|
Execution Time: 25415.088 ms
|
It also returns 45 rows, but in 25 seconds which is unacceptable.
It I execute the "has_file" subquery for one specific relpath then it
speeds up again, to < 1msec:
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
select exists ( select id from media.oo_file of2 where relpath like
'Felhasználók%')
QUERY PLAN
|
--+
Result (cost=1.66..1.67 rows=1 width=1) (actual time=0.049..0.050 rows=1
loops=1)|
Output: $0
|
Buffers: sh
Re: Terribly slow query with very good plan?
Laurenz Albe ezt írta (időpont: 2022. febr. 4., P, 10:18): > | > > > > It also returns 45 rows, but in 25 seconds which is unacceptable. > > You should create an index that supports LIKE; for example > > CREATE INDEX ON media.oo_file (relpath COLLATE "C"); > > CREATE INDEX test ON media.oo_file (relpath COLLATE "C"); EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) select id, title, (exists (select f2.id from media.oo_file f2 where f2.relpath like f.relpath || '%' )) as has_file from media.oo_folder f where f.parent_id is null; QUERY PLAN | -+ Index Scan using oo_folder_idx_parent on media.oo_folder f (cost=0.42..459.38 rows=20 width=26) (actual time=772.566..24081.820 rows=45 loops=1)| Output: f.id, f.title, (SubPlan 1) | Index Cond: (f.parent_id IS NULL) | Buffers: shared hit=6672274 | SubPlan 1 | -> Index Only Scan using test on media.oo_file f2 (cost=0.55..100756.64 rows=5379 width=0) (actual time=535.113..535.113 rows=0 loops=45) | Filter: (f2.relpath ~~ (f.relpath || '%'::text)) | Rows Removed by Filter: 777428 | Heap Fetches: 736418 | Buffers: shared hit=6672234 | Planning Time: 0.338 ms | Execution Time: 24082.152 ms | Not helping :-(
Re: Terribly slow query with very good plan?
I really think now that the query plan is wrong (or "could be improved" so to say). As far as I understand, the "index only scan" is essentially a sequential scan on the index data. In this specific case, where the filter is a "begins with" condition on a field that is the starting (and only) column of an index, there is a much much better way to find out if there is a row or not: lookup the closest value in the index and see if it begins with the value. The operation of looking up the closest value in an index would be much more efficient. > I don't understand how it is possible in the slow case Rows Removed by Filter: 792025 (returns 0 row) and in the second case Rows Removed by Filter: 15 (returns 1 row). Pavel, I think it is because the scan found a suitable row at the beginning of the scan and stopped the scan. If you look at that plan you will see that it uses a seq scan. It was fast by accident. :-) The plan of that single-row version was changed to a normal index scan, after I added the collation "C" index: EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) select exists ( select id from media.oo_file of2 where relpath like 'this does not exist%' ); QUERY PLAN | -+ Result (cost=0.63..0.64 rows=1 width=1) (actual time=0.022..0.023 rows=1 loops=1) | Output: $0 | Buffers: shared hit=4 | InitPlan 1 (returns $0) | -> Index Only Scan using test on media.oo_file of2 (cost=0.55..8.57 rows=108 width=0) (actual time=0.018..0.018 rows=0 loops=1)| Index Cond: ((of2.relpath >= 'this does not exist'::text) AND (of2.relpath < 'this does not exisu'::text)) | Filter: (of2.relpath ~~ 'this does not exist%'::text) | Heap Fetches: 0 | Buffers: shared hit=4 | Planning Time: 0.530 ms | Execution Time: 0.055 ms | I would expect for the same originally slow query with the has_file column, but it does not happen. :-(
Re: Terribly slow query with very good plan?
Nick Cleaton ezt írta (időpont: 2022. febr. 4., P, 11:00): > > In the fast case the 'Felhasználók%' part is known at query planning > time, so it can be a prefix search. > > In the slow case, the planner doesn't know what that value will be, it > could be something that starts with '%' for example. > > First of all, it CANNOT start with '%'. This is a fact and this fact can be determined by analyzing the query. Something that the query planner should do, right? Second argument: the same query is also slow with the ^@ operator... EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) select id, title, (exists (select f2.id from media.oo_file f2 where f2.relpath ^@ f.relpath )) as has_file from media.oo_folder f where f.parent_id is null QUERY PLAN | --+ Index Scan using oo_folder_idx_parent on media.oo_folder f (cost=0.42..449.38 rows=20 width=26) (actual time=1652.624..61636.232 rows=45 loops=1)| Output: f.id, f.title, (SubPlan 1) | Index Cond: (f.parent_id IS NULL) | Buffers: shared hit=6672274 | SubPlan 1 | -> Index Only Scan using test on media.oo_file f2 (cost=0.55..98067.11 rows=5379 width=0) (actual time=1369.665..1369.665 rows=0 loops=45) | Filter: (f2.relpath ^@ f.relpath) | Rows Removed by Filter: 777428 | Heap Fetches: 736418 | Buffers: shared hit=6672234 | Planning Time: 0.346 ms | Execution Time: 61636.319 ms | > Also your logic looks a bit unsafe, the query you have would include > files under all top-level folders with names starting with > Felhasználók, so you could accidentally merge in files in folders > called Felhasználókfoo and Felhasználókbar for example. > Forgive me, I typed in these examples for demonstration. The actual code uses relpath || '/%' and it avoids those cases.
Re: Terribly slow query with very good plan?
> In the fast case the 'Felhasználók%' part is known at query planning >> time, so it can be a prefix search. >> >> In the slow case, the planner doesn't know what that value will be, it >> could be something that starts with '%' for example. >> >> > First of all, it CANNOT start with '%'. This is a fact and this fact can > be determined by analyzing the query. Something that the query planner > should do, right? > > Second argument: the same query is also slow with the ^@ operator... > Oh I see, the query planner does not know that there will be no % characters in file and folder names. But what is the solution then? It just seems wrong that I can speed up a query 1000 times by replacing it with a nested loop in a pl/sql function :(
Re: Terribly slow query with very good plan?
> > >> >> First of all, it CANNOT start with '%'. This is a fact and this fact can >> be determined by analyzing the query. Something that the query planner >> should do, right? >> >> Second argument: the same query is also slow with the ^@ operator... >> > > Oh I see, the query planner does not know that there will be no % > characters in file and folder names. > > On second thought, it does not explain why it is also slow with the ^@ operator.
Re: Terribly slow query with very good plan?
Nick Cleaton ezt írta (időpont: 2022. febr. 4., P, 11:57): > > With the ^@ operator, my guess is that because the planner knows > nothing about the folder name value it could be the empty string, > which would be a prefix of everything. > I think I could narrow down the problem to the simplest query possible. The "title could be empty" does not hold for this: CREATE index test ON media.oo_file (relpath COLLATE "C"); EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) select fi.id from media.oo_file fi where fi.is_active and fi.relpath ^@ 'Természettudomány' limit 1 Limit (cost=0.00..2.70 rows=1 width=8) (actual time=14445.559..14445.561 rows=0 loops=1) Output: id Buffers: shared hit=22288 read=108975 -> Seq Scan on media.oo_file fi (cost=0.00..144710.65 rows=53574 width=8) (actual time=14445.555..14445.556 rows=0 loops=1) Output: id Filter: (fi.is_active AND (fi.relpath ^@ 'Természettudomány'::text)) Rows Removed by Filter: 1075812 Buffers: shared hit=22288 read=108975 Planning Time: 0.398 ms Execution Time: 14445.593 ms EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) select fi.id from media.oo_file fi where fi.is_active and fi.relpath like 'Természettudomány%' limit 1 Limit (cost=0.00..2.70 rows=1 width=8) (actual time=11222.280..11222.282 rows=0 loops=1) Output: id Buffers: shared hit=22320 read=108943 -> Seq Scan on media.oo_file fi (cost=0.00..144710.65 rows=53574 width=8) (actual time=11222.278..11222.279 rows=0 loops=1) Output: id Filter: (fi.is_active AND (fi.relpath ~~ 'Természettudomány%'::text)) Rows Removed by Filter: 1075812 Buffers: shared hit=22320 read=108943 Planning Time: 0.488 ms Execution Time: 11222.307 ms It is using seq scan for both cases. This is definitely wrong! One of my collaguage has noticed that the LIKE query uses index scan for some of the letters: EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) select fi.id from media.oo_file fi where fi.is_active and fi.relpath like 'A%' limit 1; Limit (cost=0.55..60.85 rows=1 width=8) (actual time=6.508..6.509 rows=0 loops=1) Output: id Buffers: shared hit=2776 -> Index Scan using test on media.oo_file fi (cost=0.55..4583.29 rows=76 width=8) (actual time=6.506..6.507 rows=0 loops=1) Output: id Index Cond: ((fi.relpath >= 'A'::text) AND (fi.relpath < 'B'::text)) Filter: (fi.is_active AND (fi.relpath ~~ 'A%'::text)) Rows Removed by Filter: 3784 Buffers: shared hit=2776 Planning Time: 0.543 ms Execution Time: 6.560 ms Actually, the number of files per starting letter is: select substr(relpath, 0, 2), count(*) from media.oo_file group by substr(relpath, 0, 2) order by count(*) desc substr|count | --+--+ O |386087| F |236752| N |167171| Ó |111479| T |109786| M | 34348| P | 19878| L | 5657| A | 3784| I | 869| C | 1| PostgreSQL uses seq scan for O, F, N, T letters, but it uses index scan for A, I, C letters (with the "like" query). There might be a problem with the planner here, because I think that using an index scan will always be faster than a seq scan. The number of rows for the prefix should not matter at all, because we are trying to get the first matching row only. For some reason it decides between seq/index scan based on the number of rows stored in some stats. At least it seems that way. If I could tell the planner to use the index, I think my problem would be solved. Is there a way to put optimizer hints into the query? There could be some improvement made to the @^ operator too, because it always uses seq scan, no matter what. What do you think?
Re: Terribly slow query with very good plan?
> > PostgreSQL uses seq scan for O, F, N, T letters, but it uses index scan > for A, I, C letters (with the "like" query). > > That's interesting. > > Does it help if you create an additional index on relpath with the > text_pattern_ops modifier, e.g. > > CREATE INDEX ... USING btree (relpath text_pattern_ops); > It does not help. Details below. (PostgreSQL version 12.8) CREATE index test ON media.oo_file (relpath COLLATE "C"); CREATE INDEX test2 ON media.oo_file USING btree (relpath text_pattern_ops); CREATE INDEX test3 ON media.oo_file USING btree (relpath collate "C" text_pattern_ops); -- letter "A" ^@ operator -> slow seq scan EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) select fi.id from media.oo_file fi where fi.is_active and fi.relpath ^@ 'A' limit 1; QUERY PLAN | + Limit (cost=0.00..1904.09 rows=1 width=8) (actual time=10779.585..10779.587 rows=0 loops=1)| Output: id | Buffers: shared hit=9960 read=121303 | -> Seq Scan on media.oo_file fi (cost=0.00..144710.65 rows=76 width=8) (actual time=10779.582..10779.583 rows=0 loops=1)| Output: id | Filter: (fi.is_active AND (fi.relpath ^@ 'A'::text)) | Rows Removed by Filter: 1075812 | Buffers: shared hit=9960 read=121303 | Planning Time: 0.428 ms | Execution Time: 10779.613 ms | -- letter 'A' like expression index scan fast EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) select fi.id from media.oo_file fi where fi.is_active and fi.relpath like 'A%' limit 1; QUERY PLAN | ---+ Limit (cost=0.55..60.85 rows=1 width=8) (actual time=7.047..7.048 rows=0 loops=1) | Output: id | Buffers: shared hit=2776 | -> Index Scan using test on media.oo_file fi (cost=0.55..4583.29 rows=76 width=8) (actual time=7.045..7.045 rows=0 loops=1)| Output: id | Index Cond: ((fi.relpath >= 'A'::text) AND (fi.relpath < 'B'::text)) | Filter: (fi.is_active AND (fi.relpath ~~ 'A%'::text)) | Rows Removed by Filter: 3784 | Buffers: shared hit=2776 | Planning Time: 0.937 ms | Execution Time: 7.091 ms | -- letter 'T' like expression, seq scan slow EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) select fi.id from media.oo_file fi where fi.is_active and fi.relpath like 'Természettudomány%' limit 1; QUERY PLAN | -+ Limit (cost=0.00..2.70 rows=1 width=8) (actual time=9842.935..9842.938 rows=0 loops=1) | Output: id | Buffers: shared hit=10024 read=121239 | -> Seq Scan on media.oo_file fi (cost=0.00..144710.65 rows=53574 width=8) (actual time=9842.933..9842.934 rows=0 loops=1)| Output: id | Filter: (fi.is_active AND (fi.relpath ~~ 'Természettudomány%'::text)) | Rows Removed by Filter: 1075812 | Buffers: shared hit=10024 read=121239 | Planning Time: 0.975 ms | Execution Time: 9842.962 ms |
Re: Terribly slow query with very good plan?
> > > > It does not help. > > What if you try applying the C collation to the values from the table: > > where fi.is_active and fi.relpath collate "C" ^@ 'A' > Slow EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) select fi.id from media.oo_file fi where fi.is_active and fi.relpath collate "C" ^@ 'A' limit 1; QUERY PLAN | --+ Limit (cost=0.00..1904.09 rows=1 width=8) (actual time=3837.338..3837.340 rows=0 loops=1)| Output: id | Buffers: shared hit=9355 read=121908 | -> Seq Scan on media.oo_file fi (cost=0.00..144710.65 rows=76 width=8) (actual time=3837.336..3837.336 rows=0 loops=1)| Output: id | Filter: (fi.is_active AND ((fi.relpath)::text ^@ 'A'::text)) | Rows Removed by Filter: 1075812 | Buffers: shared hit=9355 read=121908 | Planning Time: 0.391 ms | Execution Time: 3837.364 ms |
Re: Terribly slow query with very good plan?
> Slow
>
> What about this:
>
> fi.relpath between ('A' collate "C") and ('A'||chr(255) collate "C")
>
It uses index scan.
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
select fi.id from media.oo_file fi
where fi.is_active
and fi.relpath between ('A' collate "C") and ('A'||chr(255) collate "C")
limit 1;
QUERY PLAN
|
---+
Limit (cost=0.55..6.12 rows=1 width=8) (actual time=1623.069..1623.070
rows=0 loops=1)|
Output: id
|
Buffers: shared hit=2439 read=1994 dirtied=1107
|
-> Index Scan using test on media.oo_file fi (cost=0.55..5732.47
rows=1029 width=8) (actual time=1623.067..1623.067 rows=0 loops=1)|
Output: id
|
Index Cond: ((fi.relpath >= 'A'::text COLLATE "C") AND (fi.relpath
<= 'A '::text COLLATE "C")) |
Filter: fi.is_active
|
Rows Removed by Filter: 3784
|
Buffers: shared hit=2439 read=1994 dirtied=1107
|
Planning Time: 18.817 ms
|
Execution Time: 1623.104 ms
|
Although the same with 'Természettudomány' uses seq scan:
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
select fi.id from media.oo_file fi
where fi.is_active
and fi.relpath between
('Természettudomány' collate "C")
and ('Természettudomány'||chr(255) collate "C")
limit 1;
QUERY PLAN
|
---+
Limit (cost=0.00..2.13 rows=1 width=8) (actual time=7521.531..7521.532
rows=0 loops=1)|
Output: id
|
Buffers: shared hit=17018 read=150574
|
-> Seq Scan on media.oo_file fi (cost=0.00..188195.39 rows=88290
width=8) (actual time=7521.528..7521.529 rows=0 loops=1)
|
Output: id
|
Filter: (fi.is_active AND (fi.relpath >= 'Természettudomány'::text
COLLATE "C") AND (fi.relpath <= 'Természettudomány '::text COLLATE "C"))|
Rows Removed by Filter: 1075812
|
Buffers: shared hit=17018 read=150574
|
Planning Time: 8.918 ms
|
Execution Time: 7521.560 ms
|
Re: Terribly slow query with very good plan?
>
> That may be because it's expecting to get 88290 rows from the
> sequential scan, and the"limit 1" means it expects sequential scan to
> be fast because on average it will only need to scan 1/88290 of the
> table before it finds a matching row, then it can stop.
>
We are looking for a single row. With an index scan, it is always much
faster to find a single row. No seq scan can be faster "on average", when
you are looking for a single row. Am I wrong?
> Try it without the "limit 1"
Without the limit it uses bitmap heap scan. Unbelievable!
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
select fi.id from media.oo_file fi
where fi.is_active
and fi.relpath between
('Természettudomány' collate "C")
and ('Természettudomány'||chr(255) collate "C");
QUERY PLAN
|
--+
Bitmap Heap Scan on media.oo_file fi (cost=10480.10..140065.96 rows=70010
width=8) (actual time=9757.917..9757.920 rows=0 loops=1) |
Output: id
|
Recheck Cond: ((fi.relpath >= 'Természettudomány'::text COLLATE "C") AND
(fi.relpath <= 'Természettudomány '::text COLLATE "C"))|
Filter: fi.is_active
|
Rows Removed by Filter: 85207
|
Heap Blocks: exact=24954
|
Buffers: shared hit=197 read=26531
|
-> Bitmap Index Scan on test (cost=0.00..10462.59 rows=99404 width=0)
(actual time=425.571..425.572 rows=85207 loops=1) |
Index Cond: ((fi.relpath >= 'Természettudomány'::text COLLATE "C")
AND (fi.relpath <= 'Természettudomány '::text COLLATE "C"))|
Buffers: shared hit=6 read=1768
|
Planning Time: 1.145 ms
|
JIT:
|
Functions: 6
|
Options: Inlining false, Optimization false, Expressions true, Deforming
true |
Timing: Generation 2.295 ms, Inlining 0.000 ms, Optimization 1.142 ms,
Emission 11.632 ms, Total 15.070 ms |
Execution Time: 9760.361 ms
|
>
>
