Re: Partial index creation always scans the entire table
Laurenz Albe writes: > On Sat, 2020-02-15 at 19:04 +0800, MingJu Wu wrote: >> My scenario is that I have a table with 50M rows that are categorized into >> 10K categories. >> I need to create a partial index for each category. > You don't need a partial index per category, you need a single index that > *contains* the category. Yeah, that's an anti-pattern. Essentially, you are trying to replace the first branching level of an index that includes the category column with a ton of system catalog entries and planner proof logic to select one of N indexes that don't include the category. It is *highly* unlikely that that's going to be a win. It's going to be a huge loss if the planner fails to make the proof you need, and even when it does, it's not really going to be faster overall --- you've traded off run-time for planning time, at a rather unfavorable exchange rate. Updates on the table are going to be enormously penalized, too, because the index machinery doesn't have any way to understand that only one of the indexes needs work. I've seen people try to do this before. I wonder if the manual page about partial indexes should explicitly say "don't do that". regards, tom lane
Re: Partial index creation always scans the entire table
On Sat, Feb 15, 2020 at 10:15:53PM +0100, Laurenz Albe wrote: > > My scenario is that I have a table with 50M rows that are categorized into > > 10K categories. > > I need to create a partial index for each category. I have created a index > > on the category > > column, hoping that postgres can use this information when creating the > > partial indexes. > > However, postgres always performs full table scan. > > There is your problem. > > You don't need a partial index per category, you need a single index that > *contains* the category. On Sun, Feb 16, 2020 at 10:30:05AM -0500, Tom Lane wrote: > Laurenz Albe writes: > > On Sat, 2020-02-15 at 19:04 +0800, MingJu Wu wrote: > >> My scenario is that I have a table with 50M rows that are categorized into > >> 10K categories. > >> I need to create a partial index for each category. > > > You don't need a partial index per category, you need a single index that > > *contains* the category. > > Yeah, that's an anti-pattern. Essentially, you are trying to replace the The OP mentioned having an index on "category", which they were hoping the creation of partial indexes would use: On Sat, Feb 15, 2020 at 07:04:48PM +0800, MingJu Wu wrote: > My scenario is that I have a table with 50M rows that are categorized into > 10K categories. I need to create a partial index for each category. I have > created a index on the category column, hoping that postgres can use this > information when creating the partial indexes. However, postgres always > performs full table scan. So the question is why they (think they) *also* need large number of partial indexes. I was reminded of reading this, but I think it's a pretty different case. https://heap.io/blog/engineering/running-10-million-postgresql-indexes-in-production -- Justin
Re: Partial index creation always scans the entire table
Justin Pryzby writes: > I was reminded of reading this, but I think it's a pretty different case. > https://heap.io/blog/engineering/running-10-million-postgresql-indexes-in-production Yeah, the critical paragraph in that is This isn’t as scary as it sounds for a two main reasons. First, we shard all of our data by customer. Each table in our database holds only one customer’s data, so each table has a only a few thousand indexes at most. Second, these events are relatively rare. The most common defined events make up only a few percent of a customer’s raw events, and most are much more rare. This means that we perform relatively little I/O maintaining this schema, because most incoming events match no event definitions and therefore don’t need to be written to any of the indexes. Similarly, the indexes don’t take up much space on disk. A set of partial indexes that cover a small part of the total data can be sensible. If you're trying to cover most/all of the data, you're doing it wrong --- basically, you're reinventing partitioning using the wrong tools. regards, tom lane
SubtransControlLock and performance problems
Hi
On a server with 32 cores and 250 GB memory, with CentOS 7 and kernel
4.4.214-1.el7.elrepo.x86_64, I try to run 30 parallel threads using dblink.
(https://github.com/larsop/postgres_execute_parallel) . I have tried to
disconnect and reconnect in the dblink code and that did not help.
If I reduce the number of threads I get less CPU usage and much less
SubtransControlLock.
Each thread are inserting many lines into a Postgis Topology layer. I have a
lot of try catch in this code to avoid missing lines
(https://github.com/larsop/resolve-overlap-and-gap/tree/add_postgis_topology_using_proc)
.
What happens is that after some minutes the CPU can fall to maybe 20% usage and
most of the threads are blocked by SubtransControlLock, and when the number
SubtransControlLock goes down the CPU load increases again. The jobs usually
goes through without any errors, but it takes to long time because of the
SubtransControlLock blocks.
There is no iowait on the server and there is plenty of free memory on the
server. There seems to be no locks on the common tables.
“SELECT relation::regclass, * FROM pg_locks WHERE NOT GRANTED;” is always empty.
I am using a lot temp tables and unlogged tables.
To reduce the number locks I have a simple check before I kick off new jobs
like the one below, but that did not help very much either. Yes it does a lot
waiting, but SubtransControlLock kick in when all threads are up running again.
LOOP
EXECUTE Format('SELECT count(*) from pg_stat_activity where wait_event = %L and
query like %L',
'SubtransControlLock',
'CALL resolve_overlap_gap_save_single_cells%') into subtransControlLock;
EXIT WHEN subtransControlLock = 0;
subtransControlLock_count := subtransControlLock_count + 1;
PERFORM pg_sleep(subtransControlLock*subtransControlLock_count*0.1);
END LOOP;
I have tested with postgres 11, postgres 12, postgis 2.5 , postgis 3.0 and it
seems to behave save.
I have also tried to recompile postgres with the setting below and that did not
solve the problem either.
/* Number of SLRU buffers to use for subtrans */
#define NUM_SUBTRANS_BUFFERS 2048
I have tested different values for memory and other settings nothing seems to
matter. Here are the settings right now.
maintenance_work_mem = 8GB
max_connections = 600
work_mem = 500MB
temp_buffers = 100MB
shared_buffers = 64GB
effective_cache_size = 124GB
wal_buffers = 640MB
seq_page_cost = 2.0
random_page_cost = 2.0
checkpoint_flush_after = 2MB
checkpoint_completion_target = 0.9
default_statistics_target = 1000
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 1
pg_stat_statements.track = all
effective_io_concurrency = 500 # 1-1000; 0 disables prefetching
# test to avoid SubtransControlLock
#bgwriter_lru_maxpages = 10
#bgwriter_lru_maxpages=0
#bgwriter_delay = 20ms
synchronous_commit = off
Any idea about how to solve this ?
Lars
RE: Partial index creation always scans the entire table
>From: Tom LaneSent: Sunday, February 16, 2020 7:30 AM >I've seen people try to do this before. I wonder if the manual page about partial indexes should explicitly say "don't do that". > regards, tom lane Yes please (seriously). The utter beauty of Postgres is the flexibility and power that its evolutionary path has allowed/created. The tragic danger is that the beauty is fairly easy to misapply/misuse. Caveats in the documentation would be very beneficial to both seasoned practitioners and newcomers - it is quite challenging to keep up with everything Postgres and the documentation is where most of us turn for guidance. And thank you Tom (and others), for your willingness to share these (and many, many other) insights - it is so powerful when facts connect with database reality. Mike Sofen
Re: Partial index creation always scans the entire table
On Sun, Feb 16, 2020 at 04:43:10PM -0800, Mike Sofen wrote: > >From: Tom LaneSent: Sunday, February 16, 2020 7:30 > AM > >I've seen people try to do this before. I wonder if the manual page about > partial indexes should explicitly say "don't do that". > > regards, tom lane > > Yes please (seriously). The utter beauty of Postgres is the flexibility and > power that its evolutionary path has allowed/created. The tragic danger is > that the beauty is fairly easy to misapply/misuse. Quote. Enough rope to shoot yourself in the foot. Would you care to suggest text to be included here ? https://www.postgresql.org/docs/devel/indexes-partial.html -- Justin
Re: Partial index creation always scans the entire table
"Mike Sofen" writes: >> From: Tom LaneSent: Sunday, February 16, 2020 7:30 AM >>> I've seen people try to do this before. I wonder if the manual page about >>> partial indexes should explicitly say "don't do that". > Yes please (seriously). The utter beauty of Postgres is the flexibility and > power that its evolutionary path has allowed/created. The tragic danger is > that the beauty is fairly easy to misapply/misuse. Caveats in the > documentation would be very beneficial to both seasoned practitioners and > newcomers - it is quite challenging to keep up with everything Postgres and > the documentation is where most of us turn for guidance. OK, so how about something like this added to section 11.8 (no pretty markup as yet): Example 11.4. Do Not use Partial Indexes as a Substitute for Partitioning You might be tempted to create a large set of non-overlapping partial indexes, for example CREATE INDEX mytable_cat_1 ON mytable (data) WHERE category = 1; CREATE INDEX mytable_cat_2 ON mytable (data) WHERE category = 2; CREATE INDEX mytable_cat_3 ON mytable (data) WHERE category = 3; ... This is a bad idea! Almost certainly, you'll be better off with a single non-partial index, declared like CREATE INDEX mytable_cat_data ON mytable (category, data); (Put the category column first, for the reasons described in section 11.3 Multicolumn Indexes.) While a search in this larger index might have to descend through a couple more tree levels than a search in a smaller index, that's almost certainly going to be cheaper than the planner effort needed to select the appropriate one of the partial indexes. The core of the problem is that the system does not understand the relationship among the partial indexes, and will laboriously test each one to see if it's applicable to the current query. If your table is large enough that a single index really is a bad idea, you should look into using partitioning instead (section whatever-it-is). With that mechanism, the system does understand that the tables and indexes are non-overlapping, so much better performance is possible. regards, tom lane
