Rapid disk usage spikes when updating large tables with GIN indexes

2018-05-14 Thread Jonathan Marks
Hello!

We have a mid-sized database on RDS running 10.1 (32 cores, 240 GB RAM, 5TB 
total disk space, 20k PIOPS) with several large (100GB+, tens of millions of 
rows) tables that use GIN indexes for full-text search. We at times need to 
index very large (hundreds of pages) documents and as a result our tables have 
a mix of small (tens of tokens) to very large (hundreds of thousands of tokens 
near to the tsvector 1MB limit). All our GIN indexes have fastupdate turned off 
— we found that turning fastupdate on led to significant blocking and that we 
get better average performance with it turned off. We’ve put a lot of effort 
into tuning our database over the last several years to the point where we have 
acceptable read and write performance for these tables. 

One recurring, and predictable, issue that we have experienced regularly for 
multiple years is that inserting or updating rows in any table with GIN indexes 
results in extremely large drops in free disk space — i.e. inserting 10k rows 
with a total size of 10GB can result in the temporary loss of several hundred 
gigabytes of free disk space over 2-3 hours (and it could be more — we try to 
keep a 10-15% buffer of free disk space so that often represents almost all 
available disk space). Once we stop the operation, free disk space rapidly 
recovers, which makes us believe that this occurs due to logs, or some kind of 
temporary table. Our work_mem and maintenance_work_mem settings are pretty 
large (12GB and 62GB, respectively). The database’s size on disk scarcely 
budges during this process.

Unfortunately, we’re on RDS, so we’re unable to ssh directly into the instance 
to see what files are so large, and none of the logs we can see (nor the wal 
logs) are large enough to explain this process. Any suggestions about where to 
look to see the cause of this problem (or about any settings we can tune or 
changes we could make to stop it) would be greatly appreciated.

Thank you!


Re: Rapid disk usage spikes when updating large tables with GIN indexes

2018-05-14 Thread Jonathan Marks
We’ll turn on log_temp_files and get back to you to see if that’s the cause. 
Re: the exact queries — these are just normal INSERTs and UPDATEs. This occurs 
as part of normal database operations — i.e., we are processing 10% of a table 
and marking changes to a particular row, or happen to be inserting 5-10% of the 
table volume with new rows. Whenever we bulk load we have to drop the indexes 
because the disk space loss just isn’t tenable.

Re: extra disk space consumption not within PG — the AWS folks can’t tell me 
what the problem is because it’s all internal to the PG part of the instance 
they can’t access. Doesn’t mean your last suggestion can’t be the case but 
makes it slightly less likely.

Any chance that GIN indexes are double-logging? I.e. with fastupdate off they 
are still trying to keep track of the changes in the pending list or something?

Our thought has been temp files for a while, but we’re not sure what we should 
do if that turns out to be the case.

> On May 14, 2018, at 3:08 PM, Tom Lane  wrote:
> 
> [ please keep the list cc'd ]
> 
> Jonathan Marks  writes:
>> Thanks for your quick reply. Here’s a bit more information:
>> 1) to measure the “size of the database” we run something like `select 
>> datname, pg_size_pretty(pg_database_size(datname)) from pg_database;` I’m 
>> not sure if this includes WAL size.
>> 2) I’ve tried measuring WAL size with `select sum(size) from 
>> pg_ls_waldir();` — this also doesn’t budge.
>> 3) Our current checkpoint_timeout is 600s with a 
>> checkpoint_completion_target of 0.9 — what does that suggest?
> 
> Hmph.  Your WAL-size query seems on point, and that pretty much destroys
> my idea about a WAL emission spike.
> 
> pg_database_size() should include all regular and temporary tables/indexes
> in the named DB.  It doesn't include WAL (but we've eliminated that), nor
> cluster-wide tables such as pg_database (but those seem pretty unlikely
> to be at issue), nor non-relation temporary files such as sort/hash temp
> space.  At this point I think we have to focus our attention on what might
> be creating large temp files.  I do not see anything in the GIN index code
> that could do that, especially not if you have fastupdate off.  I wonder
> whether there is something about the particular bulk-insertion queries
> you're using that could result in large temp files --- which'd make the
> apparent correlation with GIN index use a mirage, but we're running out
> of other ideas.  You could try enabling log_temp_files to see if there's
> anything to that.
> 
> In the grasping-at-straws department: are you quite sure that the extra
> disk space consumption is PG's to begin with, rather than something
> outside the database entirely?
> 
>   regards, tom lane




Re: Rapid disk usage spikes when updating large tables with GIN indexes

2018-05-16 Thread Jonathan Marks
Hi Tom —

We turned on log_temp_files and since the last stats reset (about a week ago) 
we’re seeing 0 temp files altogether (grabbing that info from pg_stat_database).

So, as far as we know:

1) It’s not WAL
2) It’s not tempfiles
3) It’s not the size of the error logs
4) It’s not the size of the actual rows in the database or the indexes

Another thread we found suggested pg_subtrans — this seems less likely because 
we’ve been able to replicate this across many different types of connections 
etc. but thought it might be a potential source.

Any other system-monitoring queries that we can run that might further 
illuminate the issue?

Thank you!

> On May 14, 2018, at 3:31 PM, Jonathan Marks  
> wrote:
> 
> We’ll turn on log_temp_files and get back to you to see if that’s the cause. 
> Re: the exact queries — these are just normal INSERTs and UPDATEs. This 
> occurs as part of normal database operations — i.e., we are processing 10% of 
> a table and marking changes to a particular row, or happen to be inserting 
> 5-10% of the table volume with new rows. Whenever we bulk load we have to 
> drop the indexes because the disk space loss just isn’t tenable.
> 
> Re: extra disk space consumption not within PG — the AWS folks can’t tell me 
> what the problem is because it’s all internal to the PG part of the instance 
> they can’t access. Doesn’t mean your last suggestion can’t be the case but 
> makes it slightly less likely.
> 
> Any chance that GIN indexes are double-logging? I.e. with fastupdate off they 
> are still trying to keep track of the changes in the pending list or 
> something?
> 
> Our thought has been temp files for a while, but we’re not sure what we 
> should do if that turns out to be the case.
> 
>> On May 14, 2018, at 3:08 PM, Tom Lane  wrote:
>> 
>> [ please keep the list cc'd ]
>> 
>> Jonathan Marks  writes:
>>> Thanks for your quick reply. Here’s a bit more information:
>>> 1) to measure the “size of the database” we run something like `select 
>>> datname, pg_size_pretty(pg_database_size(datname)) from pg_database;` I’m 
>>> not sure if this includes WAL size.
>>> 2) I’ve tried measuring WAL size with `select sum(size) from 
>>> pg_ls_waldir();` — this also doesn’t budge.
>>> 3) Our current checkpoint_timeout is 600s with a 
>>> checkpoint_completion_target of 0.9 — what does that suggest?
>> 
>> Hmph.  Your WAL-size query seems on point, and that pretty much destroys
>> my idea about a WAL emission spike.
>> 
>> pg_database_size() should include all regular and temporary tables/indexes
>> in the named DB.  It doesn't include WAL (but we've eliminated that), nor
>> cluster-wide tables such as pg_database (but those seem pretty unlikely
>> to be at issue), nor non-relation temporary files such as sort/hash temp
>> space.  At this point I think we have to focus our attention on what might
>> be creating large temp files.  I do not see anything in the GIN index code
>> that could do that, especially not if you have fastupdate off.  I wonder
>> whether there is something about the particular bulk-insertion queries
>> you're using that could result in large temp files --- which'd make the
>> apparent correlation with GIN index use a mirage, but we're running out
>> of other ideas.  You could try enabling log_temp_files to see if there's
>> anything to that.
>> 
>> In the grasping-at-straws department: are you quite sure that the extra
>> disk space consumption is PG's to begin with, rather than something
>> outside the database entirely?
>> 
>>  regards, tom lane
> 




Re: Rapid disk usage spikes when updating large tables with GIN indexes

2018-05-21 Thread Jonathan Marks
I believe that we have figured it out. It indeed was a WAL issue — the WAL 
wasn’t getting measured because it had been moved into an archived folder.

We resolved this in a two main ways:

1. By dramatically increasing max_wal_size to decrease the frequency of commits
2. By turning on wal_compression

We’re going to continue looking into whether we can reduce wal generation / 
write volume further by turning back on fast_update on our GIN indexes.

> On May 16, 2018, at 4:31 PM, Nikolay Samokhvalov  
> wrote:
> 
> Why not set up a spot EC2 instance with Postgres 10.1, load database from a 
> dump (yes you’ll need to create one from RDS because they don’t provide 
> direct access to dumps/backups; probably you need to get only specific 
> tables) and repeat your actions, closely looking at filesystem.
> 
> ср, 16 мая 2018 г. в 13:10, Jonathan Marks  <mailto:jonathanaveryma...@gmail.com>>:
> Hi Tom —
> 
> We turned on log_temp_files and since the last stats reset (about a week ago) 
> we’re seeing 0 temp files altogether (grabbing that info from 
> pg_stat_database).
> 
> So, as far as we know:
> 
> 1) It’s not WAL
> 2) It’s not tempfiles
> 3) It’s not the size of the error logs
> 4) It’s not the size of the actual rows in the database or the indexes
> 
> Another thread we found suggested pg_subtrans — this seems less likely 
> because we’ve been able to replicate this across many different types of 
> connections etc. but thought it might be a potential source.
> 
> Any other system-monitoring queries that we can run that might further 
> illuminate the issue?
> 
> Thank you!
> 
> > On May 14, 2018, at 3:31 PM, Jonathan Marks  > <mailto:jonathanaveryma...@gmail.com>> wrote:
> > 
> > We’ll turn on log_temp_files and get back to you to see if that’s the 
> > cause. Re: the exact queries — these are just normal INSERTs and UPDATEs. 
> > This occurs as part of normal database operations — i.e., we are processing 
> > 10% of a table and marking changes to a particular row, or happen to be 
> > inserting 5-10% of the table volume with new rows. Whenever we bulk load we 
> > have to drop the indexes because the disk space loss just isn’t tenable.
> > 
> > Re: extra disk space consumption not within PG — the AWS folks can’t tell 
> > me what the problem is because it’s all internal to the PG part of the 
> > instance they can’t access. Doesn’t mean your last suggestion can’t be the 
> > case but makes it slightly less likely.
> > 
> > Any chance that GIN indexes are double-logging? I.e. with fastupdate off 
> > they are still trying to keep track of the changes in the pending list or 
> > something?
> > 
> > Our thought has been temp files for a while, but we’re not sure what we 
> > should do if that turns out to be the case.
> > 
> >> On May 14, 2018, at 3:08 PM, Tom Lane  >> <mailto:t...@sss.pgh.pa.us>> wrote:
> >> 
> >> [ please keep the list cc'd ]
> >> 
> >> Jonathan Marks  >> <mailto:jonathanaveryma...@gmail.com>> writes:
> >>> Thanks for your quick reply. Here’s a bit more information:
> >>> 1) to measure the “size of the database” we run something like `select 
> >>> datname, pg_size_pretty(pg_database_size(datname)) from pg_database;` I’m 
> >>> not sure if this includes WAL size.
> >>> 2) I’ve tried measuring WAL size with `select sum(size) from 
> >>> pg_ls_waldir();` — this also doesn’t budge.
> >>> 3) Our current checkpoint_timeout is 600s with a 
> >>> checkpoint_completion_target of 0.9 — what does that suggest?
> >> 
> >> Hmph.  Your WAL-size query seems on point, and that pretty much destroys
> >> my idea about a WAL emission spike.
> >> 
> >> pg_database_size() should include all regular and temporary tables/indexes
> >> in the named DB.  It doesn't include WAL (but we've eliminated that), nor
> >> cluster-wide tables such as pg_database (but those seem pretty unlikely
> >> to be at issue), nor non-relation temporary files such as sort/hash temp
> >> space.  At this point I think we have to focus our attention on what might
> >> be creating large temp files.  I do not see anything in the GIN index code
> >> that could do that, especially not if you have fastupdate off.  I wonder
> >> whether there is something about the particular bulk-insertion queries
> >> you're using that could result in large temp files --- which'd make the
> >> apparent correlation with GIN index use a mirage, but we're running out
> >> of other ideas.  You could try enabling log_temp_files to see if there's
> >> anything to that.
> >> 
> >> In the grasping-at-straws department: are you quite sure that the extra
> >> disk space consumption is PG's to begin with, rather than something
> >> outside the database entirely?
> >> 
> >>  regards, tom lane
> > 
> 
> 



Segfault leading to crash, recovery mode, and TOAST corruption

2018-06-05 Thread Jonathan Marks
Hello —

We had two issues today (once this morning and once a few minutes ago) with our 
primary database (RDS running 10.1, 32 cores, 240 GB RAM, 5TB total disk space, 
20k PIOPS) where the database suddenly crashed and went into recovery mode. The 
first time this happened, we restarted the server after about 5 minutes in an 
attempt to get the system live, and the second time we let it stay in recovery 
mode until it recovered (took about 10 minutes). The system was not under high 
load in either case.

Both times that the server crashed, we saw this in the logs:

2018-06-05 23:08:44 UTC:172.31.7.89(36224):production@OURDB:[12173]:ERROR:  
canceling statement due to statement timeout
2018-06-05 23:08:44 UTC::@:[48863]:LOG:  worker process: parallel worker for 
PID 12173 (PID 20238) exited with exit code 1
2018-06-05 23:08:49 UTC::@:[48863]:LOG:  server process (PID 12173) was 
terminated by signal 11: Segmentation fault

After the first crash, we then started getting errors like:

2018-06-05 23:08:45 UTC:172.31.6.84(33392):production@OURDB:[11888]:ERROR:  
unexpected chunk number 0 (expected 1) for toast value 1592283014 in 
pg_toast_26656

We were able to identify 15 rows that are corrupted and the exact fields that 
are being TOASTED. We’re following Josh Berkus’ post here: 
http://www.databasesoup.com/2013/10/de-corrupting-toast-tables.html 
.

We have tried to update those rows to change the bad fields by using UPDATE and 
DELETE, but every time we do we get an error: ERROR: tuple concurrently updated

We’re intending to reindex the TOAST table this evening, then try to delete 
again, and then run pg_repack. However, while that may resolve the TOAST 
corruption, we don’t believe it’s the root cause of this issue. We can in 
theory restore from one of our backups, but that would result in data loss for 
our clients and may not necessarily resolve the issue. We’re worried that this 
is a Postgres bug, perhaps due to parallelization — would appreciate any 
guidance people can give.

Thank you!





Re: Segfault leading to crash, recovery mode, and TOAST corruption

2018-06-05 Thread Jonathan Marks
Thank you so very much, Tom.

Vacuuming fixed the TOAST corruption issue and we’ll upgrade our instances 
tonight (max RDS has is 10.3, but that’s a start).


> On Jun 5, 2018, at 8:07 PM, Tom Lane  wrote:
> 
> Jonathan Marks  writes:
>> We had two issues today (once this morning and once a few minutes ago)
>> with our primary database (RDS running 10.1, 32 cores, 240 GB RAM, 5TB
>> total disk space, 20k PIOPS) where the database suddenly crashed and
>> went into recovery mode.
> 
> I'd suggest updating to 10.4 ... see below.
> 
>> Both times that the server crashed, we saw this in the logs:
>> 2018-06-05 23:08:44 UTC:172.31.7.89(36224):production@OURDB:[12173]:ERROR:  
>> canceling statement due to statement timeout
>> 2018-06-05 23:08:44 UTC::@:[48863]:LOG:  worker process: parallel worker for 
>> PID 12173 (PID 20238) exited with exit code 1
>> 2018-06-05 23:08:49 UTC::@:[48863]:LOG:  server process (PID 12173) was 
>> terminated by signal 11: Segmentation fault
> 
> This looks to be a parallel leader process getting confused when a worker
> process exits unexpectedly.  There were some related fixes in 10.2, which
> might resolve the issue, though it's also possible we have more to do there.
> 
>> After the first crash, we then started getting errors like:
>> 2018-06-05 23:08:45 UTC:172.31.6.84(33392):production@OURDB:[11888]:ERROR:  
>> unexpected chunk number 0 (expected 1) for toast value 1592283014 in 
>> pg_toast_26656
> 
> This definitely looks to be the "reuse of TOAST OIDs immediately after
> crash" issue that was fixed in 10.4.  AFAIK it's recoverable corruption;
> I believe you'll find that VACUUMing the parent table will make the
> errors stop, and all will be well.  But an update would be prudent to
> prevent it from happening again.
> 
>   regards, tom lane




tsvector field length limitation

2018-06-20 Thread Jonathan Marks
Hi folks —

We utilize Postgres’ full text search system pretty heavily in our team’s 
operations and often index tens of millions of records with varying lengths of 
text. In most cases, the text we need to index is pretty short (no more than. 
hundreds of words) but in rare cases a single record is very very long (high 
hundreds of thousands of words or longer). With those records, we run into the 
max tsvector length requirement "The length of a tsvector (lexemes + positions) 
must be less than 1 megabyte”

I understand the performance implications of having very long tsvectors (our 
gin index updates are pretty terrible in some cases) but would really 
appreciate it if the max tsvector length were larger (even 5MB would make a 
huge difference) or if that error were a stern warning rather than a hard error.

Is there any way to disable or increase that limit in Postgres 10.3? Perhaps in 
a future version?

Thank you!
Jonathan


Re: tsvector field length limitation

2018-06-20 Thread Jonathan Marks
What if we just didn’t use positional arguments at all? I.e. we just populate 
the tsvector with lexemes only?

> On Jun 20, 2018, at 10:49 AM, Tom Lane  wrote:
> 
> Jonathan Marks  writes:
>> ... we run into the max tsvector length requirement "The length of a 
>> tsvector (lexemes + positions) must be less than 1 megabyte”
> 
>> Is there any way to disable or increase that limit in Postgres 10.3?
> 
> No; it's forced by the representation used for tsvector, which stores
> lexeme offsets in 20-bit fields (cf WordEntry in
> src/include/tsearch/ts_type.h).  Perhaps that was short-sighted but
> I don't foresee it changing anytime soon.  You'd more or less need
> a whole new datatype ("bigtsvector"?) to make it happen.
> 
>   regards, tom lane




Forcing index usage

2019-04-03 Thread Jonathan Marks
Hi folks —

We’ve got several tables that are relatively large (~250-500GB in total size) 
that use very large (tens to hundreds of GB) GIN indexes for full-text search. 
We’ve set the column statistics for our tsvector columns as high as they go 
(10,000). We almost always order our search results by a separate date column 
(which has an index) and we almost always use a limit.

Whenever the query planner chooses to use the indexes, queries on these tables 
are somewhat fast, maxing out at a few hundred milliseconds per query (which is 
terrible, but acceptable to end users).

When the query planner chooses not to use the indexes, queries can take many 
tens of seconds if they ever finish at all. When this happens, the query 
planner usually chooses to use the date index instead of the GIN index, and 
that is almost always a bad idea. We have sometimes been able to trick it into 
a better query plan by also adding the tsvector column in the ORDER BY clause, 
but that has bad performance implications if the result set is large.

Is there a way to tell Postgres “please don’t use index X when queries that 
could use index Y instead occur?”

Thank you!