postgres 9.5 DB corruption

2019-07-24 Thread Thomas Tignor
Hello postgres community,
Writing again to see if there are insights on this issue. We have had 
infrequent but recurring corruption since upgrading from postgres 9.1 to 
postgres 9.5. We are presently on 9.5.16. Our DB-facing app continually 
performs a mixture of DML, primarily inserts and updates on two specific 
tables, with no single op being suspect. In the past, corruption events have 
produced encoding errors on COPY operations (invalid byte sequence for encoding 
"UTF8"). More recently, they have caused segmentation faults. We were able to 
take a cold backup after a recent event. SELECTing the corrupted data on our 
cold backup yields the following stack. Any info on a solution or how to 
proceed towards a solution would be much appreciated.
Thanks in advance.

(gdb) where
#0  pglz_decompress (source=source@entry=0xa617904 "0", slen=8139, 
dest=dest@entry=0x4268e028 "", rawsize=808452096) at pg_lzcompress.c:745
#1  0x080f3079 in toast_decompress_datum (attr=0xa6178fc) at tuptoaster.c:2210
#2  0x080f3716 in heap_tuple_untoast_attr (attr=0xa6178fc) at tuptoaster.c:183
#3  0x08440955 in pg_detoast_datum_packed (datum=) at fmgr.c:2270
#4  0x084145bf in text_to_cstring (t=0x7592fd2a) at varlena.c:176
#5  0x0843e874 in FunctionCall1Coll (flinfo=flinfo@entry=0xa614738, 
collation=collation@entry=0, arg1=arg1@entry=1972567338) at fmgr.c:1297
#6  0x0843fef8 in OutputFunctionCall (flinfo=0xa614738, val=1972567338) at 
fmgr.c:1950
#7  0x080bf84b in printtup (slot=0xa613bf4, self=0xa60d714) at printtup.c:359
#8  0x08220f9a in ExecutePlan (dest=0xa60d714, direction=, 
numberTuples=0, sendTuples=, operation=CMD_SELECT, 
planstate=0xa613974, estate=0xa6138ec) at execMain.c:1574
#9  standard_ExecutorRun (queryDesc=0xa6134e4, direction=ForwardScanDirection, 
count=0) at execMain.c:337
#10 0x08332c1b in PortalRunSelect (portal=portal@entry=0xa6114dc, 
forward=forward@entry=1 '\001', count=0, count@entry=2147483647, 
dest=dest@entry=0xa60d714) at pquery.c:942
#11 0x08333fa7 in PortalRun (portal=portal@entry=0xa6114dc, 
count=count@entry=2147483647, isTopLevel=isTopLevel@entry=1 '\001', 
dest=dest@entry=0xa60d714, altdest=altdest@entry=0xa60d714, 
completionTag=completionTag@entry=0xffd5d71c "")
    at pquery.c:786
#12 0x08330ba8 in exec_simple_query (query_string=0xa5f1754 "select * from 
ams.alert_attribute_bak;") at postgres.c:1096
#13 PostgresMain (argc=1, argv=0xa53dbbc, dbname=0xa53daec "ams", 
username=0xa53dadc "akamai") at postgres.c:4049
#14 0x080b53af in BackendRun (port=0xa584b78) at postmaster.c:4312
#15 BackendStartup (port=0xa584b78) at postmaster.c:3986
#16 ServerLoop () at postmaster.c:1705
#17 0x082d0dd7 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xa53d2a8) 
at postmaster.c:1313
#18 0x080b68eb in main (argc=3, argv=0xa53d2a8) at main.c:228
(gdb)

Tom    :-)


Re: postgres 9.5 DB corruption

2019-07-25 Thread Thomas Tignor
Hi Adrian,Thanks for responding. Below is the schema data for the tables where 
we always see corruption. You'll notice they have triggers for a postgres 
extension called Slony-I which provides replication service. It's not clear 
if/how that's a factor, though.

ams=# \d ams.alert_instance

                    Table "ams.alert_instance"

       Column        |              Type              | Modifiers 

-++---

 alert_instance_id   | integer                        | not null

 alert_definition_id | integer                        | not null

 alert_instance_key  | character varying(500)         | not null

 start_active_date   | timestamp(0) without time zone | not null

 stop_active_date    | timestamp(0) without time zone | 

 active              | smallint                       | not null

 acknowledged        | smallint                       | not null

 ack_clear_time      | timestamp(0) without time zone | 

 user_set_clear_time | smallint                       | 

 category_id         | integer                        | not null

 condition_start     | timestamp(0) without time zone | not null

 unack_reason        | character varying(1)           | 

 viewer_visible      | smallint                       | not null

Indexes:

    "pk_alert_instance" PRIMARY KEY, btree (alert_instance_id), tablespace 
"tbls5"

    "idx_alert_inst_1" btree (alert_instance_key, alert_definition_id, 
alert_instance_id, active, acknowledged, ack_clear_time), tablespace "tbls5"

    "idx_alert_inst_cat_id" btree (category_id), tablespace "tbls5"

    "idx_alert_inst_def_id" btree (alert_definition_id), tablespace "tbls5"

Check constraints:

    "ck_alert_inst_acked" CHECK (acknowledged = 0 OR acknowledged = 1)

    "ck_alert_inst_active" CHECK (active = 0 OR active = 1)

    "ck_alert_inst_set_cl_tm" CHECK (user_set_clear_time = 0 OR 
user_set_clear_time = 1)

    "ck_alert_inst_viewer_vis" CHECK (viewer_visible = 0 OR viewer_visible = 1)

Foreign-key constraints:

    "fk_alert_inst_cat_id" FOREIGN KEY (category_id) REFERENCES 
ams.category(category_id)

    "fk_alert_inst_def_id" FOREIGN KEY (alert_definition_id) REFERENCES 
ams.alert_definition(alert_definition_id)

    "fk_alert_inst_unack_reason" FOREIGN KEY (unack_reason) REFERENCES 
ams.unack_reason(unack_reason)

Referenced by:

    TABLE "ams.alert_attribute" CONSTRAINT "fk_alert_attr_instance_id" FOREIGN 
KEY (alert_instance_id) REFERENCES ams.alert_instance(alert_instance_id) ON 
DELETE CASCADE

Triggers:

    _ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON 
ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE 
_ams_cluster.logtrigger('_ams_cluster', '1', 'k')

    _ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_instance FOR EACH 
STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('1')

Disabled user triggers:

    _ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON 
ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE 
_ams_cluster.denyaccess('_ams_cluster')

    _ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_instance FOR EACH 
STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()

 

ams=# 

ams=# \d ams.alert_attribute

               Table "ams.alert_attribute"

      Column       |          Type           | Modifiers 

---+-+---

 alert_instance_id | integer                 | not null

 name              | character varying(200)  | not null

 data_type         | smallint                | not null

 value             | character varying(2000) | 

Indexes:

    "pk_alert_attributes" PRIMARY KEY, btree (alert_instance_id, name), 
tablespace "tbls5"

    "idx_alert_attr_name" btree (name)

Foreign-key constraints:

    "fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES 
ams.alert_instance(alert_instance_id) ON DELETE CASCADE

Triggers:

    _ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON 
ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE 
_ams_cluster.logtrigger('_ams_cluster', '2', 'kk')

    _ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_attribute FOR 
EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('2')

Disabled user triggers:

    _ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON 
ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE 
_ams_cluster.denyaccess('_ams_cluster')

    _ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_attribute FOR EACH 
STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()

 

ams=# 


Tom    :-) 

On Wednesday, July 24, 2019, 11:15:04 AM EDT, Adrian Klaver 
 wrote:  

postgres 9.5 DB corruption: invalid byte sequence for encoding "UTF8"

2019-03-25 Thread Thomas Tignor
Hoping someone may be able to offer some guidance on this recurring problem. I 
am providing this "problem report" to the general list as I understand the bugs 
list requires a set of reproduction steps we do not yet have. Please advise if 
I have the process wrong. I have tried to provide all known relevant info here. 
Thanks in advance for any insights.--> A description of what you are trying to 
achieve and what results you expect.:
We are experiencing intermittent DB corruption in postgres 9.5.14. We are 
trying to identify and eliminate all sources. We are using two independent 
services for data replication, Slony-I v2.2.6 and a custom service developed 
in-house. Both are based on COPY operations. DB corruption is observed when 
COPY operations fail with an error of the form: 'invalid byte sequence for 
encoding "UTF8"'. This occurs with a frequency ranging between a few days and 
several weeks. Each incident is followed by a race to find and repair or remove 
corrupted data, which we are getting good at. With well over a dozen incidents, 
the great majority originally showed corruption in a single VARCHAR(2000) 
column (value) in a single table (alert_attribute). In this time, we read about 
suspected and real problems with TOAST functionality and so made the decision 
to change alert_attribute.value to PLAIN storage. Since that change was made, 
most new incidents show corruption in the alert_attribute.name column instead 
(VARCHAR(200)). Another table (alert_instance) has been impacted as well. See 
below for their schemas.

We have looked high and low through system logs and device reporting utility 
output for any sign of hardware failures. We haven't turned up anything yet. We 
also tried rebuilding an entire DB from scratch. That did not seem to help.

We have not been performing routine reindexing. This is a problem we are 
working to correct. Normally our master DB serves for an 8-12 week period 
without reindexing before we failover to a peer. Before assuming the master 
role, the peer always begins by truncating the alert_instance and 
alert_attribute tables and loading all data from the current master.

Hardware specs are listed below. For storage, we have 8 INTEL SSDSA2BW12 
direct-attached disks. We can provide additional info as needed.

ams=# \d ams.alert_attribute
   Table "ams.alert_attribute"
  Column   |  Type   | Modifiers 
---+-+---
 alert_instance_id | integer | not null
 name  | character varying(200)  | not null
 data_type | smallint| not null
 value | character varying(2000) | 
Indexes:
"pk_alert_attributes" PRIMARY KEY, btree (alert_instance_id, name), 
tablespace "tbls5"
"idx_aa_aval" btree (name, value)
Foreign-key constraints:
"fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES 
ams.alert_instance(alert_instance_id) ON DELETE CASCADE
Triggers:
_ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON 
ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE 
_ams_cluster.logtrigger('_ams_cluster', '2', 'kk')
_ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_attribute FOR 
EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('2')
Disabled user triggers:
_ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON 
ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE 
_ams_cluster.denyaccess('_ams_cluster')
_ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_attribute FOR EACH 
STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()

ams=# 
ams=# \d ams.alert_instance 
Table "ams.alert_instance"
   Column|  Type  | Modifiers 
-++---
 alert_instance_id   | integer| not null
 alert_definition_id | integer| not null
 alert_instance_key  | character varying(500) | not null
 start_active_date   | timestamp(0) without time zone | not null
 stop_active_date| timestamp(0) without time zone | 
 active  | smallint   | not null
 acknowledged| smallint   | not null
 ack_clear_time  | timestamp(0) without time zone | 
 user_set_clear_time | smallint   | 
 category_id | integer| not null
 condition_start | timestamp(0) without time zone | not null
 unack_reason| character varying(1)   | 
 viewer_visible  | smallint   | not null
Indexes:
"pk_alert_instance" PRIMARY KEY, btree (alert_instance_id), tablespace 
"tbls5"
"idx_alert_inst_1" btree (alert_instance_key, alert_definition_id, 
alert_instance_id, active, acknowledged, ack_clear_time), tablespace "tbls5"
"idx_alert_inst_cat_id" btree (category_id), tablespace "tbls5"
"idx_alert_inst_d

Re: [External] postgres 9.5 DB corruption: invalid byte sequence for encoding "UTF8"

2019-03-25 Thread Thomas Tignor
Hi Vijay,Thanks for writing. To your questions:
 
 i think the experts will chime in soon,

but why do you think this as db corruption and not just a bad input?
https://github.com/postgres/postgres/blob/master/src/pl/plperl/expected/plperl_lc_1.out

tptignor: Our JVM logs data before the insert and update ops. It all looks fine 
there, and is usually straight ASCII. 
or it may also be encoding issue.
U137: Invalid byte sequence for encoding · pganalyze
tptignor: There are a limited number of settings we apply to our driver, and 
the encoding isn't one of them. I expect it's using the server-configured 
SQL_ASCII encoding, which I understand is like having no rules and just takes 
data as is.

can you do a pg_dump and restore on a parallel instance? does it
result in failure?
tptignor: We can do better than that. When this occurs, we isolate and COPY the 
problem data to file (which always works fine), then COPY it back in. (Without 
the COPY in ops, the corruption might even go unnoticed...) The error on COPY 
in shows us a line number which takes us right to the corrupted data.

we also ask the app to log data (temporarily) inserted  so that we
could figure out directly if there was bad data upstream or have
validations to prevent inserts when there is bad data.

also, in our case the query was stuck at "PARSE"  (if you do ps aux |
grep postgres) and in some cases did result in oom.
but upgrading the client and using session mode pooling in pgbouncer
worked for us.

tptignor: We depend on a JVM which works through the v42.0.0 postgres driver. 
We haven't taken a hard look yet at the driver, but certainly would if there 
was a reason. I would have expected the server to catch bad data on input from 
the driver in the same way it catches on COPY in.
Tom    :-)


Regards,
Vijay

On Tue, Mar 26, 2019 at 12:17 AM Thomas Tignor  wrote:
>
> Hoping someone may be able to offer some guidance on this recurring problem. 
> I am providing this "problem report" to the general list as I understand the 
> bugs list requires a set of reproduction steps we do not yet have. Please 
> advise if I have the process wrong. I have tried to provide all known 
> relevant info here. Thanks in advance for any insights.
>
> --> A description of what you are trying to achieve and what results you 
> expect.:
>
> We are experiencing intermittent DB corruption in postgres 9.5.14. We are 
> trying to identify and eliminate all sources. We are using two independent 
> services for data replication, Slony-I v2.2.6 and a custom service developed 
> in-house. Both are based on COPY operations. DB corruption is observed when 
> COPY operations fail with an error of the form: 'invalid byte sequence for 
> encoding "UTF8"'. This occurs with a frequency ranging between a few days and 
> several weeks. Each incident is followed by a race to find and repair or 
> remove corrupted data, which we are getting good at. With well over a dozen 
> incidents, the great majority originally showed corruption in a single 
> VARCHAR(2000) column (value) in a single table (alert_attribute). In this 
> time, we read about suspected and real problems with TOAST functionality and 
> so made the decision to change alert_attribute.value to PLAIN storage. Since 
> that change was made, most new incidents show corruption in the 
> alert_attribute.name column instead (VARCHAR(200)). Another table 
> (alert_instance) has been impacted as well. See below for their schemas.
>
> We have looked high and low through system logs and device reporting utility 
> output for any sign of hardware failures. We haven't turned up anything yet. 
> We also tried rebuilding an entire DB from scratch. That did not seem to help.
>
> We have not been performing routine reindexing. This is a problem we are 
> working to correct. Normally our master DB serves for an 8-12 week period 
> without reindexing before we failover to a peer. Before assuming the master 
> role, the peer always begins by truncating the alert_instance and 
> alert_attribute tables and loading all data from the current master.
>
> Hardware specs are listed below. For storage, we have 8 INTEL SSDSA2BW12 
> direct-attached disks. We can provide additional info as needed.
>
> ams=# \d ams.alert_attribute
>                Table "ams.alert_attribute"
>      Column      |          Type          | Modifiers
> ---+-+---
>  alert_instance_id | integer                | not null
>  name              | character varying(200)  | not null
>  data_type        | smallint                | not null
>  value            | character varying(2000) |
> Indexes:
>    "pk_alert_attributes" PRIMARY KEY, btree (alert_instance_id, name), 
>tablespace "tbls5&quo

Re: [External] postgres 9.5 DB corruption: invalid byte sequence for encoding "UTF8"

2019-03-25 Thread Thomas Tignor
Hi Brad,Thanks for writing. As I mentioned to Vijay, the "source" is a JVM 
using the postgres v42.0.0 JDBC driver. I do not believe we have any explicit 
encoding set, and so I expect the client encoding is SQL_ASCII. The DB is most 
definitely UTF8. Our log shows no issue with the input data we've discovered 
(at the time that it's logged.) If the data is somehow corrupted before 
inserting, won't the server encoding kick in and generate an error? We can 
certainly test that.
Tom    :-) 

On Monday, March 25, 2019, 3:56:04 PM EDT, Brad Nicholson 
 wrote:  
 
 
Vijaykumar Jain  wrote on 03/25/2019 03:07:19 PM:


> but why do you think this as db corruption and not just a bad input?
> INVALID URI REMOVED
> u=https-3A__github.com_postgres_postgres_blob_master_src_pl_plperl_expected_plperl-5Flc-5F1.out&d=DwIFaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=BX8eA7xgfVJIpaY_30xSZQ&m=7u71qfQylE2M0dQlbUBn399O53IK1HQHm-
> Unxl9LUzw&s=K6nXHvrx3aX4riGMLnucLoRa76QNC0_TOS5R4AziTVM&e=



This looked interesting to me in the settings below:


>   client_encoding                | SQL_ASCII          | client


Unless you have set this explicitly, it will use the default encoding for the 
database.  If it hasn't been explicitly set, then the source database (assuming 
that that output was from the source) is SQL_ASCII.

Double check the database encoding for the source database and target database. 
 I'm wondering if you have SQL_ASCII for the source, and UTF8 for the target.  
If that is the case, you can take invalid UTF8 characters into the source, and 
they will fail to replicate to the target.  That's not a Postgres problem, but 
an encoding mismatch


Brad
  

Re: postgres 9.5 DB corruption: invalid byte sequence for encoding "UTF8"

2019-03-26 Thread Thomas Tignor
Hi Laurenz. Thanks for writing. I can tell you that while the error message 
usually identifies just one byte (though sometimes up to three), inspection of 
the data has frequently shown several bytes impacted. It often seems that the 
corruption begins at one point in the row and continues to the end.
At this point, we've taken a few mitigating steps: we have switched all the 
VARCHAR columns in the alert_instance and alert_attribute tables to use "plain" 
storage. Also, we've dropped a (name, value) index we were using for the 
alert_attribute table.
I looked at the checksums feature sometime earlier. This is something we could 
try, with some effort, but it appears a sum is applied to a page leaving 
shared_buffers and then checked when the page is read back. Our Slony-I 
replication service is pulling the latest changes into subscriber DBs every 
couple seconds. Normally, when data corruption occurs, it appears very quickly 
with a Slony-I COPY failure. It seems there may not be time to write a checksum.
Tom    :-) 

On Tuesday, March 26, 2019, 4:25:33 AM EDT, Laurenz Albe 
 wrote:  
 
 Thomas Tignor wrote:
> We are experiencing intermittent DB corruption in postgres 9.5.14. We are 
> trying to
> identify and eliminate all sources. We are using two independent services for 
> data
> replication, Slony-I v2.2.6 and a custom service developed in-house. Both are 
> based
> on COPY operations. DB corruption is observed when COPY operations fail with 
> an error
> of the form: 'invalid byte sequence for encoding "UTF8"'.
> This occurs with a frequency ranging between a few days and several weeks.

It might be a storage problem, but it could also be a problem with how the data
get into the database in the first place.

I would consider it unlikely that there is a bug in the TOAST code that changes
just one byte.

You could use block checksums to see if the storage is at fault, but 
unfortunately
it requires an "initdb", dump and restore to switch to checksums.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com

  

Re: [External] postgres 9.5 DB corruption: invalid byte sequence for encoding "UTF8"

2019-03-26 Thread Thomas Tignor
Hello again Brad, and hello Tom. Thanks for writing, and Tom I'll add some 
extra thanks for the many google searches over the years which have been 
answered by your posts. :)
It seems there's been some confusion on the data flow so I'll try to explain. 
The original "source" I referred to was our JVM generating alert data and 
performing DML on the DB via the v42.0.0 driver. I do not know of any encoding 
setting for the driver itself.
The driver writes directly to our primary (Slony-I master) DB. Changes are then 
replicated by our replica (Slony-I subscriber) DBs. Each of these DBs has a 
"UTF8" server encoding. COPY out ops (performed by Slony-I or anything else) 
always work fine, and COPY in ops fail if there is data corruption.

a...@ams3.dfw.netmgmt:/a/ams/lib/ext/database$ sum postgresql-42.0.0.jar 

12476   691

a...@ams3.dfw.netmgmt:/a/ams/lib/ext/database$ 

a...@ams3.dfw.netmgmt:~$ echo  Slony-I Master 

 Slony-I Master 

a...@ams3.dfw.netmgmt:~$ psql -U akamai -d ams -c "SELECT datname, 
pg_encoding_to_char(encoding) FROM pg_database WHERE datname = 'ams'"

 datname | pg_encoding_to_char 

-+-

 ams     | UTF8

(1 row)




a...@ams3.dfw.netmgmt:~$ 

r...@ams-repl7.sjc.netmgmt:~# echo  Slony-I Replica 

 Slony-I Replica 

r...@ams-repl7.sjc.netmgmt:~# psql -U akamai -d ams -c "SELECT datname, 
pg_encoding_to_char(encoding) FROM pg_database WHERE datname = 'ams'"

 datname | pg_encoding_to_char 

-+-

 ams     | UTF8

(1 row)




r...@ams-repl7.sjc.netmgmt:~# 


Tom    :-) 

On Tuesday, March 26, 2019, 10:14:59 AM EDT, Tom Lane  
wrote:  
 
 "Brad Nicholson"  writes:
> Thomas Tignor  wrote on 03/25/2019 08:25:49 PM:
>> Thanks for writing. As I mentioned to Vijay, the "source" is a JVM
>> using the postgres v42.0.0 JDBC driver. I do not believe we have any
>> explicit encoding set, and so I expect the client encoding is
>> SQL_ASCII. The DB is most definitely UTF8.

> These statements are contradictory.

> The value of client_encoding from your select on pg_settings is SQL_ASCII.

> The docs clearly state:
> https://www.postgresql.org/docs/9.5/runtime-config-client.html
> "Sets the client-side encoding (character set). The default is to use the
> database encoding. "

> If you don't have client_encoding explicitly, then it is using the database
> encoding.

Umm ... not necessarily.  That bit in runtime-config-client.html
correctly states what the *server's* default for client_encoding is,
but an awful lot of client-side code will immediately override that.
psql will try to set it based on its LANG/LC_CTYPE environment,
for example.  I don't know what JDBC does; it might be different.

But in any case, yes, it'd be more reliable to check pg_database.encoding
explicitly.

            regards, tom lane