from:"Andres Freund"

Re: index-only-scan when there is an index on all columns

2018-02-27 Thread Andres Freund

On 2018-02-27 16:58:11 -0500, Tom Lane wrote:
> > Also, I recall discussion that select statements will touch the visibility
> > map (hence causing write I/O even in a read-only query) but [1] indicates
> > that only vacuum will set them ddl will clear them.
> 
> Hm, I don't recall that, but I've not been involved in the last few rounds
> of hacking on that mechanism.

I'm fairly certain that only vacuum and table rewrites like cluster sets
all-visible, and that there was never any released code that did so
during SELECT. I think there were a few patches debating whether we
could change that, but they never really got anywhere afair.

Greetings,

Andres Freund

Re: index-only-scan when there is an index on all columns

2018-02-27 Thread Andres Freund

On 2018-02-28 13:15:45 +1300, David Rowley wrote:
> On 28 February 2018 at 11:11, Andres Freund  wrote:
> > I'm fairly certain that only vacuum and table rewrites like cluster sets
> > all-visible,
> 
> I don't think the pages are set all visible again after a rewrite.

You're right. We freeze the tuples, but don't set the heap / FSM bits.

Greetings,

Andres Freund

Re: Parallel index creation & pg_stat_activity

2018-02-28 Thread Andres Freund

Hi Peter,

On 2018-02-28 16:50:44 +, Phil Florent wrote:
> With an index creation (create index t1_i1 on t1(c1, c2);) I have this kind 
> of output :
> 
> ./t -d 20 -o "pid, backend_type, query, wait_event_type, wait_event"
> busy_pc | distinct_exe | pid  |  backend_type  |   query  
>  | wait_event_type |  wait_event
> -+--+--++---+-+--
>   68 | 1 / 136  | 8262 | client backend | create index t1_i1 on 
> t1(c1, c2); | IO  | DataFileRead
>   26 | 1 / 53   | 8262 | client backend | create index t1_i1 on 
> t1(c1, c2); | |
>6 | 1 / 11   | 8262 | client backend | create index t1_i1 on 
> t1(c1, c2); | IO  | BufFileWrite
> (3 rows)

> No parallel worker. At least one parallel worker was active though, I could 
> see its work with a direct query on pg_stat_activity or a ps -ef :
> 
> ...
> postgres  8262  8230  7 08:54 ?00:22:46 postgres: 11/main: postgres 
> postgres [local] CREATE INDEX
> ...
> postgres  9833  8230 23 14:17 ?00:00:33 postgres: 11/main: parallel 
> worker for PID 8262
> ...

Looks like we're not doing a pgstat_report_activity() in the workers?
Any argument for not doing so?

Greetings,

Andres Freund

Re: PQConsumeinput stuck on recv

2018-03-02 Thread Andres Freund

Hi,


On 2018-03-01 19:20:04 -0300, Andre Oliveira Freitas wrote:
> I was able to capture the backtrace again, now with libpq debugging symbols.
> 
> Thread 15 (Thread 0x7f8cec068700 (LWP 68)):
> #0  0x7f8d252a1d9b in __libc_recv (fd=150, buf=0x7f8cf0034410,
> n=16384, flags=623517083, flags@entry=0) at
> ../sysdeps/unix/sysv/linux/x86_64/recv.c:33
> #1  0x7f8d26689783 in recv (__flags=0, __n=,
> __buf=, __fd=) at
> /usr/include/x86_64-linux-gnu/bits/socket2.h:44
> #2  pqsecure_raw_read (conn=conn@entry=0x7f8cf001e390, ptr= out>, len=) at
> /build/postgresql-10-rIfDLC/postgresql-10-10.2/build/../src/interfaces/libpq/fe-secure.c:230
> #3  0x7f8d26689863 in pqsecure_read
> (conn=conn@entry=0x7f8cf001e390, ptr=, len= out>) at 
> /build/postgresql-10-rIfDLC/postgresql-10-10.2/build/../src/interfaces/libpq/fe-secure.c:217
> #4  0x7f8d266810ea in pqReadData (conn=conn@entry=0x7f8cf001e390)
> at 
> /build/postgresql-10-rIfDLC/postgresql-10-10.2/build/../src/interfaces/libpq/fe-misc.c:683
> #5  0x7f8d2667e6f2 in PQconsumeInput (conn=0x7f8cf001e390) at
> /build/postgresql-10-rIfDLC/postgresql-10-10.2/build/../src/interfaces/libpq/fe-exec.c:1704
> 
> In this case, I also checked the pg_stat_activity and this particular
> connection on pg server side was idle for 15 minutes. I killed it
> using pg_terminate_backend, and then somehow the application resumed
> execution.

There's something decidedly weird going on. Libpq always keeps the
connection in nonblocking mode internally. Blocking is implemented by
using select on the socket. So this should never block.

Is there any chance parts of your application changes the sockets
block-y-ness?  Is see your code is passing the socket around, so perhaps
that's happening somewhere outside of the file?


> As you can see, recv has received a non-zero value in flags

Well, no, not really. recv() has a 0 flags, it's just libc's internal
implementation that appears to be showing up weird afaict.


Greetings,

Andres Freund

Re: PQConsumeinput stuck on recv

2018-03-05 Thread Andres Freund

Hi,

On 2018-03-05 17:57:51 -0300, Andre Oliveira Freitas wrote:
> Ok, on one hand glad to hear that is an artifact, on the other hand
> that means the issue is still out there...
> 
> I'm not a maintainer of Freeswitch, I am an user of it. However I am
> trying to supply the maintainers with any relevant information
> regarding this issue.
> 
> As you mentioned, I see the socket being modified and passed around in
> the source code. Do you think the implementation is OK? I do not know
> if that's standard practice when consuming libpq.

It's ok to pass the socket around, it'd not be ok to change whether the
socket blocks or not.  I don't have the time to review the freeswitch
code to see whether it does so.

- Andres

Re: dirty_ratio & dirty_background_ratio settings with huge memory

2018-03-06 Thread Andres Freund

On 2018-03-06 03:23:29 -0700, pinker wrote:
> Hi,
> I've got cutomer with really huge RAM, now it's:
>  total   used   free sharedbuffers cached
> Mem:   31021113052596  49515 2088019922961185
> -/+ buffers/cache:  904183011693
> Swap: 8191  1   8190
> 
> (free -m)
> 
> and before it was twice more (6TB).
> 
> and trying to figure out how to set dirty_ratio & dirty_background_ratio
> parameters. Even for normal sized server those default sizes are to high,
> but now would be ridiculously huge, respectively 1,24TB and 300GB. 
> I'm thinking about 1 percent for dirty_ratio and not using
> dirty_background_ratio because it's not possible to set it below 1% but to
> set dirty_background_bytes instead to about 1-2GB.
> Maybe someone has got other experience with RAM of this size and those
> settings?

With a halfway modern PG I'd suggest to rather tune postgres settings
that control flushing. That leaves files like temp sorting in memory for
longer, while flushing things controlledly for other sources of
writes. See *_flush_after settings.

Greetings,

Andres Freund

Re: dirty_ratio & dirty_background_ratio settings with huge memory

2018-03-06 Thread Andres Freund

On 2018-03-06 18:59:01 -0700, pinker wrote:
> Andres Freund wrote
> > With a halfway modern PG I'd suggest to rather tune postgres settings
> > that control flushing. That leaves files like temp sorting in memory for
> > longer, while flushing things controlledly for other sources of
> > writes. See *_flush_after settings.
> > 
> > Greetings,
> > 
> > Andres Freund
> 
> Thank you Andres, I missed those parameters appearing. It's not the case for
> this customer, because they have an ancient version 9.0, but it's
> interesting for others :)
> The maximal value - 2MB - is it not to small? Regarding the fact that for
> instance the same buffer will be updated 10 times between checkpoints and
> bgwriter well be flushing it directly to disk 10 times instead to allow the
> buffer to stay in the cache and do the update faster?

In most cases where the same buffer is updated frequently it should be
stored in postgres' buffer cache.  Note you can tune those.

Greetings,

Andres Freund

Re: Logical decoding on standby

2018-03-12 Thread Andres Freund

Hi,

On 2018-03-13 01:08:03 +0100, Andreas Kretschmer wrote:
> On 13 March 2018 00:58:27 CET, Andreas Kretschmer  
> wrote:
> >On 12 March 2018 21:18:22 CET, Andreas Joseph Krogh
> > wrote:
> >>Anybody knows if $subject will make it into v11?

I can't see it happening for v11, there's not even a CF entry in the
last CF.

> >Why do you think you needs this? 
> >
> >Regards, Andreas
> 
> Let me explain my question. One of the key aspects of logical replication is, 
> that you can define what to replicate. That wouldn't work in this way, that's 
> why i'm asking.

The subject said logical decoding, not replication. There's a lot of
change data capture type workloads where decoding from the standby is
quite useful. And the design definitely would work for that, we've
explicitly took that into consideration.

Greetings,

Andres Freund

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-03-19 Thread Andres Freund

Hi Jeremy, Alvaro,

On 2018-03-19 13:00:13 -0500, Jeremy Finzel wrote:
> On Mon, Mar 19, 2018 at 12:46 PM, Alvaro Herrera 
> wrote:
> 
> > Jeremy Finzel wrote:
> > > Getting some concerning errors in one of our databases that is on 9.5.11,
> > > on autovacuum from template0 database pg_authid and pg_auth_members.  I
> > > only saw some notes on the list about this error related to materialized
> > > views.  FWIW, we did use pg_upgrade to upgrade this database from 9.4 to
> > > 9.5.  Here is an example:
> > >
> > > 2018-03-19 12:08:33.946 CDT,,,14892,,5aafee91.3a2c,1,,2018-03-19
> > 12:08:33
> > > CDT,59/340951,0,ERROR,XX001,"found xmin 2906288382 from before
> > relfrozenxid
> > > 740087784","automatic vacuum of table
> > > ""template0.pg_catalog.pg_authid"""""
> >
> > Can you please supply output of pg_controldata?

> Latest checkpoint's NextXID:  16/3132524419
> Latest checkpoint's NextMultiXactId:  2142
> Latest checkpoint's NextMultiOffset:  5235
> Latest checkpoint's oldestXID:1829964553
> Latest checkpoint's oldestXID's DB:   12376
> Latest checkpoint's oldestActiveXID:  3131774441
> Latest checkpoint's oldestMultiXid:   1
> Latest checkpoint's oldestMulti's DB: 16400

Hm, based on these it doesn't look like multixacts were involved (based
on oldestMultiXid it's highly unlikley there've multi wraparound, and
there's not much multixact usage on system tables anyway).  Which
suggests that there might have been actual corrpution here.

Jeremy:
- which version of 9.4 and 9.5 ran on this? Do you know?
- Can you install the pageinspect extension? If so, it might be a
  CREATE EXTENSION pageinspect;
  CREATE OR REPLACE FUNCTION check_rel(rel regclass, OUT blockno int8, OUT lp 
int2, OUT xmin xid)
RETURNS SETOF RECORD
LANGUAGE SQL
AS $$
SELECT blockno, lp, t_xmin
FROM
generate_series(0, pg_relation_size($1::text) / 8192 - 1) blockno, -- 
every block in the relation
heap_page_items(get_raw_page($1::text, blockno::int4)) -- every item on 
the page
WHERE
t_xmin IS NOT NULL -- filter out empty items
AND t_xmin != 1 -- filter out bootstrap
AND t_xmin != 2 -- filter out frozen transaction id
AND (t_infomask & ((x'0100' | x'0200')::int)) != ((x'0100' | 
x'0200')::int) -- filter out frozen rows with xid present
AND age(t_xmin) > age((SELECT relfrozenxid FROM pg_class WHERE oid = 
$1)) -- xid cutoff filter
$$;
  SELECT * FROM check_rel('pg_authid') LIMIT 100;

  and then display all items for one of the affected pages like
  SELECT * FROM heap_page_items(get_raw_page('pg_authid', 34343));


Alvaro:
- Hm, we talked about code adding context for these kind of errors,
  right? Is that just skipped for csvlog?
- Alvaro, does the above check_rel() function make sense?

Greetings,

Andres Freund

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-03-19 Thread Andres Freund

On 2018-03-19 14:37:24 -0500, Jeremy Finzel wrote:
> We upgraded to 9.5.5, and today we are running 9.5.11.  And actually we
> upgraded from 9.3, not 9.4.  We are still trying to figure out which point
> release we were on at 9.3.

Ok.  IIRC there used to be a bug a few years back that sometimes lead to
highly contended pages being skipped during vacuum, and we'd still
update relfrozenxid. IIRC it required the table to be extended at the
same time or something?


> 
> > - Can you install the pageinspect extension? If so, it might be a
> >   CREATE EXTENSION pageinspect;
> >   CREATE OR REPLACE FUNCTION check_rel(rel regclass, OUT blockno int8, OUT
> > lp int2, OUT xmin xid)
> > RETURNS SETOF RECORD
> > LANGUAGE SQL
> > AS $$
> > SELECT blockno, lp, t_xmin
> > FROM
> > generate_series(0, pg_relation_size($1::text) / 8192 - 1) blockno,
> > -- every block in the relation
> > heap_page_items(get_raw_page($1::text, blockno::int4)) -- every
> > item on the page
> > WHERE
> > t_xmin IS NOT NULL -- filter out empty items
> > AND t_xmin != 1 -- filter out bootstrap
> > AND t_xmin != 2 -- filter out frozen transaction id
> > AND (t_infomask & ((x'0100' | x'0200')::int)) != ((x'0100' |
> > x'0200')::int) -- filter out frozen rows with xid present
> > AND age(t_xmin) > age((SELECT relfrozenxid FROM pg_class WHERE oid
> > = $1)) -- xid cutoff filter
> > $$;
> >   SELECT * FROM check_rel('pg_authid') LIMIT 100;
> >
> 
> Small note - Needs to be this because != is not supported for xid:
> 
> AND NOT t_xmin = 1 -- filter out bootstrap
> AND NOT t_xmin = 2 -- filter out frozen transaction id

Only on older releases ;). But yea, that looks right.



> >   and then display all items for one of the affected pages like
> >   SELECT * FROM heap_page_items(get_raw_page('pg_authid', 34343));
> >
> >
> > Alvaro:
> > - Hm, we talked about code adding context for these kind of errors,
> >   right? Is that just skipped for csvlog?
> > - Alvaro, does the above check_rel() function make sense?
> >
> > Greetings,
> >
> > Andres Freund
> >
> 
> The function does NOT show any issue with either of those tables.

Uh, huh?  Alvaro, do you see a bug in my query?

Greetings,

Andres Freund

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-03-19 Thread Andres Freund

Hi,

On 2018-03-19 14:53:58 -0500, Jeremy Finzel wrote:
> FWIW, if I remove the last filter, I get these rows and I believe row 7/57/
> 2906288382 is the one generating error:

Oh, yea, that makes sense. It's wrapped around and looks like it's from
the future.

> SELECT * FROM check_rel('pg_authid') LIMIT 100;
>  blockno | lp |xmin
> -++
>7 |  4 | 2040863716
>7 |  5 | 2040863716
>7 |  8 | 2041172882
>7 |  9 | 2041172882
>7 | 12 | 2041201779
>7 | 13 | 2041201779
>7 | 16 | 2089742733
>7 | 17 | 2090021318
>7 | 18 | 2090021318
>7 | 47 | 2090021898
>7 | 48 | 2090021898
>7 | 49 | 2102749003
>7 | 50 | 2103210571
>7 | 51 | 2103210571
>7 | 54 | 2154640913
>7 | 55 | 2163849781
>7 | 56 | 2295315714
>7 | 57 | 2906288382
>7 | 58 | 2906329443
>7 | 60 | 3131766386
>8 |  1 | 2089844462
>8 |  2 | 2089844462
>8 |  3 | 2089844463
>8 |  6 | 2089844463
>8 |  9 | 2295318868
> (25 rows)

Could you show the contents of those two pages with a query like I had
in an earlier email?

Greetings,

Andres Freund

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-03-19 Thread Andres Freund

Hi,

On 2018-03-19 15:37:51 -0500, Jeremy Finzel wrote:
> Does the fact that a snapshot does not have this issue suggest it could be
> memory-related corruption and a db restart could clear it up?

Could you show the page from the snapshot? I suspect it might just be a
problem that's temporarily not visible as corrupted.

Greetings,

Andres Freund

Re: ERROR: found multixact from before relminmxid

2018-04-09 Thread Andres Freund



On April 9, 2018 7:51:19 PM PDT, Peter Geoghegan  wrote:
>On Mon, Apr 9, 2018 at 6:56 PM, Alexandre Arruda 
>wrote:
>> (... and all other indexes returns null too)
>>
>> I tried with bt_index_check too. Same results.
>
>That's interesting, because it tells me that you have a table that
>appears to not be corrupt, despite the CLUSTER error. Also, the error
>itself comes from sanity checking added to MultiXact freezing fairly
>recently, in commit 699bf7d0.
>
>You didn't say anything about regular VACUUM being broken. Do you find
>that it works without any apparent issue?
>
>I have a suspicion that this could be a subtle bug in
>CLUSTER/freezing. The only heap_freeze_tuple() caller is code used by
>CLUSTER, so it's not that hard to imagine a MultiXact freezing bug
>that is peculiar to CLUSTER. Though I haven't thought about it in much
>detail.

I've not followed this thread. Possible it's the overeager check for pg 
upgraded tuples from before 9.3 that Alvaro fixed recently?

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: ERROR: found multixact from before relminmxid

2018-04-10 Thread Andres Freund

On 2018-04-10 08:31:20 -0300, Alexandre Arruda wrote:
> 2018-04-09 23:51 GMT-03:00 Peter Geoghegan :
> > On Mon, Apr 9, 2018 at 6:56 PM, Alexandre Arruda  wrote:
> >> (... and all other indexes returns null too)
> >>
> >> I tried with bt_index_check too. Same results.
> >
> > That's interesting, because it tells me that you have a table that
> > appears to not be corrupt, despite the CLUSTER error. Also, the error
> > itself comes from sanity checking added to MultiXact freezing fairly
> > recently, in commit 699bf7d0.
> >
> > You didn't say anything about regular VACUUM being broken. Do you find
> > that it works without any apparent issue?
> >
> > I have a suspicion that this could be a subtle bug in
> > CLUSTER/freezing. The only heap_freeze_tuple() caller is code used by
> > CLUSTER, so it's not that hard to imagine a MultiXact freezing bug
> > that is peculiar to CLUSTER. Though I haven't thought about it in much
> > detail.
> >
> > --
> > Peter Geoghegan
> 
> Hi Peter,
> 
> Actualy, I first notice the problem in logs by autovacuum:
> 
> 2018-04-10 08:22:15.385 -03 [55477] CONTEXT:  automatic vacuum of
> table "production.public.fn06t"
> 2018-04-10 08:22:16.815 -03 [55477] ERROR:  found multixact 68834765
> from before relminmxid 73262006
> 
> production=# vacuum analyze verbose fn06t;
> INFO:  vacuuming "public.fn06t"
> ERROR:  found multixact 76440919 from before relminmxid 122128619

What does the function in
https://www.postgresql.org/message-id/20180319181723.ugaf7hfkluqyo...@alap3.anarazel.de
say about your table?

Could you post pg_controldata output and
SELECT * FROM pg_class WHERE oid = 'public.fn06t'::regclass;
?

Greetings,

Andres Freund

Re: ERROR: found multixact from before relminmxid

2018-04-10 Thread Andres Freund



On April 10, 2018 11:51:40 PM PDT, Pavan Deolasee  
wrote:
>On Wed, Apr 11, 2018 at 8:31 AM, Alexandre Arruda 
>wrote:
>
>> 2018-04-10 19:09 GMT-03:00 Peter Geoghegan :
>> > On Tue, Apr 10, 2018 at 4:31 AM, Alexandre Arruda
>
>> wrote:
>> >> Actualy, I first notice the problem in logs by autovacuum:
>> >>
>> >> 2018-04-10 08:22:15.385 -03 [55477] CONTEXT:  automatic vacuum of
>> >> table "production.public.fn06t"
>> >> 2018-04-10 08:22:16.815 -03 [55477] ERROR:  found multixact
>68834765
>> >> from before relminmxid 73262006
>> >>
>> >> production=# vacuum analyze verbose fn06t;
>> >> INFO:  vacuuming "public.fn06t"
>> >> ERROR:  found multixact 76440919 from before relminmxid 122128619
>> >
>> > Do you think that CLUSTER was run before regular VACUUM/autovacuum
>> > showed this error, though?
>>
>> Yes, because the table is clustered in the old database and
>> reclustered when it was reloaded in the version 10.
>> But the table was not clustered again.
>>
>>
>I wonder if we're staring at some race condition in visibility map
>where a
>previous vacuum inadvertently skips a page because the visibility map
>bit
>is set, thus leaving behind an unfrozen multixid. The page then again
>becomes !all_visible and the next vacuum then complains about the
>unfrozen
>multixid.

Should be possible to write a query using page inspect and freespacemap to make 
sure they at least currently match.

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Postgres and fsync

2018-04-22 Thread Andres Freund

Hi,

On 2018-04-23 08:30:25 +1000, Tim Cross wrote:
> the recent article in LWN regarding issues with fsync and error
> reporting in the Linux kernel and the potential for lost data has
> prompted me to ask 2 questions.

Note that you need to have *storage* failures for this to
happen. I.e. your disk needs to die, and there's no raid or such to fix
the issue.


> 1. Is this issue low level enough that it affects all potentially
> supported sync methods on Linux? For example, if you were concerned
> about this issue and you had a filesystem which supports open_sync or
> open_datasync etc, is switching to one of these options something which
> should be considered or is this issue low level enough that all sync
> methods are impacted?

No, the issue is largely about datafiles whereas the setting you refer
to is about the WAL.

Greetings,

Andres Freund

Re: Can PostgreSQL create new WAL files instead of reusing old ones?

2018-04-27 Thread Andres Freund

On 2018-04-27 12:28:25 -0400, Bruce Momjian wrote:
> On Wed, Apr 18, 2018 at 09:34:50AM -0400, Vick Khera wrote:
> > On Wed, Apr 18, 2018 at 12:35 AM, Michael Paquier  
> > wrote:
> > 
> > 
> > That looks like a rather difficult problem to solve in PostgreSQL
> > itself, as the operator running the cluster is in charge of setting up
> > the FS options which would control the COW behavior, so it seems to me
> > 
> > 
> > You cannot turn off CoW on ZFS. What other behavior would you refer to here?
> > 
> > I suppose one could make a dedicated data set for the WAL and have ZFS make 
> > a
> > reservation for about 2x the total expected WAL size. It would require 
> > careful
> > attention to detail if you increase WAL segments configuration, though, and 
> > if
> > you had any kind of hiccup with streaming replication that caused the 
> > segments
> > to stick around longer than expected (but that's no different from any other
> > file system).
> 
> Uh, at the risk of asking an obvious question, why is the WAL file COW
> if it was renamed?  No one has the old WAL file open, as far as I know.

Because it's a COW filesystem that doesn't overwrite in place. That's
how snapshots etc are implemented.

Greetings,

Andres Freund

Re: statement_timeout issue

2018-05-04 Thread Andres Freund

Hi,

On 2018-05-04 09:52:35 -0700, Ian Harding wrote:
> I know there are a lot of moving parts to this issue but I think I've
> eliminated most of them and the problem comes down to the postgresql server
> generating a statement_timeout error after 10 seconds when the connection
> statement_timeout is actually set to 15 minutes.
> 
> My web server gets a database handle which it keeps for the duration of the
> function that generates the response.  Multiple queries can be sent and
> will use the same handle.
> 
> After getting the handle I issue "show statement_timeout" and log the
> result which shows 15 minutes.
> 
> 84876:May  4 08:31:06 seattle-vm-1-0 nsd: STATEMENT_TIMEOUT IS:
> 84877-May  4 08:31:06 seattle-vm-1-0 nsd: statement_timeout = 15min
> 
> Then I send a query, and it generates a statement_timeout error after 10
> seconds.
> 
> The log has a CONTEXT which indicates a foreign key lookup was going on at
> the time which is fine...
> 
> 2018-05-04 04:05:20 PDT [62028]: [82-1] ERROR:  canceling statement due to
> statement timeout
> 2018-05-04 04:05:20 PDT [62028]: [83-1] CONTEXT:  SQL statement "SELECT 1
> FROM ONLY "public"."clients" x WHERE "c_id" OPERATOR(pg_catalog.=) $1 FOR
> KEY SHARE OF x"
> 2018-05-04 04:05:20 PDT [62028]: [84-1] STATEMENT:  INSERT /*
> 
> Is there anything I'm not thinking of?

Probably some part of your application, or a function you're calling
from there, is setting a different timeout.

Greetings,

Andres Freund

Re: Error on vacuum: xmin before relfrozenxid

2018-05-22 Thread Andres Freund

Hi,

On 2018-05-22 16:18:20 +0200, Paolo Crosato wrote:
> PostgreSQL version number you are running:
> 
> PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
> 20150623 (Red Hat 4.8.5-28), 64-bit
> 
> How you installed PostgreSQL:
> 
> From the pgdg yum repositories.
> 
> Changes made to the settings in the postgresql.conf file:  see Server
> Configuration for a quick way to list them all.
> May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-1] user=,db=,client=
> ERROR:  found xmin 2889675859 from before relfrozenxid 400011439
> May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-2] user=,db=,client=
> CONTEXT:  automatic vacuum of table "postgres.pg_catalog.pg_authid"
> 
> postgres=# select xmin from pg_authid ;
> xmin
> 
>   1
>   1
>   1
>   1
>   1
> 557
> 7216348
>   110077819
>   110511334
>  3031994631
>  3032044199
>  3032044199
>  3032044199
>  3032070282
> (14 rows)
> 
> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
>  relfrozenxid
> --
> 400011439
> (1 row)
> 
> postgres=#
> 
> Is this a sympthom of data corruption or transaction wraparound due to the
> long running transaction that we killed weeks ago?

That shouldn't be the sole cause.


> This is the only table in the whole cluster that has this error. We
> are monitoring transactions wraparound with the the check_postgres.pl
> script, the check is still running fine and no alert was given at all
> since the cluster has been running.
> 
> Thank you in advance for any answer.

Could you report the result of
select ctid, xmin, xmax from pg_authid ;
and
CREATE EXTENSION pageinspect;
SELECT * FROM heap_page_items(get_raw_page('pg_authid', 0));

Thanks.
Greetings,

Andres Freund

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-05-22 Thread Andres Freund

On 2018-05-15 11:06:38 +0200, Maxim Boguk wrote:
> Hi everyone,
> 
> I just got the same issue on 9.6.8:
> 
> 2018-05-15 11:52:01 MSK 33558 @ from  [vxid:317/92895305 txid:0] [] ERROR:
> found xmin 2808837517 from before relfrozenxid 248712603
> 2018-05-15 11:52:01 MSK 33558 @ from  [vxid:317/92895305 txid:0] []
> CONTEXT:  automatic vacuum of table "template0.pg_catalog.pg_authid"
> 
> Additional details:
> 
> 1) bt_index_check - no errors on both indexes.
> 
> 2) SELECT pg_truncate_visibility_map('pg_authid'::regclass); + vacuum
> freeze - doesn't help.

Yea, too late.


> 3)Total 6 rows affected:
> template1=# select oid, ctid, xmin, xmax, cmin, cmax from pg_authid where
> xmin::text::bigint > (select relfrozenxid::text::bigint from pg_class where
> relname='pg_authid');
> oid|  ctid  |xmin| xmax | cmin | cmax
> ---+++--+--+--
>  183671986 | (0,90) | 3039161773 |0 |0 |0
>  183106699 | (1,48) | 2576823237 |0 |0 |0
>  183921770 | (1,50) | 3265971811 |0 |0 |0
>  183921914 | (1,52) | 3266122344 |0 |0 |0
>  187988173 | (1,58) | 4258893789 |0 |0 |0
>  182424977 | (1,62) | 2808837517 |0 |0 |0
> on total two pages.
> 
> 4) template1=# select relfrozenxid from pg_catalog.pg_class where relname =
> 'pg_authid';
>  relfrozenxid
> --
>2548304492
> 
> 
> 5)Rows itself looks pretty valid and correspond to the actual database
> users.
> 7)No database/server crash happened last few years, no disk errors/problems.
> 
> I feel it could be related with vacuum skip locked pages patch + freeze +
> shared catalog combination, but cannot prove it yet.
> 
> Looking for possible course of action.
> Probably simplest fix - drop and recreate these 6 affected users, but so
> far I willing spent some time research into this issue.

Could you use pageinspect to get the infomasks for the affected tuples?

Greetings,

Andres Freund

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-05-22 Thread Andres Freund

Hi,

On 2018-05-22 21:30:43 +0300, Maxim Boguk wrote:
> For sample:
> 
> postgres=# vacuum pg_catalog.pg_authid;
> ERROR:  found xmin 2894889518 from before relfrozenxid 248712603
> 
> select ctid, xmin, xmax, cmin, cmax from pg_catalog.pg_authid where
> xmin::text::bigint=2894889518;
>   ctid  |xmin| xmax | cmin | cmax
> ++--+--+--
>  (1,26) | 2894889518 |0 |0 |0
> 
> 
> postgres=# SELECT * FROM heap_page_items(get_raw_page('pg_authid', 1))
> where t_ctid::text='(1,26)';
> -[ RECORD 1
> ]---
> lp  | 26
> lp_off  | 4656
> lp_flags| 1
> lp_len  | 144
> t_xmin  | 2894889518
> t_xmax  | 0
> t_field3| 0
> t_ctid  | (1,26)
> t_infomask2 | 32779
> t_infomask  | 10507
> t_hoff  | 32
> t_bits  | 1100
> t_oid   | 189787727
>

So this row has, if I didn't screw up decoding the following infomask
bits set::
HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATED

So it's not been previously frozen, which I was wondering about.


> Which indeed makes it
> Any new role created in DB instantly affected by this issue.

What's txid_current()?


> 
> In the same time:
> 
>  select relfrozenxid from pg_class where relname='pg_authid';
>  relfrozenxid
> --
>2863429136
> 
> So it's interesting where value of " from before relfrozenxid 248712603"
> come from.

Hm. That's indeed odd.  Could you get a backtrace of the error with "bt
full" of the error?

Greetings,

Andres Freund

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-05-22 Thread Andres Freund

Hi,

On 2018-05-22 22:18:15 +0300, Maxim Boguk wrote:
> On Tue, May 22, 2018 at 9:47 PM, Andres Freund  wrote:
> > >  select relfrozenxid from pg_class where relname='pg_authid';
> > >  relfrozenxid
> > > --
> > >2863429136

> select txid_current();
>  txid_current
> --
>   41995913769

So that's an xid of 3341208114, if you leave the epoch out. What's
autovacuum_freeze_max_age set to in that cluster?

Can you show pg_controldata output, and relminmxid from that cluster?

I might be daft here, but it's surely curious that the relfrozenxid from
the error and pg_catalog are really different (number of digits):
catalog: 2863429136
error:248712603

> About gdb bt - it's tricky because it is mission critical master db of
> huge project.
> I'll will try promote backup replica and check is issue persist there and
> if yes - we will have our playground for a while, but it will require
> sometime to arrange.

You should be ok to just bt that in the running cluster, but I
definitely understand if you don't want to do that...  I'd appreciate if
you set up the a playground, because this seems like something that'll
reappear.

Greetings,

Andres Freund

Re: Error on vacuum: xmin before relfrozenxid

2018-05-22 Thread Andres Freund

Hi,

On 2018-05-22 21:43:01 +0200, Paolo Crosato wrote:
> > Could you report the result of
> > select ctid, xmin, xmax from pg_authid ;
> >
> 
> This is the result:
> 
> postgres=# select ctid, xmin, xmax from pg_authid ;

>  (0,16) | 3031994631 |0
>  16 |   6496 |1 |144 | 3031994631 |  0 |0 | (0,16)
> |   32779 |  10507 | 32 | 1100 |  675851 |
> \x6e6167696f7300010100496d64356333633236616163636439616665346437383061396239613464663634653639

> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
>  relfrozenxid
> --
> 400011439

That tuple looks, to me, like it indeed shouldn't exist, given the
relfrozenxid.  Decoding infomask (10507 / 0x290B), if I did it right,
shows:

HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATED

so it's not frozen.  That suggests there's either a bug, or you had
corruption in your cluster.

Could you give a bit of "history" about that postgres instance? What
version of postgres was it run on earliest? Was it ever pg_upgraded?
Were there any OS crashes? Other hardware issues?  Was the cluster ever
used with fsync=off or full_page_writes=off?

Greetings,

Andres Freund

Re: Error on vacuum: xmin before relfrozenxid

2018-05-22 Thread Andres Freund

Hi,

On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> I managed to recover the log of the first time we run into the issue, the
> error was the same but on template1:
> 
> May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> found xmin 2600758304 from before relfrozenxid 400011439
> May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> automatic vacuum of table "template1.pg_catalog.pg_authid"

pg_authid (along with a few other tables) is shared between
databases. So that's just hte same error.  At which rate are you
creating / updating database users / roles?

Greetings,

Andres Freund

Re: Error on vacuum: xmin before relfrozenxid

2018-05-23 Thread Andres Freund

On 2018-05-22 16:39:58 -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> > I managed to recover the log of the first time we run into the issue, the
> > error was the same but on template1:
> > 
> > May  8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> > found xmin 2600758304 from before relfrozenxid 400011439
> > May  8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> > automatic vacuum of table "template1.pg_catalog.pg_authid"
> 
> pg_authid (along with a few other tables) is shared between
> databases. So that's just hte same error.  At which rate are you
> creating / updating database users / roles?

Other questions:
- did you ever use VACUUM FULL or CLUSTER on pg_authid (or just on all
  tables)?
- Did you have any failovers?
- Do you use logical replication?

Greetings,

Andres Freund

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-05-24 Thread Andres Freund

On 2018-05-24 13:08:53 -0400, Alvaro Herrera wrote:
> Hmm .. surely
> 
> diff --git a/src/backend/access/heap/heapam.c 
> b/src/backend/access/heap/heapam.c
> index 5016181fd7..5d7fa1fb45 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -6690,7 +6690,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
>   xid = HeapTupleHeaderGetXmin(tuple);
>   xmin_frozen = ((xid == FrozenTransactionId) ||
>  HeapTupleHeaderXminFrozen(tuple));
> - if (TransactionIdIsNormal(xid))
> + if (!xmin_frozen && TransactionIdIsNormal(xid))
>   {
>   if (TransactionIdPrecedes(xid, relfrozenxid))
>   ereport(ERROR,
> 
> 
> ??

I don't think that's necesary - HeapTupleHeaderGetXmin() returns
FrozenTransactionId if the tuple is frozen (note the
HeapTupleHeaderXminFrozen() within).

Greetings,

Andres Freund

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-05-24 Thread Andres Freund

On 2018-05-24 13:30:54 -0700, Andres Freund wrote:
> On 2018-05-24 13:08:53 -0400, Alvaro Herrera wrote:
> > Hmm .. surely
> > 
> > diff --git a/src/backend/access/heap/heapam.c 
> > b/src/backend/access/heap/heapam.c
> > index 5016181fd7..5d7fa1fb45 100644
> > --- a/src/backend/access/heap/heapam.c
> > +++ b/src/backend/access/heap/heapam.c
> > @@ -6690,7 +6690,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
> > xid = HeapTupleHeaderGetXmin(tuple);
> > xmin_frozen = ((xid == FrozenTransactionId) ||
> >HeapTupleHeaderXminFrozen(tuple));
> > -   if (TransactionIdIsNormal(xid))
> > +   if (!xmin_frozen && TransactionIdIsNormal(xid))
> > {
> > if (TransactionIdPrecedes(xid, relfrozenxid))
> > ereport(ERROR,
> > 
> > 
> > ??
> 
> I don't think that's necesary - HeapTupleHeaderGetXmin() returns
> FrozenTransactionId if the tuple is frozen (note the
> HeapTupleHeaderXminFrozen() within).

FWIW, even if that weren't the case: a) there'd be a lot more wrong with
this routine imo. b) some of the tuples affected clearly weren't
frozen...

Greetings,

Andres Freund

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-05-24 Thread Andres Freund

On 2018-05-24 16:49:40 -0400, Alvaro Herrera wrote:
> BTW is it just a coincidence or are all the affected tables pg_authid?
> Maybe the problem is shared relations ..?  Maybe the fact that they have
> separate relfrozenxid (!?) in different databases?

Yes, that appears to be part of the problem. I've looked at a number of
shared relation related codepaths, but so far my theory is that the
relcache is wrong. Note that one of the reports in this thread clearly
had a different relcache relfrozenxid than in the catalog.

Then there's also:
http://archives.postgresql.org/message-id/1527193504642.36340%40amazon.com

Greetings,

Andres Freund

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-05-24 Thread Andres Freund

On 2018-05-24 17:13:11 -0400, Alvaro Herrera wrote:
> On 2018-May-24, Andres Freund wrote:
> 
> > On 2018-05-24 16:49:40 -0400, Alvaro Herrera wrote:
> > > BTW is it just a coincidence or are all the affected tables pg_authid?
> > > Maybe the problem is shared relations ..?  Maybe the fact that they have
> > > separate relfrozenxid (!?) in different databases?
> > 
> > Yes, that appears to be part of the problem. I've looked at a number of
> > shared relation related codepaths, but so far my theory is that the
> > relcache is wrong. Note that one of the reports in this thread clearly
> > had a different relcache relfrozenxid than in the catalog.
> 
> Hmm ... is that because they read the values on different databases?
> Are you referring to the reports by Maxim Boguk?  I see one value from
> template1, another value from template0.

I was referring to
https://www.postgresql.org/message-id/20180522193007.4bi5oluqb7c72...@alap3.anarazel.de
but you're right, it's possible that that's just caused by time passing
or different databases.

Greetings,

Andres Freund

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-05-25 Thread Andres Freund

On 2018-05-24 16:46:24 -0400, Alvaro Herrera wrote:
> On 2018-May-24, Andres Freund wrote:
> 
> > On 2018-05-24 13:08:53 -0400, Alvaro Herrera wrote:
> > > Hmm .. surely
> 
> > >   xid = HeapTupleHeaderGetXmin(tuple);
> > >   xmin_frozen = ((xid == FrozenTransactionId) ||
> > >  HeapTupleHeaderXminFrozen(tuple));
> > > - if (TransactionIdIsNormal(xid))
> > > + if (!xmin_frozen && TransactionIdIsNormal(xid))
> 
> > I don't think that's necesary - HeapTupleHeaderGetXmin() returns
> > FrozenTransactionId if the tuple is frozen (note the
> > HeapTupleHeaderXminFrozen() within).
> 
> Ah, yeah ... I probably thought about this when writing it and removed
> it for that reason.
> 
> BTW I think the definition of HeapTupleHeaderXminFrozen is seriously
> confusing, by failing to return true if the xmin is numerically
> FrozenXid (which it'll be if the database was pg_upgraded).  I wonder
> about this one in HeapTupleSatisfiesMVCC:

I suggest raising this on -hackers. I agree that it's unfortunate.

Greetings,

Andres Freund

Re: ERROR: found multixact from before relminmxid

2018-05-28 Thread Andres Freund

Hi,

I think I found the bug, and am about to post a fix for it belo
https://postgr.es/m/20180525203736.crkbg36muzxrj...@alap3.anarazel.de.

Greetings,

Andres Freund

Re: binaries for 11 beta compiled with --with-llvm?

2018-05-28 Thread Andres Freund

Hi,

On 2018-05-29 07:54:52 +0200, Thomas Kellerer wrote:
> But neither the Linux binaries nor the Windows binaries were compiled with 
> the --with-llvm option 
> (will JITting be possible with Windows at all?)

Not in 11.

Greetings,

Andres Freund

Re: Postgres 11 beta - no JITing

2018-06-05 Thread Andres Freund

Hi,

On 2018-06-05 13:45:26 +0200, Thomas Kellerer wrote:
> Hello,
> 
> I am trying to assess the benefits of the new JIT feature in Postgres 11. 
> However I can't figure out how to enable it. 
> 
> I have a test server with "CentOS Linux release 7.5.1804"
> 
> I installed the beta using:
> 
>   yum install 
> https://download.postgresql.org/pub/repos/yum/11/redhat/rhel-7.4-x86_64/postgresql11-server-11.0-beta1_1PGDG.rhel7.x86_64.rpm
>   yum install postgresql11
>   yum install postgresql11-server

I suspect the issue is that you also need the postgresql11-llvmjit
package. The LLVM support adds dependencies, so it's not included in the
main package.

> Then initdb and everything was up and running. 
> 
> I checked the installation:
> 
>   "show jit_provider" gives me: "llvmjit"
>   "show jit" gives me: "on"

You can check whether support actually could be loaded with SELECT 
pg_jit_available();

Greetings,

Andres Freund

Re: Code of Conduct plan

2018-06-05 Thread Andres Freund

Hi,

On 2018-06-06 02:20:45 +0700, Benjamin Scherrey wrote:
> On Wed, Jun 6, 2018 at 2:12 AM, Christophe Pettus  wrote:
> 
> >
> > > On Jun 5, 2018, at 12:06, Benjamin Scherrey 
> > wrote:
> > > Doesn't that 20 years of results pretty clearly demonstrate that this
> > community does not gain an advantage for adopting a CoC?
> >
> > Not at all.  The need for a CoC is not theoretical.  Real people,
> > recently, have left the community due to harassment, and there was no
> > system within the community to report and deal with that harassment.
> >
> 
> I keep hearing this claim. I've followed up and tried to verify them.

What would satisfy you? Dishing out all the details for everyone to see?
That'd both personally effect the victim and the alleged perpetrator,
and have potential legal implications.  At some point you're going to
have to trust that community stewards are working in good faith (which
doesn't imply agreeing on everything) and not trying to just screw with
you for the sake of it.

> Sorry but "trust me" doesn't cut it here any more than "trust me this
> will make Postgres go faster" would on a code change.

You do trust us to run code on your systems without having read every
line.

Greetings,

Andres Freund

Re: ERROR: found multixact from before relminmxid

2018-06-08 Thread Andres Freund

On 2018-06-08 12:38:03 -0500, Jeremy Finzel wrote:
> On Tue, Jun 5, 2018 at 8:42 PM, Alexandre Arruda  wrote:
> 
> > Em seg, 28 de mai de 2018 às 16:44, Andres Freund 
> > escreveu:
> > >
> > > Hi,
> > >
> > > I think I found the bug, and am about to post a fix for it belo
> > > https://postgr.es/m/20180525203736.crkbg36muzxrj...@alap3.anarazel.de.
> > >
> > > Greetings,
> > >
> > > Andres Freund
> >
> > Hi Andres,
> >
> > In end of April we did a complete dump/reload in database to version 10.3.
> > Today, the problem returns:
> >
> > production=# vacuum verbose co27t;
> > INFO:  vacuuming "public.co27t"
> > ERROR:  found multixact 81704071 from before relminmxid 107665371
> > production=# vacuum full verbose co27t;
> > INFO:  vacuuming "public.co27t"
> > ERROR:  found multixact 105476076 from before relminmxid 107665371
> > production=# cluster co27t;
> > ERROR:  found multixact 105476076 from before relminmxid 107665371
> >
> > But this time, regular vacuum versus full/cluster are different in
> > multixact number.
> > Your patch is applicable to this issue and is in 10.4 ?
> >
> > Best regards,
> >
> > Alexandre
> >
> >
> We encountered this issue ourselves for the first time on a busy OLTP
> system.  It is at 9.6.8.  We found that patching to 9.6.9 on a snapshot of
> this system did not fix the problem, but I assume that is because the patch
> in 9.6.9 only prevents the problem moving forward.  Is that accurate?

Which patch in 9.6.9 are you referring to?  The patch I linked to above
hasn't yet been merged, much less been released.

Unfortunately, on second thought, I don't quite see how it applies to
the cases here (rather than other reports about pg_authid and such). So
there might be a separate issue.  Is there any chance I could get access
to a copy of the data? It's very hard to debug something like this
without something that can reproduce the issue...

Greetings,

Andres Freund

Re: pg_upgrade and wraparound

2018-06-11 Thread Andres Freund

Hi,

On 2018-06-09 13:46:16 +0300, Alexander Shutyaev wrote:
> Hello!
> 
> I've been trying to upgrade a postgresql cluster from 9.6 to 10. I've
> executed the pg_upgrade with the following options:
> 
>  /usr/lib/postgresql/10/bin/pg_upgrade -b /usr/lib/postgresql/9.6/bin/ -B
> /usr/lib/postgresql/10/bin/ -d /var/lib/postgresql/9.6/main -D
> /var/lib/postgresql/10/main -o ' -c
> config_file=/etc/postgresql/9.6/main/postgresql.conf' -O ' -c
> config_file=/etc/postgresql/10/main/postgresql.conf'
> 
> The upgrade operation failed after several hours with the following error:
> 
> database is not accepting commands to avoid wraparound data loss in
> database with OID 0
> 
> Earlier in the log there are a lot of messages like
> 
> pg_restore: executing BLOB 1740736966
> pg_restore: WARNING:  database with OID 0 must be vacuumed within 1000279
> transactions
> HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that
> database.
> You might also need to commit or roll back old prepared transactions.

Since this happens in the version you're migrating too, I suggest
replacing the warning with a PANIC and then using the debugger to look
at the corefile generated (when using an appropriate ulimit).  It's not
immediately obvious why there'd not be correct knowledge about the
oldest database around. If you do so, please include 'p *ShmemVariableCache'
output.

Greetings,

Andres Freund

Re: pg_upgrade and wraparound

2018-06-11 Thread Andres Freund

On 2018-06-09 15:52:26 -0400, Tom Lane wrote:
> Adrian Klaver  writes:
> > On 06/09/2018 03:46 AM, Alexander Shutyaev wrote:
> >> The upgrade operation failed after several hours with the following error:
> >> database is not accepting commands to avoid wraparound data loss in 
> >> database with OID 0
> 
> > Do you know which database has an OID of 0?
> 
> Well, none do, so the correct question is what is passing an invalid
> database OID to the code that's complaining.  This sure looks like a
> bug, though I'm not sure we have enough info to locate it.

It sure looks like ShmemVariableCache->oldestXidDB isn't initialized.
As far as I remember we just initialize that from a checkpoint at
startup.   I suspect the issue is that pg_resetwal does:
if (set_xid != 0)
{
ControlFile.checkPointCopy.nextXid = set_xid;

/*
 * For the moment, just set oldestXid to a value that will force
 * immediate autovacuum-for-wraparound.  It's not clear whether 
adding
 * user control of this is useful, so let's just do something 
that's
 * reasonably safe.  The magic constant here corresponds to the
 * maximum allowed value of autovacuum_freeze_max_age.
 */
ControlFile.checkPointCopy.oldestXid = set_xid - 20;
if (ControlFile.checkPointCopy.oldestXid < 
FirstNormalTransactionId)
ControlFile.checkPointCopy.oldestXid += 
FirstNormalTransactionId;
ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
}

but we have codepath that doesn't check for oldestXidDB being
InvalidOid.  Not great.

Greetings,

Andres Freund

Re: pg_upgrade and wraparound

2018-06-11 Thread Andres Freund

On 2018-06-11 13:14:12 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I suspect the issue is that pg_resetwal does:
> > if (set_xid != 0)
> > {
> > ControlFile.checkPointCopy.nextXid = set_xid;
> 
> > /*
> >  * For the moment, just set oldestXid to a value that will force
> >  * immediate autovacuum-for-wraparound.  It's not clear whether 
> > adding
> >  * user control of this is useful, so let's just do something 
> > that's
> >  * reasonably safe.  The magic constant here corresponds to the
> >  * maximum allowed value of autovacuum_freeze_max_age.
> >  */
> > ControlFile.checkPointCopy.oldestXid = set_xid - 20;
> > if (ControlFile.checkPointCopy.oldestXid < 
> > FirstNormalTransactionId)
> > ControlFile.checkPointCopy.oldestXid += 
> > FirstNormalTransactionId;
> > ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
> > }
> 
> > but we have codepath that doesn't check for oldestXidDB being
> > InvalidOid.  Not great.
> 
> Hm, I think I'd define the problem as "pg_resetwal is violating the
> expectation that oldestXidDB be valid".

Well, what could it do otherwise?  ForceTransactionIdLimitUpdate()
currently does a syscache check for database existence. That'll just
return a lookup failure for InvalidOid, so we're reasonably good on that
front.

Using a hardcoded 20 seems worse, will have funny results if
running with a smaller autovacuum_freeze_max_age...

> However, this just explains the basically-cosmetic issue that the
> complaint message mentions OID 0.  It doesn't really get us to the
> answer to why Alexander is seeing a failure.  It might be useful
> to see pg_controldata output for the old cluster, as well as
> "select datname, datfrozenxid from pg_database" output from the
> old cluster.

pg_upgrade starts the server with autovacuum disabled, I suspect
restoring all the large objects ends up using a lot of transaction
ids. GetNewTransactionId() should start autovacuum, but I'd guess that's
where things are going wrong for some reason.

Alexander, could you hack things up so autovacuum logging is enabled
(log_autovacuum_min_duration=0), and see whether it's triggered?

I'm not entirely clear why pg_restore appears to use a separate
transaction for each large object, surely exascerbating the problem.

Greetings,

Andres Freund

Re: Print pg_lsn as a number?

2018-06-12 Thread Andres Freund

Hi,

On 2018-06-12 09:31:49 -0700, Scott Stroupe wrote:
> According to the documentation[1], pg_lsn is a 64-bit integer that's
> printed as two hex numbers separated by a slash, e.g. 68/1225BB70. Is
> there a way to get the 64-bit integer in a common numeric
> representation instead of the peculiar hex-slash-hex representation?

What do you want to do with the LSN?

Greetings,

Andres Freund

Re: ERROR: found multixact from before relminmxid

2018-06-12 Thread Andres Freund

Hi,

On 2018-06-08 13:30:33 -0500, Jeremy Finzel wrote:
>Avoid spuriously marking pages as all-visible (Dan Wood, Pavan Deolasee,
>Álvaro Herrera)
> 
>This could happen if some tuples were locked (but not deleted). While
>queries would still function correctly, vacuum would normally ignore such
>pages, with the long-term effect that the tuples were never frozen. In
>recent releases this would eventually result in errors such as "found
>multixact n from before relminmxid n".

Oh, I already had forgotten about that one... It does sound like a
likely explanation of your issue. Hard to tell without investigating a
*lot* more closely than I realistically can do remotely.  It seems quite
possible to be the cause - I'd strongly suggest to upgrade to prevent
further occurances, or at least exclude it as a cause.

Greetings,

Andres Freund

Re: Print pg_lsn as a number?

2018-06-12 Thread Andres Freund

Hi,

On 2018-06-13 09:18:21 +0900, Michael Paquier wrote:
> We map pg_wal_lsn_diff result to a numeric, so that could make sense to
> use numeric as well here, which is not the greatest choice by the way as
> that's an int64 internally, but that's more portable for any
> (unlikely-to-happen) future changes.

The reason to not use int64 is that it's signed. lsns are
unsigned. Therefore you couldn't represent all LSNs without wrapping
into negative.

Greetings,

Andres Freund

Re: First query on each connection is too slow

2018-06-13 Thread Andres Freund

Hi,

On 2018-06-13 15:55:05 +0530, Pavan Teja wrote:
> Too much of shared buffers allocation also causes problem. Read the
> documentation.

That's not in the docs.

- Andres

Re: First query on each connection is too slow

2018-06-13 Thread Andres Freund

Hi,

On 2018-06-13 12:55:27 +0300, Vadim Nevorotin wrote:
> I have a very strange problem. I'm using PostgreSQL 9.6 with PostGIS 2.3
> (both from Debian Strecth repos) to store DB for OSM server (but actually
> it doesn't matter). And I've noticed, that on each new connection to DB
> first query is much slower (10x) than all others. E.g.:
> 
> $ psql test_gis
> psql (9.6.7)
> Type "help" for help.
> test_gis=# \timing
> Timing is on.
> test_gis=# SELECT srid FROM geometry_columns WHERE
> f_table_name='planet_osm_polygon' AND f_geometry_column='way';
>  srid
> --
>  3857
> (1 row)
> Time: 52.889 ms
> test_gis=# SELECT srid FROM geometry_columns WHERE
> f_table_name='planet_osm_polygon' AND f_geometry_column='way';
>  srid
> --
>  3857
> (1 row)
> Time: 2.070 ms
> test_gis=#
> 
> If I run others instances of psql in parallel, when the first is active -
> they has absolutely the same problem. In one instance of psql query is fast
> (if it's not the first query), in others - first query is slow, but all
> others is fast.

What you're seeing is likely a mix of
a) Operating system overhead of doing copy-on-write the first time
   memory is touched. This can be reduced to some degree by configuring
   huge pages.
b) Postgres' caches over catalog contents (i.e. how your tables look
   like) having to be filled on the first access.  There's not really
   much you can do about it.

Is the overhead of this prohibitive for you, or are you merely curious?

- Andres

Re: First query on each connection is too slow

2018-06-13 Thread Andres Freund

Hi,

On 2018-06-13 10:49:39 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-06-13 12:55:27 +0300, Vadim Nevorotin wrote:
> >> I have a very strange problem. I'm using PostgreSQL 9.6 with PostGIS 2.3
> >> (both from Debian Strecth repos) to store DB for OSM server (but actually
> >> it doesn't matter). And I've noticed, that on each new connection to DB
> >> first query is much slower (10x) than all others. E.g.:
> >> test_gis=# SELECT srid FROM geometry_columns WHERE
> >> f_table_name='planet_osm_polygon' AND f_geometry_column='way';
> 
> > What you're seeing is likely a mix of
> > a) Operating system overhead of doing copy-on-write the first time
> >memory is touched. This can be reduced to some degree by configuring
> >huge pages.
> > b) Postgres' caches over catalog contents (i.e. how your tables look
> >like) having to be filled on the first access.  There's not really
> >much you can do about it.
> 
> Seeing that this query seems to involve PostGIS, I suspect that there
> might be a third cause: time to load the PostGIS shared library.
> If so, you could probably alleviate the issue by adding postgis
> to shared_preload_libraries.

Ah, good point. It recursively depends on quite a number of other
shared libraries, several of them large:

$ ldd /usr/lib/postgresql/10/lib/postgis-2.4.so|grep '=>'|awk '{print 
$3}'|xargs readlink -f|xargs size
   textdata bss dec hex filename
 5238519512 864  534227   826d3 /usr/lib/liblwgeom-2.4.so.0.0.0
 1910084528 296  195832   2fcf8 
/usr/lib/x86_64-linux-gnu/libgeos_c.so.1.10.2
 474429   12904 512  487845   771a5 
/usr/lib/x86_64-linux-gnu/libproj.so.13.1.0
  374081048  24   384809650 /lib/x86_64-linux-gnu/libjson-c.so.3.0.1
  30775 768   8   315517b3f 
/usr/lib/x86_64-linux-gnu/libprotobuf-c.so.1.0.0
1795735   373565272 1838363  1c0d1b 
/usr/lib/x86_64-linux-gnu/libxml2.so.2.9.4
1769027   20992   17152 1807171  1b9343 /lib/x86_64-linux-gnu/libc-2.27.so
1643118 956  12 1644086  191636 /lib/x86_64-linux-gnu/libm-2.27.so
1706242   60760 568 1767570  1af892 
/usr/lib/x86_64-linux-gnu/libgeos-3.6.2.so
1511723   47328   13504 1572555  17fecb 
/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25
  90261 944 688   91893   166f5 /lib/x86_64-linux-gnu/libgcc_s.so.1
 1004181840   16768  119026   1d0f2 /lib/x86_64-linux-gnu/libpthread-2.27.so
   8106 792 11290102332 /lib/x86_64-linux-gnu/libdl-2.27.so
2691066   583763168 2752610  2a0062 
/usr/lib/x86_64-linux-gnu/libicui18n.so.60.2
1715333   738807008 1796221  1b687d 
/usr/lib/x86_64-linux-gnu/libicuuc.so.60.2
26901016544   8 2690156819a7c40 
/usr/lib/x86_64-linux-gnu/libicudata.so.60.2
 1133341272   8  114614   1bfb6 /lib/x86_64-linux-gnu/libz.so.1.2.11
 1485482040   8  150596   24c44 /lib/x86_64-linux-gnu/liblzma.so.5.2.2


and with sizable writable mappings, too.  So yea,
shared_preload_libraries should help quite a bit.

Greetings,

Andres Freund

Re: What to do when dynamic shared memory control segment is corrupt

2018-06-18 Thread Andres Freund

On 2018-06-18 12:30:13 -0400, Tom Lane wrote:
> Sherrylyn Branchaw  writes:
> > We are using Postgres 9.6.8 (planning to upgrade to 9.6.9 soon) on RHEL 6.9.
> > We recently experienced two similar outages on two different prod
> > databases. The error messages from the logs were as follows:
> > LOG:  server process (PID 138529) was terminated by signal 6: Aborted
> 
> Hm ... were these installations built with --enable-cassert?  If not,
> an abort trap seems pretty odd.

PANIC does an abort, so that's not too surprising...

if (elevel >= PANIC)
{
/*
 * Serious crash time. Postmaster will observe SIGABRT process 
exit
 * status and kill the other backends too.
 *
 * XXX: what if we are *in* the postmaster?  abort() won't kill 
our
 * children...
 */
fflush(stdout);
fflush(stderr);
    abort();
}

Greetings,

Andres Freund

Re: High WriteLatency RDS Postgres 9.3.20

2018-06-18 Thread Andres Freund

On 2018-06-18 18:43:06 -0300, Juan Manuel Cuello wrote:
> I'm experiencing high WriteLatency levels in a Postgres server 9.3.20
> hosted in Amazon RDS.

A lot of performance improvements have been made since 9.3, and it'll
soon-ish be out of support.

If you can reproduce the issue on postgres proper, rather than a
modified version in an environment that precludes getting detailed data,
we might be able to sensibly help you further.

> So far it's been almost two months of investigation
> and people at AWS technical support don't seem to find the cause. I think
> it could be related to Postgres and the number of schema/tables in the
> database, that's why I post this issue here.

There've been improvements made since 9.3. Upgrade.

Greetings,

Andres Freund

Re: Can PostgreSQL create new WAL files instead of reusing old ones?

2018-06-22 Thread Andres Freund

Hi,

On 2018-06-22 11:41:45 -0400, Adam Brusselback wrote:
> >  Is there some way to make it auto-detect when it should be enabled? If
> not, please document that it should be used on ZFS and any other file
> system with CoW properties on files.

> In addition to this, wondering what type of performance regression this
> would show on something like ext4 (if any).

It's a *massive* regression on ext4 & xfs.  You can very trivially
compare the performance of a new cluster (which doesn't have files to
recycle) against one that's running for a while.

Greetings,

Andres Freund

Re: Too many range table entries error

2018-06-25 Thread Andres Freund

Hi,

On 2018-06-25 13:02:37 +0530, Akshaya Acharya wrote:
> Hello.
> 
> 
> Please could you help debug the error "too many range table entries”?
> 
> 
> This error occurs when querying a view that is dependent on many other
> views (i.e. a view high up in the pyramid of views that we've constructed).
> 
> 
> I get this error when running select * on the view, or when running an
> explain analyse on the select.
> 
> 
> Views that use a total of more than around 4 table references (in the
> complete tree considering all the dependent views recursively) don't work,
> but it works with 2 table references. What is the maximum number of
> table references possible?

Why are you doing this?  I can't imagine queries with that many table
references ever being something useful? I'm pretty sure there's better
solutions for what you're doing.

> Can I increase this number somehow?

It's not impossible, it's not entirely trivial either. The relevant
variables currently are 16bit wide, and the limit is close to the max
for that.

Greetings,

Andres Freund

Re: Too many range table entries error

2018-06-25 Thread Andres Freund

On 2018-06-25 13:46:06 +0530, Akshaya Acharya wrote:
> On Mon, 25 Jun 2018 at 13:40, Andres Freund  wrote:
> 
> > Hi,
> >
> > On 2018-06-25 13:02:37 +0530, Akshaya Acharya wrote:
> > > Hello.
> > >
> > >
> > > Please could you help debug the error "too many range table entries”?
> > >
> > >
> > > This error occurs when querying a view that is dependent on many other
> > > views (i.e. a view high up in the pyramid of views that we've
> > constructed).
> > >
> > >
> > > I get this error when running select * on the view, or when running an
> > > explain analyse on the select.
> > >
> > >
> > > Views that use a total of more than around 4 table references (in the
> > > complete tree considering all the dependent views recursively) don't
> > work,
> > > but it works with 2 table references. What is the maximum number of
> > > table references possible?
> >
> > Why are you doing this?  I can't imagine queries with that many table
> > references ever being something useful? I'm pretty sure there's better
> > solutions for what you're doing.
> >
> 
> Our entire application—all our business logic—is built as layers of views
> inside the database. The ref counts sort of multiple at each layer, hence
> the large number.

That still doesn't explain how you realistically get to 40k references,
and how that's a reasonable design. There's be quite the massive runtime
and memory overhead for an approach like this.   What was the reasoning
leading to this architecture.


> > > Can I increase this number somehow?
> >
> > It's not impossible, it's not entirely trivial either. The relevant
> > variables currently are 16bit wide, and the limit is close to the max
> > for that.
> >
> 
> I understand.
> 
> At slide 25 of this presentation a patch is indicated. Is this relevant to
> our situation?
> https://www.slideshare.net/hansjurgenschonig/postgresql-joining-1-million-tables

Yes, but that change likely isn't sufficient.


> Alternatively we will have to optimize our views or change the architecture
> of our application? Is there any other way to resolve this situation?

Yes I think you will have to, and no I don't see any other.

Greetings,

Andres Freund

Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function

2018-06-27 Thread Andres Freund

Hi,

On 2018-06-27 20:35:16 +0200, Luca Ferrari wrote:
> I've found this strange (to me) behavior when doing nasty things with
> indexes and immutable functions:
> 
> create table t( pk serial, t text );
> insert into t( t ) values( 'hello' ), ('world');
> create or replace function f_fake( i int )
> returns text
> as $body$
> declare
>   v_t text;
> begin
>select t into strict v_t
>from t where pk = i limit 1;
>return v_t;
> exception
>   when no_data_found then return 'a';
> end
> $body$
> language plpgsql immutable;
> 
> Of course, f_fake is not immutable.
> When on 10.4 or 11 beta 1 I try to create an index on this nasty
> crappy function:
> 
> create index idx_fake on t ( f_fake( pk ) );
> 
> ERROR:  could not read block 0 in file "base/16392/16444": read only 0
> of 8192 bytes
> CONTEXT:  SQL statement "select tfrom t where pk =
> i limit 1"
> PL/pgSQL function f_fake(integer) line 5 at SQL statement

> that is somehow correct (because the function cannot be used to build
> an index), but then it goes worst:
> 
> elect * from t;
> ERROR:  could not open relation with OID 16444

That certainly isn't behaviour I'd expect. Doing nasty stuff inside an
immutable function will have bad consequences, but the permanent failure
shouldn't be there.

But I also can't reproduce it either on 10.4, 10-current, master.  Did
you build from source? Packages? Any extensions? Is there anything
missing from the above instruction to reproduce this?

Greetings,

Andres Freund

Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function

2018-06-28 Thread Andres Freund

ck_rights=true, check_not_in_use=true, 
skip_build=false, quiet=false)
at /home/andres/src/postgresql/src/backend/commands/indexcmds.c:850
#39 0x55c314179d92 in ProcessUtilitySlow (pstate=0x55c314eb9d68, 
pstmt=0x55c314e9a3b8,
queryString=0x55c314e994d8 "create index idx_fake on t ( f_fake( pk ) );", 
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x55c314e9a6a8, completionTag=0x7ffe19a80b80 "") at 
/home/andres/src/postgresql/src/backend/tcop/utility.c:1352
#40 0x55c314178f44 in standard_ProcessUtility (pstmt=0x55c314e9a3b8, 
queryString=0x55c314e994d8 "create index idx_fake on t ( f_fake( pk ) );",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, 
dest=0x55c314e9a6a8, completionTag=0x7ffe19a80b80 "")
at /home/andres/src/postgresql/src/backend/tcop/utility.c:920
#41 0x55c31417803c in ProcessUtility (pstmt=0x55c314e9a3b8, 
queryString=0x55c314e994d8 "create index idx_fake on t ( f_fake( pk ) );",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, 
dest=0x55c314e9a6a8, completionTag=0x7ffe19a80b80 "")
at /home/andres/src/postgresql/src/backend/tcop/utility.c:360
#42 0x55c314176fe8 in PortalRunUtility (portal=0x55c314f01098, 
pstmt=0x55c314e9a3b8, isTopLevel=true, setHoldSnapshot=false, 
dest=0x55c314e9a6a8,
completionTag=0x7ffe19a80b80 "") at 
/home/andres/src/postgresql/src/backend/tcop/pquery.c:1178
#43 0x55c314177208 in PortalRunMulti (portal=0x55c314f01098, 
isTopLevel=true, setHoldSnapshot=false, dest=0x55c314e9a6a8, 
altdest=0x55c314e9a6a8,
completionTag=0x7ffe19a80b80 "") at 
/home/andres/src/postgresql/src/backend/tcop/pquery.c:1324
#44 0x55c314176736 in PortalRun (portal=0x55c314f01098, 
count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x55c314e9a6a8,
altdest=0x55c314e9a6a8, completionTag=0x7ffe19a80b80 "") at 
/home/andres/src/postgresql/src/backend/tcop/pquery.c:799
#45 0x55c314170560 in exec_simple_query (query_string=0x55c314e994d8 
"create index idx_fake on t ( f_fake( pk ) );")
at /home/andres/src/postgresql/src/backend/tcop/postgres.c:1122
#46 0x55c314174914 in PostgresMain (argc=1, argv=0x55c314ec7720, 
dbname=0x55c314ec7580 "postgres", username=0x55c314e95bc8 "andres")
at /home/andres/src/postgresql/src/backend/tcop/postgres.c:4153
#47 0x55c3140d0e3c in BackendRun (port=0x55c314ebc5a0) at 
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4361
#48 0x55c3140d0579 in BackendStartup (port=0x55c314ebc5a0) at 
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4033
#49 0x55c3140cc985 in ServerLoop () at 
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1706
#50 0x55c3140cc1d4 in PostmasterMain (argc=49, argv=0x55c314e93330) at 
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1379
#51 0x55c313ff523f in main (argc=49, argv=0x55c314e93330) at 
/home/andres/src/postgresql/src/backend/main/main.c:228


What you're seeing here is just a consequence of doing nasty things in
your index function - you're accessing the index inside the function
used to build the index.  That fails because the index isn't actually
built at that point.  It's possible we could find a way to report a
better error, but I don't think we'll try to make this actually work.


> testdb=> select * from t;
> 2018-06-28 10:23:23.642 CEST [892] ERROR:  could not open relation
> with OID 16538
> 2018-06-28 10:23:23.642 CEST [892] STATEMENT:  select * from t;
> ERROR:  could not open relation with OID 16538

This however seems wrong.  Cleary the relation's index list is out of
date.

I believe this happens because there's currently no relcache
invalidation registered for the main relation, until *after* the index
is built. Normally it'd be the CacheInvalidateRelcacheByTuple(tuple) in
index_update_stats(), which is called at the bottom of index_build().
But we never get there, because the earlier error.  That's bad, because
any relcache entry built *after* the CommandCounterIncrement() in
CommandCounterIncrement() will now be outdated.

In the olden days we most of the time didn't build a relcache entry
until after the index was built - but plan_create_index_workers() now
does. I'm suspect there's other ways to trigger that earlier, too.

Putting in a CacheInvalidateRelcache(heapRelation); before the CCI in
index_create() indeed makes the "borked relcache" problem go away.


I wonder why we don't just generally trigger invalidations to an
indexes' "owning" relation in CacheInvalidateHeapTuple()?
else if (tupleRelId == IndexRelationId)
{
Form_pg_index indextup = (Form_pg_index) GETSTRUCT(tuple);

/*
 * When a pg_index row is updated, we should send out a 
relcache inval
 * for the index relation.  As above, we don't know the shared 
status
 * of the index, but in practice it doesn't matter since 
indexes of
 * shared catalogs can't have such updates.
 */
relationId = indextup->indexrelid;
databaseId = MyDatabaseId;
}


Greetings,

Andres Freund

Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function

2018-07-09 Thread Andres Freund

Hi,

On 2018-07-09 09:59:58 -0700, Peter Geoghegan wrote:
> On Thu, Jun 28, 2018 at 8:02 AM, Andres Freund  wrote:
> > I believe this happens because there's currently no relcache
> > invalidation registered for the main relation, until *after* the index
> > is built. Normally it'd be the CacheInvalidateRelcacheByTuple(tuple) in
> > index_update_stats(), which is called at the bottom of index_build().
> > But we never get there, because the earlier error.  That's bad, because
> > any relcache entry built *after* the CommandCounterIncrement() in
> > CommandCounterIncrement() will now be outdated.
> >
> > In the olden days we most of the time didn't build a relcache entry
> > until after the index was built - but plan_create_index_workers() now
> > does. I'm suspect there's other ways to trigger that earlier, too.
> 
> Note that there is a kludge within plan_create_index_workers() that
> has us treat the heap relation as an inheritance parent, just to get a
> RelOptInfo for the heap relation without running into similar trouble
> with the index in get_relation_info(). Perhaps there's an argument to
> be made for refactoring plan_create_index_workers() as a fix for this.

Maybe I'm missing something, but what has this got to do with the issue
at hand?


> > Putting in a CacheInvalidateRelcache(heapRelation); before the CCI in
> > index_create() indeed makes the "borked relcache" problem go away.
> >
> >
> > I wonder why we don't just generally trigger invalidations to an
> > indexes' "owning" relation in CacheInvalidateHeapTuple()?
> 
> I don't know, but that seems like a good question.

I assume we'll have to backpatch this issue, so I think it'd probably a
good idea to put a specific CacheInvalidateHeapTuple() in there
explicitly in the back branches, and do the larger fix in 12. ISTM
there's some risks that it'd cause issues.  Will you tackle this?

Greetings,

Andres Freund

Re: Improving pg_dump performance

2018-07-23 Thread Andres Freund

Hi,

On 2018-07-23 02:23:45 -0500, Ron wrote:
> We've got an old (v8.4.17, thus no parallel backups) 2.9TB database that
> needs to be migrated to a new data center and then restored to v9.6.9.

Have you considered using pg_upgrade instead?

Greetings,

Andres Freund

Re: Improving pg_dump performance

2018-07-23 Thread Andres Freund

On 2018-07-23 09:17:41 -0500, Ron wrote:
> On 07/23/2018 09:11 AM, Andres Freund wrote:
> > Hi,
> > 
> > On 2018-07-23 02:23:45 -0500, Ron wrote:
> > > We've got an old (v8.4.17, thus no parallel backups) 2.9TB database that
> > > needs to be migrated to a new data center and then restored to v9.6.9.
> > Have you considered using pg_upgrade instead?
> 
> Yes, but:
> 1. can't find 9.6 packages on the RHEL 5.10 server.

If necessary you could just build it yourself.


> 2. The window is only 8 hours, and the data also has to be moved to a new DC
> in that window.

You could just use physical drives ;)

Greetings,

Andres Freund

Re: logical replication snapshots

2018-07-23 Thread Andres Freund

Hi,

On 2018-07-23 15:23:04 -0500, Dimitri Maziuk wrote:
> I'm playing with logical replication in postgres-10 and I got it to hang
> on startup with
> 
> > 2018-07-23 13:15:06.577 CDT [18624] FATAL:  the database system is starting 
> > up
> > 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  shmem_exit(1): 0 
> > before_shmem_exit callbacks to make
> > 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  shmem_exit(1): 0 on_shmem_exit 
> > callbacks to make
> > 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  proc_exit(1): 1 callbacks to 
> > make
> > 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  exit(1)
> > 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  shmem_exit(-1): 0 
> > before_shmem_exit callbacks to make
> > 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  shmem_exit(-1): 0 on_shmem_exit 
> > callbacks to make
> > 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  proc_exit(-1): 0 callbacks to 
> > make
> ... forever ...

What precedes that "loop"?


Could you describe what exactly you did into that situation?


> After poking around:
> 
> # find /var/lib/pgsql/10/data/pg_logical/snapshots -printf . | wc -c
> 12727975

Ugh, something absolutely clearly isn't right here.


> It looks like ZFS simply can't handle this many files: `ls` etc. hang
> forever.

ls probably works in general, it's just the sorting that makes it
hang. Should work with -f or such.


Greetings,

Andres Freund

Re: logical replication snapshots

2018-07-24 Thread Andres Freund

On 2018-07-24 12:22:24 -0500, Dimitri Maziuk wrote:
> On 07/24/2018 12:21 PM, Alvaro Herrera wrote:
> > 
> > Are you serious about us trying to diagnose a bug with this description?
> 
> What do you want to know, exactly?

A recipe that we can follow and reproduce the issue.

Greetings,

Andres Freund

Re: Dealing with latency to replication slave; what to do?

2018-07-24 Thread Andres Freund

Hi,

On 2018-07-24 15:39:32 -0400, Rory Falloon wrote:
> Looking for any tips here on how to best maintain a replication slave which
> is operating under some latency between networks - around 230ms. On a good
> day/week, replication will keep up for a number of days, but however, when
> the link is under higher than average usage, keeping replication active can
> last merely minutes before falling behind again.
> 
> 2018-07-24 18:46:14 GMTLOG:  database system is ready to accept read only
> connections
> 2018-07-24 18:46:15 GMTLOG:  started streaming WAL from primary at
> 2B/9300 on timeline 1
> 2018-07-24 18:59:28 GMTLOG:  incomplete startup packet
> 2018-07-24 19:15:36 GMTLOG:  incomplete startup packet
> 2018-07-24 19:15:36 GMTLOG:  incomplete startup packet
> 2018-07-24 19:15:37 GMTLOG:  incomplete startup packet
> 
> As you can see above, it lasted about half an hour before falling out of
> sync.

How can we see that from the above? The "incomplete startup messages"
are independent of streaming rep? I think you need to show us more logs.


> On the master, I have wal_keep_segments=128. What is happening when I see
> "incomplete startup packet" - is it simply the slave has fallen behind,
> and  cannot 'catch up' using the wal segments quick enough? I assume the
> slave is using the wal segments to replay changes and assuming there are
> enough wal segments to cover the period it cannot stream properly, it will
> eventually recover?

You might want to look into replication slots to ensure the primary
keeps the necessary segments, but not more, around.  You might also want
to look at wal_compression, to reduce the bandwidth usage.

Greetings,

Andres Freund

Re: logical replication snapshots

2018-07-25 Thread Andres Freund

On 2018-07-24 14:02:26 -0500, Dimitri Maziuk wrote:
> On 07/24/2018 01:43 PM, Andres Freund wrote:
> > On 2018-07-24 12:22:24 -0500, Dimitri Maziuk wrote:
> >> On 07/24/2018 12:21 PM, Alvaro Herrera wrote:
> >>>
> >>> Are you serious about us trying to diagnose a bug with this description?
> >>
> >> What do you want to know, exactly?
> > 
> > A recipe that we can follow and reproduce the issue.
> 
> The nutshell version is as I said: I pretty much followed the manual to
> create a test publication/subscription setup and let it run for a few
> weeks. I had to increase a few limits but otherwise everything's at the
> default settings as shipped in the rpms.

Are you really expecting us to be able to reproduce the problem based on
the above description? Our test suites do setup plain replications
setups, and the problem doesn't occur there.


> If you are asking for something other than those two extremes, please
> let me know what it is and I'll be happy to try and provide it.

A command by command reproducer would be good.


> I can send you postgres config files from both nodes and the entire
> database dump -- it's all public domain. I can `zfs send` you the
> snapshot of the entire /var/lib/pgsql/10 as is, too.

Without the genesis, that's probably not as helpful, unfortunately.


Greetings,

Andres Freund

Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function

2018-07-25 Thread Andres Freund

Hi,

On 2018-07-09 12:06:21 -0700, Peter Geoghegan wrote:
> > I assume we'll have to backpatch this issue, so I think it'd probably a
> > good idea to put a specific CacheInvalidateHeapTuple() in there
> > explicitly in the back branches, and do the larger fix in 12. ISTM
> > there's some risks that it'd cause issues.  Will you tackle this?
> 
> Okay.

Any progress on that?

Peter, given that your patch made this more likely, and that you're a
committer these days, I'm opening an open items entry, and assign it to
you. Does that sound ok?

Greetings,

Andres Freund

Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function

2018-07-25 Thread Andres Freund

Hi,

On 2018-06-28 08:02:10 -0700, Andres Freund wrote:
> I believe this happens because there's currently no relcache
> invalidation registered for the main relation, until *after* the index
> is built. Normally it'd be the CacheInvalidateRelcacheByTuple(tuple) in
> index_update_stats(), which is called at the bottom of index_build().
> But we never get there, because the earlier error.  That's bad, because
> any relcache entry built *after* the CommandCounterIncrement() in
> CommandCounterIncrement() will now be outdated.
> 
> In the olden days we most of the time didn't build a relcache entry
> until after the index was built - but plan_create_index_workers() now
> does. I'm suspect there's other ways to trigger that earlier, too.
> 
> Putting in a CacheInvalidateRelcache(heapRelation); before the CCI in
> index_create() indeed makes the "borked relcache" problem go away.
> 
> 
> I wonder why we don't just generally trigger invalidations to an
> indexes' "owning" relation in CacheInvalidateHeapTuple()?
>   else if (tupleRelId == IndexRelationId)
>   {
>   Form_pg_index indextup = (Form_pg_index) GETSTRUCT(tuple);
> 
>   /*
>* When a pg_index row is updated, we should send out a 
> relcache inval
>* for the index relation.  As above, we don't know the shared 
> status
>* of the index, but in practice it doesn't matter since 
> indexes of
>* shared catalogs can't have such updates.
>*/
>   relationId = indextup->indexrelid;
>   databaseId = MyDatabaseId;
>   }

Tom, do you have any comments about the above?  The biggest argument
against hardcoding that a pg_index update also invalidates the
corresponding relation, is that there's a lot of other things that we
could handle similarly. But I don't think any of those are as important
to relcache entries...

Greetings,

Andres Freund

Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function

2018-07-25 Thread Andres Freund

On 2018-07-25 19:27:47 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-06-28 08:02:10 -0700, Andres Freund wrote:
> >> I wonder why we don't just generally trigger invalidations to an
> >> indexes' "owning" relation in CacheInvalidateHeapTuple()?
> 
> > Tom, do you have any comments about the above?
> 
> It seems like an ugly and fragile hack, offhand.  I can see the point
> about needing to recompute the owning relation's index list, but there's
> no good reason why an update of a pg_index row ought to force that.
> You're using that as a proxy for creation or deletion of an index, but
> (at least in principle) pg_index rows might get updated for existing
> indexes.

Sure, but in at least some of those cases you'd also need to update the
list, as it could invalidate things like the determination of the oid,
pkey or whatnot index.  But yea, I don't think it's a *great* idea, just
an idea worth considering.

> Also, it's not clear to me why the existing places that force relcache
> inval on the owning table during index create/delete aren't sufficient
> already.  I suppose it's probably a timing problem, but it's not clear
> why hacking CacheInvalidateHeapTuple in this fashion fixes that, or why
> we could expect it to stay fixed.

Hm? As I point out in my email, there's simply nothing forcing an
invalidation in the relevant path? We register an invalidation, but only
*after* successfully building the entire index. So if the index build
fails, but we actually accessed the relcache (as e.g. the new
plan_create_index_workers() pretty much always does), the relcache entry
refers to the failed index build.  I think it's basically a failure to
follow proper invalidation procedures - when modifying something
affecting a relcache entry it's not OK to delay the
CacheInvalidateRelcacheByTuple() to after a point where we can fail
nonFATALy.

Greetings,

Andres Freund

Re: logical replication snapshots

2018-07-25 Thread Andres Freund

On 2018-07-25 12:31:01 -0700, Adrian Klaver wrote:
> On 07/25/2018 11:10 AM, Dimitri Maziuk wrote:
> > On 07/25/2018 11:16 AM, Adrian Klaver wrote:
> > 
> > > Where you using pg_export_snapshot() by any chance?:
> > 
> > If it's not in chapter 31 of TFM then I have not touched it. The only
> > notable thing I changed from 31.9 quickstart is that replica identity is
> > full on all tables (the data's a mess).
> 
> So no.
> 
> > 
> > > Where there any relevant error messages in the log before the database
> > > hung?
> > 
> > Do you seriously expect me to know what "relevant" is? ;) -- It's 7
> > days' worth of
> 
> Generally speaking relevancy increases the closer you get to the fall over
> event. So we are most interested in entries close to the point where the
> database stopped/failed.
> 
> > 
> > > 2018-07-23 09:43:12.851 CDT [7975] STATEMENT:  COPY 
> > > macromolecules."Entity_common_name" TO STDOUT
> > > 2018-07-23 09:43:12.925 CDT [7982] LOG:  logical decoding found 
> > > consistent point at 19/E6942408
> > > 2018-07-23 09:43:12.925 CDT [7982] DETAIL:  There are no running 
> > > transactions.
> > > 2018-07-23 09:43:12.935 CDT [7982] ERROR:  permission denied for schema 
> > > macromolecules
> > > 2018-07-23 09:43:12.935 CDT [7982] STATEMENT:  COPY 
> > > macromolecules."Assembly" TO STDOUT
> > > 2018-07-23 09:43:13.034 CDT [7987] LOG:  logical decoding found 
> > > consistent point at 19/E6942440
> > > 2018-07-23 09:43:13.034 CDT [7987] DETAIL:  There are no running 
> > > transactions.
> > > 2018-07-23 09:43:13.044 CDT [7987] ERROR:  permission denied for schema 
> > > macromolecules
> > > 2018-07-23 09:43:13.044 CDT [7987] STATEMENT:  COPY 
> > > macromolecules."Spectral_transition" TO STDOUT
> > > 2018-07-23 09:43:13.135 CDT [7990] LOG:  logical decoding found 
> > > consistent point at 19/E6942478
> > > 2018-07-23 09:43:13.135 CDT [7990] DETAIL:  There are no running 
> > > transactions.
> > > 2018-07-23 09:43:13.145 CDT [7990] ERROR:  permission denied for schema 
> > > macromolecules
> > > 2018-07-23 09:43:13.145 CDT [7990] STATEMENT:  COPY 
> > > macromolecules."Assembly_db_link" TO STDOUT
> > 
> > that ends with
> > 
> > > 2018-07-23 09:45:40.280 CDT [6032] LOG:  database system was interrupted; 
> > > last known up at 2018-07-06 17:04:28 CDT
> > > 2018-07-23 09:45:40.281 CDT [6033] FATAL:  the database system is 
> > > starting up
> > > 2018-07-23 09:45:40.282 CDT [6034] FATAL:  the database system is 
> > > starting up
> > ... ad infinitum ...
> > 
> > Would "permission denied" be relevant?
> 
> Logical decoding is something I am still learning. The "permission denied"
> would to me be relevant only to the extent that it seems to be provoking:

Yes, it looks related. Looks like logical rep was trying to perform the
intiial sync of those tables, and couldn't due to permission errors.


> "LOG:  logical decoding found consistent point at 19/E6942440"
> DETAIL:  There are no running transactions."
> 
> Others with more experience in this area would need to fill whether that
> might account for the 13 million files in the ~snapshot/ directory.

That indicates there's some error handling issues to be resolved. Petr?

Greetings,

Andres Freund

Re: logical replication snapshots

2018-07-26 Thread Andres Freund

On 2018-07-26 12:54:19 -0500, Dimitri Maziuk wrote:
> >> "LOG:  logical decoding found consistent point at 19/E6942440"
> >> DETAIL:  There are no running transactions."
> >>
> >> Others with more experience in this area would need to fill whether that
> >> might account for the 13 million files in the ~snapshot/ directory.
> > 
> > That indicates there's some error handling issues to be resolved. Petr?
> 
> It logs "COPY TO STDOUT", does that actually cache to disk (in ~snapshot/)?

No, it doesn't. The snapshots are internal to logical replication.

> Would it be safe to delete every file in ~snapshot/ older than some
> X-number of minutes? (And what a good X might be?)

You shouldn't manually delete them. But an explicit 'checkpoint;' (and
all automatically triggered ones), should do so for all that are
older. Is there a chance that you have some old leftover replication
slots around? Because that'd prevent cleanup?

What does SELECT * FROM pg_replication_slots ; return?

Greetings,

Andres Freund

Re: Restore relhaspkey in PostgreSQL Version 11 Beta

2018-07-30 Thread Andres Freund

Hi,

On 2018-07-30 17:21:25 -0400, Melvin Davidson wrote:
> * >it has never been the case that relhaspkey meant that the table
> *currently* has a primary key. *

> *Hmmm, I guess it's a lot harder to fix "squishy semantics"from  "True
> if the table has (or once had) a primary key"  to"True if the table has
> a primary key after vacuum"rather than just dropping a column that has
> existed from version 7.2.So now I guess the policy is break code instead of
> fix documention.That meakes sense...NOT!*

A large portion of the system catalogs (i.e. objects within
pg_catalog.*) are essentially internal implementation details and we'll
change them if it makes our live easier. If you want stability use
information_schema which we'll try very hard to not ever break.  Keeping
random atavistic things around, would slow us down, which will be a
price everybody is paying.

Greetings,

Andres Freund

Re: Restore relhaspkey in PostgreSQL Version 11 Beta

2018-07-30 Thread Andres Freund

Hi,

On 2018-07-30 19:11:34 -0400, Melvin Davidson wrote:
> *Of course. Would you be so kind as to point out where in the
> information_schema  it *
> *indicates if a table has a primary key or not. Oh wait, now I
> remember...no place.*

As Adrian pointed out, that's wrong. It's in information_schema.  You're
pretty damn antagonistic while asking for things.

> *>Keeping random atavistic things around, would slow us down, which will be
> a>price everybody is paying. *

> *Random atavistic things? I hardly think relhaspkey is random. It's been
> there since version 7.2.*
> *Exactly how does keeping it around slow you/us down?*

Being old doesn't imply it's not superfluous and/or slows us
down. There've been a number of discussions and bug reports about the
inaccuracy - even though it's documented! - it in the last few
years. That alone costs time. Additionally it's code we need to
maintain.

Greetings,

Andres Freund

Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function

2018-07-31 Thread Andres Freund

On 2018-07-31 18:48:23 -0700, Peter Geoghegan wrote:
> On Mon, Jul 9, 2018 at 11:32 AM, Andres Freund  wrote:
> > I assume we'll have to backpatch this issue, so I think it'd probably a
> > good idea to put a specific CacheInvalidateHeapTuple() in there
> > explicitly in the back branches, and do the larger fix in 12. ISTM
> > there's some risks that it'd cause issues.
> 
> What do you think of the attached?
> 
> The is a new CacheInvalidateRelcache() call, rather than a new call to
> CacheInvalidateRelcacheByTuple(), but those two things are equivalent
> (I assume that you actually meant to say
> CacheInvalidateRelcacheByTuple(), not CacheInvalidateHeapTuple()).

Right.


> From 18ffbcc81c75525c73930ad3b5a63ae0873d2381 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan 
> Date: Tue, 31 Jul 2018 18:33:30 -0700
> Subject: [PATCH] Add table relcache invalidation to index builds.
> 
> It's necessary to make sure that owning tables have a relcache
> invalidation prior to advancing the command counter to make
> newly-entered catalog tuples for the index visible.  inval.c must be
> able to roll back the local caches in the event of transaction abort.
> There is only a problem when CREATE INDEX transactions abort, since
> there is a generic invalidation once we reach index_update_stats().
> 
> This bug is of long standing.  Problems were made much more likely by
> the addition of parallel CREATE INDEX (commit 9da0cc35284), but it is
> strongly suspected that similar problems can be triggered without
> involving plan_create_index_workers().

Maybe expand a bit on this by saying that it's more likely "because
plan_create_index_workers() triggers a relcache entry to be (re-)built,
which previously did only happen in edge cases" or such?

> Bug diagnosed by Andres Freund.
> 
> Author: Peter Geoghegan
> Reported-By: Luca Ferrari
> Discussion: 
> https://postgr.es/m/CAKoxK+5fVodiCtMsXKV_1YAKXbzwSfp7DgDqUmcUAzeAhf=h...@mail.gmail.com
> Backpatch: 9.3-
> ---
>  src/backend/catalog/index.c | 13 +
>  1 file changed, 13 insertions(+)
> 
> diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
> index 8b276bc430..7036d72bd6 100644
> --- a/src/backend/catalog/index.c
> +++ b/src/backend/catalog/index.c
> @@ -1137,6 +1137,19 @@ index_create(Relation heapRelation,
>   InvokeObjectPostCreateHookArg(RelationRelationId,
> 
> indexRelationId, 0, is_internal);
>  
> + /*
> +  * Invalidate the relcache for the owning table, so that any local
> +  * relcache entry for the new index built after CommandCounterIncrement
> +  * won't become inconsistent in the event of transaction abort.  inval.c
> +  * must be able to roll back the local caches when aborting.  Clearly it
> +  * isn't useful to create an index whose definition results in a 
> relcache
> +  * entry for the index being allocated before commit, but the local 
> caches
> +  * cannot be allowed to become inconsistent under any circumstances.
> +  * (There is only an issue when transactions abort because we'll reach
> +  * index_update_stats when they commit.)
> +  */

Not a fan of this comment. It doesn't really explain that well why it's
needed here, but then goes on to a relatively general explanation of why
cache invalidation is necessary.  Why not just go for something like
"register relcache invalidation on the indexes' heap relation, to
maintain consistency of its index list"?

I wonder if it wouldn't be more appropriately placed closer to the
UpdateIndexRelation(), given that that's essentially what necessitates
the relcache flush?

Greetings,

Andres Freund

Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function

2018-07-31 Thread Andres Freund

Hi,

On 2018-07-31 19:29:37 -0700, Peter Geoghegan wrote:
> On Tue, Jul 31, 2018 at 7:02 PM, Andres Freund  wrote:
> > Not a fan of this comment. It doesn't really explain that well why it's
> > needed here, but then goes on to a relatively general explanation of why
> > cache invalidation is necessary.  Why not just go for something like
> > "register relcache invalidation on the indexes' heap relation, to
> > maintain consistency of its index list"?
> 
> That seems much more generic to me!
> 
> The comment is supposed to convey that the stuff within
> index_update_stats() isn't enough because of xact abort specifically.
> SI invalidation is very much part of the index_update_stats() contract
> already.

I don't think that's particularly relevant. We should always register an
invalidation before the relevant CommandCounterIncrement(), because that
is what makes catalog changes visible, and therefore requires
registering invalidations for coherency.  That we possibly later, in
another CCI cycle, reach yet another invalidation point, doesn't matter
much (except that it can hide a lot of omissions).  By your logic we'd
need an equivalent of the above comment at most of those places.

Greetings,

Andres Freund

Re: Template0 datfrozenxid age is 160million and progressing

2018-08-01 Thread Andres Freund

Hi,

On 2018-08-01 10:24:24 -0400, Tom Lane wrote:
> Rijo Roy  writes:
> > Thanks for the advice. Autovacuum is disabled in the environment and I 
> > cannot take a call on enabling it back. I can only run manual vacuum on the 
> > database as the dev team fears autoanalyze changing the query performance. 
> > Do you still think we don't need to take any actions. 
> 
> IMO, the action you need to take is enabling autovacuum.  We've
> seen many many people go down the path you are taking, and it's
> generally led to no good in the end.  Manual vacuuming tends
> to miss stuff, and it cannot react adequately to activity spikes.

But it shouldn't matter here, autovacuum will start regardless, no?

Greetings,

Andres Freund

Re: Template0 datfrozenxid age is 160million and progressing

2018-08-01 Thread Andres Freund

On 2018-08-01 12:07:16 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-08-01 10:24:24 -0400, Tom Lane wrote:
> >> IMO, the action you need to take is enabling autovacuum.  We've
> >> seen many many people go down the path you are taking, and it's
> >> generally led to no good in the end.  Manual vacuuming tends
> >> to miss stuff, and it cannot react adequately to activity spikes.
> 
> > But it shouldn't matter here, autovacuum will start regardless, no?
> 
> Sure, once it decides that emergency anti-wraparound vacuuming is
> necessary.  I really doubt the OP wants that to happen; it's the
> exact opposite of non-intrusive.

That's solely what would trigger it were autovacuum enabled, too? I've
complained about "emergency anti-wraparound" beeing anything but
emergency (they're largely unavoidable unless you manually script it),
but they're what happen once autovacuum_freeze_max_age is reached, and
that's the only trigger for vacuuming old relations independent of other
activity?

Greetings,

Andres Freund

Re: Template0 datfrozenxid age is 160million and progressing

2018-08-01 Thread Andres Freund

On 2018-08-01 12:20:15 -0400, Alvaro Herrera wrote:
> On 2018-Aug-01, Andres Freund wrote:
> 
> > On 2018-08-01 12:07:16 -0400, Tom Lane wrote:
> > > Andres Freund  writes:
> > > > On 2018-08-01 10:24:24 -0400, Tom Lane wrote:
> > > >> IMO, the action you need to take is enabling autovacuum.  We've
> > > >> seen many many people go down the path you are taking, and it's
> > > >> generally led to no good in the end.  Manual vacuuming tends
> > > >> to miss stuff, and it cannot react adequately to activity spikes.
> > > 
> > > > But it shouldn't matter here, autovacuum will start regardless, no?
> > > 
> > > Sure, once it decides that emergency anti-wraparound vacuuming is
> > > necessary.  I really doubt the OP wants that to happen; it's the
> > > exact opposite of non-intrusive.
> > 
> > That's solely what would trigger it were autovacuum enabled, too? I've
> > complained about "emergency anti-wraparound" beeing anything but
> > emergency (they're largely unavoidable unless you manually script it),
> > but they're what happen once autovacuum_freeze_max_age is reached, and
> > that's the only trigger for vacuuming old relations independent of other
> > activity?
> 
> With a small database like template0, it doesn't matter.  The vacuuming
> is going to be over before OP realizes it has happened anyway.
> Certainly having it happen on a normal-sized table can become
> problematic, but presumably OP has taken steps to avoid it when
> disabling autovacuum (which is why only template0 is getting into
> trouble.)

Right.


> I think emergency vacuum should behave differently (not scan indexes,
> just apply HOT page prune and clear old XIDs/multixacts), which would
> make it much faster, but that's a separate line of thought (and of
> development).

What I'd love is for freeze_max_age triggered vacuums *not* to be
emergency vacuums. They should just be normal ones. There should be a
separate GUC that triggers the emergency bit. It's really annoying to
get a hard to kill ant-wraparound autovacuum on an insert only table,
where it's the only thing that'll trigger the autovacuum.

Greetings,

Andres Freund

Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function

2018-08-02 Thread Andres Freund

Hi,

On 2018-08-02 13:00:16 -0700, Peter Geoghegan wrote:
> On Tue, Jul 31, 2018 at 9:00 PM, Andres Freund  wrote:
> > I don't think that's particularly relevant. We should always register an
> > invalidation before the relevant CommandCounterIncrement(), because that
> > is what makes catalog changes visible, and therefore requires
> > registering invalidations for coherency.
> 
> Fair enough. How about the attached revision?

Looks good to me, minus some optional nitpicks in the commit message
below.


> From 156be401c38f7884ed65641aa202b72b698f3e91 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan 
> Date: Tue, 31 Jul 2018 18:33:30 -0700
> Subject: [PATCH v2] Add table relcache invalidation to index builds.
> 
> It's necessary to make sure that owning tables have a relcache
> invalidation prior to advancing the command counter to make
> newly-entered catalog tuples for the index visible.  inval.c must be
> able to roll back the local caches in the event of transaction abort.
> There is only a problem when CREATE INDEX transactions abort, since
> there is a generic invalidation once we reach index_update_stats().

s/roll back/maintain consistency of/.  I'd also perhaps weaken the next
sentence, by adding a 'practically' or such, as a memory allocation
failure or such before the index build starts could cause this as well.

Greetings,

Andres Freund

Re: Pg_rewind cannot load history wal

2018-08-04 Thread Andres Freund

On 2018-08-04 10:54:22 +0100, Simon Riggs wrote:
> On 4 August 2018 at 07:56, Michael Paquier  wrote:
> >> Sounds like we should write a pending timeline change to the control
> >> file and have pg_rewind check that instead.
> >>
> >> I'd call this a timing bug, not a doc issue.
> >
> > Well, having pg_rewind enforce a checkpoint on the promoted standby
> > could cause a performance hit as well if we do it mandatorily as if
> > there is delay between the promotion and the rewind triggerring a
> > checkpoint could have already happen.  So it is for me a documentation
> > bug first regarding the failover workflow, and potentially a patch for a
> > new feature which makes pg_rewind trigger directly a checkpoint.
> 
> pg_rewind doesn't work correctly. Documenting a workaround doesn't change 
> that.

Especially because most people will only understand this after they've
been hit, as test scenarios will often just be quick enough.

Greetings,

Andres Freund

Re: Walsender waiting on SnapbuildSync

2018-08-05 Thread Andres Freund

On 2018-08-04 13:34:04 -0600, Brent Kerby wrote:
> On Postgres 10.3 (on AWS RDS), I am running logical decoding using the
> test_decoding output plugin, and every few minutes I am seeing pauses in
> the stream, unrelated to any large transactions. About once every hour or
> two, the pause is long enough that the database disconnects my client due
> to exceeding wal_sender_timeout (30 seconds -- the RDS default value);
> after reconnecting it is able to make progress again. My client is using
> the streaming replication protocol via pgjdbc (with a status interval of 1
> second). What I'm seeing is that during such a pause, the server is not
> sending any data to the client:
> 
> - pg_stat_replication.sent_lsn stops advancing
> - My client is blocking in a call to PGReplicationStream.read()
> - pg_stat_activity shows that the walsender process has a wait_event of
> 'SnapbuildSync'.
> 
> In this scenario, it makes sense that the client would be timed out: pgjdbc
> only sends feedback to the server at the beginning of a call to
> PGReplicationStream.read(), so if a single call blocks a long time, never
> receiving any data from the server, then the client would stop sending
> feedback to the server, causing timeout.
> 
> My question is why might the server be spending so much time waiting on
> SnapbuildSync? The docs describe this event as follows:
> 
> "IO / SnapbuildSync / Waiting for a serialized historical catalog snapshot
> to reach stable storage."

> I gather that this is related to statistics collection, but I'm not
> understanding why a walsender process would wait on such an event nor why
> it would take such a long time. Any ideas?

It's *not* related to statistics collection. It's fsyncing a few *small*
files to disk. On a first blush this seems to indicate that your storage
system is quite overloaded?

The relevant piece of code is:

/*
 * fsync the file before renaming so that even if we crash after this we
 * have either a fully valid file or nothing.
 *
 * TODO: Do the fsync() via checkpoints/restartpoints, doing it here has
 * some noticeable overhead since it's performed synchronously during
 * decoding?
 */
pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_SYNC);
if (pg_fsync(fd) != 0)
{
int save_errno = errno;

CloseTransientFile(fd);
errno = save_errno;
ereport(ERROR,
(errcode_for_file_access(),
 errmsg("could not fsync file \"%s\": %m", 
tmppath)));
}
pgstat_report_wait_end();
CloseTransientFile(fd);

Greetings,

Andres Freund

Re: Logical replication from standby

2018-08-16 Thread Andres Freund

On 2018-08-16 13:35:06 +0200, Andreas Joseph Krogh wrote:
> Any plans to support $subject?

There's LOADS of discussion on this on the lists.

Re: Logical replication from standby

2018-08-16 Thread Andres Freund

Hi,

On 2018-08-16 14:32:34 +0200, Andreas Joseph Krogh wrote:
> På torsdag 16. august 2018 kl. 14:18:12, skrev Andres Freund 
>   <mailto:and...@anarazel.de>>:
> On 2018-08-16 13:35:06 +0200, Andreas Joseph Krogh wrote:
>  > Any plans to support $subject?
> 
>  There's LOADS of discussion on this on the lists.
>  
> I couldn't find anything specifically regarding "from standby", other than 
> previous posts by me, which didn't really result in anything.

See e.g.
http://archives.postgresql.org/message-id/CAMsr%2BYEVmBJ%3DdyLw%3D%2BkTihmUnGy5_EW4Mig5T0maieg_Zu%3DXCg%40mail.gmail.com
and also https://commitfest.postgresql.org/15/788/ etc.

Greetings,

Andres Freund

Re: COPY FROM - to avoid WAL generation

2018-08-21 Thread Andres Freund

Hi,

On 2018-08-21 15:00:03 +, Ravi Krishna wrote:
> In a recent thread of mine I learned something very interesting.  If a table 
> is created and data is loaded via COPY FROM within the same transaction, then 
> PG will be smart enough to not generate WAL logs because all it needs to do 
> is to track the status of the transaction and let the data load go to the new 
> data file created for the table.  If committed, the table is released for 
> other sessions, if rolledback, vaccum will delete the data file later on.
> I tested it as follows for a table with 50 milllion rows.  No indexes.

Please note this is only the case if wal_level = minimal. If replication
(or PITR) is supported, that mode can't be used, because the data has to
go into the WAL.

Were you using wal_level = minimal?

(FWIW, it's not VACUUM that'd unlink the data in cause of failure, but
that doesn't really matter much).

Greetings,

Andres Freund

Re: "checkpointer process" is consuming more memory. How to control it?

2018-08-23 Thread Andres Freund

On 2018-08-22 11:14:48 -0300, Avi Vallarapu wrote:
> Seams like you are with the default or fairly less checkpoint_* settings.
> 
> You need to look at several patterns to understand the best values for
> these settings.
> 
> For now, i can say 300 seconds for checkpoint_timeout could be fairly very
> less. Try increasing that with a reload.
> 
> See this =>
> https://www.percona.com/blog/2018/06/15/tuning-postgresql-for-sysbench-tpcc/
> 
> Please do note that delaying checkpointing could also increase the time for
> crash recovery/shutdown, etc.

This seems to have not that much to do with the problem the OP has, no?

I suggest just using pmap -d $process_id_of_checkpointer
that'll likely show that it doesn't actually use much memory, but that
the RSS originates from the shared memory it has touched.

- Andres

Re: Autovacuum degrades all other operations by keeping all buffers dirty?

2018-08-31 Thread Andres Freund

On 2018-08-31 19:31:47 -0300, Alvaro Herrera wrote:
> On 2018-Aug-31, David Pacheco wrote:
> 
> > From reading the 9.6.3 source, it looks like the autovacuum process
> > itself is single-threaded, and it reads pages essentially linearly
> > from the relation (possibly skipping some).  When the autovacuum
> > process needs to modify a page, it doesn't write it directly, but
> > rather marks the buffer dirty.  The page will be written later,
> 
> Unless there's some bug, there is a BufferAccessStrategy that only lets
> a few dozen buffers go unwritten before the autovac worker process
> itself is forced to write some.

I've not re-checked, but I'm not sure that's true if the buffer is
already in s_b, which it'll be for many workloads.

Greetings,

Andres Freund

Re: error in vacuum

2018-09-01 Thread Andres Freund

On 2018-09-01 19:02:18 -0700, Dave Peticolas wrote:
> On Sat, Sep 1, 2018 at 6:56 PM Tom Lane  wrote:
> 
> > Dave Peticolas  writes:
> > >>> I'm seeing this error on a table repeatedly:
> > >>> found update xid 3433096759 <(343)%20309-6759> from before
> > relfrozenxid 3498422955
> >
> > >> This is a primary, 9.6.8 to be exact.
> >
> > This message is known to be associated with several data-corruption bugs
> > we've fixed recently.  You should update to current (9.6.10) sooner not
> > later.  I don't think that will cure existing corruption for you, but
> > it should prevent things from getting worse.
> >
> 
> Ok, thank you for that information I will do so. The dump/truncate/restore
> I mentioned in a later post seems to fix the corruption in a restored DB.
> Given the nature of the bug do you think that would fix that table?

Yes, it's fairly likely that it does. You might even get around it by
just doing a no-op update of the whole table.  Nevertheless you need to
upgrade, or more of that corruption can occur.

Greetings,

Andres Freund

Re: WARNING: could not flush dirty data: Function not implemented

2018-09-02 Thread Andres Freund

Hi,

On 2018-09-02 22:57:55 +, Ravi Krishna wrote:
> Ubuntu 18.04 as Windows bash
> 
> Distributor ID: Ubuntu
> Description:Ubuntu 18.04.1 LTS
> Release:18.04
> Codename:   bionic
> 
> 
> PG 10.5.1
> 
> postgres@ravi-lenovo:~$ psql -d postgres
> psql (10.5 (Ubuntu 10.5-1.pgdg16.04+1))
> 
> A CREATE DATABASE statement spewed out 
> 
> WARNING:  could not flush dirty data: Function not implemented
> 
> many times, but the db was created.
> 
> What exactly is this?

That means that the linux emulation by microsoft isn't good enough.  You
can work around it by setting checkpoint_flush_after=0 and
wal_writer_flush_after=0.

> A simple search tells that this is an old problem and my Windows has
> necessary WSL and other subsystem for this error to not show up.

Please note that nobody has verified that postgres works correctly via
the emulation stuff MS is doing.  There is a native version of postgres
for windows however, and that is tested (and exercised by a lot of
installations).

Greetings,

Andres Freund

Re: WARNING: could not flush dirty data: Function not implemented

2018-09-02 Thread Andres Freund

Hi,

On 2018-09-02 19:29:49 -0400, Tom Lane wrote:
> Ravi Krishna  writes:
> > A simple search tells that this is an old problem and my Windows has 
> > necessary WSL and other subsystem for this error to not show up.
> 
> If this is on Ubuntu, I don't understand why you're talking
> about Windows.

The OP said "Ubuntu 18.04 as Windows bash" - so I assume this is
postgres compiled as a linux binary is running on MS's new-ish linux
emulation.

Greetings,

Andres Freund

Re: commit timestamps and replication

2018-09-14 Thread Andres Freund

Hi,

On 2018-09-14 21:01:54 +0200, Andreas Brandl wrote:
> we're discussing a system design and it boils down to a question
> regarding commit timestamps (in the sense of [1],
> track_commit_timestamp='on'):
> 
> We have a insert-only (append-only) table. Do commit timestamps on
> this table constitute the same order in which records become visible
> on a secondary (streaming replication)? Is there any reason why this
> might not be the case?
> 
> To put this differently: If a client reads from a secondary and reads
> the "latest" record by commit timestamp (commit time T). Is it safe to
> assume that there won't ever be another record with a lower commit
> timestamp < T, that only shows up on the secondary after that read?

No, that's not safe to assume.  The order of visibility on the standby
is determined by the point the WAL record is inserted into the log.  The
commit timestamp has to *previously* be determined to be part of the WAL
log:

SetCurrentTransactionStopTimestamp();

XactLogCommitRecord(xactStopTimestamp,
nchildren, children, nrels, rels,
nmsgs, invalMessages,
RelcacheInitFileInval, forceSyncCommit,
MyXactFlags,
InvalidTransactionId, NULL /* plain commit */ );

Those two things happen sequentially, *WITHOUT* a lock preventing
concurrent transactions to do the same.

> I'm aware of concerns regarding physical time, time adjustments and so
> on, so the question here assumes those things never happen.

Note that in addition to that you cannot be certain that concurrent
transactions *in the same postgres cluster* have perfectly coherent
timestamp - clock drift between CPU sockets does still sometimes occur,
and was extremely common.

Greetings,

Andres Freund

Re: heads up on large text fields.

2018-09-21 Thread Andres Freund

On 2018-09-21 18:28:37 -0600, Rob Sargent wrote:
> "/tmp/regen.file" gets very large, very fast and I have to
> pg_terminate_backend.  Tried this three times, once using "\o test.blob"
> instead.
> 
>h009357:loader$ ls -ltr
>total 2048
>-rwxr-xr-x 1 u0138544 camplab 10002460672 Sep 21 15:49 test.blob

I suspect the layouting of such wide columns problably creates a lot of
pain.  I'd try \copy and doing the query after \a.

Greetings,

Andres Freund

Re: backend_xmin in pg_stat_replication

2018-10-01 Thread Andres Freund

Hi,

On 2018-10-01 12:20:26 +0200, Torsten Förtsch wrote:
> if I understand it correctly, backend_xmin in pg_stat_replication is the
> xmin that's reported back by hot_standby_feedback. Given there are no
> long-running transactions on the replica, I presume that value should be
> pretty close to the xmin field of any recent snapshots on the master. This
> is true for all my databases but one:
> 
> select application_name,
>txid_snapshot_xmin(txid_current_snapshot()),
>backend_xmin::TEXT::BIGINT,
> 
>  txid_snapshot_xmin(txid_current_snapshot())-backend_xmin::TEXT::BIGINT
>   from pg_stat_replication;
> 
>  application_name | txid_snapshot_xmin | backend_xmin |  ?column?
> --++--+
>  xx   | 6 957 042 833 |   2 662 075 435 | 4 294 967 398

I don't think the calculation you're doing here is correct.
backend_xmin is an xid (max 2^32-1), whereas txid_snapshot_xmin returns
an xid *with* epoch (max 2^64-1). What you're measuring here is simply
the fact that the xid counter has wrapped around.

Greetings,

Andres Freund

Re: COPY threads

2018-10-10 Thread Andres Freund

On 2018-10-10 18:50:02 +0200, Laurenz Albe wrote:
> Rob Sargent wrote:
> > Can anyone here tell me whether or not the CopyManager facility in JDBC 
> > via org.postgresql:postgresql:42.1.4 is internally multithreaded? 
> > Running on CentOS 7 (all participants), java8, postgres 10.5
> 
> It isn't, and there would not be much reason for it to be, as COPY
> in PostgreSQL cannot be parallelized.

s/cannot/is not/.

You obviously can just copy the data into postgres over multiple
connections if you need to speed COPY up. But that requires splitting up
the data on the clientside.

Greetings,

Andres Freund

Re: COPY threads

2018-10-10 Thread Andres Freund

Hi,

On 2018-10-10 17:02:59 -0400, Ravi Krishna wrote:
> I am talking about pg_restore which is COPY FROM, which takes exclusive lock 
> on the table
> while the data is loaded into the table.

pg_restore doesn't take locks on the table for the COPY, it does so
because creating the table takes an exclusive lock.

Greetings,

Andres Freund

Re: COPY threads

2018-10-10 Thread Andres Freund




On October 10, 2018 2:15:19 PM PDT, Ravi Krishna  wrote:
>> 
>> pg_restore doesn't take locks on the table for the COPY, it does so
>> because creating the table takes an exclusive lock.
>
>
>Interesting.  I seem to recollect reading here that I can't have
>concurrent COPY on the same table because of the lock.
>To give an example:
>
>If I have a large file with say 400 million rows, can I first split it
>into 10 files of 40 million rows each and then fire up 10 different
>COPY sessions , each reading from a split file, but copying into the
>same table.  I thought not.  It will be great if we can do this.

Yes, you can.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Slot issues

2018-10-14 Thread Andres Freund

Hi,

On 2018-10-15 00:15:53 +0530, bhargav kamineni wrote:
> Hi Team,
> 
> I am getting this ERROR while starting my second slave server
> 
> PANIC:  too many replication slots active before shutdown
> HINT:  Increase max_replication_slots and try again.
> 
> max_replication_slots on my master is 2 and one of them is already active
> for another slave,
> do i need to increase this parameter for the need of working of another
> slave ? if so whats the reason ?

You're getting that error on the standby, not the primary, right?   In
all likelihood the problem is that you copied over replication slots to
your standby server, but that *on the standby* haven't set
max_replication_slots high enough.

Greetings,

Andres Freund

Re: Slot issues

2018-10-14 Thread Andres Freund

Hi,

Please try to quote properly.

On 2018-10-15 01:57:53 +0530, bhargav kamineni wrote:
> I got his on standby,  could you please explain in detail about
>  --*but that *on the standby* haven't set max_replication_slots high enough*
> .

What is max_replication_slots set to on the new standby?

If you created the new basebackup using rsync, and didn't exclude
pg_replication_slot, it'll have copied the slots from the primary.  And
thus needs a high enough max_replication_slots to work with them.

- Andres

Re: Slot issues

2018-10-14 Thread Andres Freund

Hi,

As I just wrote:
> On Mon, 15 Oct 2018 at 02:40, Andres Freund  wrote:
> > Please try to quote properly.

On 2018-10-15 02:45:51 +0530, bhargav kamineni wrote:
> Yeah i have used rsync , Got it now will increase the max_replication_slots
> to high enough , Thank you Andres Freund :-)

You probably afterwards want to drop those slots from the new standby.
See the following section from the docs:

https://www.postgresql.org/docs/current/static/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP-DATA
"It is often a good idea to also omit from the backup the files within the 
cluster's pg_replslot/ directory, so that replication slots that exist on the 
master do not become part of the backup. Otherwise, the subsequent use of the 
backup to create a standby may result in indefinite retention of WAL files on 
the standby, and possibly bloat on the master if hot standby feedback is 
enabled, because the clients that are using those replication slots will still 
be connecting to and updating the slots on the master, not the standby. Even if 
the backup is only intended for use in creating a new master, copying the 
replication slots isn't expected to be particularly useful, since the contents 
of those slots will likely be badly out of date by the time the new master 
comes on line."

Greetings,

Andres Freund

Re: Slot issues

2018-10-14 Thread Andres Freund

Hi,

On 2018-10-15 03:18:35 +0530, bhargav kamineni wrote:
> >  You probably afterwards want to drop those slots from the new standby.
> > See the following section from the docs:
> 
> >
> https://www.postgresql.org/docs/current/static/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP-DATA
> "It is often a good idea to also omit from the backup the files within the
> cluster's pg_replslot/ directory, so that replication slots that exist on
> the master do not become part of the backup. Otherwise, the subsequent use
> of the backup to create a standby may result in indefinite retention of WAL
> files on the standby, and possibly bloat on the master if hot standby
> feedback is enabled, because the clients that are using those replication
> slots will still be connecting to and updating the slots on the master, not
> the standby. Even if the backup is only intended for use in creating a new
> master, copying the replication slots isn't expected to be particularly
> useful, since the contents of those slots will likely be badly out of date
> by the time the new master comes on line."
> 
> Since i already synced the pg_repslot to  standby ,Is it okay if i remove
> the pg_repslot directory befor starting postgresql service  ?

You can do that, but the nicer way probably is to just remove them via
sql once started.  Something like
SELECT pg_drop_replication_slot(slot_name) FROM pg_replication_slots;

Greetings,

Andres Freund

Re: Slot issues

2018-10-14 Thread Andres Freund

Hi,

On 2018-10-14 17:52:30 -0400, Ravi Krishna wrote:
> The best part in Db2 HADR is that when the standby is catching up with the 
> master after a long time, it will
> start from the last LSN and fetch it from the primary WAL directory (active 
> logs ). If not found, it will look for it
> in the archived logs and start applying from there until the current
> point.

Uh, it works precisely the same in postgres.

> No rep slot business.

You're not forced to use slots.  Their purpose is to allow to force the
primary to keep necessary resources around. Which also allows to get rid
of the archive in some setups.

Greetings,

Andres Freund

Re: postgres server process crashes when using odbc_fdw

2018-10-17 Thread Andres Freund

Hi,

On 2018-10-17 11:17:11 -0400, Ravi Krishna wrote:
> It turned out that enabling ODBC trace was causing PG to crash.  Once 
> disabled it started working, but found another issue.
> All object names in DB2 is assumed to be upper case.  odbc_fdw sends queries 
> like this
> 
> 
> select "fld1","fld2" from "schema_name"."table_name".
> 
> So the foreign table in PG has to created in upper case within quotes.  It is 
> bit of an annoyance.

Please note that odbc_fdw is not maintained by the postgresql
developers, but a separate project.

Greetings,

Andres Freund

Re: postgres server process crashes when using odbc_fdw

2018-10-17 Thread Andres Freund

On October 17, 2018 10:57:37 AM PDT, Ravi Krishna  wrote:
>
>> 
>> Please note that odbc_fdw is not maintained by the postgresql
>developers, but a separate project.
>
>
>Translation: You are on your own.  We are hoping this will make our
>migration out of DB2 quicker. Oh well.

Come on. We can't realistically support & debug random postgres extending 
projects, nor do we have control over them. And you're not necessarily on your 
own, you could report the issue to odbcfdw's authors/github tracker.  Or pay a 
company for support.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: postgres server process crashes when using odbc_fdw

2018-10-17 Thread Andres Freund

Hi,

On 2018-10-17 14:12:01 -0400, Ravi Krishna wrote:
> > 
> > Come on. We can't realistically support & debug random postgres extending 
> > projects, nor do we have control over them. And you're not necessarily on 
> > your own, you could report the issue to odbcfdw's authors/github tracker.  
> > Or pay a company for support.
> > 
> 
> On a related note is fdw for Oracle and SQLServer supported by the community ?

They're not postgresql.org projects if that's what you're asking. IOW,
they probably have their own communities, it's just not this.

Greetings,

Andres Freund

Re: postgres server process crashes when using odbc_fdw

2018-10-17 Thread Andres Freund

On 2018-10-17 11:02:40 -0700, Adrian Klaver wrote:
> On 10/17/18 10:57 AM, Ravi Krishna wrote:
> > 
> > > 
> > > Please note that odbc_fdw is not maintained by the postgresql developers, 
> > > but a separate project.
> > 
> > 
> > Translation: You are on your own.  We are hoping this will make our 
> > migration out of DB2 quicker. Oh well.
> > 
> 
> No it means you need to take this up with the project maintainer(s) as they
> are the folks that can determine whether it is fixable or not:
> 
> https://github.com/ZhengYang/odbc_fdw

Looks like https://github.com/CartoDB/odbc_fdw is more recently maintained.

Re: found xmin x from before relfrozenxid y

2018-10-21 Thread Andres Freund

Hi,

On 2018-10-21 10:24:16 -0400, Tom Lane wrote:
> =?UTF-8?Q?Johannes_Gra=c3=abn?=  writes:
> > after upgrading to version 11, I see the error pattern "found xmin x
> > from before relfrozenxid y" in different databases on different hosts.
> > From https://www.postgresql.org/docs/10/static/release-10-3.html, I
> > learned that this was an error caused by pg_upgrade, which apparently
> > had been fixed in 10.3. This page also states that refreshing the
> > affected materialized view non-concurrently would fix the problem.
> > My question is now how to infer the affected materialized view from the
> > error message. Is there a way to tell which one to refresh from the xmin
> > or relfrozenxid value?
> 
> No :-(.  I wonder why in the world we didn't make that error message
> include the relation and block number the tuple was found in.

Because it was a really complicated bugfix already, I don't think the
answer is more complicated than that.


> (Well, I see the short answer: the code layer throwing the error
> doesn't know.  But that could be fixed easily enough.)

I wonder if the better approach wouldn't be to add an errcontext for
vaccuum, where continually update the block number etc. Theres plenty of
different sources of corruption that'd potentially cause debug messages
or errors, and that should get most of them.

Greetings,

Andres Freund

Re: Corrupt index stopping autovacuum system wide

2019-07-17 Thread Andres Freund

Hi,

On 2019-07-17 13:27:23 -0400, Tom Lane wrote:
> My concern here is that if we have blinders on to the extent of only
> processing that one table or DB, we're unnecessarily allowing bloat to
> occur in other tables, and causing that missed vacuuming work to pile
> up so that there's more of it to be done once the breakage is cleared.

That a pretty common problem in the real world, with or without problems
like corrupted indexes. Autovacuum's scheduling is just too simplistic
to avoid that.

Common problems:

- If one table in one database has an xmin older than
  autovacuum_freeze_max_age, the autovacuum launcher continuously throws
  more workers at that database. Even though there's a lot of work to be
  done in other databases.

  That's because do_start_worker() ignores the "adl_next_worker"
  mechanism, and *always* starts more workers for the database with the
  oldest datfrozenxid (same with datminmxid), and because we explicitly
  don't start workers for other databases ("/* ignore not-at-risk DBs
  */").

  That's especially bad if that database has a longrunning transaction
  preventing cleanup from happening - the vacuuming will not be able to
  advance the horizon, and all others are starved.

- If there's a table "early in pg_class", and vacuum on it failes, it
  will prevent vacuuming other tables. It's pretty common for vacuum to
  fail often for a table, e.g. because it's sometimes exclusively locked,
  which then causes autovacuum to kill itself.  There's absolutely no
  mechanism for autovacuum workers to skip over that table for a while,
  leading to all other tables in the database not being vacuumed, unless
  there happens to be second worker in the database, while the first
  vacuum hasn't failed.

  This obviously also applies to the corrupted index case.

  The 'kill itself' behaviour is exascerbated by lazy_truncate_heap()'s
  exclusive lock - which will obviously trigger other backend to send
  cancellation requests. There's unfortunately a number of places where
  that leads us to just throw all the work done away, and not update
  pg_class.relfrozenxid/datfrozenxid

- Anti-wraparound vacuums are more impactful (don't cancel themselves
  upon lock conflicts, cause more IO, wait for cleanup locks), often
  emit scary messages ("oldest xmin is far in the past").  But we don't
  have *any* mechanism that avoids them for very common scenarios.

  E.g. for insert-mostly workloads, there'll never be enough dead tuples
  to cause a vacuum to happen before autovacuum_freeze_max_age is
  reached. That prevents vacuum_freeze_table_age from pre-empting the
  need to do an anti-wraparound vacuum, by increasing the xid horizon.

  We literally call anti-wraparound autovacuums "emergency" in the code,
  yet they're an almost inevitablepart of running postgres.

- There's no meaningful in-database prioritization. Which means that
  autovacuum workers might happily vacuum the table just a bit over the
  thresholds, even though there's much worse tables around.

  Especially on a busy and large databases that can lead to
  anti-wraparound started launchers effectively never getting to
  vacuuming tables above autovacuum_freeze_max_age, because tables
  earlier in pg_class are modified heavily enough that they have dead
  tuples above the thresholds by the time vacuum finishes.  To get to
  the anti-wraparound vacuum needing table, a single launcher needs to
  go through all tables preceding the table in pg_class that need
  vacuuming (only skipping over ones that are concurrently vacuumed by
  somebody else, but not ones that have *recently* been vacuumed).

  I kinda forgot how bad this one was until looking at the code again.

And there are plenty more.

My impression is that these are really hard to fix unless we develop a
new scheduling approach. And that scheduling approach probably needs to
be more stateful than the current code.

IMO these are the main concerns for how work needs to be distributed:

- work in different databases needs to be scheduled in a smarter way, in
  particular anti-wraparound cannot simply cause only the database with
  the to oldest datfrozenxid to be vacuumed until the wraparound
  "danger" is over (especially not if there's other database needing to
  be anti-wrap vacuumed)

- tables within a database need to be prioritized in a smarter way, so
  databases with a lot of bloat get vacuumed before ones with a lot less
  bloat, and similarly tables with the oldest horizon need to be
  vacuumed before ones with newer horizons, even if all of the tables
  are above the autovacuum thresholds.

- tables need to be prioritized across databases, to avoid problems like
  one currently vacuumed table causing unnecessary anti-wraparound
  workers to be launched for a database where they can't perform any
  work.

- there needs to be a separate rung between normal autovacuums and
  anti-wraparound vacuums. The inbetween level does 'aggressive'
  vacuuming (so it free

Re: LWLock optimization

2019-07-23 Thread Andres Freund

Hi,

On 2019-07-23 16:28:52 +, Alexander Pyhalov wrote:
> I was asked to bring up this topic on maling lists after asking question on 
> IRC. 
> 
> The issue is the following.
> On one of our servers (10.5), I see a lot of queries with 
> wait_event_type='LWLock', wait_event='lock_manager' 

Could you qualify this a bit more?  What does "a lot" mean, in
comparison to the total number of queries/session?

> This is a stat gathering/analyzing application with tables which have
> > 100 partitions.

Hm, what kind of partitioning scheme is this? The "new"-ish partitioning
support isn't yet always that good to only perform the minimal amount of
metadata lookups (which in turn require locks). Especially in 10.

Are you using prepared statements?

> The queries itself are mostly simple (select by pkey, update by primary key 
> and so on).
> 
>  select count(*) from pg_locks shows about 40K  records (all granted) and 
> 
>  select count(*) from (select distinct 
> locktype,database,relation,page,tuple,virtualxid,transactionid,classid,objid,objsubid
>  from pg_locks) 
> 
> is varying from about 1K to 25K (I haven't stored this value in monitoring 
> system, just observed manually).

I think to know what exactly is going on, we'd need a cpu profile, and
then perhaps a followup profile for also getting the callsites for lock
waits.  Could you take a perf profile?

https://wiki.postgresql.org/wiki/Profiling_with_perf

> Currently I was adviced to increase LOG2_NUM_LOCK_PARTITIONS at
> compile time. If it bothers us enough, we would likely do it, but I
> have a question, if this (or NUM_LOCK_PARTITIONS) value should be
> increased by default or set tunable?

I'm not quite sure what you're asking precisely here. Are you wondering
whether postgres should increase the default value for
LOG2_NUM_LOCK_PARTITIONS should be increased? And whether it should be
runtime configurable?  I'm fairly sure we don't want to make it runtime
configurable, that'd add overhead to some pretty central code.
Increasing the default would need a lot of benchmarks, to prove it
doesn't cause regressions in other common scenarios.

Greetings,

Andres Freund

Re: Why does backend send buffer size hardcoded at 8KB?

2019-07-27 Thread Andres Freund

On 2019-07-27 14:43:54 +0300, Artemiy Ryabinkov wrote:
> Why backend send buffer use exactly 8KB? 
> (https://github.com/postgres/postgres/blob/249d64999615802752940e017ee5166e726bc7cd/src/backend/libpq/pqcomm.c#L134)
> 
> 
> I had this question when I try to measure the speed of reading data. The
> bottleneck was a read syscall. With strace I found that in most cases read
> returns 8192 bytes (https://pastebin.com/LU10BdBJ). With tcpdump we can
> confirm, that network packets have size 8192 (https://pastebin.com/FD8abbiA)

Well, in most setups, you can't have that large frames. The most common
limit is 1500 +- some overheads. Using jumbo frames isn't that uncommon,
but it has enough problems that I don't think it's that widely used with
postgres.

> So, with well-tuned networking stack, the limit is 8KB. The reason is the
> hardcoded size of Postgres write buffer.

Well, jumbo frames are limited to 9000 bytes.

But the reason you're seeing 8192 sized packages isn't just that we have
an 8kb buffer, I think it's also that that we unconditionally set
TCP_NODELAY:

#ifdef  TCP_NODELAY
on = 1;
if (setsockopt(port->sock, IPPROTO_TCP, TCP_NODELAY,
   (char *) &on, sizeof(on)) < 0)
{
elog(LOG, "setsockopt(%s) failed: %m", "TCP_NODELAY");
return STATUS_ERROR;
}
#endif

With 8KB send size, we'll often unnecessarily send some smaller packets
(both for 1500 and 9000 MTUs), because 8kB doesn't neatly divide into
the MTU. Here's e.g. the ip packet sizes for a query returning maybe
18kB:

1500
1500
1500
1500
1500
1004
1500
1500
1500
1500
1500
1004
1500
414

the dips are because that's where our 8KB buffer + disabling nagle
implies a packet boundary.

I wonder if we ought to pass MSG_MORE (which overrides TCP_NODELAY by
basically having TCP_CORK behaviour for that call) in cases we know
there's more data to send. Which we pretty much know, although we'd need
to pass that knowledge from pqcomm.c to be-secure.c

It might be better to just use larger send sizes however. I think most
kernels are going to be better than us knowing how to chop up the send
size. We're using much larger limits when sending data from the client
(no limit for !win32, 65k for windows), and I don't recall seeing any
problem reports about that.

OTOH, I'm not quite convinced that you're going to see much of a
performance difference in most scenarios. As soon as the connection is
actually congested, the kernel will coalesce packages regardless of the
send() size.

> Does it make sense to make this parameter configurable?

I'd much rather not. It's goign to be too hard to tune, and I don't see
any tradeoffs actually requiring that.

Greetings,

Andres Freund

1 2 3 >

1 - 100 of 221 matches

Mail list logo