from:"Thomas Munro"

Re: Unexpected behavior with transition tables in update statement trigger

2018-02-26 Thread Thomas Munro

On Tue, Feb 27, 2018 at 4:18 AM, Tom Kazimiers  wrote:
> On Mon, Feb 26, 2018 at 11:15:44PM +1300, Thomas Munro wrote:
>> On Sat, Feb 24, 2018 at 4:47 PM, Tom Kazimiers 
>> wrote:
>> Thanks for the reproducer.  Yeah, that seems to be a bug.
>> nodeNamedTuplestorescan.c allocates a new read pointer for each
>> separate scan of the named tuplestore, but it doesn't call
>> tuplestore_select_read_pointer() so that the two scans that appear in
>> your UNION ALL plan are sharing the same read pointer.  At first
>> glance the attached seems to fix the problem, but I'll need to look
>> more carefully tomorrow.
>
> Thanks very much for investigating this. I can confirm that applying your
> patch results in the tuples I expected in both my test trigger and my actual
> trigger function.

Thanks for testing.

> It would be great if this or a similar fix would make it into the next
> official release.

Here's a new version with tuplestore_select_read_pointer() added in
another place where it was lacking, and commit message.  Moving to
-hackers, where patches go.

Here's a shorter repro.  On master it prints:

NOTICE:  count = 1
NOTICE:  count union = 1

With the patch the second number is 2, as it should be.

CREATE TABLE test (i int);
INSERT INTO test VALUES (1);

CREATE OR REPLACE FUNCTION my_trigger_fun() RETURNS trigger
LANGUAGE plpgsql AS
$$
  BEGIN
 RAISE NOTICE 'count = %', (SELECT COUNT(*) FROM new_test);
 RAISE NOTICE 'count union = %', (SELECT COUNT(*)
  FROM (SELECT * FROM new_test UNION ALL SELECT * FROM new_test) ss);
 RETURN NULL;
 END;
$$;

CREATE TRIGGER my_trigger AFTER UPDATE ON test
REFERENCING NEW TABLE AS new_test OLD TABLE as old_test
FOR EACH STATEMENT EXECUTE PROCEDURE my_trigger_fun();

UPDATE test SET i = i;

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Fix-tuplestore-read-pointer-confusion-in-nodeNamedtu.patch
Description: Binary data

Re: Unexpected behavior with transition tables in update statement trigger

2018-02-27 Thread Thomas Munro

On Wed, Feb 28, 2018 at 9:58 AM, Tom Lane  wrote:
> Thomas Munro  writes:
>> Here's a new version with tuplestore_select_read_pointer() added in
>> another place where it was lacking, and commit message.  Moving to
>> -hackers, where patches go.
>
> Pushed, along with a regression test based on your example.
> Unfortunately, this came in a bit too late for this week's releases :-(

Thanks!

Tom K, if you need a workaround before 10.4 comes out in May[1], you
could try selecting the whole transition table into a CTE up front.
Something like WITH my_copy AS (SELECT * FROM new_table) SELECT * FROM
my_copy UNION ALL SELECT * FROM my_copy should work.

[1] https://www.postgresql.org/developer/roadmap/

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Concurrent CTE

2018-04-04 Thread Thomas Munro

On Wed, Apr 4, 2018 at 8:01 AM, Artur Formella  wrote:
> Execution now:
> time-->
> Thread1: aa | bb | cc | dd | ee | ff | gg | hh | primary
>
> And the question: is it possible to achieve more concurrent execution plan
> to reduce the response time? For example:
> Thread1: aa | dd | ff | primary
> Thread2: bb | ee | gg
> Thread3: cc | -- | hh

Parallel query can't be used for CTE queries currently.  Other good
things also don't happen when you use CTEs -- it's an "optimiser
fence" (though there is discussion of changing that eventually).
Maybe try rewriting your query as:

 SELECT ...
   FROM (SELECT ...) AS aa,
(SELECT ...) AS bb,
...

Note that in the form of parallelism supported in PostgreSQL 10, every
process (we use processes instead of threads) runs the same execution
plan at the same time, but gives each worker only a part of the
problem using disk block granularity, so it looks more like this:

Process1: fragments of aa | fragments of bb | ...
Process2: fragments of aa | fragments of bb | ...

PostgreSQL 11 (not yet released) will introduce an exception that
looks more like what you showed: the Parallel Append operator (for
unions and scans of partitions) can give each worker a different part
of the plan approximately as you showed, but IIUC that's used as a
fallback strategy when it can't use block granularity (because of
technical restrictions).  The problem with sub-plan granularity is
that the various sub-plans can finish at different times leaving some
CPU cores with nothing to do while others are still working, whereas
block granularity keeps everyone busy until the work is done and
should finish faster.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Concurrent CTE

2018-04-04 Thread Thomas Munro

On Thu, Apr 5, 2018 at 5:16 PM, David G. Johnston
 wrote:
> On Wed, Apr 4, 2018 at 10:12 PM, Thomas Munro
>  wrote:
>>
>> Parallel query can't be used for CTE queries currently.
>
> A pointer to the location in the docs covering this limitation would be
> appreciated.  It isn't covered here:
>
> https://www.postgresql.org/docs/10/static/when-can-parallel-query-be-used.html

https://www.postgresql.org/docs/10/static/parallel-safety.html

-- 
Thomas Munro
http://www.enterprisedb.com

Re: pg_multixact/members growing

2018-05-22 Thread Thomas Munro

On Wed, May 23, 2018 at 7:49 AM, Tom Lane  wrote:
> Tiffany Thang  writes:
>> Our pg_multixact/members directory has been growing to more than 18GB over
>> the last couple of months. According to the documentation, the files in
>> there are used to support row locking by multiple transactions and when all
>> tables in all databases are eventually scanned by VACUUM, the older
>> multixacts are removed. In our case, the files are not removed.
>
> Hmm.  What does pg_controldata tell you about NextMultiXactId,
> NextMultiOffset, oldestMultiXid, oldestMulti's DB?
> Are pg_clog/ or pg_subtrans/ or pg_multixact/offsets/ getting large?
> Is there anything at all in pg_twophase/?  Is this system a replication
> master, and if so are any of its slaves lagging behind?

Some thoughts:

There are MULTIXACT_MEMBERS_PER_PAGE = 1636 members for every 8KB
page.  The reported directory size implies 18GB / 8KB * 1636 =
3,859,808,256 members.  Above MULTIXACT_MEMBER_SAFE_THRESHOLD =
2,147,483,647 we should be triggering emergency autovacuums to try to
reclaim space.  Only ~435 million more members can be created.

Is this system now aggressively running "wraparound prevention" autovacuums?

There are MULTIXACT_OFFSETS_PER_PAGE = 2048 multixacts for every 8KB
page, so the default autovacuum_multixact_freeze_max_age should
soft-cap the size of pg_multixact/offsets at around 1.5GB ~=
400,000,000 / 2048 * 8KB.

Unfortunately autovacuum_multixact_freeze_max_age doesn't impose any
limit on the number of members.  The totals can be quite explosive
with high numbers of backends, because when n backends share lock a
row we make O(n) multixacts and O(n^2) members.  First we make a
multixact with 2 members, then a new one with 3 members, etc... so
that's n - 1 multixacts and (n * (n + 1)) / 2 - 1 members.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Can PostgreSQL create new WAL files instead of reusing old ones?

2018-06-21 Thread Thomas Munro

On Fri, Jun 22, 2018 at 11:22 AM, David Pacheco  wrote:
> On Wed, Jun 20, 2018 at 10:35 AM, Jerry Jelinek 
> wrote:
>> I have implemented a small change to allow WAL recycling to be disabled.
>> It is visible at:
>> https://cr.joyent.us/#/c/4263/
>>
>> I'd appreciate getting any feedback on this.

>> tl;dr: We've found that under many conditions, PostgreSQL's re-use of old
>> WAL files appears to significantly degrade query latency on ZFS.

I haven't tested by it looks reasonable to me.  It needs documentation
in doc/src/sgml/config.sgml.  It should be listed in
src/backend/utils/misc/postgresql.conf.sample.  We'd want a patch
against our master branch.  Could you please register it in
commitfest.postgresql.org so we don't lose track of it?

Hey, a question about PostgreSQL on ZFS: what do you guys think about
pg_flush_data() in fd.c?  It does mmap(), msync(), munmap() to try to
influence writeback?  I wonder if at least on some operating systems
that schlepps a bunch of data out of ZFS ARC into OS page cache, kinda
trashing the latter?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: incorrect resource manager data checksum in record

2018-06-28 Thread Thomas Munro

On Fri, Jun 29, 2018 at 5:44 AM, Devin Christensen 
wrote:
> The pattern is the same, regardless of ubuntu or postgresql versions. I'm
> concerned this is somehow a ZFS corruption bug, because the error always
> occurs downstream of the first ZFS node and ZFS is a recent addition. I
> don't know enough about what this error means, and haven't found much
> online. When I restart the nodes effected, replication resumes normally,
> with no known side-effects that I've discovered so far, but I'm no longer
> confident that the data downstream from the primary is valid. Really not
> sure how best to start tackling this issue, and hoping to get some
guidance.
> The error is infrequent. We have 11 total replication chains, and this
error
> has occurred on 5 of those chains in approximately 2 months.

It's possible and sometimes expected to see that error when there has been
a crash, but you didn't mention that.  From your description it sounds like
it's happening in the middle of streaming, right?  My first thought was
that the filesystem change is surely a red herring.  But... I did find this
similar complaint that involves an ext4 primary and a btrfs replica:

https://dba.stackexchange.com/questions/116569/postgresql-docker-incorrect-resource-manager-data-checksum-in-record-at-46f-6

I'm having trouble imagining how the filesystem could be triggering a
problem though (unless ZoL is dramatically less stable than on other
operating systems, "ZFS ate my bytes" seems like a super unlikely theory).
Perhaps by being slower, it triggers a bug elsewhere?  We did have a report
recently of ZFS recycling WAL files very slowly (presumably because when it
moves the old file to become the new file, it finishes up slurping it back
into memory even though we're just going to overwrite it, and it can't see
that because our writes don't line up with the ZFS record size, possibly
unlike ye olde write-in-place 4k block filesystems, but that's just my
guess).  Does your machine have ECC RAM?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: incorrect resource manager data checksum in record

2018-06-28 Thread Thomas Munro

On Fri, Jun 29, 2018 at 1:14 PM, Devin Christensen  wrote:
>> From your description it sounds like it's happening in the middle of
>> streaming, right?
>
> Correct. None of the instances in the chain experience a crash. Most of the
> time I see the "incorrect resource manager data checksum in record" error,
> but I've also seen it manifested as:
>
> invalid magic number 8813 in log segment 0003AEC2009C, offset
> 15335424

I note that that isn't at a segment boundary.  Is that also the case
for the other error?

One theory would be that there is a subtle FS cache coherency problem
between writes and reads of a file from different processes
(causality), on that particular stack.  Maybe not too many programs
pass data through files with IPC to signal progress in this kinda
funky way, but that'd certainly be a violation of POSIX if it didn't
work correctly and I think people would know about that so I feel a
bit silly suggesting it.  To follow that hypothesis to the next step:
I suppose it succeeds after you restart because it requests the whole
segment again and gets a coherent copy all the way down the chain.
Another idea would be that our flush pointer tracking and IPC is
somehow subtly wrong and that's exposed by different timing leading to
incoherent reads, but I feel like we would know about that by now too.

I'm not really a replication expert, so I could be missing something
simple here.  Anyone?

>> I did find this similar complaint that involves an ext4 primary and a
>> btrfs replica:
>
> It is interesting that my issue occurs on the first hop from ZFS to ext4. I
> have not seen any instances of this happening going from the ext4 primary to
> the first ZFS replica.

I happen to have a little office server that uses ZFS so I left it
chugging through a massive pgbench session with a chain of 3 replicas
while I worked on other stuff, and didn't see any problems (no ext4
involved though, this is a FreeBSD box).  I also tried
--wal-segsize=1MB (a feature of 11) to get some more frequent
recycling happening just in case it was relevant.

>> We did have a report recently of ZFS recycling WAL files very slowly
>
> Do you know what version of ZFS that effected? We're currently on 0.6.5.6,
> but could upgrade to 0.7.5 on Ubuntu 18.04

I think that issue is fundamental/all versions, and has something to
with the record size (if you have 128KB ZFS records and someone writes
8KB, it probably needs to read a whole 128KB record in, whereas with
ext4 et al you have 4KB blocks and the OS can very often skip reading
it in because it can see you're entirely overwriting blocks), and
possibly the COW design too (I dunno).  Here's the recent thread,
which points back to an older one, from some Joyent guys who I gather
are heavy ZFS users:

https://www.postgresql.org/message-id/flat/CACPQ5FpEY9CfUF6XKs5sBBuaOoGEiO8KD4SuX06wa4ATsesaqg%40mail.gmail.com

There was a ZoL bug that made headlines recently but that was in 0.7.7
so not relevant to your case.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Why the sql is not executed in parallel mode

2018-09-18 Thread Thomas Munro

On Wed, Sep 19, 2018 at 1:53 PM jimmy  wrote:
>
> Why the sql is not executed in parallel mode, does the sql has some problem?
> with sql1 as

Hello Jimmy,

WITH is the problem.  From the manual[1]:  "The following operations
are always parallel restricted.  Scans of common table expressions
(CTEs). ...".  That means that these CTEs can only be scanned in the
leader process.

If you rewrite the query using sub selects it might do better.  FWIW
there is a project to make WITH work like subselects automatically in
a future release of PostgreSQL:

https://www.postgresql.org/message-id/flat/87sh48ffhb@news-spur.riddles.org.uk

[1] https://www.postgresql.org/docs/10/static/parallel-safety.html

-- 
Thomas Munro
http://www.enterprisedb.com

Re: how to know whether query data from memory after pg_prewarm

2018-09-18 Thread Thomas Munro

On Wed, Sep 19, 2018 at 1:35 PM jimmy  wrote:
> I use select pg_prewarm('table1','read','main')  to load data of table1 into 
> the memory.
> when I use select count(1) from table1 group by aa to query data.
> I find the speed of query is not fast, I wonder whether it query data from 
> memory.
> And it is slower than Oracle, both of Oracle and Postgresql has same table 
> and count of data.
> when pg_prewarm use 'read' mode,  the data is put into the OS cache, how to 
> examine the table which is pg_prewarmed into the OS cache .
> I know pg_buffercache ,but it just examine the table in the shared buffer of 
> Postgresql, not the table in the OS cache.

This is a quick and dirty hack, but it might do what you want:

https://github.com/macdice/pgdata_mincore

Tested on FreeBSD, not sure how well it'll travel.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: how to know whether query data from memory after pg_prewarm

2018-09-19 Thread Thomas Munro

On Wed, Sep 19, 2018 at 7:44 PM Cédric Villemain  wrote:
> Le 19/09/2018 à 05:29, Thomas Munro a écrit :
> > On Wed, Sep 19, 2018 at 1:35 PM jimmy  wrote:
> >> I use select pg_prewarm('table1','read','main')  to load data of table1 
> >> into the memory.
> >> when I use select count(1) from table1 group by aa to query data.
> >> I find the speed of query is not fast, I wonder whether it query data from 
> >> memory.
> >> And it is slower than Oracle, both of Oracle and Postgresql has same table 
> >> and count of data.
> >> when pg_prewarm use 'read' mode,  the data is put into the OS cache, how 
> >> to examine the table which is pg_prewarmed into the OS cache .
> >> I know pg_buffercache ,but it just examine the table in the shared buffer 
> >> of Postgresql, not the table in the OS cache.
> >
> > This is a quick and dirty hack, but it might do what you want:
> >
> > https://github.com/macdice/pgdata_mincore
> >
> > Tested on FreeBSD, not sure how well it'll travel.
>
> You can use pgfincore extension for that purpose, and more.
>
> https://github.com/klando/pgfincore/blob/master/README.md

Yes, if you only want to know *how many* pages are in the OS page
cache.  pgdata_mincore shows you which PG blocks are in the page cache
in the same format as pg_buffercache, which is useful for studying
double buffering effects.  Maybe I should turn it into a patch for
pgfincore...

-- 
Thomas Munro
http://www.enterprisedb.com

Re: We are facing "PANIC: could not read from control file:Success error while starting the database.

2018-10-04 Thread Thomas Munro

On Fri, Oct 5, 2018 at 4:29 AM Raghavendra Rao J S V
 wrote:
> PANIC: could not read from control file:Success

That means that the pg_control file is the wrong size.  What size is
it?  What filesystem is this, that allowed an out-of-space condition
to result in a file being truncated?  Normally we only overwrite that
file, so after creation it should stay the same size.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: OOM with many sorts

2019-07-08 Thread Thomas Munro

On Tue, Jul 9, 2019 at 4:44 AM Justin Pryzby  wrote:
>  [...many partial GroupAggregate/Sort repeated 
> here for ~200 child tables...]
>
> This isn't the first time I've had to reduce work_mem on a parallel query to
> avoid OOM, but it seems unreasonable to be concerned with 50MB sorts.

This is a fundamental problem with our model of memory budgets.  To
oversimplify massively, we can use about work_mem * ntables * the
number of parallel workers.  Core counts are going up, and now we have
a convenient way to get large values of ntables.  One of many -hackers
threads to discuss the problem:

https://www.postgresql.org/message-id/flat/CAH2-WzmNwV%3DLfDRXPsmCqgmm91mp%3D2b4FvXNF%3DcCvMrb8YFLfQ%40mail.gmail.com

> It looks like the results of each Sort node stay in RAM, during processing of
> additional sort nodes (is that required?)

That's a very interesting question linked to whole-query memory
budgets.  If you don't have a whole-query memory budget, then you have
nothing to lose by keeping hash and sort results in memory, and you
gain the ability to do cheap rescans (if the node happens to be in a
nest loop).  I discussed this in the context of hash joins over here:

https://www.postgresql.org/message-id/CAEepm%3D0N6DODN7nx6Zb93YOW-y%3DRftNNFZJRaLyG6jbJHJVjsA%40mail.gmail.com

The TL;DR is that some other databases throw out eg hash tables as
soon as possible, and consider changing the shape of their join
nesting in order to minimise the total number of hash tables in memory
at once, in order to come in under a certain budget for memory used at
any one point in time.  That is, they don't just divide query_work_mem
up over all the nodes, they understand things about when nodes will
allocate and free memory.

There be some less clever things we can do, though, before we tackle
the big problems involved.  We could probably still opportunistically
give back memory sooner, when we know there is no chance of rescan,
and other things along those lines.

> Thanks in advance for any advice.

I think it's impossible to choose a single value for work_mem if you
have a mixture of types of queries that hit wildly different numbers
of partitions and workers.  I think this is an ongoing topic for
-hackers.

-- 
Thomas Munro
https://enterprisedb.com

Re: PGSQL 11.4: shared_buffers and /dev/shm size

2019-07-09 Thread Thomas Munro

On Tue, Jul 9, 2019 at 10:15 PM Jean Louis  wrote:
> * Konstantin Malanchev  [2019-07-09 12:10]:
> > I have 8 GB RAM and /dev/shm size is 4GB, and there is no significant 
> > memory usage by other system processes. I surprised that Postgres uses more 
> > space in /dev/shm than sharred_buffers parameter allows, probably I don't 
> > understand what this parameter means.
> >
> > I have no opportunity to enlarge total RAM and probably this query requires 
> > too much RAM to execute. Should Postgres just use HDD as temporary storage 
> > in this case?
>
> That I cannot know. I know that /dev/shm could
> grow as much as available free RAM.

Hi,

PostgreSQL creates segments in /dev/shm for parallel queries (via
shm_open()), not for shared buffers.  The amount used is controlled by
work_mem.  Queries can use up to work_mem for each node you see in the
EXPLAIN plan, and for each process, so it can be quite a lot if you
have lots of parallel worker processes and/or lots of
tables/partitions being sorted or hashed in your query.

-- 
Thomas Munro
https://enterprisedb.com

Re: PGSQL 11.4: shared_buffers and /dev/shm size

2019-07-09 Thread Thomas Munro

On Tue, Jul 9, 2019 at 11:11 PM Konstantin Malanchev  wrote:
> Thank you for explanation. work_mem = 512MB and 
> max_parallel_workers_per_gather = 2 and I run only one Postgres instance and 
> only one query. EXPLAIN shows "Workers Planned: 2" for this query. Why it can 
> use more than 1GB of /dev/shm?

For example, if you have one Parallel Hash Join in your plan, it could
allocate up to 512MB * 3 of shared memory (3 = leader process + 2
workers).  It sounds like you'll need to set work_mem smaller.  If you
run EXPLAIN ANALYZE you'll see how much memory is used by individual
operations.  Usually it's regular private anonymous memory, but for
Parallel Hash it's /dev/shm memory.

-- 
Thomas Munro
https://enterprisedb.com

Re: PGSQL 11.4: shared_buffers and /dev/shm size

2019-07-09 Thread Thomas Munro

On Wed, Jul 10, 2019 at 12:27 AM Konstantin Malanchev  wrote:
> I'm executing the query with smaller work_mem, it will take some time. But I 
> still confused why it used all /dev/shm (4GB) and fails with "no space left" 
> error while work_mem = 512MB.

I think it could fail that way for two reasons: /dev/shm size limit
(mount option, which I think you are saying you have set to 4GB?), or
your system ran out of RAM +swap.  The directly listing in your first
message only shows 1.4GB, not 4GB, so perhaps it's the second problem.

>  ->  Parallel Hash Join  (cost=11030236131.39..255829470118.27 
> rows=18618961759219 width=31)
>Hash Cond: (another_mat_view.oid = my_table.oid)
>->  Parallel Seq Scan on another_mat_view  
> (cost=0.00..652514.56 rows=31645556 width=8)
>->  Parallel Hash  (cost=636676233.38..636676233.38 
> rows=20353804801 width=31)
>  ->  Parallel Seq Scan on my_table  
> (cost=0.00..636676233.38 rows=20353804801 width=31)
>Filter: (flag = ''::bit(16))

It's strange that it's hashing the ~20B row table instead of the ~30M row table.

-- 
Thomas Munro
https://enterprisedb.com

Re: ERROR: too many dynamic shared memory segments

2019-09-11 Thread Thomas Munro

On Wed, Sep 11, 2019 at 11:20 PM Nicola Contu  wrote:
> If the error persist I will try to revert the work_mem.
> Thanks a lot

Hi Nicola,

It's hard to say exactly what the cause of the problem is in your case
and how to avoid it, without knowing what your query plans look like.
PostgreSQL allows 64 + 2 * max_connections segments to exist a time,
and it needs a number of them that depends on work_mem (in the case of
Parallel Hash Join and Parallel Bitmap Index Scan), and also depends
on the number of Gather nodes that appear in the plan, which in some
unusual cases can result from partitioning.

I've seen people reaching this error by running a lot of parallel
queries concurrently.  If that's the cause, then you can definitely
get some relief by turning work_mem down, or by turning
max_connections up (even though you don't want to allow more
connections -- because it influences the formula for deciding on the
DSM segment limit).  We should probably adjust some of the internal
constants to give us more slots, to avoid that problem, as discussed
here:

https://www.postgresql.org/message-id/flat/CA%2BhUKGL6H2BpGbiF7Lj6QiTjTGyTLW_vLR%3DSn2tEBeTcYXiMKw%40mail.gmail.com

I've also seen people reaching this error by somehow coming up with
plans that had a very large number of Gather nodes in them,
corresponding to partitions; that's probably a bad plan (it'd
presumably be better to terminate parallelism higher up in the plan,
but these plans do seem to exist in the wild; I don't recall exactly
why).  I think we need a bit of a redesign so that if there are
multiple Gather nodes, they share the same main DSM segment, instead
of blowing through this limit.

-- 
Thomas Munro
https://enterprisedb.com

Re: PostgreSQL - unrecognized win32 error code: 38

2019-10-29 Thread Thomas Munro

On Tue, Oct 29, 2019 at 9:23 PM ZhenHua Cai  wrote:
> No, it doesn't call any in-core code.

I wondered if this could be coming from the new code in
src/port/pg_p{read,write}.c.  ERROR_HANDLE_EOF is a documented
GetLastError() return value after ReadFile() fails[1], but only for
asynchronous files.  We are using that interface ("overlapped" IO,
their name for asynchronous IO, but the same interface can also do
synchronous IO and happens to support an offset like Unix's pread()),
but we're not opening file handles with FILE_FLAG_OVERLAPPED so we
have a plain old synchronous handle here.

Searching the web for ERROR_HANDLE_EOF leads to a least one
discussion[2] of "mailslots" (like our syslog pipe?) and a couple of
possible underlying NT errors, and following those leads some other
stuff about SMB filesystems and sockets.  Is your database running on
a local or SMB (or other remote) filesystem?  Are you using the
PostgreSQL syslog process?  I don't know anything about Windows at all
but from these breadcrumbs I feel like there has to be a network or
asynchronous communication involved somewhere here, not a plain old
end-of-file on a regular synchronous file operation, which, as Tom
said, we have code paths for that don't look like this (because that
just shows up as 0 bytes read).

[1] 
https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-readfile
[2] 
https://stackoverflow.com/questions/56510727/what-causes-writefile-to-return-error-38-error-handle-eof

Re: here does postgres take its timezone information from?

2019-11-05 Thread Thomas Munro

On Wed, Nov 6, 2019 at 12:02 PM Adrian Klaver  wrote:
> On 11/5/19 3:00 PM, Chris Withers wrote:
> > Hmm. Is there any option to use the system timezone packages?
>
> https://www.postgresql.org/docs/11/install-procedure.html
>
> --with-system-tzdata=DIRECTORY

By the way, you can see if your installation of PostgreSQL was built
to use system-provided tzdata by running the pg_config program that
was installed alongside it.  That could be useful if it was built by a
package maintainer (Debian etc) and you want to see how they
configured it.  You'll see something like CONFIGURE = '...
--with-system-tzdata=/usr/share/... ' if it's using OS vendor tzdata
files.  I hope that most distributions do that*, because otherwise you
could finish up with lots of out-of-sync copies of the tzdata database
inside your database, your JVM, your libc, etc etc, and you want a
single source of truth for that stuff.

Once I was involved in rolling out a last minute DST rule change that
happened in Australia due to politicians and an international sporting
event, and we had to go hunting for copies of tzdata hiding on our
servers that had to agree on when the financial markets were
opening... we found many copies, and ever since then I complain
wherever I see packages shipping their own copies of this stuff...

Assuming you are using system tzdata, your other question was what you
need to do after the tzdata files have been updated.  I suspect that
new PostgreSQL database sessions (processes) will see the new rules,
but existing sessions may continue to see the old rules if they had
loaded them already, because we cache them in per-process memory (see
pg_tzset()).  It would probably be safest to restart the PostgreSQL
cluster.

If you're using PostgreSQL's build-in tzdata, then you'll need to
restart your cluster anyway once you install the version that shipped
with the new tzdata rules, and depending on your package manager, that
might happen automatically when you upgrade.

*It looks like FreeBSD's port uses the copy of tzdata from the
PostgreSQL source tree by default and thus that is what you get if you
install PostgreSQL with "pkg".  That's not a great default IMHO and
should be changed.

Re: here does postgres take its timezone information from?

2019-11-05 Thread Thomas Munro

On Wed, Nov 6, 2019 at 2:20 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > On Wed, Nov 6, 2019 at 12:02 PM Adrian Klaver  
> > wrote:
> >> On 11/5/19 3:00 PM, Chris Withers wrote:
> >>> Hmm. Is there any option to use the system timezone packages?
>
> >> --with-system-tzdata=DIRECTORY
>
> > I hope that most distributions do that*, because otherwise you
> > could finish up with lots of out-of-sync copies of the tzdata database
> > inside your database, your JVM, your libc, etc etc, and you want a
> > single source of truth for that stuff.
>
> Right.  Our recommendation is to use --with-system-tzdata if you're on
> a platform where the vendor updates their copy of tzdata regularly.
> The fact that we supply a copy of tzdata at all is really just a fallback
> for folks on poorly-supported platforms.  (Naming no names here ...)

Incidentally, that's also why I don't want to give up on libc
collations quite as easily as some, despite their limitations.  It
should be possible to get all the software on your system to agree on
the ordering of two strings and the current time!

> > *It looks like FreeBSD's port uses the copy of tzdata from the
> > PostgreSQL source tree by default and thus that is what you get if you
> > install PostgreSQL with "pkg".  That's not a great default IMHO and
> > should be changed.
>
> Ugh.  Who can we poke there?

Maybe Palle?  (CCed).

Re: here does postgres take its timezone information from?

2019-11-12 Thread Thomas Munro

On Wed, Nov 13, 2019 at 3:40 AM Palle Girgensohn  wrote:
> The decision to use postgresql's tzdata is quite old. It was based on the 
> assumption that postgres is updated more frequently than the operating 
> system, and that for that reason it was better to use postgresql's tzdata, 
> since it would be more accurate more often. This is probably not true 
> anymore, so I agree it should probably be changed to default=system-tzdata on 
> FreeBSD.
>
> I will commit an upgrade in Thursday, and unless anybody raise their voice 
> against it, I will change the default setting then.

Thanks!

FreeBSD users already have the choice between zoneinfo from base or
the misc/zoneinfo port if for some reason they want to control tzdata
updates separately.  PostgreSQL and FreeBSD both track tzdata closely,
and both pushed a commit for version 2019c into their stable branches
within a couple of weeks of it being released, so I don't foresee any
problem with this change, and staying in sync with libc seems to
outweigh any other concerns IMHO.

https://www.iana.org/time-zones
https://github.com/postgres/postgres/tree/REL_12_STABLE/src/timezone/data
https://github.com/freebsd/freebsd/tree/stable/12/contrib/tzdata

Re: here does postgres take its timezone information from?

2019-11-15 Thread Thomas Munro

On Sat, Nov 16, 2019 at 8:38 AM Tom Lane  wrote:
> Palle Girgensohn  writes:
> >> 6 nov. 2019 kl. 03:03 skrev Thomas Munro :
> >>> *It looks like FreeBSD's port uses the copy of tzdata from the
> >>> PostgreSQL source tree by default and thus that is what you get if you
> >>> install PostgreSQL with "pkg".  That's not a great default IMHO and
> >>> should be changed.
>
> > The decision to use postgresql's tzdata is quite old. It was based on the 
> > assumption that postgres is updated more frequently than the operating 
> > system, and that for that reason it was better to use postgresql's tzdata, 
> > since it would be more accurate more often. This is probably not true 
> > anymore, so I agree it should probably be changed to default=system-tzdata 
> > on FreeBSD.
> > I will commit an upgrade in Thursday, and unless anybody raise their voice 
> > against it, I will change the default setting then.
>
> So it seems that that change was not entirely without fallout:
>
> https://www.postgresql.org/message-id/flat/16118-ef1e45e342c52416%40postgresql.org
>
> I don't think this is reason to revert the change, exactly,
> but it's a concern.  I wonder why FreeBSD editorializes on
> the set of zone names?

Ugh.  It doesn't have the old backward compatibility names like
US/Pacific installed by default, which is a problem if that's what
initdb picked for your cluster (or you've stored references to any of
those names in other ways).

Re: here does postgres take its timezone information from?

2019-11-16 Thread Thomas Munro

On Sat, Nov 16, 2019 at 7:13 PM Tom Lane  wrote:
> Palle Girgensohn  writes:
> > 15 nov. 2019 kl. 21:32 skrev Thomas Munro :
> >> Ugh.  It doesn't have the old backward compatibility names like
> >> US/Pacific installed by default, which is a problem if that's what
> >> initdb picked for your cluster (or you've stored references to any of
> >> those names in other ways).
>
> > One quick fix is to revert the change. Tom thinks this is not reason to 
> > revert. Would it be enough to edit the postgresql.conf to use the correct 
> > "modern" name for US/Pacific (PST?)? In rhar case, an update note might be 
> > sufficient?
>
> I think the "official" name of that zone is America/Los_Angeles.
> But initdb might seize on the US/Pacific alias, if available,
> because it's shorter.  We've seen related problems with other
> time zone names, though usually it was just cosmetic and not a
> reason for the postmaster to fail to start.
>
> Yes, changing the zone name in postgresql.conf should be a sufficient
> fix.  In theory, a FreeBSD user ought to know the "official" alias
> for their zone, since the rest of the system would expect that.
> So this is slightly tedious if initdb chose a non-official alias,
> but I don't think it's reason to panic.

Perhaps the best thing would be to revert this for the older
PostgreSQL releases so that people doing minor version upgrades are
inconvenienced by a system that can't start up after "pkg upgrade",
but do it for 12 since not many people will be using that yet?

Re: PostgreSQL - unrecognized win32 error code: 38

2019-11-19 Thread Thomas Munro

On Wed, Oct 30, 2019 at 12:13 AM Thomas Munro  wrote:
> On Tue, Oct 29, 2019 at 9:23 PM ZhenHua Cai  wrote:
> > No, it doesn't call any in-core code.
>
> I wondered if this could be coming from the new code in
> src/port/pg_p{read,write}.c.  ERROR_HANDLE_EOF is a documented
> GetLastError() return value after ReadFile() fails[1], but only for
> asynchronous files.  We are using that interface ("overlapped" IO,
> their name for asynchronous IO, but the same interface can also do
> synchronous IO and happens to support an offset like Unix's pread()),
> but we're not opening file handles with FILE_FLAG_OVERLAPPED so we
> have a plain old synchronous handle here.

> [1] 
> https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-readfile

Hi,

With a little help we got to the bottom of this:

https://www.postgresql.org/message-id/flat/CAC%2BAXB0_zNFUH1BrRGKUkFxvq3SxsojroJN70iTUA1gUNn_gag%40mail.gmail.com#cb324760c6a142ec4ed6abefa8e68f18

It should be fixed in the next release.  Thanks for the report!

Re: tcp keep alive don't work when the backend is busy

2019-12-10 Thread Thomas Munro

On Wed, Dec 11, 2019 at 4:17 AM Fabio Ugo Venchiarutti
 wrote:
> On 10/12/2019 15:06, Tom Lane wrote:
> > =?utf-8?B?0J7Qu9C10LMg0KHQsNC80L7QudC70L7Qsg==?=  writes:
> >> According to the documentation
> >> https://www.postgresql.org/docs/12/runtime-config-connection.html
> >> A backend must check connection to the client by tcp_keepalive messages. 
> >> (Config option tcp_keepalives_idle).
> >
> >> But this is don't work if the backend is busy.
> >
> > You're reading something into the documentation that isn't there.
> >
> > The TCP keepalive mechanism is something that the OS does, independently
> > of backend processing.  The backend isn't going to notice loss of client
> > connection until it tries to read or write on the connection.
> >
> > If it were free to improve this, we might do so.  But it would be
> > very much not free.
>
> At what points does the backend bite the bullet to test the state of
> that file descriptor?
>
> I'd expect select() and poll() to return immediately when keepalive
> probes timeout, so idling clients are covered (and that's the main use
> case); does any other code path go out if its way to ensure that there's
> still a client without actually needing to read()/write()/send()/recv()?
> (obviously at the cost you mentioned)

It has been proposed that busy backends should (optionally)
periodically try to do a MSG_PEEK so they can learn about a client
that has gone away some time before they eventually try to write:

https://www.postgresql.org/message-id/flat/77def86b27e41f0efcba411460e929ae%40postgrespro.ru

More work is needed to move that forward, though.

Re: When are Predicate Locks Freed

2019-12-12 Thread Thomas Munro

On Fri, Dec 13, 2019 at 5:00 AM Dave Halter  wrote:
> I have worked quite a bit with serializable transactions. I'm trying
> to understand when predicate locks are freed. I read the whole
> README-SSI [1], but I'm still not sure when a predicate lock gets
> dropped.
>
> What I learned from that README is that predicate locks are not freed
> at the moment where a transaction passes, but only at a later stage.
> This makes sense, because some concurrent transactions might be
> invalidated by the predicate locks that are not needed anymore for a
> transaction that was just committed.

Right.  Even though T1 has committed, T2 and T3 can create a
"dangerous cycle" by touching data that T1 accessed, meaning that they
can't all be allowed to commit because what they've seen isn't
consistent with any serial ordering.  So the ghost of T1 continues to
affect other transactions, and predicate.c needs to figure out the
correct poltergeist duration.

> What I don't understand is: We have only short transaction (<1s), but a
>
> select count(*) from pg_locks where mode = 'SIReadLock';
>
> would vary by a lot. It typically went up all the way to 300k. At this
> point it would drop to almost zero (always <10k) and rise again. This
> cycle pretty consistently happened every three minutes. (I think it
> looked a bit more like 2 minutes 50 seconds, but I wasn't measuring it
> exactly). Is there a "garbage collection" happening every few minutes?
> Since we scaled up our service by quite a bit, we increased
> `max_pred_locks_per_transaction`. Now we are at 1.0m to 1.3m
> predicate locks with the same ~3 minute cleanup period.

Garbage collection happens in ClearOldPredicateLocks() when a
SERIALIZABLE transactions end.  This is triggered every time the
oldest serializable transaction finishes (according to the "xmin"):

/*
 * Check whether it's time to clean up old transactions. This
can only be
 * done when the last serializable transaction with the oldest
xmin among
 * serializable transactions completes.  We then find the "new oldest"
 * xmin and purge any transactions which finished before this
transaction
 * was launched.
 */

> Does anybody have any insights in why this might be happening? I would
> also be very interested in good explanations about when predicate
> locks are actually freed. I tried reading the C code, but couldn't
> really figure out where that was happening.

Do you have long running serializable transactions?  They would
prevent cleanup of any number of short serializable transactions that
start later.  Can you tell us a bit about your workload, number of
concurrent queries, range of query durations?  Do you see a lot of
files under pg_serial (this is where finished transactions get spilled
to disk, and with the large numbers you're reporting, perhaps there
would be some of that happening too).

Re: Row locks, SKIP LOCKED, and transactions

2019-12-17 Thread Thomas Munro

On Wed, Dec 18, 2019 at 5:12 AM Steven Winfield
 wrote:
> * I observe this even if I crank up the transaction isolation level to 
> repeatable read and serializable.

Huh.  SERIALIZABLE shouldn't allow two transactions to see no result
row for a given ID and then insert a result row for that ID.  One of
those transactions should have to roll back, because otherwise it'd be
incompatible with both serial orderings of the two transactions.

> I'm wondering if row locks are not obeying the same transactional semantics 
> as row data,

They are indeed a bit weird.  They sometimes check if the condition
still apply (since the row might have changed between the scan and
LockRows node) which leads to some interesting effects, but only if
the row being locked was concurrently updated, and here that isn't the
case.  This is a source of a fair amount of confusion about FOR UPDATE
and joins/subselects.

> Perhaps this is a misuse of the locking system, since I'm locking a row "FOR 
> UPDATE" but not actually updating it, but as row locks are released at the 
> end of a transaction (according to the docs) then my expectation was for the 
> unlocking and the visibility of newly committed rows to be atomic.
> I've tried FOR NO KEY UPDATE too, without luck.
>
> If I'm doing something forbidden (and the docs say so) then I'd be grateful 
> if someone could point that out!

Conceptually, the thing you really need to lock for this to work is
the result row that isn't there yet, so that some overlapping
transaction doesn't try to lock the same absent thing.  Unfortunately,
our system for locking things that aren't there isn't there either.
Some articles on serializability talk about "materialising the
conflict", which means locking some other surrogate thing that
"covers" a gap you are interested in.  You might think the job row
would do the trick, but since we don't recheck the condition (that is,
recheck that there is no corresponding result because you don't update
the job row), no cigar.  You could also use plain old
pg_try_advisory_xact_lock(id), because it just locks integers, and
they're always there.

SERIALIZABLE deals with that type of magic internally (it locks gaps
in key ranges by predicate-locking a physical btree or hash page that
you'd need to write on to insert a row with a matching key, which is
how it discovers a conflict between one transaction that went looking
for key=42 but didn't find it and another that later writes key=42),
but, as mentioned, SERIALIZABLE doesn't really allow concurrency with
this workload, and you specified that you wanted concurrency with SKIP
LOCKED (but I think you'd have the same problem without it; SKIP
LOCKED just gets you the wrong answer faster).

There are various ways you could deal with this, but I'd probably go
for a simple scheme where you only have to consult a single row to
know if you can claim it.  You could still put the results into a
separate table, but use job.state to find work, and set it to DONE
when you insert the result.  It may also be possible to add no new
columns but do a dummy update to the job row to get the join qual
rechecked, but I'm not sure if that'd work.  Another reason to add a
state column to the job table is so that you can put a conditional
index on it so you can find jobs to be done very quickly, if you're
not planning to remove the ones that are done.

Re: Writing Postgres Extensions in C on Windows

2020-01-07 Thread Thomas Munro

On Wed, Jan 8, 2020 at 4:32 AM İlyas Derse  wrote:
> I want to register C code to PostgreSql on Windows. So I think, I have to 
> make a extension for PostgreSql. But I did not find to written extension on 
> windows. Do you have an idea ?

I don't do Windows myself but this blog from Craig Ringer looks like a
good starting point:

https://www.2ndquadrant.com/en/blog/compiling-postgresql-extensions-visual-studio-windows/

Re: ERROR: too many dynamic shared memory segments

2020-01-29 Thread Thomas Munro

On Wed, Jan 22, 2020 at 4:06 AM Nicola Contu  wrote:
> after a few months, we started having this issue again.
> So we revert the work_mem parameter to 600MB instead of 2GB.
> But the issue is still there. A query went to segmentation fault, the DB went 
> to recovery mode and our app went to read only for a few minutes.

Hi Nicola,
Hmm, a segmentation fault sounds like a different problem.  Can you
please share the exact error messages from PostgreSQL and OS logs?

> I understand we can increase max_connections so we can have many more 
> segments.
>
> My question is : is there a way to understand the number of segments we 
> reached?

If you're on Linux, you can probably see them with "ls /dev/shm".

> Currently we have 220 max_conn so as your formula is 64 + 2* 220 we have 
> about 500 shared segments.
> We would like to increase that number to 300 or 400 but would be great to 
> understand if there is a way to make sure we will solve the issue as it 
> requires a restart of the service.
>
> I know you were also talking about a redesign this part in PostgreSQL. Do you 
> know if anything has changed in any of the newer versions after 11.5?

It's possible that we should increase a couple of constants used the
formula -- I'll look into that again.  But first I'd like to see if
we're even investigating the right problem here.

Re: ERROR: too many dynamic shared memory segments

2020-01-29 Thread Thomas Munro

On Wed, Jan 29, 2020 at 10:37 PM Nicola Contu  wrote:
> This is the error on postgres log of the segmentation fault :
>
> 2020-01-21 14:20:29 GMT [] [4]: [108-1] db=,user= LOG:  server process 
> (PID 2042) was terminated by signal 11: Segmentation fault
> 2020-01-21 14:20:29 GMT [] [4]: [109-1] db=,user= DETAIL:  Failed process 
> was running: select pid from pg_stat_activity where query ilike 'REFRESH 
> MATERIALIZED VIEW CONCURRENTLY matview_vrs_request_stats'
> 2020-01-21 14:20:29 GMT [] [4]: [110-1] db=,user= LOG:  terminating any 
> other active server processes

Ok, this is a bug.  Do you happen to have a core file?  I don't recall
where CentOS puts them.

> > If you're on Linux, you can probably see them with "ls /dev/shm".
>
> I see a lot of files there, and doing a cat they are empty. What can I do 
> with them?

Not much, but it tells you approximately how many 'slots' are in use
at a given time (ie because of currently running parallel queries), if
they were created since PostgreSQL started up (if they're older ones
they could have leaked from a crashed server, but we try to avoid that
by trying to clean them up when you restart).

> Those are two different problems I guess, but they are related because right 
> before the Segmentation Fault I see a lot of shared segment errors in the 
> postgres log.

That gave me an idea...  I hacked my copy of PostgreSQL to flip a coin
to decide whether to pretend there are no slots free (see below), and
I managed to make it crash in the regression tests when doing a
parallel index build.  It's late here now, but I'll look into that
tomorrow.  It's possible that the parallel index code needs to learn
to cope with that.

#2  0x00a096f6 in SharedFileSetInit (fileset=0x80b2fe14c,
seg=0x0) at sharedfileset.c:71
#3  0x00c72440 in tuplesort_initialize_shared
(shared=0x80b2fe140, nWorkers=2, seg=0x0) at tuplesort.c:4341
#4  0x005ab405 in _bt_begin_parallel
(buildstate=0x7fffc070, isconcurrent=false, request=1) at
nbtsort.c:1402
#5  0x005aa7c7 in _bt_spools_heapscan (heap=0x801ddd7e8,
index=0x801dddc18, buildstate=0x7fffc070, indexInfo=0x80b2b62d0)
at nbtsort.c:396
#6  0x005aa695 in btbuild (heap=0x801ddd7e8,
index=0x801dddc18, indexInfo=0x80b2b62d0) at nbtsort.c:328
#7  0x00645b5c in index_build (heapRelation=0x801ddd7e8,
indexRelation=0x801dddc18, indexInfo=0x80b2b62d0, isreindex=false,
parallel=true) at index.c:2879
#8  0x00643e5c in index_create (heapRelation=0x801ddd7e8,
indexRelationName=0x7fffc510 "pg_toast_24587_index",
indexRelationId=24603, parentIndexRelid=0,

I don't know if that's the bug that you're hitting, but it definitely
could be: REFRESH MATERIALIZED VIEW could be rebuilding an index.

===

diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index 90e0d739f8..f0b49d94ee 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -468,6 +468,13 @@ dsm_create(Size size, int flags)
nitems = dsm_control->nitems;
for (i = 0; i < nitems; ++i)
{
+   /* BEGIN HACK */
+   if (random() % 10 > 5)
+   {
+   nitems = dsm_control->maxitems;
+   break;
+   }
+   /* END HACK */
if (dsm_control->item[i].refcnt == 0)
{
dsm_control->item[i].handle = seg->handle;

Re: ERROR: too many dynamic shared memory segments

2020-01-29 Thread Thomas Munro

On Wed, Jan 29, 2020 at 11:24 PM Julian Backes  wrote:
> we only had the "too many shared too many dynamic shared memory segments" 
> error but no segmentation faults. The error started occurring after upgrading 
> from postgres 10 to postgres 12 (server has 24 cores / 48 threads, i.e. many 
> parallel workers). The error itself was not that much of a problem but 
> /dev/shm started filling up with orphaned files which probably (?) had not 
> been cleaned up by postgres after the parallel workers died. In consequence, 
> after some time, /dev/shm was full and everything crashed.

Oh, thanks for the report.  I think see what was happening there, and
it's a third independent problem.  The code in dsm_create() does
DSM_OP_DESTROY (ie cleans up) in the DSM_CREATE_NULL_IF_MAXSEGMENTS
case, but in the case where you see "ERROR: too many dynamic shared
memory segments" it completely fails to clean up after itself.  I can
reproduce that here.  That's a terrible bug, and has been sitting in
the tree for 5 years.

> Unfortunately, the only "solution" we found so far was to increase max 
> connections from 100 to 1000. After that (about 2 months ago I think), the 
> error had gone.

I'll take that as a vote for increasing the number of slots.

Re: ERROR: too many dynamic shared memory segments

2020-01-30 Thread Thomas Munro

On Wed, Jan 29, 2020 at 11:53 PM Thomas Munro  wrote:
> On Wed, Jan 29, 2020 at 10:37 PM Nicola Contu  wrote:
> > This is the error on postgres log of the segmentation fault :
> >
> > 2020-01-21 14:20:29 GMT [] [4]: [108-1] db=,user= LOG:  server process 
> > (PID 2042) was terminated by signal 11: Segmentation fault
> > 2020-01-21 14:20:29 GMT [] [4]: [109-1] db=,user= DETAIL:  Failed 
> > process was running: select pid from pg_stat_activity where query ilike 
> > 'REFRESH MATERIALIZED VIEW CONCURRENTLY matview_vrs_request_stats'
> > 2020-01-21 14:20:29 GMT [] [4]: [110-1] db=,user= LOG:  terminating any 
> > other active server processes

> That gave me an idea...  I hacked my copy of PostgreSQL to flip a coin
> to decide whether to pretend there are no slots free (see below), and
> I managed to make it crash in the regression tests when doing a
> parallel index build.  It's late here now, but I'll look into that
> tomorrow.  It's possible that the parallel index code needs to learn
> to cope with that.

Hi Nicola,

Without more information I can't know if I found the same bug you
experienced, but I think it's likely.  I have committed a fix for
that, which will be available in the next release (mid-February).

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=74618e77b43cfce670b4725d5b9a300a2afd12d1

Re: ERROR: too many dynamic shared memory segments

2020-01-31 Thread Thomas Munro

On Thu, Jan 30, 2020 at 12:26 AM Thomas Munro  wrote:
> On Wed, Jan 29, 2020 at 11:24 PM Julian Backes  wrote:
> > we only had the "too many shared too many dynamic shared memory segments" 
> > error but no segmentation faults. The error started occurring after 
> > upgrading from postgres 10 to postgres 12 (server has 24 cores / 48 
> > threads, i.e. many parallel workers). The error itself was not that much of 
> > a problem but /dev/shm started filling up with orphaned files which 
> > probably (?) had not been cleaned up by postgres after the parallel workers 
> > died. In consequence, after some time, /dev/shm was full and everything 
> > crashed.
>
> Oh, thanks for the report.  I think see what was happening there, and
> it's a third independent problem.  The code in dsm_create() does
> DSM_OP_DESTROY (ie cleans up) in the DSM_CREATE_NULL_IF_MAXSEGMENTS
> case, but in the case where you see "ERROR: too many dynamic shared
> memory segments" it completely fails to clean up after itself.  I can
> reproduce that here.  That's a terrible bug, and has been sitting in
> the tree for 5 years.

I committed a fix for that.  It'll be in the new releases that due out
in a couple of weeks.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=93745f1e019543fe7b742d0c5e971aad8d08fd56

> > Unfortunately, the only "solution" we found so far was to increase max 
> > connections from 100 to 1000. After that (about 2 months ago I think), the 
> > error had gone.
>
> I'll take that as a vote for increasing the number of slots.

I committed something to do this for 13 (due out end of year), but I'm
not game to back-patch it to the release branches.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d061ea21fc1cc1c657bb5c742f5c4a1564e82ee2

Re: ERROR: too many dynamic shared memory segments

2020-01-31 Thread Thomas Munro

On Fri, Jan 31, 2020 at 11:05 PM Nicola Contu  wrote:
> Do you still recommend to increase max_conn?

Yes, as a workaround of last resort.  The best thing would be to
figure out why you are hitting the segment limit, and see if there is
something we could tune to fix that. If you EXPLAIN your queries, do
you see plans that have a lot of "Gather" nodes in them, perhaps
involving many partitions?  Or are you running a lot of parallel
queries at the same time?  Or are you running queries that do very,
very large parallel hash joins?  Or something else?

Re: Unable to startup postgres: Could not read from file "pg_clog/00EC"

2020-02-06 Thread Thomas Munro

On Fri, Feb 7, 2020 at 1:47 AM Nick Renders  wrote:
> Thank you for the feedback, Alvaro.
>
> Unfortunately, the database is no longer "dumpable". We were able to do
> a pg_dump yesterday morning (12 hours after the crash + purging the
> pg_clog) but if we try one now, we get the following error:
>
> unexpected chunk number 1 (expected 0) for toast value 8282331 in
> pg_toast_38651
>
> Looking at our data, there seem to be 6 tables that have corrupt
> records. Doing a SELECT * for one of those records, will return a
> similar error:
>
> missing chunk number 0 for toast value 8288522 in pg_toast_5572299
>
>
> What is the best way to go from here? Is tracking down these corrupt
> records and deleting them the best / only solution?
> Is there a way to determine of there are issues with new data (after the
> crash)?
>
> Any help and advice is very much appreciated.

This error indicates that the file did exist already, it was just
shorter than we expected:

2020-02-04 15:20:44 CET DETAIL:  Could not read from file
"pg_clog/00EC" at offset 106496: Undefined error: 0.

What was the length of the file before you overwrote it? Are there
00EB and 00ED files, and if so what size?  When your server rebooted,
did crash recovery run or had it shut down cleanly?  Do you know if
the machine lost power, or the kernel crashed, or if it was a normal
reboot?  What are your settings for "fsync" and "wal_sync_method"?
What is the output of pg_controldata -D pgdata?  I wonder if that part
of the clog file was supposed to be created before the checkpoint (ie
the checkpoint is somehow borked), or was supposed to be created
during recovery after that checkpoint (something else is borked, but I
don't know what), or if the xid is somehow corrupted.

Here's a dirty trick that might help rescue some data.  Assuming you
have a copy of the original file before you zeroed it, you could write
a 256kb file full of 0x55 (that's 01010101 and represents 4 commits,
so if you fill the file up with that it means 'all transactions in
this range committed', which is probably closer to the truth than all
zeroes), and then copy the original shorter file over the top of it,
so that at least the range of transactions represented by the earlier
part of the file that did make it to disk are preserved, and we have
just have bogus force-everything-to-look-committed data after that.
But as Alvaro said, this is a pretty bad situation, this is key meta
data used to interpret all other data files, so all bets are off here,
this is restore-from-backups territory.

Re: Postgres 12.1 : UPPER() in WHERE clause restarts server

2020-02-09 Thread Thomas Munro

On Sun, Feb 9, 2020 at 11:46 AM Tom Lane  wrote:
> "Nick Renders"  writes:
> > When we do the following statement:
> >   SELECT * FROM f_gsxws_schedule WHERE UPPER(gwsc_dossier) = 'TEST'
> > the Postgres service restarts.
>
> Hm.
>
> > Here is what is logged:
> > 2020-02-08 20:21:19.942 CET [83892] LOG:  server process (PID 85456) was
> > terminated by signal 9: Killed: 9
> > 2020-02-08 20:21:19.942 CET [83892] DETAIL:  Failed process was running:
> > SELECT * FROM f_gsxws_schedule WHERE UPPER(gwsc_dossier) = 'TEST'
>
> That's mighty interesting, because signal 9 is an external SIGKILL,
> not an internal-to-Postgres software fault.
>
> If you were running on Linux I would hypothesize that your process
> was getting killed by the infamous OOM killer, in which case we could
> guess that for some reason this query is consuming an unreasonable
> amount of memory and thereby attracting the wrath of the OOM killer.
> However, I'm not aware that any such mechanism exists on macOS.

macOS's thing like that appears as "kernel[0]: memorystatus_thread:
idle exiting pid XXX [some program]" in system.log, which seems like a
bit of an understatement to me but that's what they call it.  Some
details here:

http://newosxbook.com/articles/MemoryPressure.html

Nick, did you see that?

Re: Postgres 12.1 : UPPER() in WHERE clause restarts server

2020-02-10 Thread Thomas Munro

On Mon, Feb 10, 2020 at 4:35 AM Marc  wrote:
> We will keep the 12.1 in place so that we can run additional tests to assist 
> to pin-point the issue.
>
> Feel free to ask but allow us to recover from these hectic days ;-)

Here's how to get a stack so we can see what it was doing, assuming
you have the Apple developer tools installed:

1.  Find the PID of the backend you're connected to with SELECT
pg_backend_pid().
2.  "lldb -p PID" from a shell to attach to the process, then "cont"
to let it continue running.
3.  Run the query in that backend and wait for the SIGKILL.
4.  In the lldb session, type "bt".

It'll only make sense if your PostgreSQL build has debug symbols, but let's see.

Re: LDAP with TLS is taking more time in Postgresql 11.5

2020-02-25 Thread Thomas Munro

On Wed, Feb 26, 2020 at 7:37 AM Adrian Klaver  wrote:
> On 2/25/20 10:23 AM, Mani Sankar wrote:
> > Hi Adrian,
> >
> > Both the machines are in same network and both are pointing towards the
> > same LDAP server
>
> I don't see any errors in the Postgres logs.
>
> You probably should take a look at the LDAP server logs to see if there
> is anything there.
>
> You could also turn up the logging detail in Postgres to see if it
> reveals anything.

A couple more ideas:

If you take PostgreSQL out of the picture and run the equivalent LDAP
queries with the ldapsearch command line tool, do you see the same
difference in response time?  If so, I'd trace that with strace etc
with timings to see where the time is spent -- for example, is it
simply waiting for a response from the LDAP (AD?) server?   If not,
I'd try tracing the PostgreSQL process and looking at the system calls
(strace -tt -T for high res times and elapsed times), perhaps using
PostgreSQL's pre_auth_delay setting to get time to attach strace.

A wild stab in the dark: if it's slow from one computer and not from
another, perhaps the problem has something to do with a variation in
reverse DNS lookup speed on the LDAP server side when it's verifying
the certificate.  Or something like that.

Re: \COPY to accept non UTF-8 chars in CHAR columns

2020-03-27 Thread Thomas Munro

On Sat, Mar 28, 2020 at 4:46 AM Tom Lane  wrote:
> Matthias Apitz  writes:
> > In short, it there a way to let \COPY accept such broken ISO bytes, just
> > complaining about, but not stopping the insert of the row?
>
> No.  We don't particularly believe in the utility of invalid data.
>
> If you don't actually care about what encoding your data is in,
> you could use SQL_ASCII as the database "encoding" and thereby
> disable all UTF8-specific behavior.  Otherwise, maybe this conversion
> is a good time to clean up the mess?

Something like this approach might be useful for fixing the CSV file:

https://codereview.stackexchange.com/questions/185821/convert-a-mix-of-latin-1-and-utf-8-to-proper-utf-8

I haven't tested that program but it looks like the right sort of
approach; I remember writing similar logic to untangle the strange
mixtures of Latin 1, Windows 1252, and UTF-8  that late 90s browsers
used to send.  That sort of approach can't fix every theoretical
problem (some valid Latin1 sequences are also valid UTF-8 sequences)
but it's doable with text in European languages.

Re: EINTR while resizing dsm segment.

2020-04-03 Thread Thomas Munro

On Thu, Apr 2, 2020 at 9:25 PM Kyotaro Horiguchi
 wrote:
> I provided the subject, and added -hackers.
>
> > Hello,
> > I am running postgres 11.5 and we were having issues with shared segments.
> > So I increased the max_connection as suggested by you guys and reduced my
> > work_mem to 600M.
> >
> > Right now instead, it is the second time I see this error :
> >
> > ERROR:  could not resize shared memory segment "/PostgreSQL.2137675995" to
> > 33624064 bytes: Interrupted system call
>
> The function posix_fallocate is protected against EINTR.
>
> | do
> | {
> |   rc = posix_fallocate(fd, 0, size);
> | } while (rc == EINTR && !(ProcDiePending || QueryCancelPending));
>
> But not for ftruncate and write. Don't we need to protect them from
> ENTRI as the attached?

We don't handle EINTR for write() generally because that's not
supposed to be necessary on local files (local disks are not "slow
devices", and we document that if you're using something like NFS you
should use its "hard" mount option so that it behaves that way too).
As for ftruncate(), you'd think it'd be similar, and I can't think of
a more local filesystem than tmpfs (where POSIX shmem lives on Linux),
but I can't seem to figure that out from reading man pages; maybe I'm
reading the wrong ones.  Perhaps in low memory situations, an I/O wait
path reached by ftruncate() can return EINTR here rather than entering
D state (non-interruptable sleep) or restarting due to our SA_RESTART
flag... anyone know?

Another thought: is there some way for the posix_fallocate() retry
loop to exit because (ProcDiePending || QueryCancelPending), but then
for CHECK_FOR_INTERRUPTS() to do nothing, so that we fall through to
reporting the EINTR?

Re: EINTR while resizing dsm segment.

2020-04-07 Thread Thomas Munro

On Tue, Apr 7, 2020 at 8:58 PM Nicola Contu  wrote:
> So that seems to be a bug, correct?
> Just to confirm, I am not using NFS, it is directly on disk.
>
> Other than that, is there a particular option we can set in the postgres.conf 
> to mitigate the issue?

Hi Nicola,

Yeah, I think it's a bug.  We're not sure exactly where yet.

Re: Transition tables for column-specific UPDATE triggers

2020-05-03 Thread Thomas Munro

On Wed, Oct 9, 2019 at 3:59 PM Guy Burgess  wrote:
> The manual says: 
> https://www.postgresql.org/docs/current/sql-createtrigger.html
>
> A column list cannot be specified when requesting transition relations.
>
> And (I think the same point):
>
> The standard allows transition tables to be used with column-specific UPDATE 
> triggers, but then the set of rows that should be visible in the transition 
> tables depends on the trigger's column list. This is not currently 
> implemented by PostgreSQL.
>
> Are there any plans to allow transition tables to be used with 
> column-specific UPDATE triggers?  Or, is there another way for a trigger 
> function to see the rows changed by a column-specific UPDATE trigger?

Hi Guy,

Answering an ancient message that went unanswered... I'm not aware of
anyone working on that, and there isn't another way: the transition
tables feature simply won't let you create such a trigger.  The last
I've seen anyone say about that was in the following commit:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0f79440fb0b4c5a9baa9a95570c01828a9093802

Excerpt:

"Also, forbid using transition tables with column-specific UPDATE triggers.
The spec requires such transition tables to show only the tuples for which
the UPDATE trigger would have fired, which means maintaining multiple
transition tables or else somehow filtering the contents at readout.
Maybe someday we'll bother to support that option, but it looks like a
lot of trouble for a marginal feature."

The code preventing it is here:

/*
 * We currently don't allow column-specific triggers with
 * transition tables.  Per spec, that seems to require
 * accumulating separate transition tables for each combination of
 * columns, which is a lot of work for a rather marginal feature.
 */
if (stmt->columns != NIL)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("transition tables cannot be specified
for triggers with column lists")));

In theory you could do some filtering in your trigger procedure, by
comparing rows in OLD TABLE and NEW TABLE to see which row-pairs
represent changes to the columns you care about, but that's slightly
complicated: you can join OLD and NEW using whatever keys you have
defined, but that only works if there's no danger of the keys
themselves changing.  I wondered about inventing something like WITH
ORDINALITY so that you get unique ordered numbers in an extra column
that can be used to join the two transition tables without knowing
anything about the user defined keys, but among other problems I
couldn't figure out how to fit it in syntax-wise.

I suppose PostgreSQL could do this internally to make the feature you
want work: whenever you scan either table, in an UPDATE OF 
trigger, it could scan both transition tables in sync and filter out
rows that didn't change your columns of interest.  Or it could do that
filtering up front, before your trigger fires, to create two brand new
tuplestores just for your trigger invocation.  Or there could be a
third spillable data structure, that records which triggers should be
able to see each old/new-pair, or which columns changed, and is
scanned in sync with the others.  Just some first thoughts; I am not
planning to work on this any time soon.

Re: 12.2: Howto check memory-leak in worker?

2020-05-04 Thread Thomas Munro

On Tue, May 5, 2020 at 10:13 AM Peter  wrote:
> BTW, I would greatly appreciate if we would reconsider the need for
> the server to read the postmaster.pid file every few seconds (probably
> needed for something, I don't know).
> That makes it necessary to set atime=off to get a spindown, and I
> usually prefer to have atime=on so I can see what my stuff is
> currently doing.

That'd be this:

/*
 * Once a minute, verify that postmaster.pid hasn't been removed or
 * overwritten.  If it has, we force a shutdown.  This avoids having
 * postmasters and child processes hanging around after their database
 * is gone, and maybe causing problems if a new database cluster is
 * created in the same place.  It also provides some protection
 * against a DBA foolishly removing postmaster.pid and manually
 * starting a new postmaster.  Data corruption is likely to ensue from
 * that anyway, but we can minimize the damage by aborting ASAP.
 */
if (now - last_lockfile_recheck_time >= 1 * SECS_PER_MINUTE)
{
if (!RecheckDataDirLockFile())

Re: Is there a significant difference in Memory settings between 9.5 and 12

2020-05-11 Thread Thomas Munro

On Tue, May 12, 2020 at 2:52 PM Tory M Blue  wrote:
> It took the change but didn't help. So 10GB of shared_buffers in 12 is still 
> a no go. I'm down to 5GB and it works, but this is the same hardware, the 
> same exact 9.5 configuration. So I'm missing something. WE have not had to 
> mess with kernel memory settings since 9.4, so this is an odd one.
>
> I'll keep digging, but i'm hesitant to do my multiple TB db's with half of 
> their shared buffer configs, until I understand what 12 is doing differently 
> than 9.5

Which exact version of 9.5.x are you coming from?  What's the exact
error message on 12 (you showed the shared_memory_type=sysv error, but
with the default  value (mmap) how does it look)?  What's your
huge_pages setting?

Can you reproduce the problem with a freshly created test cluster?  As
a regular user, assuming regular RHEL packaging, something like
/usr/pgsql-12/bin/initdb -D test_pgdata, and then
/usr/pgsql-12/bin/postgres -D test_pgdata -c shared_buffers=10GB (then
^C to stop it).  If that fails to start in the same way, it'd be
interesting to see the output of the second command with strace in
front of it, in the part where it allocates shared memory.  And
perhaps it'd be interesting to see the same output with
/usr/pgsql-9.5/bin/XXX (if you still have the packages).  For example,
on my random dev laptop that looks like:

openat(AT_FDCWD, "/proc/meminfo", O_RDONLY) = 6
fstat(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(6, "MemTotal:   16178852 kB\nMemF"..., 1024) = 1024
read(6, ":903168 kB\nShmemHugePages:  "..., 1024) = 311
close(6)= 0
mmap(NULL, 11016339456, PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = -1 ENOMEM (Cannot
allocate memory)
mmap(NULL, 11016003584, PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_ANONYMOUS, -1, 0) = 0x7ff74e579000
shmget(0x52e2c1, 56, IPC_CREAT|IPC_EXCL|0600) = 3244038
shmat(3244038, NULL, 0) = 0x7ff9df5ad000

The output is about the same on REL9_5_STABLE and REL_12_STABLE for
me, only slightly different sizes.  If that doesn't fail in the same
way on your system with 12, perhaps there are some more settings from
your real clusters required to make it fail.  You could add them one
by one with -c foo=bar or in the throw away
test_pgdata/postgresql.conf, and perhaps that process might shed some
light?

I was going to ask if it might be a preloaded extension that is asking
for gobs of extra memory in 12, but we can see from your "Failed
system call was shmget(key=5432001, size=11026235392, 03600)" that
it's in the same ballpark as my total above for shared_buffers=10GB.

Re: Ubuntu 20.04: apt.postgresql.org/pub/repos/apt focal-pgdg InRelease' doesn't support architecture 'i386'

2020-05-16 Thread Thomas Munro

On Sun, May 17, 2020 at 10:45 AM Hugh  wrote:
> While this doesn't appear to be a bug that causes problems of any kind, I do 
> have a question about its cause.
>
> The "error" listed in the Subject: line is basically what I'm seeing. The 
> entire message is below, particularly the 'N:' at the end. Is there a repo 
> setting I should change to prevent the request for '386' architecture? Thank 
> you in advance for your assistance.

I'm not sure, but it seems related to this complaint and the answer
might be to tell your sources.list that that source has only amd64:

https://www.postgresql.org/message-id/flat/16402-1f2d77e819f9e1f2%40postgresql.org

Re: Help understanding SIReadLock growing without bound on completed transaction

2020-05-21 Thread Thomas Munro

On Fri, May 22, 2020 at 7:48 AM Mike Klaas  wrote:
> locktype: page
> relation::regclass::text: _pkey
> virtualtransaction: 36/296299968
> granted:t
> pid:2263461

That's an unusually high looking pid.  Is that expected, for example
did you crank Linux's pid_max right up, or is this AIX, or something?

Re: Help understanding SIReadLock growing without bound on completed transaction

2020-05-22 Thread Thomas Munro

On Fri, May 22, 2020 at 7:48 AM Mike Klaas  wrote:
> It's my understanding that these locks should be cleared when there are no 
> conflicting transactions.  These locks had existed for > 1 week and we have 
> no transactions that last more than a few seconds (the oldest transaction in 
> pg_stat_activity is always < 1minute old).
> Why would a transaction that is finished continue accumulating locks over 
> time?

Predicate locks are released by ClearOldPredicateLocks(), which
releases SERIALIZABLEXACTs once they are no longer interesting.  It
has a  conservative idea of what is no longer interesting: it waits
until the lowest xmin across active serializable snapshots is >= the
transaction's finishedBefore xid, which was the system's next xid (an
xid that hasn't been used yet*) at the time the SERIALIZABLEXACT
committed.  One implication of this scheme is that SERIALIZABLEXACTs
are cleaned up in commit order.  If you somehow got into a state where
a few of them were being kept around for a long time, but others
committed later were being cleaned up (which I suppose must be the
case or your system would be complaining about running out of
SERIALIZABLEXACTs), that might imply that there is a rare leak
somewhere in this scheme.  In the past I have wondered if there might
be a problem with wraparound in the xid tracking for finished
transactions, but I haven't worked out the details (transaction ID
wraparound is both figuratively and literally the Ground Hog Day of
PostgreSQL bug surfaces).

*Interestingly, it takes an unlocked view of that value, but that
doesn't seem relevant here; it could see a value that's too low, not
too high.

Re: Can we get SQL Server-like cross database queries

2020-06-03 Thread Thomas Munro

On Thu, Jun 4, 2020 at 4:26 PM Ron  wrote:
> On 6/3/20 2:57 PM, Rob Sargent wrote:
> >> On Jun 3, 2020, at 1:46 PM, Tom Lane  wrote:
> >> Guyren Howe  writes:
> >>> Is it practical to provide the SQL Server-like feature in Postgres?
> >> No.
> > That got me chuckling.
> > I had just decided not to bother posting, but well, here goes.
> >
> > I call bs on the “cross db query” notion of tsql - but I admit I haven’t 
> > used it since it was a Sybase thing.
> >
> > Is db.owner.table (iirc) is really anything more than nuance on 
> > schema.table.  Does a db for automotive parts need to be 
> > up-close-and-personal with a db payroll?
>
> Those aren't the only two databases that exist.  Think of a federated system
> where you've got a "reference" database full of lookup tables, and one
> database for every 10,000 customers. For 45,000 customers you've got five
> databases, and they all need to access the reference database, plus "all
> customers" queries need to access all five databases.

There's no doubt it's useful, and it's also part of the SQL spec,
which says you can do catalog.schema.table.  I would guess that we
might get that as a byproduct of any project to make PostgreSQL
multithreaded.  That mountain moving operation will require us to get
rid of all the global state that currently ties a whole process to one
session and one database, and replace it with heap objects with names
like Session and Database that can be passed around between worker
threads.

Re: Shared memory error

2020-06-04 Thread Thomas Munro

On Thu, Jun 4, 2020 at 6:18 AM Sonam Sharma  wrote:
>>> 1) Postgres version : 11.2

FYI This is missing over a year's worth of bugfixes.  That said, I
don't know of anything fixed that has this symptom.

>>> 4) Is this only with one query and if so what is it doing? : No , few 
>>> queries work, few fails .. mostly all are complex select queries.

It's a long shot, because I expect you're running the usual packages
under a user set up by the package to be a "system" user, but just in
case: there is one known way for POSIX shm to be unlinked while you're
still using it: systemd.  For example, this might happen if you have
cronjobs running as the same user.  Some more info:

https://www.postgresql.org/docs/11/kernel-resources.html#SYSTEMD-REMOVEIPC

Do you see any other errors around this one, in the PostgreSQL logs?

Re: Shared memory error

2020-06-04 Thread Thomas Munro

>> Do you see any other errors around this one, in the PostgreSQL logs?
> No , only this is the error from db and jdbc end .. and queries are failing

If you need a workaround right now you could always set
max_parallel_workers_per_gather=0 so that it doesn't try to use
parallel query.  That could make some queries run more slowly.

When you say you extended the /dev/shm filesystem, what are the mount
options, or how did you configure it?  It's interesting that the
problem went away for a while when you did that.

Re: Shared memory error

2020-06-04 Thread Thomas Munro

On Fri, Jun 5, 2020 at 1:00 AM Sonam Sharma  wrote:
> The dynamic_shared_memory_type was set to POSIX . Because of this it was 
> using tmpfs /dev/shm. When the query was running I saw the file system was 
> filling. So I  extended the file system and luckily the query worked for that 
> time

Oh, hmm.  When you say "filling", maybe you hit the leaking shm
problem that was fixed in 11.7, that requires messages somewhere in
your log about lack of DSM slots?  I don't have an explanation for the
exact the error you're seeing though.  It's a bit tricky to speculate
on older releases with so many bug fixes missing though.  Please let
us know if you still see the problem after your next scheduled
upgrade.

Re: troubleshooting postgresql ldap authentication

2020-06-08 Thread Thomas Munro

On Tue, Jun 9, 2020 at 9:05 AM Chris Stephens  wrote:
> hostsslall all 0.0.0.0/0  ldap 
> ldapserver="ldaps://xxx" ldapbasedn="yyy" ldaptls=1

> does anyone know what might be causing "LDAP: Bad parameter to an ldap 
> routine"

You probably want ldapurl="ldaps://xxx" (note: ldapurl, not
ldapserver).  Or you could use ldapscheme="ldaps" and
ldapserver="xxx".

Re: Definition of REPEATABLE READ

2020-06-21 Thread Thomas Munro

On Mon, Jun 22, 2020 at 12:25 AM Peter J. Holzer  wrote:
> I've read http://jepsen.io/analyses/postgresql-12.3 which reports a
> problem in PostgreSQL's SERIALIZABLE isolation leven (which has since been
> fixed) and also shows an example of a violation of what they consider to
> be the correct definition of REPEATABLE READ.

In response to that report, we added a paragraph to the manual to
state explicitly that what we offer is snapshot isolation:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=1575fbf1ac3c8464b2dade245deff537a3aa2498

https://www.postgresql.org/docs/devel/transaction-iso.html#XACT-REPEATABLE-READ

> Since those papers are now 25 and 20 years old, respectively, and there
> have been several revisions of the SQL standard in the meantime, has the
> SQL committee come around to that view (SQL/2003 seems unchanged, I
> couldn't find a later revision online)? And if it has, did it follow
> those definitions or come up with different ones (it seems to me that
> G2-item is much stronger that warranted by the wording in the standard)?

SQL:2016 is the latest, and I don't have it to hand right now, but
SQL:2011 still defined four degrees of isolation in terms of the three
phenomena (pages 133-135 of part II, "Foundations", at least in the
draft copy you can find on the net).

As for what else PostgreSQL should do about this historical confusion,
I suspect that a patch to pgsql-hackers to accept the name SNAPSHOT
ISOLATION as an alternative would be popular, and we could probably
even spin it as the "true" name of the level (while still accepting
REPEATABLE READ; there is no point in breaking that), and maybe add a
second table to the manual's Transaction Isolation section to cover
the later formalisations in a digested user friendly format (if that
is possible).

Re: Same query taking less time in low configuration machine

2020-07-15 Thread Thomas Munro

On Tue, Jul 14, 2020 at 9:27 PM Vishwa Kalyankar
 wrote:
>  Function Scan on kseb_geometry_trace_with_barrier_partition  
> (cost=0.25..10.25 rows=1000 width=169) (actual time=11626.548..11626.568 
> rows=254 loops=1)

>  Function Scan on kseb_geometry_trace_with_barrier_partition  
> (cost=0.25..10.25 rows=1000 width=169) (actual time=22304.425..22304.448 
> rows=254 loops=1)

I have no idea what that function does, but perhaps it runs more
queries, and you can't see the plans for those here.  If you set up
auto_explain[1], and turn on auto_explain.log_nested_statements, then
you'll be able to see the query plans for the internal stuff happening
in that function.

I'd look at EXPLAIN (BUFFERS) or auto_explain.log_buffers to get more
information on cache hits and misses.  I'd look for settings
differences with EXPLAIN (SETTINGS) to see if there's anything
accidentally set differently (maybe JIT or paralelism or something
like that).  I'd look at pg_stat_activity repeatedly while it runs to
see what the processes are doing, especially the wait_event column.
I'd also look at the CPU and I/O on the systems with operating system
tools like iostat, htop, perf to try to find the difference.

[1] https://www.postgresql.org/docs/current/auto-explain.html

Re: PG 9.5.5 cores on AIX 7.1

2020-07-19 Thread Thomas Munro

On Sun, Jul 19, 2020 at 11:01 PM Abraham, Danny  wrote:
> Segmentation fault in _alloc_initial_pthread at 0x9521474
> 0x9521474 (_alloc_initial_pthread+0x1d4) e803 ld   
> r0,0x0(r3)
> (dbx) where
> _alloc_initial_pthread(??) at 0x9521474
> __pth_init(??) at 0x951f390
> uload(??, ??, ??, ??, ??, ??) at 0x9fff000ab70 load1(??, ??, ??, ??) at 
> 0x9000b74 load(??, ??, ??) at 0x9001ef0 loadAndInit(??, ??, 
> ??) at 0x905b38c dlopen(??, ??) at 0x909bfe0
> internal_load_library(??) at 0x10014c684
> RestoreLibraryState(??) at 0x10014d79c
> ParallelWorkerMain(??) at 0x1000bb2d0
> StartBackgroundWorker() at 0x10026cd94
> maybe_start_bgworkers() at 0x10003834c
> sigusr1_handler(??) at 0x10003902c
> __fd_select(??, ??, ??, ??, ??) at 0x91567fc
> ServerLoop() at 0x1004cec90
> PostmasterMain(??, ??) at 0x10003a4e8
> main(??, ??) at 0x108f8

FWIW there was a report a decade ago that looked at least superficially similar:

https://www.postgresql.org/message-id/flat/09B23E7BF70425478C1330D893A722C602FEC019BD%40MailSVR.invera.com

Re: Check replication lag

2020-08-05 Thread Thomas Munro

On Thu, Aug 6, 2020 at 7:02 AM Sreejith P  wrote:
> IN SYNC mode of replication what would be the impact on Master DB server in 
> terms of over all performance  ?

The pg_stat_replication columns write_lag, flush_lag and replay_lag
are designed tell you how long to expect commits to take for
synchronous standbys, based on recent history, if synchronous_commit
it set to remote_write, on or remote_apply respectively.  Those times
tell you about commit latency, which limits sequential commit rate for
each session.

Re: PostgreSQL on Windows' state

2020-09-23 Thread Thomas Munro

On Wed, Sep 23, 2020 at 10:53 PM Alessandro Dentella
 wrote:
> Thanks Pavel, but psycopg2 (that I always use is just for Python). T
> hey claim/complain that from c# there's no native solution.

Maybe https://www.npgsql.org/?

Re: Unable to compile postgres 13.1 on Slackware current x64

2020-11-16 Thread Thomas Munro

On Mon, Nov 16, 2020 at 10:10 PM Laurenz Albe  wrote:
> On Mon, 2020-11-16 at 09:15 +0200, Condor wrote:
> > collationcmds.c: In function ‘get_icu_language_tag’:
> > collationcmds.c:467:51: error: ‘TRUE’ undeclared (first use in this
> > function); did you mean ‘IS_TRUE’?
> >467 |  uloc_toLanguageTag(localename, buf, sizeof(buf), TRUE, &status);
> >|   ^~~~
> >|   IS_TRUE
> >

> "UBool" and "TRUE" are defined in "umachine.h", which is a header file for the
> "libicu" library.
>
> PostgreSQL includes "unicode/ucol.h", which will include "umachine.h"
> (via "utypes.h"), so that should be fine.
>
> Are your libicu headers installed under /usr/include/unicode?
> Do you get any messages about missing include files earlier?

It looks like something happened to ICU's boolean macros .  See this
commit in FreeBSD ports:

https://github.com/freebsd/freebsd-ports/commit/81a88b4506ec06d07be10d199170ef4003eb0e30

... which references:

https://github.com/unicode-org/icu/commit/c3fe7e09d844

Re: Unable to compile postgres 13.1 on Slackware current x64

2020-11-17 Thread Thomas Munro

On Tue, Nov 17, 2020 at 8:02 PM Condor  wrote:
> I try to compile postgres again with (cd src/backend/commands; sed
> 's/TRUE/true/' collationcmds.c > collationcmds.c; ) and it's compiled
> but get new error on linking:

Doesn't that produce an empty file collationcmds.c? I think you want:

sed 's/TRUE/true/' collationcmds.c > collationcmds.c.tmp && mv
collationcmds.c.tmp collationcmds.c

Re: Unable To Drop Tablespace

2021-02-04 Thread Thomas Munro

On Fri, Feb 5, 2021 at 12:43 PM Ian Lawrence Barwick  wrote:
> 2021年2月5日(金) 3:52 Pavan Pusuluri :
>> We are trying to drop a table space on RDS Postgres . We have removed the 
>> objects etc, but it still won't drop.
>>
>> I have checked and there's no reference anywhere to this tablespace but it 
>> complains it's not empty.
>>
>> I checked if it is a default for a database, revoked all privileges on the 
>> tablespace.
>>
>> We dropped the database but underpinning tablespace remained but when I 
>> query to see if any reference i get no hits.
>>
>> "Select c.relname,t.spcname from pg_class c JOIN pg_tablespace t ON 
>> c.reltablespace=t.oid where t.spcname='mytablespace'
>>
>> I dont find any objects referencing. Kindly let me know if anything else 
>> needs to be checked?
>
>
> There's a handy function "pg_tablespace_databases()" to check which databases
> might still have objects in a database. There are a couple of useful queries 
> demonstrating
> usage here:
>
>   https://pgpedia.info/p/pg_tablespace_databases.html

It's also possible for there to be stray files in there, in some crash
scenarios where PostgreSQL doesn't currently clean up relation files
that it ideally should.  The one with the widest window AFAIK is where
you crash after creating a table but before committing[1].  You'd need
a directory listing to investigate that.

[1] 
https://www.postgresql.org/message-id/flat/CAEepm%3D0ULqYgM2aFeOnrx6YrtBg3xUdxALoyCG%2BXpssKqmezug%40mail.gmail.com

Re: How to post to this mailing list from a web based interface

2021-02-14 Thread Thomas Munro

On Fri, Jan 29, 2021 at 4:27 AM Alvaro Herrera  wrote:
> On 2021-Jan-28, Ravi Krishna wrote:
> > I recollect there use to be a website from where one can reply from web.
>
> The community does not maintain such a service.
>
> There used to be a Gmane archive of this list that you could use to
> post.  Seems it's still online at postgresql-archive.org.  They have a
> "Reply" button and it says to require your account, but given SPF and
> DMARC and other restrictions on email generation, it seems pretty
> uncertain that emails posted that way would work correctly.  I think we
> would even reject such emails if they reached our mailing list servers.

While catching up with some interesting new threads just now I was
quite confused by the opening sentence of this message (which also
arrived in my mailbox):

https://www.postgresql.org/message-id/1611355191319-0.post%40n3.nabble.com

... until I got to the last line.  I wonder if the "Resend" facility
on our own archives could be better advertised, via a "Want to join
this thread?" link in the Quick Links section that explains how to use
it and what problem it solves, or something...

Re: [LDAPS] Test connection user with ldaps server

2021-02-15 Thread Thomas Munro

On Tue, Feb 16, 2021 at 4:32 AM Laurenz Albe  wrote:
> What I would do is experiment with the "ldapsearch" executable from OpenLDAP 
> and see
> if you can reproduce the problem from the command line.

Also, maybe try doing this as the "postgres" user (or whatever user
PostgreSQL runs as), just in case there are some environmental
differences affecting the behaviour.

Re: SELECT creates millions of temp files in a single directory

2022-04-23 Thread Thomas Munro

On Sun, Apr 24, 2022 at 8:00 AM Peter  wrote:
> More than a million files in a single directory, this is
> inacceptable.

You didn't show EXPLAIN (ANALYZE) but if [Parallel] Hash is making
insane numbers of temporary files then something is not working as
intended... and I can take a guess at what it is.  I tried to recreate
this with randomly distributed file.pathid, same size tables as you,
and I got 32 batches and a nice manageable number of temporary files.
Adding millions of extra file rows with duplicate pathid=42 gets me
something like "Batches: 524288 (originally 32)" in EXPLAIN (ANALYZE)
output.  I guess that's the sort of distribution you have here?

Extensive discussion of the cause of that and potential (hard)
solutions over here:

https://www.postgresql.org/message-id/flat/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com

To summarise, when the hash table doesn't fit in work_mem, then we
"partition" (spill part of the data to disk) to make twice as many
(hopefully) smaller hash tables that do fit.  Sometimes partitoning
produces one or more hash tables that are still too big because of
uneven key distribution, so we go around again, doubling the number of
partitions (and thus temporary files) every time.  I would say that
once we get past hundreds of partitions, things are really turning to
custard (file descriptors, directory entries, memory overheads, ... it
just doesn't work well anymore), but currently we don't give up until
it becomes very clear that repartitioning is not helping.

This algorithmic problem existed before parallelism was added, but
it's possible that the parallel version of the meltdown hurts a lot
more (it has extra per-process files, and in multi-attempt scenarios
the useless earlier attempts hang around until the end of the query
instead of being cleaned up sooner, which doubles the number of
files).

Hopefully that gives some clues about how one might rewrite the query
to avoid massive unsplittable set of duplicate keys in hash tables,
assuming I'm somewhere close to the explanation here (maybe some
subselect with UNIQUE or GROUP BY in it, or some way to rewrite the
query to avoid having the problematic duplicates on the "inner" side,
or completely different plan..., or just crank up work_mem massively
for this query so that you don't need a partition step at all) .
Obviously it would be nice for PostgreSQL to be hardened against this
risk, eventually, though.

As for merge join planning, unfortunately they aren't fully
parallelisable and in the plan you showed, a separate copy of the big
sort runs in every process, which isn't nice (we don't have a Parallel
Sort that could fix that yet).

Re: Pg14 possible index corruption after reindex concurrently

2022-05-24 Thread Thomas Munro

On Wed, May 25, 2022 at 6:17 AM Aleš Zelený  wrote:
> SELECT format('REINDEX SCHEMA CONCURRENTLY %I;', n.nspname)

This may be related to bug #17485, discussed at:

https://www.postgresql.org/message-id/flat/17485-396609c6925b982d%40postgresql.org

Re: Improve configurability for IO related behavoir

2022-05-28 Thread Thomas Munro

On Sun, May 29, 2022 at 4:29 AM 浩辰 何  wrote:
> Furthermore, the results above are also related to IO API supported by OS. 
> MySQL support synchronized IO and Linux libaio. It seems
> that PostgreSQL only supports synchronized IO, so shall we support more IO 
> engines? like io_uring which is very popular in recent years.

Hi Haochen,

There is an active project to bring these things to PostgreSQL.
https://wiki.postgresql.org/wiki/AIO has some information and links.
The short version is that there is a development patch set to add
these GUCs:

io_method=worker,io_uring,posix_aio,iocp,...
io_data_direct=on,off
io_wal_direct=on,off

It also adds a bunch of specialised logic that knows how to initiate
IO in key places (scans, writeback, recovery, ...), because it's not
enough to just turn off kernel I/O buffering, we also have to do all
the work the kernel is doing for us or performance will be terrible.

Re: Order of rows in statement triggers NEW/OLD tables

2022-05-31 Thread Thomas Munro

On Fri, May 6, 2022 at 6:20 PM hubert depesz lubaczewski
 wrote:
> when defining statement triggers on update I can use:
>
> REFERENCING OLD TABLE AS xxx NEW TABLE as YYY
>
> these "pseudo" tables contain rows that were before and after.
>
> Is the order guaranteed?
>
> Can I assume that "first" row returned by select from xxx, will be older
> version of first row returned by select from yyy?

Good question, and one I've wondered about before.  I *think* that is
currently true, due to implementation details, but it could change.
The trigger code fills up both tuplestores (= memory/disk containers
for transition tables) in sync with each other in
AfterTriggerSaveEvent(), and then NamedTuplestoreScan node emits
tuples in insertion order.  We already banned the use of transition
tables when there is "ON UPDATE OR INSERT" etc so you can't get mixed
INSERT/UPDATE/DELETE results which would desynchronise the old and new
tuples, and I also wondered if something tricky like FK ON DELETE
CASCADE/SET NULL in a self-referencing table could mix some
old-with-no-new into UPDATE results, but I can't see how to do that,
and I can't think of any other way off the top of my head.  Of course,
joins etc could destroy the order higher in your query plan.

While reading about proposed incremental materialized view ideas, I
once idly wondered if it'd be useful, as an extension to the standard,
to be able to use WITH ORDINALITY for transition tables (normally only
used for unnest(), and in PG also any SRF) and then you could use that
+ ORDER BY to explicitly state your desired order (and possibly teach
the planner that ORDINALITY attributes are path keys by definition so
it doesn't need to insert sort nodes in simple cases).  That is,
instead of relying on scan order.  In any case, an in-core IMV feature
is allowed to peek deeper into the implementation and doesn't even
need SQL here, so I didn't follow that thought very far...  I am not
sure about this, but I wonder if any user-level
portable-across-SQL-implementation user-level scheme for
replication/materialization built on top of trigger transition tables
would need to require immutable unique keys in the rows in order to be
able match up before/after tuples.

Re: AIX and EAGAIN on open()

2022-07-03 Thread Thomas Munro

On Mon, Jun 20, 2022 at 9:53 PM Christoph Berg
 wrote:
> IBM's reply to the issue back in December 2020 was this:
>
>   The man page / infocenter document is not intended as an exhaustive
>   list of all possible error codes returned and their circumstances.
>   "Resource temporarily unavailable" may also be returned for
>   O_NSHARE, O_RSHARE with O_NONBLOCK.
>
> Afaict, PG does not use these flags either.
>
> We also ruled out that the system is using any anti-virus or similar
> tooling that would intercept IO traffic.
>
> Does anything of that ring a bell for someone? Is that an AIX bug, a
> PG bug, or something else?

No clue here.  Anything unusual about the file system (NFS etc)?  Can
you truss/strace the system calls, to sanity check the flags arriving
into open(), and see if there's any unexpected other activity around
open() calls that might be coming from something you're linked
against?

Re: Strange collation names ("hu_HU.UTF-8")

2022-08-02 Thread Thomas Munro

On Wed, Aug 3, 2022 at 1:43 AM Tom Lane  wrote:
> I believe most if not all variants of Unix are
> permissive about the spelling of the encoding part.

I've only seen glibc doing that downcase-and-strip-hyphens thing to
the codeset part of a locale name when looking for locale definition
files.  Other systems like FreeBSD expect to be able to open
/usr/share/locale/$LC_COLLATE/LC_COLLATE directly without any kind of
munging.  On a Mac it's probably a little fuzzy because the filenames
are case insensitive...

Re: Segmentation Fault PG 14

2022-11-07 Thread Thomas Munro

On Tue, Nov 8, 2022 at 11:45 AM Willian Colognesi
 wrote:
> root@ip-10-x-x-x:/home/ubuntu# pg_config --configure
> ... --with-extra-version= (Ubuntu 14.5-2.pgdg20.04+2)' ...
> ... '--with-llvm' 'LLVM_CONFIG=/usr/bin/llvm-config-10' ...

> There is no llvm installed on ubuntu server, postgresql was installed via apt 
> package `apt install postgresql-14`

We can see from the pg_config output that it's built with LLVM 10.
Also that looks like it's the usual pgdg packages which are certainly
built against LLVM and will install it automatically.

Re: PANIC: could not flush dirty data: Cannot allocate memory

2022-11-14 Thread Thomas Munro

On Tue, Nov 15, 2022 at 10:54 AM Christoph Moench-Tegeder
 wrote:
> ## klaus.mailingli...@pernau.at (klaus.mailingli...@pernau.at):
> > On several servers we see the error message: PANIC:  could not flush
> > dirty data: Cannot allocate memory

> Of these three places, there's an sync_file_range(), an posix_fadvise()
> and an msync(), all in src/backend/storage/file/fd.c. "Cannot allocate
> memory" would be ENOMEM, which posix_fadvise() does not return (as per
> it's docs). So this would be sync_file_range(), which could run out
> of memory (as per the manual) or msync() where ENOMEM actually means
> "The indicated memory (or part of it) was not mapped". Both cases are
> somewhat WTF for this setup.

It must be sync_file_range().  The others are fallbacks that wouldn't
apply on a modern Linux.

It has been argued before that we might have been over-zealous
applying the PANIC promotion logic to sync_file_range().  It's used to
start asynchronous writeback to make the later fsync() call fast, so
it's "only a hint", but I have no idea if it could report a writeback
error from the kernel that would then be consumed and not reported to
the later fsync(), so I defaulted to assuming that it could.

Re: PANIC: could not flush dirty data: Cannot allocate memory

2022-11-15 Thread Thomas Munro

On Wed, Nov 16, 2022 at 1:24 AM  wrote:
> Filesystem is ext4. VM technology is mixed: VMware, KVM and XEN PV.
> Kernel is 5.15.0-52-generic.
>
> We have not seen this with Ubutnu 18.04 and 20.04 (although we might not
> have noticed it).
>
> I guess upgrading to postgresql 13/14/15 does not help as the problem
> happens in the kernel.
>
> Do you have any advice how to go further? Shall I lookout for certain
> kernel changes? In the kernel itself or in ext4 changelog?

It'd be good to figure out what is up with Linux or tuning.  I'll go
write a patch to reduce that error level for non-EIO errors, to
discuss for the next point release.  In the meantime, you could
experiment with setting checkpoint_flush_after to 0, so the
checkpointer/bgwriter/other backends don't call sync_file_range() all
day long.  That would have performance consequences for checkpoints
which might be unacceptable though.  The checkpointer will fsync
relations one after another, with less I/O concurrency.   Linux is
generally quite lazy at writing back dirty data, and doesn't know
about our checkpointer's plans to fsync files on a certain schedule,
which is why we ask it to get started on multiple files concurrently
using sync_file_range().

https://www.postgresql.org/docs/15/runtime-config-wal.html#RUNTIME-CONFIG-WAL-CHECKPOINTS

Re: Getting PSQL in Windows to support TAB/Autocomplete via modified readline...

2022-11-22 Thread Thomas Munro

On Tue, Nov 22, 2022 at 4:25 AM Kirk Wolak  wrote:
>   In researching this problem, it appears that the decision was made like 
> 17yrs ago, when windows did not have a realistic "terminal" type interface.  
> Assuming we target Windows 8.1 or higher, I believe this goes away.

FWIW PostgreSQL 16 will require Windows 10+.  Not a Windows user
myself, but I read somewhere that Windows 8.1 has already started
showing full screen warnings at startup that security patches etc end
in January, while PostgreSQL 16 (the soonest release that could
include your changes) is expected late in the year.

>   What we are looking for is a simple bullet point list of what would make 
> such a contribution acceptable.  And how far should we get through that list 
> on our own, before getting some help, especially from the build teams?  [Our 
> goal would be an NEW Config type flag: READLINE_FOR_WINDOWS (you guys name 
> it, and control the default setting)]

Some thoughts:

Re configuration flags: don't waste time with the old perl-based build
system.  The configuration should be done only with the new meson
build system (soon to be the only way to build on Windows).

I didn't quite understand if you were saying that readline itself
needs patches for this (I gather from earlier threads about this that
there were some problems with dll symbol export stuff, so maybe that's
it?).  In passing, I noticed that there is also a Windows port of
editline AKA libedit, the BSD-licensed alternative to readline.  It
has a compatible API and PostgreSQL can use that too.  I have no idea
which is easier, more supported, etc on Windows.

It's OK to post a work-in-progress patch to pgsql-hackers, even if it
doesn't work right yet.  With any luck, people will show up to help
with problems.  I am 100% sure that our Windows user community would
love this feature.  It would be good if the tests in
src/bin/psql/t/010_tab_completion.pl pass on Windows, but if that's
hard, don't let that stop you sharing a patch.

Re: 13.x, stream replication and locale(?) issues

2023-02-28 Thread Thomas Munro

On Wed, Mar 1, 2023 at 12:09 AM Eugene M. Zheganin  wrote:
> 3) how do I fix it ? Should I take locale sources for ru_RU.utf8 on Linux and 
> compile it on FreeBSD  - will it help ?

Out of curiosity (I'm not saying it's a good idea!), do you know if
FreeBSD's localedef can compile glibc's collation definitions?  In
theory they are in a format standardised by POSIX...  I suspect there
may be extensions and quirks...

At a wild guess, since the data you showed doesn't even look like it
contains non-ASCII characters (it looks like machine readable
identifiers or something, and perhaps its the sort order of '-' that
is causing you trouble), so it might also be possible to use
"ucs_basic" locale for that column and then all computers will agree
on the sort order, but of course that doesn't address the more general
problem; presumably you might also have Russian language text in your
system too.

As for ".utf8" vs ".UTF-8", which one is selected by initdb as the
database default seems to be something that varies between Linux
distributions, so I guess maybe the installers use different
techniques for discovering and selecting default locale names.  Unlike
glibc, FreeBSD doesn't do any name mangling at all when mapping
LC_COLLATE to a pathname to find the file, whereas glibc downcases and
removes '-' so you can find both formats of name in the various
places...

Re: Interval in hours but not in days Leap second not taken into account

2023-02-28 Thread Thomas Munro

On Mon, Feb 27, 2023 at 8:26 PM PALAYRET Jacques
 wrote:
> # PostgreSQL does not take into account the additional second (leap second) 
> in some calendar days ; eg. 2016, 31 dec. :
> SELECT to_timestamp('20170102 10:11:12','mmdd hh24:mi:ss') - 
> to_timestamp('20161230 00:00:00','mmdd hh24:mi:ss') intervalle ;
>intervalle
> -
>  3 days 10:11:12

Bonjour Jacques,

Just for fun:

postgres=# SELECT utc_to_tai(to_timestamp('20170102
10:11:12','mmdd hh24:mi:ss')) -
   utc_to_tai(to_timestamp('20161230 00:00:00','mmdd
hh24:mi:ss')) intervalle;
   intervalle
-
 3 days 10:11:13
(1 row)

PostgreSQL could, in theory, provide built-in UTC/TAI conversions
functions using a leap second table that would be updated in each
minor release, considering that the leap second table is included in
the tzdata package that PostgreSQL vendors (ie includes a copy of),
but it doesn't do anything like that or know anything about leap
seconds.  Here's a quick and dirty low technology version of the
above:

CREATE TABLE leap_seconds (time timestamptz primary key, off int);

-- refresh leap second table from ietf.org using not-very-secure hairy
shell code
BEGIN;
CREATE TEMP TABLE import_leap_seconds (s int8, off int);
COPY import_leap_seconds FROM PROGRAM 'curl -s
https://www.ietf.org/timezones/data/leap-seconds.list | grep -v ''^#''
| cut -f1,2';
TRUNCATE TABLE leap_seconds;
INSERT INTO leap_seconds (time, off)
SELECT '1900-01-01 00:00:00Z'::timestamptz + interval '1 second' * s, off
  FROM import_leap_seconds;
DROP TABLE import_leap_seconds;
COMMIT;

CREATE OR REPLACE FUNCTION leap_seconds_before_utc_time(t timestamptz)
RETURNS int STRICT LANGUAGE SQL AS
$$
  SELECT off FROM leap_seconds WHERE time <= t ORDER BY time DESC
FETCH FIRST ROW ONLY
$$;

CREATE OR REPLACE FUNCTION utc_to_tai(t timestamptz)
RETURNS timestamptz STRICT LANGUAGE SQL AS
$$
  SELECT t + interval '1 second' * coalesce(leap_seconds_before_utc_time(t), 0);
$$;

CREATE OR REPLACE FUNCTION tai_to_utc(t timestamptz)
RETURNS timestamptz STRICT LANGUAGE SQL AS
$$
  SELECT t - interval '1 second' * coalesce(leap_seconds_before_utc_time(t), 0);
$$;

Re: 13.x, stream replication and locale(?) issues

2023-03-03 Thread Thomas Munro

On Wed, Mar 1, 2023 at 10:30 AM Thomas Munro  wrote:
> On Wed, Mar 1, 2023 at 12:09 AM Eugene M. Zheganin  wrote:
> > 3) how do I fix it ? Should I take locale sources for ru_RU.utf8 on Linux 
> > and compile it on FreeBSD  - will it help ?
>
> Out of curiosity (I'm not saying it's a good idea!), do you know if
> FreeBSD's localedef can compile glibc's collation definitions?  In
> theory they are in a format standardised by POSIX...  I suspect there
> may be extensions and quirks...

Another thought: if you upgrade to 15, you could use ICU as the
default collation provider, and then make sure you have the same major
version of ICU on your Linux and FreeBSD systems (which probably means
compiling from source on FreeBSD, as FreeBSD tends to have a newish
ICU in packages, while popular stable Linux distributions tend to have
a fairly old one).  As far as I know, Linux and FreeBSD systems on
amd64 arch should match in every other important detail already (they
both follow the System V ABI, so there shouldn't be any layout
differences in eg structs that are sent over the wire AFAIK).

For what it's worth, for the future we are trying to figure out how to
support multi-version ICU so that you could explicitly set the
provider to ICU v72 to get that sort of thing working across OS
versions and even "close enough" cousins like your case, without
having to compile anything from source, but unfortunately we didn't
manage to get it ready in time for 16.

Re: src/test/examples/testlibpq2.c where the HAVE_SYS_SELECT_H is defined.

2023-03-17 Thread Thomas Munro

On Fri, Mar 17, 2023 at 7:48 PM jian he  wrote:
> Hi,
> playing around with $[0] testlibpq2.c example. I wondered where 
> HAVE_SYS_SELECT_H is defined?
>
> I searched on the internet, founded that people also asked the same question 
> in $[1].
>
> In my machine, I do have .
> system version: Ubuntu 22.04.1 LTS
> gcc version: gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
> gcc compile command: gcc pg_testlibpq2.c -I/home/jian/postgres/pg16/include \
> -L/home/jian/postgres/pg16/lib -lpq
>
> [0]https://www.postgresql.org/docs/current/libpq-example.html
> [1]https://stackoverflow.com/questions/37876850/in-compilation-time-how-to-find-the-macro-is-defined-in-which-header-file

In 15 and earlier, it is defined in pg_config.h, which is created by
configure.  But in 16, that particular macro was removed by commit
7e50b4e3c.  It looks like you are using PostgreSQL 16 sources, but
looking at PostgreSQL 15 examples?

Re: Postgresql issue: FATAL: dsa_allocate could not find 7 free pages

2018-10-25 Thread Thomas Munro

On Fri, Oct 26, 2018 at 2:21 AM Alexandre Assouad
 wrote:
> FATAL:  dsa_allocate could not find 7 free pages

> Some users have faced this issue which seems to be a bug in postgresql query 
> planner which should be solved :
> https://www.postgresql.org/message-id/CAEepm%3D1k7sYJbxoOSJcS-4ti2MHOnBXBfLf%3D-gtuFLTXPqvTDg%40mail.gmail.com

Hello Alexandre,

Thanks for the report.  Yes, that bug does seem like a plausible
explanation for that error.  The fix was in commit 8ffc3be1, which
will be included in 10.6 (target release date:  November 8th).  It's
also in 11.0, out now.

If you are able to reproduce the problem easily on a copy of your
database, and you have the time/inclination to investigate, is there
any chance you could test the query on a local build of REL_10_STABLE
(the branch that will produce 10.6 soon), instructions below, or the
released v11.0 (if Timescale is available for that, it doesn't look
like it)?  If not, no worries.

> But I’m still facing this issue.
> I’m using postgresql 10.5 on ubuntu 18.04
> With timescaledb extension (which could be involved in this bug but I 
> couldn’t find any related issue on their side)

It's interesting that reports came from users of Citus and Timescale.
There doesn't seem to be any reason to think it's caused by anything
these extension are doing, other than just having a lot of data, big
queries and the right access pattern to hit that bug.

=== How to set up a throw-away REL_10_STABLE cluster:

On an Ubuntu developer machine, check out, build, install into
temporary directory:
sudo apt-get install git make gcc flex bison libz-dev libreadline-dev
git clone https://github.com/postgres/postgres.git
cd postgres
git checkout REL_10_STABLE
./configure --prefix=$HOME/tmp_install
make -s -j8 && make -s install

Initialise and start a database cluster:
~/tmp_install/bin/initdb -D ~/tmp_pgdata
~/tmp_install/bin/postgres -D ~/tmp_pgdata
... now postgres is running in the foreground, until you hit ^C
... do whatever you need to do to install Timescale extension, schema,
data, reproduce problem

To check that you can reproduce the problem in 10.5 with a server
built that way, stop that and:
git checkout REL_10_5
make -s clean && make -s -j8 && make -s install
~/tmp_install/bin/postgres -D ~/tmp_pgdata

To install Timescale it's probably the instructions from
https://github.com/timescale/timescaledb, using ./bootstrap
-DPG_CONFIG=~/tmp_install/bin/pgconfig but i haven't tried that
myself.

(You don't have to run initdb again or reload data when switching
between tags/branches in the 10.x series).

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Postgresql issue: FATAL: dsa_allocate could not find 7 free pages

2018-10-26 Thread Thomas Munro

On Fri, Oct 26, 2018 at 9:17 PM Alexandre Assouad
 wrote:
> I’ll try next week to build a test environment and I’ll send you the results.
> Does it make any difference to set up a VM vs a dedicated machine ?

Thanks.  Probably not, the important thing is probably the same data,
settings (work_mem etc) and query.  Maybe start with a build of
REL_10_5 and try to reproduce the problem, and if you can get that to
happen reliably, then switch to a build of REL_10_STABLE to confirm
that the problem goes away?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Different memory allocation strategy in Postgres 11?

2018-10-26 Thread Thomas Munro

On Sat, Oct 27, 2018 at 6:10 AM Thomas Kellerer  wrote:
> Jeff Janes schrieb am 26.10.2018 um 17:42:
> > I typically configure "shared_buffers = 4096MB" on my 16GB system as 
> > sometimes when testing, it pays off to have a bigger cache.
> >
> > With Postgres 10 and earlier, the Postgres process(es) would only 
> > allocate that memory from the operating system when needed.
> > So right after startup, it would only consume several hundred MB, not 
> > the entire 4GB
> >
> > However with Postgres 11 I noticed that it immediately grabs the 
> > complete memory configured for shared_buffers during startup.
> >
> > It's not really a big deal, but I wonder if that is an intentional 
> > change or a result from something else?
> >
> >
> > Do you have pg_prewarm in shared_preload_libraries?
>
> No. The only shared libraries are those for pg_stat_statemens

Does your user have "Lock Pages in Memory" privilege?  One thing that
is new in 11 is huge AKA large page support, and the default is
huge_pages=try.  Not a Windows person myself but I believe that should
succeed if you have that privilege and enough contiguous chunks of
physical memory are available.  If you set huge_pages=off does it
revert to the old behaviour?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Is there a way to speed up WAL replay?

2018-10-31 Thread Thomas Munro

On Thu, Nov 1, 2018 at 4:25 AM Jeff Janes  wrote:
> On Wed, Oct 31, 2018 at 1:38 AM Torsten Förtsch  
> wrote:
>> I am working on restoring a database from a base backup + WAL. With the 
>> default settings the database replays about 3-4 WAL files per second. The 
>> startup process takes about 65% of a CPU and writes data with something 
>> between 50 and 100 MB/sec.
>>
>> Is there a way to speed that up? The disk can easily sustain 400-500 MB/sec.
>
>
> WAL replay is single-threaded, so the most you would be able to speed it up 
> is 50%, to where it would be taking 100% CPU.
>
> Is the time spent not on the CPU being spent waiting for WAL files to arrive 
> from the restore_command, or waiting for the blocks it needs to replay into 
> to get read from disk, or waiting for dirty pages to get written to disk so 
> they can be evicted to make way for new ones?
>
> One way I found to speed up restore_command is to have another program run a 
> few WAL files ahead of it, copying the WAL from the real archive into a 
> scratch space which is on the same filesystem as pg_xlog/pg_wal.  Then have 
> restore_command simply move (not copy) the requested files from the scratch 
> space onto %p.  The intra-filesystem move completes much faster than a copy.
>
> If it spends time waiting for blocks that need to be recovered into to get 
> read from disk, and you have enough RAM, you could speed it up by pre-warming 
> the file system cache.  Something like:
>
> tar -cf - $PGDATA | wc -c

For more targeted prewarming of large systems that don't fit in RAM
and to get all the way into PostgreSQL's buffer pool, I suppose you
could write a small Python/whatever script that extracts the
relfilenode + block references from the output of pg_waldump (one file
ahead, or whatever), sorts and uniques them, merges them into block
ranges, converts the relfilenode reference to relation OID, and then
calls pg_prewarm() for each range.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Impact on PostgreSQL due to Redhat acquisition by IBM

2018-11-15 Thread Thomas Munro

On Fri, Nov 16, 2018 at 5:07 AM Adrian Klaver  wrote:
> On 11/15/18 7:59 AM, Sachin Kotwal wrote:
> > I feel community has most of linux based instance in thier buildfarm for
> > testing, might be very few Ubuntu based.
> > I might be wrong here.
>
> Easy enough to see:
>
> https://buildfarm.postgresql.org/cgi-bin/show_members.pl

Out of curiosity, here are the current counts for HEAD:

Linux distros:
  Amazon Linux: 1
  Arch Linux: 1
  CentOS: 9
  Debian: 34
  Fedora: 5
  Photon: 1
  Raspbian: 2
  RHEL: 8
  SUSE: 7
  Ubuntu: 7

BSD diaspora:
  DragonflyBSD: 1
  FreeBSD: 6
  NetBSD: 2
  OpenBSD: 1

OpenSolaris diapora:
  OmniOS: 1
  SmartOS: 1

Other Unixen:
  AIX: 4
  HP-UX: 3
  macOS: 4

Windows:
  Windows: 6
  Cygwin: 1

I wouldn't be too worried about any of these, especially the open
ones.  Closed Solaris, though, is apparently dead to us.  Nobody cares
enough to build HEAD on it anymore, and for example 3a769d82 (reflink
support for pg_upgrade) went in without consideration of Solaris 11.4
reflink().  I've personally moved to the 'acceptance' phase of grief;
all three Unixes that I cut my teeth on in the 90s are now either
formally dead and buried or in this case, a zombie.

>From personal observations, I know that we have developers and
committers doing their primary development work on at least Debian,
Fedora, FreeBSD, macOS, Ubuntu and Windows.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Transition Tables doesn´t have OID

2018-12-01 Thread Thomas Munro

On Sun, Dec 2, 2018 at 7:38 AM Adrian Klaver  wrote:
>
> On 12/1/18 8:51 AM, PegoraroF10 wrote:
> > ok
> > then, help me to find alternatives to it.
> > As I sad, sometimes whe change our PK, so using OID would be a smart way to
> > have a join between old and new transition tables and we would like to use
> > transition tables because each statement is a lot faster than each row for
> > auditing purposes.
> >
> > So, whats the alternative ? One trigger for each row just for changing PK
> > values (will occur few times) and another trigger for each statement to do
> > the logging ?
>
> If you don't want to change the structure of your tables that would be
> seem to be the way to go. It will require some thought to make sure the
> 'for each row' and 'for each statement' don't interfere with each other.

I also contemplated this question when hacking on transition tables.
We know that the new and old tuplestores are ordered in the same way
(as an implementation detail), but there is no way to take advantage
of that knowledge in SQL, where relations are unordered sets.  There
is a syntax WITH ORDINALITY that SQL uses to deal with the fact that
set-returning functions' results might really be be ordered, and you
might not want to lose that information.  Suppose we allowed WITH
ORDINALITY for transition tables, so that the 'capture' order of rows
could be exposed, and we promised that old and new ORDINALTITY numbers
will line up, and then we made the ORDINALITY column a pathkey of the
scan.  Now you could join old and new tables by the ORDINALITY column,
and get a merge join without any sorting.  That's... pretty weird
though, and the syntax would be outside the SQL spec, and the
semantics might be questionable.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: What is the tuplestore?

2018-12-10 Thread Thomas Munro

On Tue, Dec 11, 2018 at 2:56 AM Ron  wrote:
> 2018-12-07 06:21:55.504 EST 10.140.181.89(35868) CDS CDSLBXW 13748 SELECT
> PostgreSQL JDBC Driver 53100 ERROR:  could not write to tuplestore temporary
> file: No space left on device
>
> I see this in the pg_log file, but #1 can't figure out what "tuplestore" is
> (Google doesn't help), and #2 there's lots of space on all my file systems.
> data/base, where pgsql_tmp lives, has 96GB free.)

Maybe the setting log_temp_files could help you identify the query
that is doing it?

https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-LOG-TEMP-FILES

-- 
Thomas Munro
http://www.enterprisedb.com

Re: What is the tuplestore?

2018-12-10 Thread Thomas Munro

On Tue, Dec 11, 2018 at 3:58 PM Ron  wrote:
> On 12/10/2018 06:24 PM, Thomas Munro wrote:
> > On Tue, Dec 11, 2018 at 2:56 AM Ron  wrote:
> >> 2018-12-07 06:21:55.504 EST 10.140.181.89(35868) CDS CDSLBXW 13748 SELECT
> >> PostgreSQL JDBC Driver 53100 ERROR:  could not write to tuplestore 
> >> temporary
> >> file: No space left on device
> >>
> >> I see this in the pg_log file, but #1 can't figure out what "tuplestore" is
> >> (Google doesn't help), and #2 there's lots of space on all my file systems.
> >> data/base, where pgsql_tmp lives, has 96GB free.)
> > Maybe the setting log_temp_files could help you identify the query
> > that is doing it?
> >
> > https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-LOG-TEMP-FILES
>
> Great idea; I've made the change.  What do I grep for in the pg_log file?

It should say something like:

LOG: temporary file: path "base/pgsql_tmp/...", size ...
STATEMENT: ...

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Does PostgreSQL use atomic file creation of FS?

2018-12-12 Thread Thomas Munro

On Wed, Dec 12, 2018 at 11:52 PM Dmitry Lazurkin  wrote:
> Thank you. But I have read this. I said about network file system only
> for example. I would like to known how PostgreSQL handles this specific
> case (of course if someone knowns a answer):
>
> fd = open(file, "w");
> write(fd, data);
> // crash and now I have empty file which isn't correct
> fsync(fd);
>
> PS. I think PostgreSQL doesn't have this problem.

It depends on the context, but in general PostgreSQL knows about that
sort of thing.  When the cluster shuts down, it records that it shut
down cleanly, meaning that everything that should be on disk is on
disk.  When you start the cluster up, if it sees that it didn't shut
down cleanly, it enters recovery.  During recovery it tolerates files
being too short while it's replaying the WAL to get back to a
consistent state.  See the comment in mdread() for example:

https://github.com/postgres/postgres/blob/master/src/backend/storage/smgr/md.c#L755

It's called "write-ahead log" because we log our intention before we
write to data files (and make sure it's on disk first), so we'll
always replay the same effects again if we're interrupted.  The WAL is
a magic source of reliability (we can do it again if things go wrong)
and also performance (IO becomes serial, optimised for the storage
hardware).

https://www.postgresql.org/docs/current/wal-intro.html

-- 
Thomas Munro
http://www.enterprisedb.com

Re: shared_buffers on Big RAM systems

2018-12-13 Thread Thomas Munro

On Fri, Dec 14, 2018 at 2:17 AM Ron  wrote:
> https://www.postgresql.org/docs/9.6/runtime-config-resource.html
>
> The docs say, "If you have a dedicated database server with 1GB or more of
> RAM, a reasonable starting value for shared_buffers is 25%".
>
> But that's pretty archaic in 2018.  What if the dedicated database server
> has 128GB RAM?

I agree, we might as well drop the words "with 1GB of more of RAM".
That's the size of the very smallest cloud instances available these
days, available for free or up to a few bucks a month, and for
physical servers I wonder if you can still get DIMMs that small.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Unused files in the database directory after crashed VACUUM FULL

2019-02-10 Thread Thomas Munro

On Mon, Feb 11, 2019 at 10:21 AM Tom Lane  wrote:
> Ah, so Andrew was correct: we panicked due to lack of WAL space, and
> that explains why the vacuuming process didn't have an opportunity
> to delete the files belonging to the uncommitted new relation.

> It's a pretty well-understood dynamic, I believe.  Perhaps we should
> try harder to recover cleanly, but I don't know of anyone putting
> effort into the case.

FTR I am working on a PG13 patch that records relfilenodes of
uncommitted transactions in undo logs, so it can unlink them reliably,
even if you crash (at the cost of introducing a WAL flush before
creating files).  I haven't specifically studied the VACUUM FULL case
yet, but in principle this is exactly what my project aims to fix.
It's mostly intended as example code to demonstrate the undo log
machinery (defining undo record types, registering undo log action
functions that are invoked during rollback, rollback of aborted but
not yet rolled back transaction at startup, ...) without having to
understand the whole zheap sandwich at once, but it's also a solution
to an age old problem.  More on that soon.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: WSL (windows subsystem on linux) users will need to turn fsync off as of 11.2

2019-02-14 Thread Thomas Munro

On Fri, Feb 15, 2019 at 2:56 PM Bruce Klein  wrote:
> >  In 11.1 did you see the message "WARNING: could not flush dirty data: 
> > Function not implemented"
> Yes

I wonder if this is coming from sync_file_range(), which is not
implemented on WSL according to random intergoogling, but probably
appears as present to our configure script.  I find it harder to
believe they didn't implement fsync().

--
Thomas Munro
http://www.enterprisedb.com

Re: WSL (windows subsystem on linux) users will need to turn fsync off as of 11.2

2019-02-14 Thread Thomas Munro

On Fri, Feb 15, 2019 at 3:56 PM Thomas Munro
 wrote:
> On Fri, Feb 15, 2019 at 2:56 PM Bruce Klein  wrote:
> > >  In 11.1 did you see the message "WARNING: could not flush dirty data: 
> > > Function not implemented"
> > Yes
>
> I wonder if this is coming from sync_file_range(), which is not
> implemented on WSL according to random intergoogling, but probably
> appears as present to our configure script.  I find it harder to
> believe they didn't implement fsync().

Here is a place where people go to complain about that:

https://github.com/Microsoft/WSL/issues/645

I suppose we could tolerate ENOSYS.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: WSL (windows subsystem on linux) users will need to turn fsync off as of 11.2

2019-02-14 Thread Thomas Munro

On Fri, Feb 15, 2019 at 5:29 PM Tom Lane  wrote:
> Thomas Munro  writes:
> >> On Fri, Feb 15, 2019 at 2:56 PM Bruce Klein  wrote:
> >>> In 11.1 did you see the message "WARNING: could not flush dirty data: 
> >>> Function not implemented"
> >> Yes
>
> > Here is a place where people go to complain about that:
> > https://github.com/Microsoft/WSL/issues/645
> > I suppose we could tolerate ENOSYS.
>
> What I'm not grasping here is why you considered that sync_file_range
> failure should be treated as a reason to PANIC in the first place?
> Surely it is not fsync(), nor some facsimile thereof.  In fact, if
> any of the branches in pg_flush_data really need the data_sync_elevel
> treatment, somebody's mental model of that operation needs adjustment.
> Maybe it's mine.

My thinking was that sync_file_range() might in its current, future or
alternative (WSL, ...) implementation eat an error that would
otherwise reach fsync(), due to the single-flag error state treatment
we see in several OSes (older Linux, also recent Linux via the 'seen'
flag that we rely on to receive errors that happened before we opened
the fd).  Should we be inspecting the Linux source or asking
assurances from Linux hackers that that can't happen?   Perhaps it
behaves more like fdatasync() with the SYNC_FILE_RANGE_WAIT_* flags (=
can clear seen flag), but more like fadvise() without (can't touch
it)?  I don't know, and I didn't want to base my choice on what it
looks like it currently does in the Linux tree.  Without guarantees
from standards (not relevant here) or man pages (which note only that
EIO is possible), I made what I thought was an appropriately
pessimistic choice.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: WSL (windows subsystem on linux) users will need to turn fsync off as of 11.2

2019-02-15 Thread Thomas Munro

On Sat, Feb 16, 2019 at 6:50 AM Andres Freund  wrote:
> On February 15, 2019 9:44:50 AM PST, Tom Lane  wrote:
> >Andres Freund  writes:
> >> On February 15, 2019 9:13:10 AM PST, Tom Lane 
> >wrote:
> >>> I'm of the opinion that we shouldn't be panicking for
> >sync_file_range
> >>> failure, period.
> >
> >> With some flags it's strictly required, it does"eat"errors depending
> >on the flags. So I'm not sure I understand?
> >
> >Really?  The specification says that it starts I/O, not that it waits
> >around for any to finish.
>
> That depends on the flags you pass in. By memory I don't think it eats an 
> error with our flags in recent kernels, but I'm not sure.

Right, there was some discussion of that, and I didn't (and still
don't) think it'd be wise to rely on undocumented knowledge about
which flags can eat errors based on a drive-by reading of a particular
snapshot of the Linux tree.  The man page says it can return EIO; I
think we should assume that it might actually do that.

BTW I had a report from someone on IRC that PostgreSQL breaks in other
ways (not yet understood) if you build it directly on WSL/Ubuntu.  I
guess the OP is reporting about a .deb that was built on a real Linux
system.  I'm vaguely familiar with these types of problems from other
platforms (Linux syscall emulation on FreeBSD and Sun-ish systems, and
also I'm old enough to remember people doing SCO SysV syscall
emulation on Linux systems back before certain valuable software was
available natively); it's possible that you get ENOSYS on other
emulators too, considering that other kernels don't seem to have a
sync_file_range()-like facility, but probably no one cares, since
there is no reason to run PostgreSQL on a syscall emulator when you
can run it natively.  This is a bit different though: I guess people
want to be able to develop Linux-stack stuff on company-issued Windows
computers for later deployment on Linux servers; someone interested in
this would ideally make it work and set up a build farm animal to tell
us when we break it.  It would probably require only minimal changes,
but considering that no one bothered to complain about PostgreSQL
spewing scary looking warnings on WSL for years, it's not too
surprising that we didn't consider this case before.  A bit like the
nightjar case, the PANIC patch revealed a pre-existing problem that
had gone unreported and needs some work, but it doesn't seem like a
very good reason to roll back that part of the change completely IMHO.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: WSL (windows subsystem on linux) users will need to turn fsync off as of 11.2

2019-02-17 Thread Thomas Munro

On Sun, Feb 17, 2019 at 4:56 AM Tom Lane  wrote:
> Thomas Munro  writes:
> >>> Really?  The specification says that it starts I/O, not that it waits
> >>> around for any to finish.
>
> > Right, there was some discussion of that, and I didn't (and still
> > don't) think it'd be wise to rely on undocumented knowledge about
> > which flags can eat errors based on a drive-by reading of a particular
> > snapshot of the Linux tree.  The man page says it can return EIO; I
> > think we should assume that it might actually do that.
>
> I had a thought about this: maybe we should restrict the scope of this
> behavior to be "panic on EIO", not "panic on anything within hailing
> distance of fsync".
>
> The direction you and Andres seem to want to go in is to add a pile of
> unprincipled exception cases, which seems like a recipe for constant
> pain to me.  I think we might be better off with a whitelist of errnos
> that mean trouble, instead of a blacklist of some that don't.  I'm
> especially troubled by the idea that blacklisting some errnos might
> reduce to ignoring them completely, which would be a step backwards
> from our pre-PANIC behavior.

Hmm.  Well, at least ENOSPC should be treated the same way as EIO.
Here's an experiment that seems to confirm some speculations about NFS
on Linux from the earlier threads:

$ uname -a
Linux debian 4.18.0-3-amd64 #1 SMP Debian 4.18.20-2 (2018-11-23)
x86_64 GNU/Linux
$ dpkg -l nfs-kernel-server | tail -1
ii  nfs-kernel-server 1:1.3.4-2.4  amd64support for NFS kernel server

First, set up a 10MB loop-back filesystem:

$ dd if=/dev/zero of=/tmp/10mb.loopback bs=1024 count=1
$ sudo losetup /dev/loop0 /tmp/10mb.loopback
$ sudo mkfs -t ext3 -m 1 -v /dev/loop0
...
$ sudo mkdir /mnt/test_loopback
$ sudo mount -t ext3 /dev/loop0 /mnt/test_loopback

Then, export that via NFS:

$ tail -1 /etc/exports
/mnt/test_loopback localhost(rw,sync,no_subtree_check)
$ sudo exportfs -av
exporting localhost:/mnt/test_loopback

Next, mount that over NFS:

$ sudo mkdir /mnt/test_loopback_remote
$ sudo mount localhost:/mnt/test_loopback /mnt/test_loopback_remote

Now, fill up the whole disk with a file full of newlines:

$ sudo mkdir /mnt/test_loopback/dir
$ sudo chown $USER:$USER /mnt/test_loopback/dir
$ tr "\000" "\n" < /dev/zero > /mnt/test_loopback_remote/dir/file
tr: write error: No space left on device
tr: write error
$ df -h /mnt/test_loopback*
FilesystemSize  Used Avail Use% Mounted on
/dev/loop08.5M  8.4M 0 100% /mnt/test_loopback
localhost:/mnt/test_loopback  8.5M  8.4M 0 100% /mnt/test_loopback_remote

Now, run a program that appends a greeting and then calls fsync() twice:

$ cat test.c
#include 
#include 
#include 

int main(int argc, char *argv[])
{
int fd, rc;

fd = open("/mnt/test_loopback_remote/dir/file", O_RDWR | O_APPEND);
if (fd < 0)
{
perror("open");
return 1;
}
rc = write(fd, "hello world\n", 12);
if (rc < 0)
perror("write");
else if (rc < 12)
fprintf(stderr, "only managed to write %d bytes\n", rc);
rc = fsync(fd);
if (rc < 0)
perror("fsync 1");
rc = fsync(fd);
if (rc < 0)
perror("fsync 2");
rc = close(fd);
if (rc < 0)
perror("close");

return 0;
}
$ cc test.c
$ ./a.out
fsync 1: No space left on device
$

The write() and the second fsync() reported success.  Great, let's go
and look at our precious data, both through NFS and locally:

$ tail -3 /mnt/test_loopback_remote/dir/file

$ tail -3 /mnt/test_loopback/dir/file

$

It's gone.  If you try it again with a file containing just a few
newlines so there is free space, it works correctly and you see the
appended greeting.  Perhaps the same sort of thing might happen with
remote EDQUOT, but I haven't tried that.  Perhaps there are some
things that could be tuned that would avoid that?

(Some speculation about NFS:  To avoid data-loss from running out of
disk space, I think PostgreSQL requires either a filesystem that
reserves space when we're extending a file, so that we can exclude the
possibility of ENOSPC before we evict data from our own shared
buffers, or a page cache that doesn't drop dirty flags or whole
buffers on failure so we can meaningfully retry once space becomes
available.  As far as I know, the former would be theoretically
possible with NFS, if the client and server are using NFSv4.2+ with
ALLOCATE support and glibc and kernel versions both support true
fallocate() and pass it all the way through, but current versions
either don&#

Re: WSL (windows subsystem on linux) users will need to turn fsync off as of 11.2

2019-02-17 Thread Thomas Munro

On Mon, Feb 18, 2019 at 2:19 PM Michael Paquier  wrote:
> On Sun, Feb 17, 2019 at 10:54:54AM -0800, Andres Freund wrote:
> > On 2019-02-17 23:29:09 +1300, Thomas Munro wrote:
> >> Hmm.  Well, at least ENOSPC should be treated the same way as EIO.
> >> Here's an experiment that seems to confirm some speculations about NFS
> >> on Linux from the earlier threads:
> >
> > I wish we'd' a good way to have test scripts in the tree for something
> > like that, but using postgres. Unfortunately it's not easy to write
> > portable setup scripts for it.
>
> Yes, it seems to me as well that ENOSPC should be treated as much as
> EIO.  Just looking at the code for data_sync_retry we should really
> have some errno filtering.

I agree with you up to a point:  It would make some amount of sense
for data_sync_elevel() not to promote to PANIC if errno == ENOSYS;
then for sync_file_range() you'd get WARNING and for fsync() you'd get
ERROR (since that's what those call sites pass in) on hypothetical
kernels that lack those syscalls.  As I argued earlier, ENOSYS seems
to be the only errno that we know for sure to be non-destructive to
the page cache since it promises it didn't run any kernel code at all
(or rather didn't make it past the front door), so it's the *only*
errno that belongs on such a whitelist IMHO.  That would get us back
to where we were for WSL users in 11.1.

The question is whether we want to go further than that and provide a
better experience for WSL users, now that we know that it was already
spewing warnings.  One way to do that might be not to bother with
errno filtering at all, but instead (as Andres suggested) do a test of
whether sync_file_range() is implemented on this kernel/emulator at
startup and if not, just disable it somehow.  Then we don't need any
filtering.

Here is a restatement of my rationale for not including errno
filtering in the first place:  Take a look at the documented errnos
for fsync() on Linux.  Which of those tell us that it's sane to retry?
 AFAICS they all either mean that it ran filesystem code that is known
to toast your data on error (data loss has already occurred), or your
file descriptor is bogus (code somewhere is seriously busted and all
bets are off).

-- 
Thomas Munro
http://www.enterprisedb.com

Re: WSL (windows subsystem on linux) users will need to turn fsync off as of 11.2

2019-02-18 Thread Thomas Munro

On Tue, Feb 19, 2019 at 6:01 AM Andres Freund  wrote:
> On 2019-02-18 10:33:50 -0500, Ravi Krishna wrote:
> > Are there any plans to support PG on WSL ?  Just curious.

Hi Ravi,

I definitely want to fix this particular issue for 11.3.

> I think if somebody wanted to start investing efforts to improve testing
> of that setup, and then fix the resulting issues, nobody would seriously
> object. But also most people working on PG are already busy.  Please
> feel free to create a buildfarm test machine with postgres running on
> WSL.

Right, the first step would be for a WSL user to figure out what's
wrong with builds on the WSL and show us how to fix it; I heard
through the grapevine that if you try it, initdb doesn't work (it must
be something pretty subtle in the configure phase or something like
that, since the Ubuntu .deb apparently works, except for the issue
reported in this thread).  Then, confirm that they're happy with
whatever patch we come up with.  Then if someone wants to make sure we
don't accidentally break it in future, yeah, a build farm animal would
help a lot.

Here's a starter patch that shows one of the approaches discussed.  It
gets WSL users to a better place than they were before, by suppressing
further warnings after the first one.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
 <https://postgresvision.com/>
 <https://postgresvision.com/>

0001-Tolerate-ENOSYS-failure-from-sync_file_range.patch
Description: Binary data

Re: WSL (windows subsystem on linux) users will need to turn fsync off as of 11.2

2019-02-18 Thread Thomas Munro

On Tue, Feb 19, 2019 at 5:16 PM James Sewell  wrote:
>> Here's a starter patch that shows one of the approaches discussed.  It
>> gets WSL users to a better place than they were before, by suppressing
>> further warnings after the first one.
>
> This wasn't quite right, updated to check erro for ENOSYS (not rc)
>
> This compiles and stops the panic on WSL (with a single warning).
>
> I haven't tested if a version compiled on Linux will behave the same way - 
> but based on the error messages in the top post it looks like the behavior is 
> the same.

Great.  Thanks for testing, and for the fix!  Well that all sounds
like good news: it corrects the behaviour from 11.2, and also improves
on the previous behaviour which I'd have accepted as a bug if anyone
had reported it.  So the next problem is that we don't have a
consensus on whether this is the right approach, so I don't feel like
I can commit it yet.  Does any want to make another concrete proposal?

-- 
Thomas Munro
https://enterprisedb.com

Re: WSL (windows subsystem on linux) users will need to turn fsync off as of 11.2

2019-02-22 Thread Thomas Munro

On Tue, Feb 19, 2019 at 5:31 PM Thomas Munro  wrote:
> On Tue, Feb 19, 2019 at 5:16 PM James Sewell  
> wrote:
> >> Here's a starter patch that shows one of the approaches discussed.  It
> >> gets WSL users to a better place than they were before, by suppressing
> >> further warnings after the first one.
> >
> > This wasn't quite right, updated to check erro for ENOSYS (not rc)
> >
> > This compiles and stops the panic on WSL (with a single warning).
> >
> > I haven't tested if a version compiled on Linux will behave the same way - 
> > but based on the error messages in the top post it looks like the behavior 
> > is the same.
>
> Great.  Thanks for testing, and for the fix!  Well that all sounds
> like good news: it corrects the behaviour from 11.2, and also improves
> on the previous behaviour which I'd have accepted as a bug if anyone
> had reported it.  So the next problem is that we don't have a
> consensus on whether this is the right approach, so I don't feel like
> I can commit it yet.  Does any want to make another concrete proposal?

Ok, here's the version I'm planning to push soon if there are no objections.
Re-adding Bruce to the thread, as I just noticed the CC list got
pruned at some point in this thread.

-- 
Thomas Munro
https://enterprisedb.com


0001-Tolerate-ENOSYS-failure-from-sync_file_range-v3.patch
Description: Binary data

Re: WSL (windows subsystem on linux) users will need to turn fsync off as of 11.2

2019-02-24 Thread Thomas Munro

>> > Great.  Thanks for testing, and for the fix!  Well that all sounds
>> > like good news: it corrects the behaviour from 11.2, and also improves
>> > on the previous behaviour which I'd have accepted as a bug if anyone
>> > had reported it.  So the next problem is that we don't have a
>> > consensus on whether this is the right approach, so I don't feel like
>> > I can commit it yet.  Does any want to make another concrete proposal?
>>
>> Ok, here's the version I'm planning to push soon if there are no objections.
>> Re-adding Bruce to the thread, as I just noticed the CC list got
>> pruned at some point in this thread.

Pushed.

I also noticed that the call to sync_file_range() in file_utils.c used
by fsync_pgdata() ignores the return code, and in the 9.4 and 9.5
branches, the call in pg_flush_data() ignores the return code too.
This inconsistency should be fixed; I'll think about which direction
it should be fixed in (either we are convinced that
sync_file_range(SYNC_FILE_RANGE_WRITE) is non-destructive of error
state or we aren't, and should handle it everywhere), and maybe start
a new -hackers thread.

-- 
Thomas Munro
https://enterprisedb.com

Re: LDAP authenticated session terminated by signal 11: Segmentation fault, PostgresSQL server terminates other active server processes

2019-02-25 Thread Thomas Munro

On Thu, Feb 21, 2019 at 2:42 PM Mike Yeap  wrote:
> openldap-clients.x86_64 2.4.44-21.el7_6
> @updates
> openldap-devel.i686 2.4.44-21.el7_6updates
> openldap-devel.x86_64   2.4.44-21.el7_6updates
> openldap.i686   2.4.44-21.el7_6updates
> openldap-servers-sql.x86_64 2.4.44-21.el7_6updates
> openldap-servers.x86_64 2.4.44-21.el7_6updates
> openldap.x86_64 2.4.44-21.el7_6
> @updates

> On Wed, Feb 20, 2019 at 10:17 PM Tom Lane  wrote:
>> With OpenLDAP versions 2.4.24 through 2.4.31, inclusive, PostgreSQL
>> backends can crash at exit.  Raise a warning during "configure" based on
>> the compile-time OpenLDAP version number, and test the crash scenario in
>> the dblink test suite.  Back-patch to 9.0 (all supported versions).

Clearly 2.4.44 is not in the range 2.4.24 through 2.4.31.  Perhaps the
dangerous range is out of date?  Hmm, so Noah's analysis[1] says this
is a clash between libldap_r.so (used by libpq) and libldap.so (used
by the server), specifically in destructor/exit code.  Curiously, in a
thread about Curl's struggles with this problem, I found a claim[2]
that Debian decided to abandon the non-"_r" variant and just use _r
always.  Sure enough, on my Debian buster VM I see a symlink
libldap-2.4.so.2 -> libldap_r-2.4.so.2.  So essentially Debian and
friends have already forced Noah's first option on users:

> 1. Link the backend with libldap_r, so we never face the mismatch. On some
> platforms, this means also linking in threading libraries.

FreeBSD and CentOS systems near me have separate libraries still.

[1] 
https://www.postgresql.org/message-id/flat/20140612210219.GA705509%40tornado.leadboat.com
[2] https://www.openldap.org/lists/openldap-technical/201608/msg00094.html

-- 
Thomas Munro
https://enterprisedb.com

Re: LDAP authenticated session terminated by signal 11: Segmentation fault, PostgresSQL server terminates other active server processes

2019-02-26 Thread Thomas Munro

On Tue, Feb 26, 2019 at 8:17 PM Mike Yeap  wrote:
> Hi Thomas, does that mean the bug is still there?

Hi Mike,

I haven't tried to repro this myself, but it certainly sounds like it.
It also sounds like it would probably go away if you switched to a
Debian-derived distro, instead of a Red Hat-derived distro, but I
doubt that's the kind of advice you were looking for.  We need to
figure out a proper solution here, though I'm not sure what.  Question
for the list: other stuff in the server needs libpthread (SSL, LLVM,
...), so why are we insisting on using non-MT LDAP?

-- 
Thomas Munro
https://enterprisedb.com

Re: LDAP authenticated session terminated by signal 11: Segmentation fault, PostgresSQL server terminates other active server processes

2019-02-26 Thread Thomas Munro

On Tue, Feb 26, 2019 at 9:11 PM Thomas Munro  wrote:
> On Tue, Feb 26, 2019 at 8:17 PM Mike Yeap  wrote:
> > Hi Thomas, does that mean the bug is still there?

> I haven't tried to repro this myself, but it certainly sounds like it.
> It also sounds like it would probably go away if you switched to a
> Debian-derived distro, instead of a Red Hat-derived distro, but I
> doubt that's the kind of advice you were looking for.  We need to
> figure out a proper solution here, though I'm not sure what.  Question
> for the list: other stuff in the server needs libpthread (SSL, LLVM,
> ...), so why are we insisting on using non-MT LDAP?

Concretely, why don't we just kill the LDAP_LIBS_FE/LDAP_LIBS_BE
distinction and use a single LDAP_LIBS?  Then it'll always match.  It
can still be the non-MT variant if you build with
--disable-thread-safety (who does that?), but then it'll be the same
in the server too so that postgres_fdw + ldap works that way too.
Sketch patch attached.


--
Thomas Munro
https://enterprisedb.com
diff --git a/configure b/configure
index 481bd3b66d..e2fc438297 100755
--- a/configure
+++ b/configure
@@ -652,8 +652,7 @@ CFLAGS_SSE42
 have_win32_dbghelp
 LIBOBJS
 UUID_LIBS
-LDAP_LIBS_BE
-LDAP_LIBS_FE
+LDAP_LIBS
 PTHREAD_CFLAGS
 PTHREAD_LIBS
 PTHREAD_CC
@@ -12375,7 +12374,6 @@ else
   as_fn_error $? "library 'ldap' is required for LDAP" "$LINENO" 5
 fi
 
-LDAP_LIBS_BE="-lldap $EXTRA_LDAP_LIBS"
 if test "$enable_thread_safety" = yes; then
   # on some platforms ldap_r fails to link without PTHREAD_LIBS
   { $as_echo "$as_me:${as_lineno-$LINENO}: checking for ldap_simple_bind in -lldap_r" >&5
@@ -12425,9 +12423,9 @@ else
   as_fn_error $? "library 'ldap_r' is required for LDAP" "$LINENO" 5
 fi
 
-  LDAP_LIBS_FE="-lldap_r $EXTRA_LDAP_LIBS"
+  LDAP_LIBS="-lldap_r $EXTRA_LDAP_LIBS"
 else
-  LDAP_LIBS_FE="-lldap $EXTRA_LDAP_LIBS"
+  LDAP_LIBS="-lldap $EXTRA_LDAP_LIBS"
 fi
 for ac_func in ldap_initialize
 do :
@@ -12488,14 +12486,12 @@ else
   as_fn_error $? "library 'wldap32' is required for LDAP" "$LINENO" 5
 fi
 
-LDAP_LIBS_FE="-lwldap32"
-LDAP_LIBS_BE="-lwldap32"
+LDAP_LIBS="-lwldap32"
   fi
   LIBS="$_LIBS"
 fi
 
 
-
 # for contrib/sepgsql
 if test "$with_selinux" = yes; then
   { $as_echo "$as_me:${as_lineno-$LINENO}: checking for security_compute_create_name in -lselinux" >&5
diff --git a/configure.in b/configure.in
index 9c7a9738bc..bf21728268 100644
--- a/configure.in
+++ b/configure.in
@@ -1238,26 +1238,23 @@ if test "$with_ldap" = yes ; then
 AC_CHECK_LIB(ldap, ldap_bind, [],
 		 [AC_MSG_ERROR([library 'ldap' is required for LDAP])],
 		 [$EXTRA_LDAP_LIBS])
-LDAP_LIBS_BE="-lldap $EXTRA_LDAP_LIBS"
 if test "$enable_thread_safety" = yes; then
   # on some platforms ldap_r fails to link without PTHREAD_LIBS
   AC_CHECK_LIB(ldap_r, ldap_simple_bind, [],
 		   [AC_MSG_ERROR([library 'ldap_r' is required for LDAP])],
 		   [$PTHREAD_CFLAGS $PTHREAD_LIBS $EXTRA_LDAP_LIBS])
-  LDAP_LIBS_FE="-lldap_r $EXTRA_LDAP_LIBS"
+  LDAP_LIBS="-lldap_r $EXTRA_LDAP_LIBS"
 else
-  LDAP_LIBS_FE="-lldap $EXTRA_LDAP_LIBS"
+  LDAP_LIBS="-lldap $EXTRA_LDAP_LIBS"
 fi
 AC_CHECK_FUNCS([ldap_initialize])
   else
 AC_CHECK_LIB(wldap32, ldap_bind, [], [AC_MSG_ERROR([library 'wldap32' is required for LDAP])])
-LDAP_LIBS_FE="-lwldap32"
-LDAP_LIBS_BE="-lwldap32"
+LDAP_LIBS="-lwldap32"
   fi
   LIBS="$_LIBS"
 fi
-AC_SUBST(LDAP_LIBS_FE)
-AC_SUBST(LDAP_LIBS_BE)
+AC_SUBST(LDAP_LIBS)
 
 # for contrib/sepgsql
 if test "$with_selinux" = yes; then
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index c118f64040..6fe5afa890 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -284,8 +284,7 @@ AR = @AR@
 DLLTOOL = @DLLTOOL@
 DLLWRAP = @DLLWRAP@
 LIBS = @LIBS@
-LDAP_LIBS_FE = @LDAP_LIBS_FE@
-LDAP_LIBS_BE = @LDAP_LIBS_BE@
+LDAP_LIBS = @LDAP_LIBS@
 UUID_LIBS = @UUID_LIBS@
 UUID_EXTRA_OBJS = @UUID_EXTRA_OBJS@
 LLVM_LIBS=@LLVM_LIBS@
@@ -570,7 +569,7 @@ endif
 
 # Cygwin seems to need ldap libraries to be mentioned here, too
 ifeq ($(PORTNAME),cygwin)
-libpq_pgport += $(LDAP_LIBS_FE)
+libpq_pgport += $(LDAP_LIBS)
 endif
 
 
diff --git a/src/Makefile.shlib b/src/Makefile.shlib
index 373d73caef..8765818920 100644
--- a/src/Makefile.shlib
+++ b/src/Makefile.shlib
@@ -359,7 +359,7 @@ ifeq ($(PORTNAME), cygwin)
 # Cygwin case
 
 $(shlib): $(OBJS) | $(SHLIB_PREREQS)
-	$(CC) $(CFLAGS)  -shared -o $@ -Wl,--out-implib=$(stlib) $(OBJS) $(LDFLAGS) $(LDFLAGS_SL) $(SHLIB_LINK) $(

1 2 3 >

1 - 100 of 208 matches

Mail list logo