Unable to startup postgres: Could not read from file "pg_clog/00EC"

2020-02-05 Thread Nick Renders

Hello,

Yesterday, we experienced some issues with our Postgres installation 
(v9.6 running on macOS 10.12).
It seems that the machine was automatically rebooted for a yet unknown 
reason, and afterwards we were unable to start the Postgres service.


The postgres log shows the following:

2020-02-04 15:20:41 CET LOG:  database system was interrupted; last 
known up at 2020-02-04 15:18:34 CET
2020-02-04 15:20:43 CET LOG:  database system was not properly shut 
down; automatic recovery in progress
2020-02-04 15:20:44 CET LOG:  invalid record length at 14A/9E426DF8: 
wanted 24, got 0

2020-02-04 15:20:44 CET LOG:  redo is not required
2020-02-04 15:20:44 CET FATAL:  could not access status of transaction 
247890764
2020-02-04 15:20:44 CET DETAIL:  Could not read from file "pg_clog/00EC" 
at offset 106496: Undefined error: 0.
2020-02-04 15:20:44 CET LOG:  startup process (PID 403) exited with exit 
code 1
2020-02-04 15:20:44 CET LOG:  aborting startup due to startup process 
failure

2020-02-04 15:20:44 CET LOG:  database system is shut down


After some searching, I found someone who had had a similar issue and 
was able to resolve it by overwriting the file in pg_clog.

So I tried the following command:

dd if=/dev/zero of=[dbpath]/pg_clog/00EC bs=256k count=1

and now the service is running again.


But I am worried that there might still be some issues that we haven't 
noticed yet. I also have no idea what caused this error in the first 
place. It might have been the reboot, but maybe the reboot was a result 
of a Postgres issue.


Is there anything specific I should check in our postgres installation / 
database to make sure it is running ok now? Anyway to see what the 
consequences were of purging that one pg_clog file?


Best regards,

Nick Renders




Re: Unable to startup postgres: Could not read from file "pg_clog/00EC"

2020-02-06 Thread Nick Renders

Thank you for the feedback, Alvaro.

Unfortunately, the database is no longer "dumpable". We were able to do 
a pg_dump yesterday morning (12 hours after the crash + purging the 
pg_clog) but if we try one now, we get the following error:


	unexpected chunk number 1 (expected 0) for toast value 8282331 in 
pg_toast_38651


Looking at our data, there seem to be 6 tables that have corrupt 
records. Doing a SELECT * for one of those records, will return a 
similar error:


missing chunk number 0 for toast value 8288522 in pg_toast_5572299


What is the best way to go from here? Is tracking down these corrupt 
records and deleting them the best / only solution?
Is there a way to determine of there are issues with new data (after the 
crash)?


Any help and advice is very much appreciated.

Thanks,


Nick Renders


On 5 Feb 2020, at 12:51, Alvaro Herrera wrote:


On 2020-Feb-05, Nick Renders wrote:

Is there anything specific I should check in our postgres 
installation /

database to make sure it is running ok now? Anyway to see what the
consequences were of purging that one pg_clog file?


Losing pg_clog files is pretty bad, and should not happen; then again,
this might have been something else (ie. the file was maybe not lost).
That said, wrongly overwriting files is even worse.

By zeroing an existing pg_clog file, you marked a bunch of 
transactions

as aborted.  Your data is now probably inconsistent, if not downright
corrupt.  I would be looking for my most recent backup ...

If you're very lucky, your database might be pg_dumpable.  I would try
that, followed by restoring it in a separate clean instance and seeing
what happens.

--
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Postgres 12.1 : UPPER() in WHERE clause restarts server

2020-02-08 Thread Nick Renders

Hi,

We have just upgraded our Postgres 9.6 database to 12.1 (pg_dumpall -> 
pg_restore on a clean installation) and now we are having some issues 
with one of our tables.


When we do the following statement:

SELECT * FROM f_gsxws_schedule WHERE UPPER(gwsc_dossier) = 'TEST'

the Postgres service restarts.

It seems that using UPPER() in the WHERE clause is causing this. The 
same statement without UPPER() works just fine.


I have tried to emulate the issue with other tables, but 
f_gsxws_schedule seems to be the only one.
The table also has another character field that is indexed, and the same 
problem occurs there. Whenever we use UPPER() or LOWER() to do a 
case-insensitive search, the service reboots.


Looking at the table's definition, I don't see anything different with 
the other tables.


Here is what is logged:

2020-02-08 20:21:19.942 CET [83892] LOG:  server process (PID 85456) was 
terminated by signal 9: Killed: 9
2020-02-08 20:21:19.942 CET [83892] DETAIL:  Failed process was running: 
SELECT * FROM f_gsxws_schedule WHERE UPPER(gwsc_dossier) = 'TEST'
2020-02-08 20:21:19.942 CET [83892] LOG:  terminating any other active 
server processes
2020-02-08 20:21:19.943 CET [85364] WARNING:  terminating connection 
because of crash of another server process
2020-02-08 20:21:19.943 CET [85364] DETAIL:  The postmaster has 
commanded this server process to roll back the current transaction and 
exit, because another server process exited abnormally and possibly 
corrupted shared memory.
2020-02-08 20:21:19.943 CET [85364] HINT:  In a moment you should be 
able to reconnect to the database and repeat your command.
2020-02-08 20:21:19.943 CET [85360] WARNING:  terminating connection 
because of crash of another server process
2020-02-08 20:21:19.943 CET [85360] DETAIL:  The postmaster has 
commanded this server process to roll back the current transaction and 
exit, because another server process exited abnormally and possibly 
corrupted shared memory.
2020-02-08 20:21:19.943 CET [85360] HINT:  In a moment you should be 
able to reconnect to the database and repeat your command.
2020-02-08 20:21:19.943 CET [85269] WARNING:  terminating connection 
because of crash of another server process
2020-02-08 20:21:19.943 CET [85269] DETAIL:  The postmaster has 
commanded this server process to roll back the current transaction and 
exit, because another server process exited abnormally and possibly 
corrupted shared memory.
2020-02-08 20:21:19.943 CET [85269] HINT:  In a moment you should be 
able to reconnect to the database and repeat your command.
2020-02-08 20:21:19.946 CET [83892] LOG:  all server processes 
terminated; reinitializing
2020-02-08 20:21:19.988 CET [85686] LOG:  database system was 
interrupted; last known up at 2020-02-08 20:20:48 CET
2020-02-08 20:21:20.658 CET [85686] LOG:  database system was not 
properly shut down; automatic recovery in progress

2020-02-08 20:21:20.662 CET [85686] LOG:  redo starts at C/B99B45A0
2020-02-08 20:21:20.662 CET [85686] LOG:  invalid record length at 
C/B99B4688: wanted 24, got 0

2020-02-08 20:21:20.662 CET [85686] LOG:  redo done at C/B99B4650
2020-02-08 20:21:20.675 CET [83892] LOG:  database system is ready to 
accept connections



Has anyone noticed anything like this before? Any idea how to fix this?


Best regards,

Nick Renders




Re: Postgres 12.1 : UPPER() in WHERE clause restarts server

2020-02-11 Thread Nick Renders

Hi Thomas,

We are setting up a new test environment with 12.1.
Once it is running, I'll try out those commands and get back with the 
results.


Thanks,

Nick Renders


On 11 Feb 2020, at 2:51, Thomas Munro wrote:


On Mon, Feb 10, 2020 at 4:35 AM Marc  wrote:
We will keep the 12.1 in place so that we can run additional tests to 
assist to pin-point the issue.


Feel free to ask but allow us to recover from these hectic days ;-)


Here's how to get a stack so we can see what it was doing, assuming
you have the Apple developer tools installed:

1.  Find the PID of the backend you're connected to with SELECT
pg_backend_pid().
2.  "lldb -p PID" from a shell to attach to the process, then "cont"
to let it continue running.
3.  Run the query in that backend and wait for the SIGKILL.
4.  In the lldb session, type "bt".

It'll only make sense if your PostgreSQL build has debug symbols, but 
let's see.





PL/pgSQL question about EXCEPTION clause & corrupt records

2020-02-14 Thread Nick Renders



Hello,

We recently suffered a database crash which resulted in some corrupt 
records.


I thought I would write a little PL script that would loop through all 
the data and report any inconsistencies. However, I can't get it to work 
properly.


For instance, if I run the following statement in pgAdmin:

SELECT * FROM f_gsxws_transaction WHERE gwta_number = 762513

I get the following message:

	ERROR:  missing chunk number 0 for toast value 8289525 in 
pg_toast_5572299



So, as a test, I created a function that would just retrieve that one 
record:


DECLARE
rcontent f_gsxws_transaction%ROWTYPE;
BEGIN
		SELECT * INTO rcontent FROM f_gsxws_transaction where gwta_number = 
762513;

RETURN rcontent;
EXCEPTION WHEN OTHERS THEN
RAISE NOTICE 'Record 762513 is corrupt';
END;


Now, when I run this function, I have noticed two things:

1) The function has no problem executing the SELECT statement. It is 
only when "rcontents" is returned, that the function fails. This is a 
problem, because the ultimate goal is to loop through all records and 
only return/alert something in case of an error.


2) The function never enters the EXCEPTION clause. Instead, when it hits 
the RETURN command, it breaks and shows the same message as in pgAdmin: 
missing chunk number 0 for toast value 8289525 in pg_toast_5572299.



Is it possible to check for these kind of errors with a PL script? Or is 
there perhaps a better way to check for corrupt records in a database?


Best regards,

Nick Renders




Re: PL/pgSQL question about EXCEPTION clause & corrupt records

2020-02-17 Thread Nick Renders
The problem is that I don't know which column is corrupt. But I found a 
solution: by simply copying the record into another variable, the values 
are parsed and the TOAST errors are thrown.


In case anyone's interested, here's my code, based on an example from 
http://www.databasesoup.com/2013/10/de-corrupting-toast-tables.html



DO $f$
DECLARE
rContent1 record;
rContent2 record;
iCounter integer DEFAULT 1;
iValue integer;
pTableName varchar := 'f_gsxws_transaction';
pFieldName varchar := 'gwta_number';
BEGIN
	FOR iValue IN EXECUTE 'SELECT ' || pFieldName || ' FROM ' || 
pTableName::regclass || ' ORDER BY ' || pFieldName LOOP

BEGIN
			EXECUTE 'SELECT * FROM ' || pTableName::regclass || ' WHERE ' || 
pFieldName || ' = $1'

INTO rContent1
USING iValue;
rContent2 := rContent1;
EXCEPTION WHEN OTHERS THEN
			RAISE NOTICE 'data for %.% % is corrupt', pTableName, pFieldName, 
iValue;

END;
IF iCounter % 10 = 0 THEN
RAISE NOTICE '% % records checked', iCounter, 
pTableName;
END IF;
iCounter := iCounter+1;
END LOOP;
END;
$f$;


Cheers,

Nick


On 14 Feb 2020, at 16:14, Tom Lane wrote:


"Nick Renders"  writes:
I thought I would write a little PL script that would loop through 
all
the data and report any inconsistencies. However, I can't get it to 
work

properly.
...
1) The function has no problem executing the SELECT statement. It is
only when "rcontents" is returned, that the function fails. This is a
problem, because the ultimate goal is to loop through all records and
only return/alert something in case of an error.
2) The function never enters the EXCEPTION clause. Instead, when it 
hits
the RETURN command, it breaks and shows the same message as in 
pgAdmin:

missing chunk number 0 for toast value 8289525 in pg_toast_5572299.


I think what's happening there is that the function doesn't try to
dereference the value's TOAST pointer during SELECT INTO.  It just 
stores

that pointer into a variable, and only sometime later when the actual
content of the value is demanded, do you see the error raised.

The solution to that is to do something that uses the contents of the
busted column right away while still inside the EXCEPTION block, 
perhaps

along the lines of "select md5(mycolumn) into local_variable from..."

A close reading of

https://www.postgresql.org/docs/current/storage-toast.html

would probably help you understand what's happening here.

regards, tom lane





Re: PL/pgSQL question about EXCEPTION clause & corrupt records

2020-02-17 Thread Nick Renders

Hi Jeremy,

This happend on PostgreSQL v9.6 which crashed 2 weeks ago.
Since then we have upgraded and restored our server, but my example is 
from the older, corrupt database.


Nick


On 15 Feb 2020, at 5:30, Jeremy Schneider wrote:


On Feb 14, 2020, at 04:39, Nick Renders  wrote:

I get the following message:

  ERROR:  missing chunk number 0 for toast value 8289525 in 
pg_toast_5572299


What version of PostgreSQL are you running? I’ve seen this a number 
of times the past couple years; curious if the lurking bug is still 
observed in latest versions.


-Jeremy

Sent from my TI-83





Re: Postgres 12.1 : UPPER() in WHERE clause restarts server

2020-02-24 Thread Nick Renders
We have set up a new test environment running PostgreSQL v12.2 on macOS 
10.14 and the issue is still there.
One thing I noticed, is that the returning columns do not affect the 
behaviour:


SELECT * FROM f_gsxws_schedule WHERE UPPER(gwsc_dossier) = 'TEST'

and
	SELECT gwsc_sequence FROM f_gsxws_schedule WHERE UPPER(gwsc_dossier) = 
'TEST'


both kill the postgres service.


I will try to free some time next week to install the Apple developer 
tools and further analyse the problem.


Best regards,

Nick


On 11 Feb 2020, at 12:32, Nick Renders wrote:


Hi Thomas,

We are setting up a new test environment with 12.1.
Once it is running, I'll try out those commands and get back with the 
results.


Thanks,

Nick Renders


On 11 Feb 2020, at 2:51, Thomas Munro wrote:


On Mon, Feb 10, 2020 at 4:35 AM Marc  wrote:
We will keep the 12.1 in place so that we can run additional tests 
to assist to pin-point the issue.


Feel free to ask but allow us to recover from these hectic days ;-)


Here's how to get a stack so we can see what it was doing, assuming
you have the Apple developer tools installed:

1.  Find the PID of the backend you're connected to with SELECT
pg_backend_pid().
2.  "lldb -p PID" from a shell to attach to the process, then "cont"
to let it continue running.
3.  Run the query in that backend and wait for the SIGKILL.
4.  In the lldb session, type "bt".

It'll only make sense if your PostgreSQL build has debug symbols, but 
let's see.





Re: Postgres 12.1 : UPPER() in WHERE clause restarts server

2020-02-25 Thread Nick Renders

Hi Tom,

1. we used the EDB installer.

2. turning JIT off did make the problem go away. So I guess this was 
causing the Postgres process to crash all along.


Thanks for the help,

Nick


On 24 Feb 2020, at 16:24, Tom Lane wrote:


"Nick Renders"  writes:
We have set up a new test environment running PostgreSQL v12.2 on 
macOS

10.14 and the issue is still there.


Some nearby threads prompt these two questions:

1. Are you using your own build, or is this from EDB's installer?

2. If the latter, does turning JIT off ("set jit = off") make the
problem go away?

There is as yet no "native" support for --with-llvm on macOS,
ie Apple themselves don't provide sufficient support for that.
EDB seem to have hacked up something that sort of works, but
only sort of.

regards, tom lane





Postgres on macOS 10

2020-03-03 Thread Nick Renders

Hi,

We just noticed something strange with our Postgres server.

We have Postgres 11 and 12 running on macOS 10.14 (Mojave), installed 
with the EDB installer. Whenever the machine is restarted, the Postgres 
service cannot be launched until a macOS user logs in. We have 
"automatic login" turned off in the system preferences, meaning that 
when the machine reboots, you get the macOS login screen.


In the system.log we see the following 2 lines over and over again, 
until a user has logged in:


	Mar  3 09:37:19 postgrestest com.apple.xpc.launchd[1] 
(com.edb.launchd.postgresql-12[319]): Service exited with abnormal code: 
2
	Mar  3 09:37:19 postgrestest com.apple.xpc.launchd[1] 
(com.edb.launchd.postgresql-12): Service only ran for 0 seconds. Pushing 
respawn out by 10 seconds.



It doesn't matter which macOS user logs in, doesn't have to be an 
Administrator. But once a user has logged in, the Postgres service is 
finally launched properly. Afterwards, the macOS user can log out again, 
and the service will continue running.


It doesn't seem to be limited to Mojave, either. I did a quick test on 
our older, decommissioned Postgres server (Postgres 9 on macOS 10.12) 
and the same issue seems to occur there.


Has anyone noticed something similar with macOS? Or is it just our 
setup?


Best regards,

Nick Renders




archive_command debugging

2023-08-23 Thread Nick Renders
Hello,

I was wondering if anyone had any good tips for "debugging" the archive_command 
in the postgresql.conf.

I want to see what certain settings are when the command is run, like $PATH and 
id, but I don't want it to be successful either, so that the WAL files remain 
untouched.

Any tips?

Thanks,

Nick Renders




could not open file "global/pg_filenode.map": Operation not permitted

2024-02-26 Thread Nick Renders
Hello,

We have a Postgres server that intermittently logs the following:

2024-02-26 10:29:41.580 CET [63962] FATAL:  could not open file 
"global/pg_filenode.map": Operation not permitted
2024-02-26 10:30:11.147 CET [90610] LOG:  could not open file "postmaster.pid": 
Operation not permitted; continuing anyway
2024-02-26 10:30:11.149 CET [63975] FATAL:  could not open file 
"global/pg_filenode.map": Operation not permitted
2024-02-26 10:30:35.941 CET [63986] FATAL:  could not open file 
"global/pg_filenode.map": Operation not permitted
2024-02-26 10:30:41.546 CET [63991] FATAL:  could not open file 
"global/pg_filenode.map": Operation not permitted
2024-02-26 10:30:44.398 CET [63994] FATAL:  could not open file 
"global/pg_filenode.map": Operation not permitted
2024-02-26 10:31:11.149 CET [90610] LOG:  could not open file "postmaster.pid": 
Operation not permitted; continuing anyway
2024-02-26 10:31:11.151 CET [64008] FATAL:  could not open file 
"global/pg_filenode.map": Operation not permitted
2024-02-26 10:31:41.546 CET [64023] FATAL:  could not open file 
"global/pg_filenode.map": Operation not permitted
2024-02-26 10:32:11.150 CET [90610] LOG:  could not open file "postmaster.pid": 
Operation not permitted; continuing anyway
2024-02-26 10:32:11.153 CET [64035] FATAL:  could not open file 
"global/pg_filenode.map": Operation not permitted
2024-02-26 10:32:41.547 CET [64050] FATAL:  could not open file 
"global/pg_filenode.map": Operation not permitted
2024-02-26 10:33:11.151 CET [90610] LOG:  could not open file "postmaster.pid": 
Operation not permitted; continuing anyway
2024-02-26 10:33:11.153 CET [64062] FATAL:  could not open file 
"global/pg_filenode.map": Operation not permitted
2024-02-26 10:33:41.548 CET [64087] FATAL:  could not open file 
"global/pg_filenode.map": Operation not permitted


This has happened 3 times over the last 2 weeks now, without any indication 
what caused it.
The privileges of those 2 files are all in order.
When this happens, the server is no longer accessible, and we need to restart 
the service (pg_ctl restart).
Once restarted, Popstgres runs fine again for a couple of days.

We are running PostgreSQL 16.2 on macOS 14.3.1.

Any idea what might be causing this issue, or how to resolve it?


Best regards,

Nick Renders




Re: could not open file "global/pg_filenode.map": Operation not permitted

2024-03-11 Thread Nick Renders
Thank you for your reply Laurenz.
I don't think it is related to any third party security software. We have 
several other machines with a similar setup, but this is the only server that 
has this issue.

The one thing different about this machine however, is that it runs 2 instances 
of Postgres:
- cluster A on port 165
- cluster B on port 164
Cluster A is actually a backup from another Postgres server that is restored on 
a daily basis via Barman. This means that we login remotely from the Barman 
server over SSH, stop cluster A's service (port 165), clear the Data folder, 
restore the latest back into the Data folder, and start up the service again.
Cluster B's Data and service (port 164) remain untouched during all this time. 
This is the cluster that experiences the intermittent "operation not permitted" 
issue.

Over the past 2 weeks, I have suspended our restore script and the issue did 
not occur.
I have just performed another restore on cluster A and now cluster B is 
throwing errors in the log again.

Any idea why this is happening? It does not occur with every restore, but it 
seems to be related anyway.

Thanks,

Nick Renders


On 26 Feb 2024, at 16:29, Laurenz Albe wrote:

> On Mon, 2024-02-26 at 15:14 +0100, Nick Renders wrote:
>> We have a Postgres server that intermittently logs the following:
>>
>> 2024-02-26 10:29:41.580 CET [63962] FATAL:  could not open file 
>> "global/pg_filenode.map": Operation not permitted
>> 2024-02-26 10:30:11.147 CET [90610] LOG:  could not open file 
>> "postmaster.pid": Operation not permitted; continuing anyway
>>
>> This has happened 3 times over the last 2 weeks now, without any indication 
>> what caused it.
>> The privileges of those 2 files are all in order.
>> When this happens, the server is no longer accessible, and we need to 
>> restart the service (pg_ctl restart).
>> Once restarted, Popstgres runs fine again for a couple of days.
>>
>> We are running PostgreSQL 16.2 on macOS 14.3.1.
>
> Perhaps that is some kind of virus checker or something else that locks files.
>
> Yours,
> Laurenz Albe




Re: could not open file "global/pg_filenode.map": Operation not permitted

2024-03-12 Thread Nick Renders
On 11 Mar 2024, at 16:04, Adrian Klaver wrote:

> On 3/11/24 03:11, Nick Renders wrote:
>> Thank you for your reply Laurenz.
>> I don't think it is related to any third party security software. We have 
>> several other machines with a similar setup, but this is the only server 
>> that has this issue.
>>
>> The one thing different about this machine however, is that it runs 2 
>> instances of Postgres:
>> - cluster A on port 165
>> - cluster B on port 164
>> Cluster A is actually a backup from another Postgres server that is restored 
>> on a daily basis via Barman. This means that we login remotely from the 
>> Barman server over SSH, stop cluster A's service (port 165), clear the Data 
>> folder, restore the latest back into the Data folder, and start up the 
>> service again.
>> Cluster B's Data and service (port 164) remain untouched during all this 
>> time. This is the cluster that experiences the intermittent "operation not 
>> permitted" issue.
>>
>> Over the past 2 weeks, I have suspended our restore script and the issue did 
>> not occur.
>> I have just performed another restore on cluster A and now cluster B is 
>> throwing errors in the log again.
>
> Since it seems to be the trigger, what are the contents of the restore script?
>
>>
>> Any idea why this is happening? It does not occur with every restore, but it 
>> seems to be related anyway.
>>
>> Thanks,
>>
>> Nick Renders
>>
>
>
> -- 
> Adrian Klaver
> adrian.kla...@aklaver.com



> ...how are A and B connected?

The 2 cluster are not connected. They run on the same macOS 14 machine with a 
single Postgres installation ( /Library/PostgreSQL/16/ ) and their respective 
Data folders are located on the same volume ( 
/Volumes/Postgres_Data/PostgreSQL/16/data and 
/Volumes/Postgres_Data/PostgreSQL/16-DML/data ). Beside that, they run 
independently on 2 different ports, specified in the postgresql.conf.


> ...run them under different users on the system.

Are you referring to the "postgres" user / role? Does that also mean setting up 
2 postgres installation directories?


> ...what are the contents of the restore script?

## stop cluster A
ssh postgres@10.0.0.1 '/Library/PostgreSQL/16/bin/pg_ctl -D 
/Volumes/Postgres_Data/PostgreSQL/16/data stop'

## save config files (ARC_postgresql_16.conf is included in postgresql.conf and 
contains cluster-specific information like the port number)
ssh postgres@10.0.0.1 'cd /Volumes/Postgres_Data/PostgreSQL/16/data && cp 
ARC_postgresql_16.conf ../ARC_postgresql_16.conf'
ssh postgres@10.0.0.1 'cd /Volumes/Postgres_Data/PostgreSQL/16/data && cp 
pg_hba.conf ../pg_hba.conf'

## clear data directory
ssh postgres@10.0.0.1 'rm -r /Volumes/Postgres_Data/PostgreSQL/16/data/*'

## transfer recovery (this will copy the backup "20240312T040106" and any 
lingering WAL files into the Data folder)
barman recover --remote-ssh-command 'ssh postgres@10.0.0.1' pg 20240312T040106 
/Volumes/Postgres_Data/PostgreSQL/16/data

## restore config files
ssh postgres@10.0.0.1 'cd /Volumes/Postgres_Data/PostgreSQL/16/data && cd .. && 
mv ARC_postgresql_16.conf 
/Volumes/Postgres_Data/PostgreSQL/16/data/ARC_postgresql_16.conf'
ssh postgres@10.0.0.1 'cd /Volumes/Postgres_Data/PostgreSQL/16/data && cd .. && 
mv pg_hba.conf /Volumes/Postgres_Data/PostgreSQL/16/data/pg_hba.conf'

## start cluster A
ssh postgres@10.0.0.1 '/Library/PostgreSQL/16/bin/pg_ctl -D 
/Volumes/Postgres_Data/PostgreSQL/16/data start > /dev/null'


This script runs on a daily basis at 4:30 AM. It did so this morning and there 
was no issue with cluster B. So even though the issue is most likely related to 
the script, it does not cause it every time.


Best regards,

Nick Renders





Re: could not open file "global/pg_filenode.map": Operation not permitted

2024-03-22 Thread Nick Renders
On 13 Mar 2024, at 12:35, Stephen Frost wrote:

> Greetings,
>
> * Nick Renders (postg...@arcict.com) wrote:
>>> ...run them under different users on the system.
>>
>> Are you referring to the "postgres" user / role? Does that also mean setting 
>> up 2 postgres installation directories?
>
> Yes, two separate MacOS user accounts is what I was suggesting.  You
> could use the same postgres binaries though, no need to have two
> installation of them.  You'd need seperate data directories, of course,
> as you have currently.
>
>> This script runs on a daily basis at 4:30 AM. It did so this morning and 
>> there was no issue with cluster B. So even though the issue is most likely 
>> related to the script, it does not cause it every time.
>
> Seems likely that it's some sort of race condition.
>
> Thanks,
>
> Stephen

We now have a second machine with this issue: it is an Intel Mac mini running 
macOS Sonoma (14.4) and PostgreSQL 16.2.
This one only has a single Data directory, so there are no multiple instances 
running.

I installed Postgres yesterday and restored a copy from our live database in 
the Data directory. The Postgres process started up without problems, but after 
40 minutes it started throwing the same errors in the log:

2024-03-21 11:49:27.410 CET [1655] FATAL:  could not open file 
"global/pg_filenode.map": Operation not permitted
2024-03-21 11:49:46.955 CET [1760] FATAL:  could not open file 
"global/pg_filenode.map": Operation not permitted
2024-03-21 11:50:07.398 CET [965] LOG:  could not open file 
"postmaster.pid": Operation not permitted; continuing anyway

I stopped and started the process, and it continued working again until around 
21:20, when the issue popped up again. I wasn't doing anything on the machine 
at that time, so I have no idea what might have triggered it.

Is there perhaps some feature that I can enable that logs which processes use 
these 2 files?

Thanks,

Nick Renders




Re: could not open file "global/pg_filenode.map": Operation not permitted

2024-03-28 Thread Nick Renders

On 22 Mar 2024, at 17:00, Alban Hertroys wrote:

On Fri, 22 Mar 2024 at 15:01, Nick Renders  
wrote:




We now have a second machine with this issue: it is an Intel Mac mini
running macOS Sonoma (14.4) and PostgreSQL 16.2.
This one only has a single Data directory, so there are no multiple
instances running.



I don't think that having a single Data directory prevents multiple
instances from running. That's more of a matter of how often pg_ctl 
was

called with the start command for that particular data directory.


I installed Postgres yesterday and restored a copy from our live 
database

in the Data directory.



How did you restore that copy? Was that a file-based copy perhaps? 
Your

files may have incorrect owners or permissions in that case.


The Postgres process started up without problems, but after 40 
minutes it

started throwing the same errors in the log:

2024-03-21 11:49:27.410 CET [1655] FATAL:  could not open 
file

"global/pg_filenode.map": Operation not permitted
2024-03-21 11:49:46.955 CET [1760] FATAL:  could not open 
file

"global/pg_filenode.map": Operation not permitted
2024-03-21 11:50:07.398 CET [965] LOG:  could not open file
"postmaster.pid": Operation not permitted; continuing anyway



It's possible that some other process put a lock on these files. 
Spotlight

perhaps? Or TimeMachine?


I stopped and started the process, and it continued working again 
until
around 21:20, when the issue popped up again. I wasn't doing anything 
on
the machine at that time, so I have no idea what might have triggered 
it.


Is there perhaps some feature that I can enable that logs which 
processes

use these 2 files?



IIRC, MacOS comes shipped with the lsof command, which will tell you 
which

processes have a given file open. See man lsof.

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.



I have tried the lsof command, but it returns no info about the 
postmaster.pid and global/pg_filenode.map files, so I take they are not 
open at that moment.


Spotlight indexing has been disabled, and TimeMachine takes no snapshots 
of the volume where the data resides.


Looking at the 2 machines that are having this issue (and the others 
that don't), I think it is somehow related to the following setup:

- macOS Sonoma (14.4 and 14.4.1)
- data directory on an external drive

That external drive (a Promise RAID system in one case, a simple SSD in 
the other) has the option "ignore ownership" on by default. I have tried 
disabling that, and updating the data directory to have owner + 
read/write access for the postgres user. It seemed to work at first, but 
just now the issue re-appeared again.


Any other suggestions?

Thanks,

Nick Renders

Re: could not open file "global/pg_filenode.map": Operation not permitted

2024-04-03 Thread Nick Renders
On 29 Mar 2024, at 4:25, Thomas Munro wrote:
>
> I don't have any specific ideas and I have no idea what "ignore
> ownership" means ... what kind of filesystem is running on it?  For
> the simple SSD, is it directly connected, running a normal Apple APFS
> filesystem, or something more complicated?
>
> I wonder if this could be related to the change in 16 which started to
> rename that file:
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d8cd0c6c95c0120168df93aae095df4e0682a08a
>
> Did you ever run 15 or earlier on that system?


In the macOS Finder, when you show the Info (command+i) for an external drive 
(or any partition that is not the boot drive), there is a checkbox "Ignore 
ownership on this volume" in the Permissions section. I think it is by default 
"on" for external drives.

The external SSD is an Orico drive that is connected with USB-C. It is 
initialised as a GUID Partition Map with a single AFPS partition.

We have run PostgreSQL 15 and earlier, before upgrading to 16 when it came out 
last year. We didn't have any problems with 16 until recently, after upgrading 
to Sonoma.


Nick




pg_upgrade - fe_sendauth: no password supplied

2021-09-06 Thread Nick Renders

Hello,

I have been trying to use the pg_upgrade command to update a PostgreSQL 
11 environment to 13 on macOS 11.


I have followed the steps in the documentation, but the command always 
fails when trying to connect to the original database. This is the 
command that is sent:


	/Library/PostgreSQL/13/bin/pg_upgrade -b /Library/PostgreSQL/11/bin -B 
/Library/PostgreSQL/13/bin -d /Volumes/Postgres_Data/PostgreSQL/11/data 
-D /Volumes/Postgres_Data/PostgreSQL/13/data -p 49156 -P 49155 -U 
postgres -j 24 -v


And this is what is logged:

connection to database failed: fe_sendauth: no password supplied
could not connect to source postmaster started with the command:
	"/Library/PostgreSQL/11/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D 
"/Volumes/Postgres_Data/PostgreSQL/11/data" -o "-p 49156 -b  -c 
listen_addresses='' -c unix_socket_permissions=0700 -c 
unix_socket_directories='/Volumes/Free/Upgrade'" start
	"/Library/PostgreSQL/11/bin/pg_ctl" -w -D 
"/Volumes/Postgres_Data/PostgreSQL/11/data" -o "" -m fast stop >> 
"pg_upgrade_server.log" 2>&1



According to the documentation, the connection should be established 
with the data in the .pgpass file. Its contents look like this (the 
password has been changed) :


localhost:49155:*:postgres:password1234
localhost:49156:*:postgres:password1234

The .pgpass file works without problems with the pg_dump and pg_restore 
commands, so I'm fairly certain its contents and privileges are set 
correctly.



The PostgreSQL documentation also mentions that you can update the 
pg_hba.conf file to use authentication method "peer". This has no effect 
either, however when I set it to "trust", the command goes through just 
fine.


So I have been able to do the upgrade, but I am still wondering why I 
got the error in the first place. Any idea why the .pgpass file isn't 
working with the pg_upgrade command?


Best regards,

Nick Renders




PostgreSQL 14: pg_dump / pg_restore error: could not write to the communication channel: Broken pipe

2021-10-15 Thread Nick Renders

Hello,

I have been trying to import a Postgres 11 database into Postgres 14, 
but the pg_restore command exits with the following message:


	pg_restore: error: could not write to the communication channel: Broken 
pipe


The command I sent looks like this:

	/Library/PostgreSQL/14/bin/pg_restore -h localhost -p 48100 -U postgres 
-w -C -d template1 -j 24 -Fd /Volumes/Migration/dbname --verbose



It seems that the multiple jobs parameter is the cause. If I specify "-j 
1", the command works without problems. If I specify "-j 2" or higher, I 
get the above error after a few seconds.


Postgres is running on a Mac Pro 12-core machine, so it has plenty of 
resources at its disposal. The config file is a copy of the Postgres 11 
configuration, which has no problem with multiple jobs.



Furthermore, the pg_dump command seems to have the same issue as well. 
The following command:


	/Library/PostgreSQL/14/bin/pg_dump -h localhost -p 48100 -U postgres -w 
ServicePGR_UTF8 -j 24 -Fd -f /Volumes/Migration/dbname --verbose


will stop prematurely with the following error:

	pg_dump: error: could not write to the communication channel: Broken 
pipe



Does this sound familiar to anyone? Is it an issue with the new Postgres 
14 release, or is there something else that might be causing this?


Best regards,

Nick Renders




Re: PostgreSQL 14: pg_dump / pg_restore error: could not write to the communication channel: Broken pipe

2021-10-18 Thread Nick Renders

Thank you for all the feedback and suggestions.

It seems that the "-h localhost" parameter is triggering the issue. If I 
leave it out, pg_restore works without problems with multiple jobs. I 
have also tried specifying the IP number instead of "localhost", but 
that results in the same error.


I see now that our original pg_restore script does not include the -h 
parameter. Somehow, it has snuck in my commands when testing Postgres 
14. That might mean that the same issue exists in previous versions as 
well. I will investigate further.


Nick


On 15 Oct 2021, at 19:08, Tom Lane wrote:


Alvaro Herrera  writes:
Yeah, pg_dump in parallel mode uses a pipe to communicate between 
leader
and workers; the error you see is what happens when a write to the 
pipe

fails.  It sounds to me like something in the operating system is
preventing the pipes from working properly.


BTW, I think a more likely explanation is "one of the pg_dump or
pg_restore worker processes crashed".  Why that should be is still
a mystery though.

regards, tom lane