Unable to startup postgres: Could not read from file "pg_clog/00EC"
Hello, Yesterday, we experienced some issues with our Postgres installation (v9.6 running on macOS 10.12). It seems that the machine was automatically rebooted for a yet unknown reason, and afterwards we were unable to start the Postgres service. The postgres log shows the following: 2020-02-04 15:20:41 CET LOG: database system was interrupted; last known up at 2020-02-04 15:18:34 CET 2020-02-04 15:20:43 CET LOG: database system was not properly shut down; automatic recovery in progress 2020-02-04 15:20:44 CET LOG: invalid record length at 14A/9E426DF8: wanted 24, got 0 2020-02-04 15:20:44 CET LOG: redo is not required 2020-02-04 15:20:44 CET FATAL: could not access status of transaction 247890764 2020-02-04 15:20:44 CET DETAIL: Could not read from file "pg_clog/00EC" at offset 106496: Undefined error: 0. 2020-02-04 15:20:44 CET LOG: startup process (PID 403) exited with exit code 1 2020-02-04 15:20:44 CET LOG: aborting startup due to startup process failure 2020-02-04 15:20:44 CET LOG: database system is shut down After some searching, I found someone who had had a similar issue and was able to resolve it by overwriting the file in pg_clog. So I tried the following command: dd if=/dev/zero of=[dbpath]/pg_clog/00EC bs=256k count=1 and now the service is running again. But I am worried that there might still be some issues that we haven't noticed yet. I also have no idea what caused this error in the first place. It might have been the reboot, but maybe the reboot was a result of a Postgres issue. Is there anything specific I should check in our postgres installation / database to make sure it is running ok now? Anyway to see what the consequences were of purging that one pg_clog file? Best regards, Nick Renders
Re: Unable to startup postgres: Could not read from file "pg_clog/00EC"
Thank you for the feedback, Alvaro. Unfortunately, the database is no longer "dumpable". We were able to do a pg_dump yesterday morning (12 hours after the crash + purging the pg_clog) but if we try one now, we get the following error: unexpected chunk number 1 (expected 0) for toast value 8282331 in pg_toast_38651 Looking at our data, there seem to be 6 tables that have corrupt records. Doing a SELECT * for one of those records, will return a similar error: missing chunk number 0 for toast value 8288522 in pg_toast_5572299 What is the best way to go from here? Is tracking down these corrupt records and deleting them the best / only solution? Is there a way to determine of there are issues with new data (after the crash)? Any help and advice is very much appreciated. Thanks, Nick Renders On 5 Feb 2020, at 12:51, Alvaro Herrera wrote: On 2020-Feb-05, Nick Renders wrote: Is there anything specific I should check in our postgres installation / database to make sure it is running ok now? Anyway to see what the consequences were of purging that one pg_clog file? Losing pg_clog files is pretty bad, and should not happen; then again, this might have been something else (ie. the file was maybe not lost). That said, wrongly overwriting files is even worse. By zeroing an existing pg_clog file, you marked a bunch of transactions as aborted. Your data is now probably inconsistent, if not downright corrupt. I would be looking for my most recent backup ... If you're very lucky, your database might be pg_dumpable. I would try that, followed by restoring it in a separate clean instance and seeing what happens. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Postgres 12.1 : UPPER() in WHERE clause restarts server
Hi, We have just upgraded our Postgres 9.6 database to 12.1 (pg_dumpall -> pg_restore on a clean installation) and now we are having some issues with one of our tables. When we do the following statement: SELECT * FROM f_gsxws_schedule WHERE UPPER(gwsc_dossier) = 'TEST' the Postgres service restarts. It seems that using UPPER() in the WHERE clause is causing this. The same statement without UPPER() works just fine. I have tried to emulate the issue with other tables, but f_gsxws_schedule seems to be the only one. The table also has another character field that is indexed, and the same problem occurs there. Whenever we use UPPER() or LOWER() to do a case-insensitive search, the service reboots. Looking at the table's definition, I don't see anything different with the other tables. Here is what is logged: 2020-02-08 20:21:19.942 CET [83892] LOG: server process (PID 85456) was terminated by signal 9: Killed: 9 2020-02-08 20:21:19.942 CET [83892] DETAIL: Failed process was running: SELECT * FROM f_gsxws_schedule WHERE UPPER(gwsc_dossier) = 'TEST' 2020-02-08 20:21:19.942 CET [83892] LOG: terminating any other active server processes 2020-02-08 20:21:19.943 CET [85364] WARNING: terminating connection because of crash of another server process 2020-02-08 20:21:19.943 CET [85364] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2020-02-08 20:21:19.943 CET [85364] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2020-02-08 20:21:19.943 CET [85360] WARNING: terminating connection because of crash of another server process 2020-02-08 20:21:19.943 CET [85360] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2020-02-08 20:21:19.943 CET [85360] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2020-02-08 20:21:19.943 CET [85269] WARNING: terminating connection because of crash of another server process 2020-02-08 20:21:19.943 CET [85269] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2020-02-08 20:21:19.943 CET [85269] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2020-02-08 20:21:19.946 CET [83892] LOG: all server processes terminated; reinitializing 2020-02-08 20:21:19.988 CET [85686] LOG: database system was interrupted; last known up at 2020-02-08 20:20:48 CET 2020-02-08 20:21:20.658 CET [85686] LOG: database system was not properly shut down; automatic recovery in progress 2020-02-08 20:21:20.662 CET [85686] LOG: redo starts at C/B99B45A0 2020-02-08 20:21:20.662 CET [85686] LOG: invalid record length at C/B99B4688: wanted 24, got 0 2020-02-08 20:21:20.662 CET [85686] LOG: redo done at C/B99B4650 2020-02-08 20:21:20.675 CET [83892] LOG: database system is ready to accept connections Has anyone noticed anything like this before? Any idea how to fix this? Best regards, Nick Renders
Re: Postgres 12.1 : UPPER() in WHERE clause restarts server
Hi Thomas, We are setting up a new test environment with 12.1. Once it is running, I'll try out those commands and get back with the results. Thanks, Nick Renders On 11 Feb 2020, at 2:51, Thomas Munro wrote: On Mon, Feb 10, 2020 at 4:35 AM Marc wrote: We will keep the 12.1 in place so that we can run additional tests to assist to pin-point the issue. Feel free to ask but allow us to recover from these hectic days ;-) Here's how to get a stack so we can see what it was doing, assuming you have the Apple developer tools installed: 1. Find the PID of the backend you're connected to with SELECT pg_backend_pid(). 2. "lldb -p PID" from a shell to attach to the process, then "cont" to let it continue running. 3. Run the query in that backend and wait for the SIGKILL. 4. In the lldb session, type "bt". It'll only make sense if your PostgreSQL build has debug symbols, but let's see.
PL/pgSQL question about EXCEPTION clause & corrupt records
Hello, We recently suffered a database crash which resulted in some corrupt records. I thought I would write a little PL script that would loop through all the data and report any inconsistencies. However, I can't get it to work properly. For instance, if I run the following statement in pgAdmin: SELECT * FROM f_gsxws_transaction WHERE gwta_number = 762513 I get the following message: ERROR: missing chunk number 0 for toast value 8289525 in pg_toast_5572299 So, as a test, I created a function that would just retrieve that one record: DECLARE rcontent f_gsxws_transaction%ROWTYPE; BEGIN SELECT * INTO rcontent FROM f_gsxws_transaction where gwta_number = 762513; RETURN rcontent; EXCEPTION WHEN OTHERS THEN RAISE NOTICE 'Record 762513 is corrupt'; END; Now, when I run this function, I have noticed two things: 1) The function has no problem executing the SELECT statement. It is only when "rcontents" is returned, that the function fails. This is a problem, because the ultimate goal is to loop through all records and only return/alert something in case of an error. 2) The function never enters the EXCEPTION clause. Instead, when it hits the RETURN command, it breaks and shows the same message as in pgAdmin: missing chunk number 0 for toast value 8289525 in pg_toast_5572299. Is it possible to check for these kind of errors with a PL script? Or is there perhaps a better way to check for corrupt records in a database? Best regards, Nick Renders
Re: PL/pgSQL question about EXCEPTION clause & corrupt records
The problem is that I don't know which column is corrupt. But I found a solution: by simply copying the record into another variable, the values are parsed and the TOAST errors are thrown. In case anyone's interested, here's my code, based on an example from http://www.databasesoup.com/2013/10/de-corrupting-toast-tables.html DO $f$ DECLARE rContent1 record; rContent2 record; iCounter integer DEFAULT 1; iValue integer; pTableName varchar := 'f_gsxws_transaction'; pFieldName varchar := 'gwta_number'; BEGIN FOR iValue IN EXECUTE 'SELECT ' || pFieldName || ' FROM ' || pTableName::regclass || ' ORDER BY ' || pFieldName LOOP BEGIN EXECUTE 'SELECT * FROM ' || pTableName::regclass || ' WHERE ' || pFieldName || ' = $1' INTO rContent1 USING iValue; rContent2 := rContent1; EXCEPTION WHEN OTHERS THEN RAISE NOTICE 'data for %.% % is corrupt', pTableName, pFieldName, iValue; END; IF iCounter % 10 = 0 THEN RAISE NOTICE '% % records checked', iCounter, pTableName; END IF; iCounter := iCounter+1; END LOOP; END; $f$; Cheers, Nick On 14 Feb 2020, at 16:14, Tom Lane wrote: "Nick Renders" writes: I thought I would write a little PL script that would loop through all the data and report any inconsistencies. However, I can't get it to work properly. ... 1) The function has no problem executing the SELECT statement. It is only when "rcontents" is returned, that the function fails. This is a problem, because the ultimate goal is to loop through all records and only return/alert something in case of an error. 2) The function never enters the EXCEPTION clause. Instead, when it hits the RETURN command, it breaks and shows the same message as in pgAdmin: missing chunk number 0 for toast value 8289525 in pg_toast_5572299. I think what's happening there is that the function doesn't try to dereference the value's TOAST pointer during SELECT INTO. It just stores that pointer into a variable, and only sometime later when the actual content of the value is demanded, do you see the error raised. The solution to that is to do something that uses the contents of the busted column right away while still inside the EXCEPTION block, perhaps along the lines of "select md5(mycolumn) into local_variable from..." A close reading of https://www.postgresql.org/docs/current/storage-toast.html would probably help you understand what's happening here. regards, tom lane
Re: PL/pgSQL question about EXCEPTION clause & corrupt records
Hi Jeremy, This happend on PostgreSQL v9.6 which crashed 2 weeks ago. Since then we have upgraded and restored our server, but my example is from the older, corrupt database. Nick On 15 Feb 2020, at 5:30, Jeremy Schneider wrote: On Feb 14, 2020, at 04:39, Nick Renders wrote: I get the following message: ERROR: missing chunk number 0 for toast value 8289525 in pg_toast_5572299 What version of PostgreSQL are you running? I’ve seen this a number of times the past couple years; curious if the lurking bug is still observed in latest versions. -Jeremy Sent from my TI-83
Re: Postgres 12.1 : UPPER() in WHERE clause restarts server
We have set up a new test environment running PostgreSQL v12.2 on macOS 10.14 and the issue is still there. One thing I noticed, is that the returning columns do not affect the behaviour: SELECT * FROM f_gsxws_schedule WHERE UPPER(gwsc_dossier) = 'TEST' and SELECT gwsc_sequence FROM f_gsxws_schedule WHERE UPPER(gwsc_dossier) = 'TEST' both kill the postgres service. I will try to free some time next week to install the Apple developer tools and further analyse the problem. Best regards, Nick On 11 Feb 2020, at 12:32, Nick Renders wrote: Hi Thomas, We are setting up a new test environment with 12.1. Once it is running, I'll try out those commands and get back with the results. Thanks, Nick Renders On 11 Feb 2020, at 2:51, Thomas Munro wrote: On Mon, Feb 10, 2020 at 4:35 AM Marc wrote: We will keep the 12.1 in place so that we can run additional tests to assist to pin-point the issue. Feel free to ask but allow us to recover from these hectic days ;-) Here's how to get a stack so we can see what it was doing, assuming you have the Apple developer tools installed: 1. Find the PID of the backend you're connected to with SELECT pg_backend_pid(). 2. "lldb -p PID" from a shell to attach to the process, then "cont" to let it continue running. 3. Run the query in that backend and wait for the SIGKILL. 4. In the lldb session, type "bt". It'll only make sense if your PostgreSQL build has debug symbols, but let's see.
Re: Postgres 12.1 : UPPER() in WHERE clause restarts server
Hi Tom, 1. we used the EDB installer. 2. turning JIT off did make the problem go away. So I guess this was causing the Postgres process to crash all along. Thanks for the help, Nick On 24 Feb 2020, at 16:24, Tom Lane wrote: "Nick Renders" writes: We have set up a new test environment running PostgreSQL v12.2 on macOS 10.14 and the issue is still there. Some nearby threads prompt these two questions: 1. Are you using your own build, or is this from EDB's installer? 2. If the latter, does turning JIT off ("set jit = off") make the problem go away? There is as yet no "native" support for --with-llvm on macOS, ie Apple themselves don't provide sufficient support for that. EDB seem to have hacked up something that sort of works, but only sort of. regards, tom lane
Postgres on macOS 10
Hi, We just noticed something strange with our Postgres server. We have Postgres 11 and 12 running on macOS 10.14 (Mojave), installed with the EDB installer. Whenever the machine is restarted, the Postgres service cannot be launched until a macOS user logs in. We have "automatic login" turned off in the system preferences, meaning that when the machine reboots, you get the macOS login screen. In the system.log we see the following 2 lines over and over again, until a user has logged in: Mar 3 09:37:19 postgrestest com.apple.xpc.launchd[1] (com.edb.launchd.postgresql-12[319]): Service exited with abnormal code: 2 Mar 3 09:37:19 postgrestest com.apple.xpc.launchd[1] (com.edb.launchd.postgresql-12): Service only ran for 0 seconds. Pushing respawn out by 10 seconds. It doesn't matter which macOS user logs in, doesn't have to be an Administrator. But once a user has logged in, the Postgres service is finally launched properly. Afterwards, the macOS user can log out again, and the service will continue running. It doesn't seem to be limited to Mojave, either. I did a quick test on our older, decommissioned Postgres server (Postgres 9 on macOS 10.12) and the same issue seems to occur there. Has anyone noticed something similar with macOS? Or is it just our setup? Best regards, Nick Renders
archive_command debugging
Hello, I was wondering if anyone had any good tips for "debugging" the archive_command in the postgresql.conf. I want to see what certain settings are when the command is run, like $PATH and id, but I don't want it to be successful either, so that the WAL files remain untouched. Any tips? Thanks, Nick Renders
could not open file "global/pg_filenode.map": Operation not permitted
Hello, We have a Postgres server that intermittently logs the following: 2024-02-26 10:29:41.580 CET [63962] FATAL: could not open file "global/pg_filenode.map": Operation not permitted 2024-02-26 10:30:11.147 CET [90610] LOG: could not open file "postmaster.pid": Operation not permitted; continuing anyway 2024-02-26 10:30:11.149 CET [63975] FATAL: could not open file "global/pg_filenode.map": Operation not permitted 2024-02-26 10:30:35.941 CET [63986] FATAL: could not open file "global/pg_filenode.map": Operation not permitted 2024-02-26 10:30:41.546 CET [63991] FATAL: could not open file "global/pg_filenode.map": Operation not permitted 2024-02-26 10:30:44.398 CET [63994] FATAL: could not open file "global/pg_filenode.map": Operation not permitted 2024-02-26 10:31:11.149 CET [90610] LOG: could not open file "postmaster.pid": Operation not permitted; continuing anyway 2024-02-26 10:31:11.151 CET [64008] FATAL: could not open file "global/pg_filenode.map": Operation not permitted 2024-02-26 10:31:41.546 CET [64023] FATAL: could not open file "global/pg_filenode.map": Operation not permitted 2024-02-26 10:32:11.150 CET [90610] LOG: could not open file "postmaster.pid": Operation not permitted; continuing anyway 2024-02-26 10:32:11.153 CET [64035] FATAL: could not open file "global/pg_filenode.map": Operation not permitted 2024-02-26 10:32:41.547 CET [64050] FATAL: could not open file "global/pg_filenode.map": Operation not permitted 2024-02-26 10:33:11.151 CET [90610] LOG: could not open file "postmaster.pid": Operation not permitted; continuing anyway 2024-02-26 10:33:11.153 CET [64062] FATAL: could not open file "global/pg_filenode.map": Operation not permitted 2024-02-26 10:33:41.548 CET [64087] FATAL: could not open file "global/pg_filenode.map": Operation not permitted This has happened 3 times over the last 2 weeks now, without any indication what caused it. The privileges of those 2 files are all in order. When this happens, the server is no longer accessible, and we need to restart the service (pg_ctl restart). Once restarted, Popstgres runs fine again for a couple of days. We are running PostgreSQL 16.2 on macOS 14.3.1. Any idea what might be causing this issue, or how to resolve it? Best regards, Nick Renders
Re: could not open file "global/pg_filenode.map": Operation not permitted
Thank you for your reply Laurenz. I don't think it is related to any third party security software. We have several other machines with a similar setup, but this is the only server that has this issue. The one thing different about this machine however, is that it runs 2 instances of Postgres: - cluster A on port 165 - cluster B on port 164 Cluster A is actually a backup from another Postgres server that is restored on a daily basis via Barman. This means that we login remotely from the Barman server over SSH, stop cluster A's service (port 165), clear the Data folder, restore the latest back into the Data folder, and start up the service again. Cluster B's Data and service (port 164) remain untouched during all this time. This is the cluster that experiences the intermittent "operation not permitted" issue. Over the past 2 weeks, I have suspended our restore script and the issue did not occur. I have just performed another restore on cluster A and now cluster B is throwing errors in the log again. Any idea why this is happening? It does not occur with every restore, but it seems to be related anyway. Thanks, Nick Renders On 26 Feb 2024, at 16:29, Laurenz Albe wrote: > On Mon, 2024-02-26 at 15:14 +0100, Nick Renders wrote: >> We have a Postgres server that intermittently logs the following: >> >> 2024-02-26 10:29:41.580 CET [63962] FATAL: could not open file >> "global/pg_filenode.map": Operation not permitted >> 2024-02-26 10:30:11.147 CET [90610] LOG: could not open file >> "postmaster.pid": Operation not permitted; continuing anyway >> >> This has happened 3 times over the last 2 weeks now, without any indication >> what caused it. >> The privileges of those 2 files are all in order. >> When this happens, the server is no longer accessible, and we need to >> restart the service (pg_ctl restart). >> Once restarted, Popstgres runs fine again for a couple of days. >> >> We are running PostgreSQL 16.2 on macOS 14.3.1. > > Perhaps that is some kind of virus checker or something else that locks files. > > Yours, > Laurenz Albe
Re: could not open file "global/pg_filenode.map": Operation not permitted
On 11 Mar 2024, at 16:04, Adrian Klaver wrote: > On 3/11/24 03:11, Nick Renders wrote: >> Thank you for your reply Laurenz. >> I don't think it is related to any third party security software. We have >> several other machines with a similar setup, but this is the only server >> that has this issue. >> >> The one thing different about this machine however, is that it runs 2 >> instances of Postgres: >> - cluster A on port 165 >> - cluster B on port 164 >> Cluster A is actually a backup from another Postgres server that is restored >> on a daily basis via Barman. This means that we login remotely from the >> Barman server over SSH, stop cluster A's service (port 165), clear the Data >> folder, restore the latest back into the Data folder, and start up the >> service again. >> Cluster B's Data and service (port 164) remain untouched during all this >> time. This is the cluster that experiences the intermittent "operation not >> permitted" issue. >> >> Over the past 2 weeks, I have suspended our restore script and the issue did >> not occur. >> I have just performed another restore on cluster A and now cluster B is >> throwing errors in the log again. > > Since it seems to be the trigger, what are the contents of the restore script? > >> >> Any idea why this is happening? It does not occur with every restore, but it >> seems to be related anyway. >> >> Thanks, >> >> Nick Renders >> > > > -- > Adrian Klaver > adrian.kla...@aklaver.com > ...how are A and B connected? The 2 cluster are not connected. They run on the same macOS 14 machine with a single Postgres installation ( /Library/PostgreSQL/16/ ) and their respective Data folders are located on the same volume ( /Volumes/Postgres_Data/PostgreSQL/16/data and /Volumes/Postgres_Data/PostgreSQL/16-DML/data ). Beside that, they run independently on 2 different ports, specified in the postgresql.conf. > ...run them under different users on the system. Are you referring to the "postgres" user / role? Does that also mean setting up 2 postgres installation directories? > ...what are the contents of the restore script? ## stop cluster A ssh postgres@10.0.0.1 '/Library/PostgreSQL/16/bin/pg_ctl -D /Volumes/Postgres_Data/PostgreSQL/16/data stop' ## save config files (ARC_postgresql_16.conf is included in postgresql.conf and contains cluster-specific information like the port number) ssh postgres@10.0.0.1 'cd /Volumes/Postgres_Data/PostgreSQL/16/data && cp ARC_postgresql_16.conf ../ARC_postgresql_16.conf' ssh postgres@10.0.0.1 'cd /Volumes/Postgres_Data/PostgreSQL/16/data && cp pg_hba.conf ../pg_hba.conf' ## clear data directory ssh postgres@10.0.0.1 'rm -r /Volumes/Postgres_Data/PostgreSQL/16/data/*' ## transfer recovery (this will copy the backup "20240312T040106" and any lingering WAL files into the Data folder) barman recover --remote-ssh-command 'ssh postgres@10.0.0.1' pg 20240312T040106 /Volumes/Postgres_Data/PostgreSQL/16/data ## restore config files ssh postgres@10.0.0.1 'cd /Volumes/Postgres_Data/PostgreSQL/16/data && cd .. && mv ARC_postgresql_16.conf /Volumes/Postgres_Data/PostgreSQL/16/data/ARC_postgresql_16.conf' ssh postgres@10.0.0.1 'cd /Volumes/Postgres_Data/PostgreSQL/16/data && cd .. && mv pg_hba.conf /Volumes/Postgres_Data/PostgreSQL/16/data/pg_hba.conf' ## start cluster A ssh postgres@10.0.0.1 '/Library/PostgreSQL/16/bin/pg_ctl -D /Volumes/Postgres_Data/PostgreSQL/16/data start > /dev/null' This script runs on a daily basis at 4:30 AM. It did so this morning and there was no issue with cluster B. So even though the issue is most likely related to the script, it does not cause it every time. Best regards, Nick Renders
Re: could not open file "global/pg_filenode.map": Operation not permitted
On 13 Mar 2024, at 12:35, Stephen Frost wrote: > Greetings, > > * Nick Renders (postg...@arcict.com) wrote: >>> ...run them under different users on the system. >> >> Are you referring to the "postgres" user / role? Does that also mean setting >> up 2 postgres installation directories? > > Yes, two separate MacOS user accounts is what I was suggesting. You > could use the same postgres binaries though, no need to have two > installation of them. You'd need seperate data directories, of course, > as you have currently. > >> This script runs on a daily basis at 4:30 AM. It did so this morning and >> there was no issue with cluster B. So even though the issue is most likely >> related to the script, it does not cause it every time. > > Seems likely that it's some sort of race condition. > > Thanks, > > Stephen We now have a second machine with this issue: it is an Intel Mac mini running macOS Sonoma (14.4) and PostgreSQL 16.2. This one only has a single Data directory, so there are no multiple instances running. I installed Postgres yesterday and restored a copy from our live database in the Data directory. The Postgres process started up without problems, but after 40 minutes it started throwing the same errors in the log: 2024-03-21 11:49:27.410 CET [1655] FATAL: could not open file "global/pg_filenode.map": Operation not permitted 2024-03-21 11:49:46.955 CET [1760] FATAL: could not open file "global/pg_filenode.map": Operation not permitted 2024-03-21 11:50:07.398 CET [965] LOG: could not open file "postmaster.pid": Operation not permitted; continuing anyway I stopped and started the process, and it continued working again until around 21:20, when the issue popped up again. I wasn't doing anything on the machine at that time, so I have no idea what might have triggered it. Is there perhaps some feature that I can enable that logs which processes use these 2 files? Thanks, Nick Renders
Re: could not open file "global/pg_filenode.map": Operation not permitted
On 22 Mar 2024, at 17:00, Alban Hertroys wrote: On Fri, 22 Mar 2024 at 15:01, Nick Renders wrote: We now have a second machine with this issue: it is an Intel Mac mini running macOS Sonoma (14.4) and PostgreSQL 16.2. This one only has a single Data directory, so there are no multiple instances running. I don't think that having a single Data directory prevents multiple instances from running. That's more of a matter of how often pg_ctl was called with the start command for that particular data directory. I installed Postgres yesterday and restored a copy from our live database in the Data directory. How did you restore that copy? Was that a file-based copy perhaps? Your files may have incorrect owners or permissions in that case. The Postgres process started up without problems, but after 40 minutes it started throwing the same errors in the log: 2024-03-21 11:49:27.410 CET [1655] FATAL: could not open file "global/pg_filenode.map": Operation not permitted 2024-03-21 11:49:46.955 CET [1760] FATAL: could not open file "global/pg_filenode.map": Operation not permitted 2024-03-21 11:50:07.398 CET [965] LOG: could not open file "postmaster.pid": Operation not permitted; continuing anyway It's possible that some other process put a lock on these files. Spotlight perhaps? Or TimeMachine? I stopped and started the process, and it continued working again until around 21:20, when the issue popped up again. I wasn't doing anything on the machine at that time, so I have no idea what might have triggered it. Is there perhaps some feature that I can enable that logs which processes use these 2 files? IIRC, MacOS comes shipped with the lsof command, which will tell you which processes have a given file open. See man lsof. -- If you can't see the forest for the trees, Cut the trees and you'll see there is no forest. I have tried the lsof command, but it returns no info about the postmaster.pid and global/pg_filenode.map files, so I take they are not open at that moment. Spotlight indexing has been disabled, and TimeMachine takes no snapshots of the volume where the data resides. Looking at the 2 machines that are having this issue (and the others that don't), I think it is somehow related to the following setup: - macOS Sonoma (14.4 and 14.4.1) - data directory on an external drive That external drive (a Promise RAID system in one case, a simple SSD in the other) has the option "ignore ownership" on by default. I have tried disabling that, and updating the data directory to have owner + read/write access for the postgres user. It seemed to work at first, but just now the issue re-appeared again. Any other suggestions? Thanks, Nick Renders
Re: could not open file "global/pg_filenode.map": Operation not permitted
On 29 Mar 2024, at 4:25, Thomas Munro wrote: > > I don't have any specific ideas and I have no idea what "ignore > ownership" means ... what kind of filesystem is running on it? For > the simple SSD, is it directly connected, running a normal Apple APFS > filesystem, or something more complicated? > > I wonder if this could be related to the change in 16 which started to > rename that file: > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d8cd0c6c95c0120168df93aae095df4e0682a08a > > Did you ever run 15 or earlier on that system? In the macOS Finder, when you show the Info (command+i) for an external drive (or any partition that is not the boot drive), there is a checkbox "Ignore ownership on this volume" in the Permissions section. I think it is by default "on" for external drives. The external SSD is an Orico drive that is connected with USB-C. It is initialised as a GUID Partition Map with a single AFPS partition. We have run PostgreSQL 15 and earlier, before upgrading to 16 when it came out last year. We didn't have any problems with 16 until recently, after upgrading to Sonoma. Nick
pg_upgrade - fe_sendauth: no password supplied
Hello, I have been trying to use the pg_upgrade command to update a PostgreSQL 11 environment to 13 on macOS 11. I have followed the steps in the documentation, but the command always fails when trying to connect to the original database. This is the command that is sent: /Library/PostgreSQL/13/bin/pg_upgrade -b /Library/PostgreSQL/11/bin -B /Library/PostgreSQL/13/bin -d /Volumes/Postgres_Data/PostgreSQL/11/data -D /Volumes/Postgres_Data/PostgreSQL/13/data -p 49156 -P 49155 -U postgres -j 24 -v And this is what is logged: connection to database failed: fe_sendauth: no password supplied could not connect to source postmaster started with the command: "/Library/PostgreSQL/11/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/Volumes/Postgres_Data/PostgreSQL/11/data" -o "-p 49156 -b -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/Volumes/Free/Upgrade'" start "/Library/PostgreSQL/11/bin/pg_ctl" -w -D "/Volumes/Postgres_Data/PostgreSQL/11/data" -o "" -m fast stop >> "pg_upgrade_server.log" 2>&1 According to the documentation, the connection should be established with the data in the .pgpass file. Its contents look like this (the password has been changed) : localhost:49155:*:postgres:password1234 localhost:49156:*:postgres:password1234 The .pgpass file works without problems with the pg_dump and pg_restore commands, so I'm fairly certain its contents and privileges are set correctly. The PostgreSQL documentation also mentions that you can update the pg_hba.conf file to use authentication method "peer". This has no effect either, however when I set it to "trust", the command goes through just fine. So I have been able to do the upgrade, but I am still wondering why I got the error in the first place. Any idea why the .pgpass file isn't working with the pg_upgrade command? Best regards, Nick Renders
PostgreSQL 14: pg_dump / pg_restore error: could not write to the communication channel: Broken pipe
Hello, I have been trying to import a Postgres 11 database into Postgres 14, but the pg_restore command exits with the following message: pg_restore: error: could not write to the communication channel: Broken pipe The command I sent looks like this: /Library/PostgreSQL/14/bin/pg_restore -h localhost -p 48100 -U postgres -w -C -d template1 -j 24 -Fd /Volumes/Migration/dbname --verbose It seems that the multiple jobs parameter is the cause. If I specify "-j 1", the command works without problems. If I specify "-j 2" or higher, I get the above error after a few seconds. Postgres is running on a Mac Pro 12-core machine, so it has plenty of resources at its disposal. The config file is a copy of the Postgres 11 configuration, which has no problem with multiple jobs. Furthermore, the pg_dump command seems to have the same issue as well. The following command: /Library/PostgreSQL/14/bin/pg_dump -h localhost -p 48100 -U postgres -w ServicePGR_UTF8 -j 24 -Fd -f /Volumes/Migration/dbname --verbose will stop prematurely with the following error: pg_dump: error: could not write to the communication channel: Broken pipe Does this sound familiar to anyone? Is it an issue with the new Postgres 14 release, or is there something else that might be causing this? Best regards, Nick Renders
Re: PostgreSQL 14: pg_dump / pg_restore error: could not write to the communication channel: Broken pipe
Thank you for all the feedback and suggestions. It seems that the "-h localhost" parameter is triggering the issue. If I leave it out, pg_restore works without problems with multiple jobs. I have also tried specifying the IP number instead of "localhost", but that results in the same error. I see now that our original pg_restore script does not include the -h parameter. Somehow, it has snuck in my commands when testing Postgres 14. That might mean that the same issue exists in previous versions as well. I will investigate further. Nick On 15 Oct 2021, at 19:08, Tom Lane wrote: Alvaro Herrera writes: Yeah, pg_dump in parallel mode uses a pipe to communicate between leader and workers; the error you see is what happens when a write to the pipe fails. It sounds to me like something in the operating system is preventing the pipes from working properly. BTW, I think a more likely explanation is "one of the pg_dump or pg_restore worker processes crashed". Why that should be is still a mystery though. regards, tom lane