Re: postgres on physical replica crashes

2018-04-20 Thread Hannes Erven

Hi Greig,


just last week I experienced the same situation as you on a 10.3 
physical replica (it even has checksums activated), and a few months ago 
on 9.6 .
We used the same resolution as you we, and so far we haven't noticed any 
problems with data integrity on the replicas.




The logs were as follows:
2018-04-13 06:31:16.947 CEST [15603] FATAL:  WAL-Receiver-Prozess wird 
abgebrochen wegen Zeitüberschreitung
2018-04-13 06:31:16.948 CEST [15213] FATAL:  invalid memory alloc 
request size 4280303616
2018-04-13 06:31:16.959 CEST [15212] LOG:  Startprozess (PID 15213) 
beendete mit Status 1
2018-04-13 06:31:16.959 CEST [15212] LOG:  aktive Serverprozesse werden 
abgebrochen
2018-04-13 06:31:16.959 CEST [19838] user@db WARNUNG:  Verbindung wird 
abgebrochen wegen Absturz eines anderen Serverprozesses
2018-04-13 06:31:16.959 CEST [19838] user@db DETAIL:  Der Postmaster hat 
diesen Serverprozess angewiesen, die aktuelle Transaktion zurückzurollen 
und die Sitzung zu beenden, weil ein anderer Serverprozess abnormal 
beendet wurde und möglicherweise das Shared Memory verfälscht hat.
2018-04-13 06:31:16.959 CEST [19838] user@db TIPP:  In einem Moment 
sollten Sie wieder mit der Datenbank verbinden und Ihren Befehl 
wiederholen können.



This replica then refused to start up:
2018-04-13 09:25:15.941 CEST [1957] LOG:  Standby-Modus eingeschaltet
2018-04-13 09:25:15.947 CEST [1957] LOG:  Redo beginnt bei 1C/69C0FF30
2018-04-13 09:25:15.951 CEST [1957] LOG:  konsistenter 
Wiederherstellungszustand erreicht bei 1C/69D9A9C0
2018-04-13 09:25:15.952 CEST [1956] LOG:  Datenbanksystem ist bereit, um 
lesende Verbindungen anzunehmen
2018-04-13 09:25:15.953 CEST [1957] FATAL:  invalid memory alloc request 
size 4280303616
2018-04-13 09:25:15.954 CEST [1956] LOG:  Startprozess (PID 1957) 
beendete mit Status 1



... until the WAL files from the hot standby's pg_wal were manually 
removed and re-downloaded from the primary.


Unfortunately I did not collect hard evidence, but I think I saw the 
primary's replication slot's restart point was set to a position /after/ 
the standby's actual restart location. This time, the error was noticed 
immediately and the required WAL was still present on the master.



A few months ago I experienced the same situation on a 9.6 cluster, but 
that was not noticed for a long time, and - despite using a replication 
slot! - the primary had already removed required segments. Fortunately I 
could get them from a tape backup...




Best regards,

-hannes




Am 2018-04-18 um 18:16 schrieb greigwise:

Hello.  I've had several instances where postgres on my physical replica
under version 9.6.6 is crashing with messages like the following in the
logs:

2018-04-18 05:43:26 UTC dbname 5acf5e4a.6918 dbuser DETAIL:  The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2018-04-18 05:43:26 UTC dbname 5acf5e4a.6918 dbuser HINT:  In a moment you
should be able to reconnect to the database and repeat your command.
2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser WARNING:  terminating
connection because of crash of another server process
2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser DETAIL:  The postmaster
has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser HINT:  In a moment you
should be able to reconnect to the database and repeat your command.
2018-04-18 05:43:27 UTC  5acf5e12.6819  LOG:  database system is shut down

When this happens, what I've found is that I can go into the pg_xlog
directory on the replica, remove all the log files and the postgres will
restart and things seem to come back up normally.

So, the question is what's going on here... is the log maybe getting corrupt
in transmission somehow?  Should I be concerned about the viability of my
replica after having restarted in the described fashion?

Thanks,
Greig Wise



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html






recovery_target_time and WAL fetch with streaming replication

2018-05-12 Thread Hannes Erven

Hi,


what is Postgresql's strategy when to fetch WAL from the master while in 
streaming replication, and could it be tweaked?


I'm using a physical streaming replication slave to have a database 
lagging behind about one month behind the primary, by setting 
"recovery_target_time" to the desired point in time.
This setting is periodically advanced by a cronjob to allow the replica 
to roll forward. It's a 10.3-1 install on Debian.



It seems that as soon as the slave needs new WAL data to reach the 
requested target time, it will connect to the primary and fetch /all/ 
new WAL the primary has available for the slave's slot - up to "now", 
ignoring the recovery_target_time.


So in my setup, the slave will connect to the primary once per month, 
and download the whole next month's WAL data at once.


I do not care on which instance WAL is kept until needed, but I'd really 
prefer if the transfer (and disk space requirement!) would be more 
evenly distributed.



One option of course would be to use some transfer mechanism external to 
Postgresql... but so far I'm thinking there must be any easier way?




Thanks & best regards,

-hannes



Re: recovery_target_time and WAL fetch with streaming replication

2018-05-13 Thread Hannes Erven

Michael,


Am 2018-05-13 um 08:23 schrieb Michael Paquier:

On Sun, May 13, 2018 at 01:39:48AM +0200, Hannes Erven wrote:

what is Postgresql's strategy when to fetch WAL from the master while in
streaming replication, and could it be tweaked?


Fetching WAL from a primary (or another standby) cannot be directly per
se.  By definition, any WAL present will e automatically fetched and
synced on the standby.


I also thought so, but this is not the case (and that's my whole point).

When the standby finds sufficient local WAL in respect to its 
recovery_target_time, it seems it does not even attempt to connect to 
the primary.


But when new WAL is needed, the standby will fetch /all/ WAL present on 
the master.



I'd say, the standby should either:
- always connect to the primary and fetch any WAL present
- stop fetching/streaming WAL when it is not needed for the current 
recovery_target


So ideally there would be an option to ask for the desired behaviour?




For example, what people usually want to be protected from is an
infortunate DROP TABLE on the primary run by an administrator to be
immediately replayed on the standby, losing the data.  Hence delaying
when WAL is replayed can offer some protection, and this can be achieved
by setting recovery_min_apply_delay in recovery.conf.  This will cause
WAL records replaying transactions commits to wait for the amount of
time specified by this parameter, giving you the time to recover from
any failures with a standby which has a controlled synced delta.


recovery_min_apply_delay is an int of milliseconds, so the maximum value 
is approx. 25 days - which is not enough for my requirements.


Also, transaction/MVCC visibility will not cover all cases; most 
important, it cannot protected against TRUNCATE 
(https://wiki.postgresql.org/wiki/MVCC_violations).





One option of course would be to use some transfer mechanism external to
Postgresql... but so far I'm thinking there must be any easier way?


Another option I can think of here is to use a specific restore_command
instead of streaming replication.  Simply archive a WAL segment on the
primary with some meta-data like the time it was archived, and then
allow the standby to recover the segment only after a delta has passed.


Yes, but thats far less simple than just setting restore_target_time .


Thanks again and best regards,

-hannes




Re: postgres large database backup

2022-11-30 Thread Hannes Erven

Am 30.11.22 um 20:01 schrieb Mladen Gogala:

On 11/30/22 10:40, Atul Kumar wrote:

Hi,

I have a 10TB database running on postgres 11 version running on 
centos 7 "on premises", I need to schedule the backup of this 
database in a faster way.


The scheduled backup will be used for PITR purposes.

Well, that depends on your configuration details. Is your DB located 
on a SAN device or JBOD?




You could also use a filesystem that can do atomic snapshots - like ZFS.

However, I'm wondering why none of the previous respondents mentioned 
it? Sure, ZFS may have its own performance implications... but my 
experience is that it makes "safety snapshots" and "on-demand throw-away 
testing environments" very convenient to use.



Best regards,

    -hannes






Force Reconnect of streaming replication

2018-11-28 Thread Hannes Erven
Hi,


consider a PG10 master center, streaming via internet to another site. The 
receiving site has a fast primary uplink and much slower backup link.

When the primary link goes down, all traffic is routed through the backup 
connection. The streaming replication's connection drops and is automatically 
re-establish over the backup link.

But when the primary link comes up again, the replication will still use the 
already established connection via the slow backup link.


If I kill the wal_sender process on the master, the slave will immediately 
reconnect over the "best" available link.
But is this really the best way to force PG to re-connect the replication?

I guess I could as well kill the wal-receiving process on the slave but still 
I'm hesitating to put something like "killall pg_walsender" in a cronjob...


Is there a better way?


Thanks for your opinions,
best regards

-hannes



Re: Does creating readOnly connections, when possible, free up resources in Postgres?

2019-01-27 Thread Hannes Erven

Hi David,


> I saw that when you create a JDBC connection, you can specify
> readOnly=true. Would doing so somehow help Postgres manage its other
> connections?


if you know that a certain connection will be ready-only, you could use 
a more aggressive pooling strategy.


Usually, a connection pool will return a worker thread to the pool when 
the client closes the connection.
pgbouncer for example offers a "statement" pooling strategy, meaning 
that a worker connection is returned to the pool after every completed 
SQL statement.
That way, a single Postgresql server thread can easily serve many client 
connections.
If a specific usecase fits into a single command, you can even use it 
for these writes!


Of course, you can't use server-side prepared statements, temporary 
tables, ... or anything else that depends on subsequent statements 
hitting the same server process.



For an application I'm running, we have dozens of GUI clients keeping a 
JDBC connection open all day checking for updates, and with statement 
pooling this is handled well by 1-3 server processes (auto-scaled by 
pgbouncer).



Best regards,

-hannes




Unused files in the database directory after crashed VACUUM FULL

2019-02-10 Thread Hannes Erven

Hi,


I've just had a "VACUUM FULL " crash due to 100% disk usage.
Clearly my fault, I was expecting the new table to be small enough.

After freeing up space, restarting the cluster and issuing another 
VACCUM FULL, I noticed that the cluster was way bigger that it should be.
In the base// folder, there was a large number of files with one 
certain number that pg_filenode_relation() could not turn into a 
relation. As that number was just a bit smaller that the 
pg_relation_filepath() of the table I was working on, I guess these were 
the remains of the failed VACUUM FULL operation?

I removed those files and a VACCUM ANALYZE of the whole database went fine.

So... is this the expected behaviour? Is there a safe procedure how to 
check/clean up "unnecessary" files in the cluster directory?



Version:
('PostgreSQL 10.3 (Debian 10.3-1.pgdg90+1) on x86_64-pc-linux-gnu, 
compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit')


Table definition:
(
id serial PRIMARY KEY,
data bytea,
parent_id integer NOT NULL
)
The "data" column values are large enough to be TOASTed.


Thanks & best regards,

-hannes




Re: Unused files in the database directory after crashed VACUUM FULL

2019-02-10 Thread Hannes Erven

Hi again,


Am 10.02.19 um 16:41 schrieb Tom Lane:

Hannes Erven  writes:

I've just had a "VACUUM FULL " crash due to 100% disk usage.
Clearly my fault, I was expecting the new table to be small enough.


What do you mean by "crash" exactly?  A normal transactional failure
should've cleaned up orphaned files.  I suppose if the kernel decided to
kill -9 the vacuum process, that wouldn't happen --- but that's not
the usual response to out-of-disk-space.



Here's the exact log (the pgadmin process was running the VACCUM FULL):

2019-02-09 23:44:53.516 CET [20341] @/ <> PANIC:  could not write to 
file "pg_wal/xlogtemp.20341": No space left on device
2019-02-09 23:44:53.516 CET [19181] @/ <> LOG:  WAL writer process (PID 
20341) was terminated by signal 6: Aborted
2019-02-09 23:44:53.516 CET [19181] @/ <> LOG:  terminating any other 
active server processes
2019-02-09 23:44:53.516 CET [20816] postgres@IP/db  
WARNING:  terminating connection because of crash of another serve

r process
2019-02-09 23:44:53.516 CET [20816] postgres@IP/db  
DETAIL:  The postmaster has commanded this server process to roll
back the current transaction and exit, because another server process 
exited abnormally and possibly corrupted shared memory.
2019-02-09 23:44:53.516 CET [20816] postgres@IP/db  
HINT:  In a moment you should be able to reconnect to the database

 and repeat your command.
2019-02-09 23:44:53.597 CET [19181] @/ <> LOG:  all server processes 
terminated; reinitializing
2019-02-09 23:44:53.679 CET [11686] @/ <> LOG:  database system was 
interrupted; last known up at 2019-02-09 23:44:40 CET
2019-02-09 23:44:58.761 CET [11686] @/ <> LOG:  database system was not 
properly shut down; automatic recovery in progress

2019-02-09 23:44:58.807 CET [11686] @/ <> LOG:  redo starts at 1DA/B3400798
2019-02-09 23:44:58.811 CET [11686] @/ <> FATAL:  could not extend file 
"base/16400/612716_fsm": No space left on device

2019-02-09 23:44:58.811 CET [11686] @/ <> HINT:  Check free disk space.
2019-02-09 23:44:58.811 CET [11686] @/ <> CONTEXT:  WAL redo at 
1DA/B3406310 for Heap/INSERT: off 4



After freeing space and restarting:

2019-02-10 06:28:25.622 CET [14344] @/ <> LOG:  database system was 
interrupted while in recovery at 2019-02-09 23:44:58 CET
2019-02-10 06:28:25.622 CET [14344] @/ <> HINT:  This probably means 
that some data is corrupted and you will have to use the last backup for 
recovery.
2019-02-10 06:28:26.762 CET [14344] @/ <> LOG:  database system was not 
properly shut down; automatic recovery in progress

2019-02-10 06:28:26.831 CET [14344] @/ <> LOG:  redo starts at 1DA/B3400798
2019-02-10 06:28:27.624 CET [14344] @/ <> LOG:  redo done at 1DA/D5FFF9D8



Is there a safe procedure how to
check/clean up "unnecessary" files in the cluster directory?


You just described it --- verify with pg_filenode_relation that the
file doesn't correspond to any pg_class entry, then manually delete.



OK, I see. I came up with this:

with d0 as (
select oid from pg_database where datname=current_database()
), d1 as (
select pg_ls_dir as fn,
  regexp_match(pg_ls_dir, '^([0-9]+)(.*)$') as match
from d0, pg_ls_dir('base/'||d0.oid)
order by 1
),
d2 as (
select fn, match[1] as base, match[2] as ext
from d1
where (fn NOT ILIKE 'pg_%')
),
d3 as (
select d.*,  pg_filenode_relation(0, base::oid) as relation
from d2 d
)

select fn, pg_size_pretty((pg_stat_file('base/'||d0.oid||'/'||fn)).size)
from d0, d3
where relation is null;



 fn | pg_size_pretty
+
 612713 | 25 MB
 612718 | 1024 MB
 612718.1   | 1024 MB
 612718.2   | 180 MB
 t11_621351 | 96 kB
 t11_621354 | 0 bytes
 t11_621356 | 8192 bytes
 t11_621357 | 160 kB
 t11_621360 | 0 bytes
 t11_621362 | 8192 bytes


So there's still 612713 and 612718* left to clean up; judging from the 
numbers and timestamps, they probably belong to other VACUUM FULLs I 
manually canceled in the pgadmin UI during my experiments.


Is this interesting enough to further investigate? If yes, I'll be happy 
to work on a test case.
In any case, I've now learned (how) to check for orphaned files if a 
database cluster's on-disk size is way off from the sums of 
pg_total_relation_size() + WAL...



Thanks again & best regards,

-hannes





Re: Where to store Blobs?

2019-03-13 Thread Hannes Erven

Hi,


Am 13.03.19 um 15:28 schrieb Thomas Güttler:


Where do you store Blobs?



Within PostgreSQL, of course. The system I have in mind stores ZIP and 
PDF files, usually a few MBs each; we're currently at a total of about 
100 GB and there are no evident problems.


For this application, it is extremely important that saving/modifying 
the binaries is part of a larger transaction that completes/fails 
atomically.
Replication is almost real-time and access to the binaries is provided 
with the same libraries (JDBC/Hibernate) as everything else.



Best regards,

-hannes




RowLock and multiple transactions

2024-02-14 Thread Hannes Erven

Hi,


when "SELECT .. WHERE .. FOR NO KEY UPDATE" is used synchronize access, 
and the transaction holding the lock completes, how does PostgreSQL 
decide /which one/ of multiple waiting transactions will the lock be 
granted to next?


In my testing (on Ubuntu 16.1-1.pgdg20.04+1, 64bit) with a real-world 
application (that acquires multiple locks on a number of relations) it 
seems that it is always the transaction that attempted to lock _last_ ... ?
I thought that would most probably be random, or if it was not, the 
order would have been explained in the docs?



Thank you for any insights...
Best regards

-hannes





Big variance in execution times of simple queries

2022-01-24 Thread Hannes Erven

Hi community,


I'm looking at a "SELECT * FROM pg_stat_statements" output and am 
puzzled by the huge differences between min/max_exec_time even for 
simple queries.


The most extreme example is probably the statement used by the 
application's connection health check:


SELECT 1
min=0.001, mean=0.00386, max=36.812


Other statements with huge variances include:

SET application_name=$1
min=0.002, mean=0.005, max=9.177

SELECT * FROM table WHERE id=$1  (where ID is the primary key column; 
table has 0.5M rows and is frequently vacuum analyzed)

min=0.010, mean=0.260, max=12338.665


According to the system's monitoring, there is no pressure on any 
resource (cpu/mem/io). It's 13.5-2pgdg20.04+1 on Ubuntu 20.4; the VM has 
12 cpus/16GB memory, ceph-based SSD storage (latency ~1.5ms), and runs 
on max_connections=100 with usually 25-40 processes being connected.


Is this to be expected?
Is there something I can watch out or monitor for?


Thank you for any insights...
Best regards

    -hannes








Re: array_agg and/or =ANY doesn't appear to be functioning as I expect

2018-01-20 Thread Hannes Erven

Hi Rhys,



Am 2018-01-21 um 02:42 schrieb Rhys A.D. Stewart:

Greetings All,
I'm having an issue which is very perplexing. The having clause in a
query doesn't appear to be working as I expect it. Either that or my
understanding of array_agg() is flawed.

>
> [...]


with listing as (
   select start_vid, end_vid, array_agg(node order by path_seq)
node_array, array_agg(edge order by path_seq) edge_array
 from confounded.dataset
   group by start_vid,end_vid
 having true =ALL (array_agg(truth))
  )
select count(*) from confounded.dataset
where node in (select distinct unnest(node_array) from listing) and
truth = false;

I would expect the above query to return 0 rows.



the answer is in your data: "node" is not a UNIQUE field, and there are 
node values with multiple rows.

e.g. node=977 has one row with truth=true and one with truth=false.

So what your second query really does is "select all node values from 
listing for which another entry with truth=false exists in the dataset".


Presuming that "seq" is a primary key [although not declared], you 
probably meant to restrict your query on that.



Best regards,

-hannes