Re: Cast INTEGER to BIT confusion

2023-08-15 Thread Greg Sabino Mullane
On Tue, Aug 15, 2023 at 4:49 AM [Quipsy] Markus Karg  wrote:

>
> CREATE TABLE T (c BIT);
>
> INSERT INTO T VALUES (1);
>
> -- I MUST NOT change both lines in any way as these are part of a
> third-party application!
>

That is quite the requirement! I hope that the rest of the application is
more flexible, or you might be fighting an uphill battle for a long time.

However, if the application will drive forward even after an error, you
could pre-create the table "T" the way you want (esp. "c INT"), let the
application's "CREATE TABLE" call fail, and then the subsequent insert(s)
will succeed. Other than that, there is not much that can be easily done,
as the error that is thrown happens pretty early in the process.

Cheers,
Greg


Re: vacuum freeze wait_event BufferPin

2024-02-02 Thread Greg Sabino Mullane
On Wed, Jan 31, 2024 at 2:05 PM abrahim abrahao 
wrote:

> There is a three-day vacuum freeze on a partition table with wait_event =
> BufferPin, no transactions (active or idle) older than it that are not
> blocked, but at least one select query is executing at all times related of
> this partition table.
> ...
>
Is there a wait to figure out which session vacuum freeze to wait for?
>

The vacuum needs a chance to get in and make changes to the table, but it's
not being blocked at the traditional lock level that shows up in
pg_blocking_pids. You can see what is going on with this:

select pid, mode, query_start, query from pg_locks join pg_stat_activity
using (pid) where relation::regclass::text = 'mytable' order by 3;

That may show your vacuum process with a ShareUpdateExclusiveLock and some
other processes with other locks, probably AccessShareLock. Those other
pids need to all finish or be killed - and not have any overlap between
them. In other words, that vacuum process needs to have exclusive access to
the table for a split second, no matter if the other process locked the
table before or after the vacuum started. One crude solution would be to
cancel any other existing backends interested in that table:

select pg_cancel_backend(pid), now()-query_start, query from pg_locks join
pg_stat_activity using (pid) where relation::regclass::text = 'mytable' and
lower(query) !~ 'vacuum';

Not a good long-term solution, but depending on how often the table is
updated, you might have other options. Perhaps disable  autovacuum for this
table and do a manual vacuum (e.g. in a cron script) that kills the other
backends as per above, or runs during a time with not-constant reads on the
table. Or have something that is able to pause the application. Or if this
is a partitioned table that might get dropped in the future or at least not
queried heavily, do not worry about vacuuming it now.

Cheers,
Greg


Re: vacuum freeze wait_event BufferPin

2024-02-02 Thread Greg Sabino Mullane
On Fri, Feb 2, 2024 at 3:25 PM abrahim abrahao 
wrote:

>  SHARE UPDATE EXCLUSIVE (ShareUpdateExclusiveLock) should not be blocked
> by ACCESS SHARE (AccessShareLock). Am I wrong about it? If I am not wrong
> why it still locking it?
>

Those locks with no other context are somewhat  of a red herring. The
important part is not that the AccessShare is somehow blocking
ShareUpdateExclusive, but that the ShareUpdateExclusive process is NOT
blocking new AccessShare processes! In the internals of postgres,
vacuumlazy.c tries to grab a buffer lock (different concept from all the
user-visible locks above). It politely tries to wait[1] until nobody else
is grabbing it (aka pinning it), then proceeds. The problem is that other
processes are allowed to come along and put a pin in it as well - the
vacuum's shareupdateexclusive lock does not prevent that.

So the timeline is:

Process X runs a long select and pins the buffer
Process V runs a vacuum freeze and tries to lock the buffer. It detects
other pins, so it waits. It assumes that whoever is holding the pin will
release it someday.
Process Y runs another long select and also pins the buffer.
Process X ends, and removes its pins.
Process V still cannot move - it just knows there are still pins. Where
they come from does not matter.

As long as there is at least one other process holding a pin, the vacuum
freeze cannot continue[2].

That's my understanding of the code, anyway. This could be argued as a bug.
I am not sure what a solution would be. Cancelling user queries just for a
vacuum would not be cool, but we could maybe pause future pin-creating
actions somehow?

For the time being, forcing a super-quick moment of no table access would
seem to be your best bet, as described earlier.

Cheers,
Greg

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/vacuumlazy.c;hb=HEAD#l975
See backend/storage/buffer/bufmgr.c for LockBufferForCleanup()

[2] Quick duplication script:
drop table if exists foobar;
create table foobar as select 1 as id;
alter table foobar set (autovacuum_enabled = off);
update foobar set id = id;

Process 1:
begin; select *, pg_sleep(11) from foobar;

Process 2:
vacuum(freeze,verbose) foobar; /* blocked */

Process 3:
begin; select *, pg_sleep(33) from foobar;

Run in order. Kill Process 1 and Process 2 is still blocked. Kill Process 3
and Process 2 finished the vacuum.
Note that a regular vacuum (without a freeze) will not get blocked.

Cheers,
Greg


Re: How to do faster DML

2024-02-03 Thread Greg Sabino Mullane
As a general rule, avoid heavy subselects like that. You don't need to
build a full list of duplicates before starting. Another approach:

create table mytable2 (like mytable1);

alter table mytable2 add primary key (id);

insert into mytable2 select * from mytable1 on conflict do nothing;

Given the size of your table, you probably want to divide that up.
As long as nothing is changing the original table, you could do:

insert into mytable2 select * from mytable1 order by ctid limit 10_000_000
offset 0;
insert into mytable2 select * from mytable1 order by ctid limit 10_000_000
offset 10_000_000;
insert into mytable2 select * from mytable1 order by ctid limit 10_000_000
offset 20_000_000;
etc.

Cheers,
Greg


Re: How to do faster DML

2024-02-04 Thread Greg Sabino Mullane
Partitioning will depend a lot on how you access the data. Is it always
using that transaction_date column in the where clause? Can you share the
table definition via \d? We've talked about this table quite a bit, but not
actually seen what it looks like. We are flying blind a little bit. You
mentioned your queries are slow, but can you share an explain analyze on
one of these slow queries?

45 billion rows is quite manageable. How many rows are in each day? You may
want to do a larger bucket than 24 hours per partition.

1)Even after partitioning the target table , to speed up the data load on
> this table , Is there an option to disable the primary and foreign keys and
> re-enable them post data load finishes. Will that be a good idea
>

No.

3)As the size of the table or each partition is going to be very large and
> this will be a read intensive application, compressing the historical
> partition will help us save the storage space and will also help the read
> queries performance


I am not sure what this means. If you are not going to need the data
anymore, dump the data to deep storage and drop the partition.

Cheers,
Greg


>


Re: Unused indexes

2024-02-05 Thread Greg Sabino Mullane
>
> The pg_stat_*_tables tables idx_* columns has accumulated usage since the
> last time you started the postmaster.
>

Actually, those persist at restart - you can use

 select datname, stats_reset from pg_stat_database;

to see when/if they were reset. You can look for zero/low entries in
pg_stat_user_indexes.idx_scan to find unused indexes. Note that replicas
maintain their own stats, so checking only the primary may cause a false
positive.


> I sql server we have this option to disable it and need to rebuild it to
>> ensemble it
>>
>
> Sadly, PG does not have ALTER INDEX ... DISABLE;.
>

Not really sure what the purpose of that is in sql server, but Ron is
correct, we have nothing equivalent. General usage in Postgres is to drop
the index if it is unused. If you need to create it again, easy enough with
CREATE INDEX CONCURRENTLY. Keeping your schema changes in a VCS (e.g. git)
is a good way to document when and why the index was dropped. I suppose in
a pinch you could keep the old index around by sticking it in a table
comment.

Cheers,
Greg


Re: How to do faster DML

2024-02-05 Thread Greg Sabino Mullane
On Sun, Feb 4, 2024 at 3:52 PM Lok P  wrote:

> This table will always be queried on the transaction_date column as one of
> the filters criteria. But the querying/search criteria can span from a day
> to a month worth of transaction date data.
>

In that case, and based on the numbers you provided, daily partitioning
seems a decent solution.

What I mean was, we will definitely need the data for querying purposes by
> the users, but just to keep the storage space incontrol (and to help the
> read queries), we were thinking of having the historical partitions
> compressed. And for that , if any compression strategy should we follow on
> postgres?
>

There is no compression strategy, per se. There are ways Postgres
internally compresses the data (see "TOAST"), but in general, the table is
either there or not. If space is a concern you should examine if you truly
need 127 columns, make sure they are arranged correctly (search for
'postgres column tetris'), and move unused and older partitions elsewhere -
like a separate cheaper Postgres server, or something offline.

With regards to loading data to the table faster, wondering why you said '
> *NO*' to load the data first and enabling/Creating the Primary key and
> Foreign key constraint later approach.
>

For an initial load, this is fine, if you do things very carefully. I'd
build the child table indexes post load but still feed things into the main
table as an initial tradeoff, but YMMV.

Just looked back and saw this is actually Aurora, not Postgres. Most of the
advice on this thread should still apply, but be aware that things are not
the same as Postgres, especially at the storage layer. For all the money
you are giving them, don't forget to bug them as well.

Cheers,
Greg


Re: How to do faster DML

2024-02-05 Thread Greg Sabino Mullane
> So when you said *"In that case, and based on the numbers you provided,
> daily partitioning seems a decent solution."*
> , does it mean that size of the partition (which will be ~100-150GB per
> partition here) is not an issue for the postgres to serve read latency
> sensitive queries?
>

Yes, that should be fine. At the end of the day, what's going to be more
important is making sure you are using good indexing, and keeping all your
active stuff in cache. Since you have 127 columns, only pull back the
columns you absolutely need for each query.

Cheers,
Greg


Re: How to do faster DML

2024-02-08 Thread Greg Sabino Mullane
On Thu, Feb 8, 2024 at 12:12 AM Lok P  wrote:

> Rearranging the table columns by typlen desc, didn't give much storage
> space reduction.
>

It's not so much "biggest to smallest" as making sure you don't have any
gaps when you move from one data type to another. You may have not had any
"holes" originally, so it's hard to state anything without data. The other
thing you can do is use smaller data types if you can get away with it.
smallint better than int, int better than bigint, etc.

So it seems the compression does not apply for the rows inserted using
> "CTAS" or "INSERT AS SELECT". Does that mean it is only applicable for the
> row by row inserts but not batch inserts(which a bigger system normally
> will have)? I was not expecting this though, so it was disappointing.
>

TOAST compression doesn't care how the data arrived. It does have criteria
though as to how large something is before it gets toasted and/or
compressed. See:

https://www.crunchydata.com/blog/postgres-toast-the-greatest-thing-since-sliced-bread

Cheers,
Greg


Re: Partitioning options

2024-02-08 Thread Greg Sabino Mullane
On Thu, Feb 8, 2024 at 12:42 AM sud  wrote:
...

> The key transaction table is going to have ~450 Million transactions per
> day and the data querying/filtering will always happen based on the
> "transaction date" column.
>
...

> Should we go for simple daily range partitioning on the transaction_date
> column?
>

This one gets my vote. That and some good indexes.

Cheers,
Greg


Re: Partitioning options

2024-02-08 Thread Greg Sabino Mullane
>
> Out of curiosity, As OP mentioned that there will be Joins and also
> filters on column Customer_id column , so why don't you think that
> subpartition by customer_id will be a good option? I understand List
> subpartition may not be an option considering the new customer_ids gets
> added slowly in the future(and default list may not be allowed) and also OP
> mentioned, there is skewed distribution of data for customer_id column.
> However what is the problem if OP will opt for HASH subpartition on
> customer_id in this situation?
>

It doesn't really gain you much, given you would be hashing it, the
customers are unevenly distributed, and OP talked about filtering on the
customer_id column. A hash partition would just be a lot more work and
complexity for us humans and for Postgres. Partitioning for the sake of
partitioning is not a good thing. Yes, smaller tables are better, but they
have to be smaller targeted tables.

sud wrote:

130GB of storage space as we verified using the "pg_relation_size"
> function, for a sample data set.


You might also want to closely examine your schema. At that scale, every
byte saved per row can add up.

Cheers,
Greg


Re: Clarification regarding managing advisory locks in postgresql

2024-02-08 Thread Greg Sabino Mullane
On Thu, Jan 25, 2024 at 5:44 AM Sasmit Utkarsh 
wrote:

> Therefore, locks acquired in one process (or transaction) do not directly
> affect locks in another process (or transaction).
>

Ummaybe if you describe the exact problem you are trying to solve with
advisory locks, we can find the best solution. It may not even involve
advisory locks.

Cheers,
Greg


Re: How should we design our tables and indexes

2024-02-10 Thread Greg Sabino Mullane
There is a lot to unpack here. I'm going to take a quick pass, but you
ought to consider getting some custom expert help.

On Sat, Feb 10, 2024 at 2:39 PM veem v  wrote:

> ... These will be pagination queries. These read queries will be on big
> transaction tables (will have ~500+ attributes approx will have approx.
> rows size of ~1KB) having a continuous stream of inserts
>

Pagination is already a hard problem, and does not even make sense when
combined with "a continuous stream of inserts". What should the user see
when they click on page 2?

Also, 500 attributes?! What data types are those? If boolean, you may want
to look at using bitfields.

1)Is there any way in postgres to influence the optimizer for the
> "first_row" optimization, so that it won't go for evaluating all the rows
> from such UI search queries. As because these will be pagination queries
> and the user will be interested in seeing top 100 transactions in the first
> page asap?
>

Using LIMIT does allow for this, with certain caveats. The best answer for
a lot of these questions is "try it and see" with a simplified version of
your queries against some dummy data.


>   one Index on table1(MID) , one index Table1(CID), one index on
> table2(ACN_NBR)?
>

This. One index for each important column. Especially if they vary between
queries, as you alluded to later.

select  count(*) over() as total_record, * [large query omitted]
>

Queries starting with select count(*) are also a red flag, but it's hard to
tell from this. Probably best to explain what you think the query is doing
using regular words.

3)One of the use cases is that the customer should be able to search on
> certain attributes and should be able to see the transactions in "desc by
> processing_date" i.e. latest transactions on the first page on the UI. And
> in such scenario, if the search attribute is less unique and the customer
> puts a date range of a month i.e. over 30 partitions , it may results in
> scanning and sorting billions of rows to get the top/recent ~100
> transactions and most likely not going to respond back in <1 sec, even goes
> for the index. So how should we handle or design indexes for catering such
> queries?  For e.g. if we have the only filter on column "TABLE1.CID" in the
> above query, which is very less unique then how to handle it?
>

Well, there is only so much a database can do if you are pulling from 500
different attributes. But the whole purpose of an indexed column is to
prevent having to scan *all* the rows. If it's a common value, then
hopefully there is something else in the where clause more specific that is
also indexed. As mentioned above, a LIMIT and ORDER BY, with the
appropriate indexes, can return early.

 For e.g. if we set "max_parallel_workers_per_gather"=4 to speed up the
> queries, then we will be able to serve only 32/4=8 concurrent user requests
> at any point in time. If we are targeting to serve ~100 concurrent users ,
> will it be advisable to change or we should test the system with default
> i.e. not setting this parallel parameter?
>

Again, this is a TIAS (try it and see), but in general it's not the number
of concurrent users, but the number of concurrent queries, which is a
factor of how fast your queries are. Feed Postgres as many parallel workers
as you can.

tl;dr Less than 1 second response time is *probably* reachable given your
parameters, but parts of it are too vague to state for certain.

Cheers,
Greg


Re: How to do faster DML

2024-02-12 Thread Greg Sabino Mullane
On Tue, Feb 6, 2024 at 12:15 AM Lok P  wrote:

> Another thing I noticed the shared_buffer parameters set as 2029684 in
> this instance, which comes to ~21MB and that seems very small for a
> database operating in large scale. And I see we have RAM in the instance
> showing as ~256GB. So thinking of bumping it to something as ~40-50GB.
>

shared_buffers has a unit of 8 kb blocks, so your cluster is set at 15GB,
not 21 MB. Even so, going to 50 would be fine if you have that much RAM.

Hope that will help to some extent. Not sure if there is methods to
> manually,  cache some objects(tables/indexes) which were getting used
> frequently by the read queries.
>

That's one of the points of shared_buffers - keep things that are accessed
often in memory. Postgres keeps track of which things are used more often,
so in theory the most frequently used items are removed only when
absolutely necessary.

Cheers,
Greg

>


Re: How to do faster DML

2024-02-12 Thread Greg Sabino Mullane
On Mon, Feb 12, 2024 at 1:50 AM veem v  wrote:

> So we were thinking, adding many column to a table should be fine in
> postgres (as here we have a use case in which total number of columns may
> go till ~500+). But then,  considering the access of columns towards the
> end of a row is going to add more time as compared to the column which is
> at the start of the row. As you mentioned, accessing 100th column may add 4
> to 5 times more as compared to the access of first column. So , is it
> advisable here to go for similar approach of breaking the table into two ,
> if the total number of column reaches certain number/threshold for a table?
>

I'm not sure of what Peter was testing exactly to get those 4-5x figures,
but I presume that is column access time, which would not mean a direct
effect on your total query time of 4-5x. As far as breaking the table in
two, I would not recommend that as it adds complexity and introduces other
problems. On the other hand, 500 columns is pretty extreme, so maybe things
rarely accessed or searched for could go in another table. Really hard to
say without knowing more about your data access patterns. Hopefully, we
never see a "SELECT *" for a table like that! :)

Cheers,
Greg


Re: How should we design our tables and indexes

2024-02-12 Thread Greg Sabino Mullane
>
> When the user clicks to the second page , it will see the next set of rows
> i.e 100 to 200 and next will see 200 to 300 and so on till the result set
> finishes.
>

As others have pointed out, that still makes no sense. You will either fail
to show certain rows completely, or have a stale view with additional
tracking overhead.


> All those attributes are majorly Varchar and numeric in nature , so not
> sure if any options exist there for these?
>

Nothing built in, but there are application-level tricks to combine fields,
depending on what they are. Whether that would work for your use case or be
an overall performance benefit is impossible to say without hard data.
Normally, I'd say don't worry about that but - not to belabor the point -
500 columns is a lot, to the point where normal advice may not apply.

If PR_ID is a must in the Join criteria between these table tables table1,
> table2 in all the queries, then is  it advisable to have a composite index
> like (pr_id, mid), (pr_id,cid) etc rather than having index on individual
> columns?
>

No - individual indexes are better, and Postgres has no problem combining
them when needed.


> Actually this inner query is doing the main work, i.e finding the
>>> search results based on the input search criteria. The outer query is just
>>> fetching the results from the inner query along with count(*), to pass on
>>> to the API , so as to calculate and show the user how many pages there
>>> total with a full result set. basically it will count(*)/N records per
>>> page, and that figure will be displayed in the first page of the UI screen.
>>>
>>
Okay, understood. But back to the pagination discussion, this number is
relatively meaningless on a rapidly changing table.

Sure will try to test and see how it behaves when the number of
> simultaneous queries (here 32/4=8 concurrent queries) exceed the
> max_parallel_workers limit. Though I am expecting the further queries
> exceeding the limit might get serialized.
>

Yes - if there are not enough workers available, it will run with a reduced
number of workers, including possibly zero. You can see that when you run
an explain analyze, it will show you the number of workers it wants and the
number if actually was able to get.

Cheers,
Greg


Re: Safest pgupgrade jump distance

2024-02-12 Thread Greg Sabino Mullane
On Mon, Feb 12, 2024 at 5:08 AM Dávid Suchan 
wrote:

> Hi, I was wondering what is the safest pg_upgrade version upgrade distance
> going from 9.6 version. Do I need to go version by version or I can go from
> 9.6 to 15?


You can go direct. You really should go to 16 though. If nothing else, you
gain lots of performance improvements and another year before your version
goes EOL (end of life).


> We have a very huge database(TBs) with one replication server, so we will
> first run the pgupgrade on the main server and then rsync to a standby
> replica.


I assume you mean the three-way rsync recipe in the docs.

Also, when upgrading a very big database with replication where none of the
> data can be allowed to be lost, is the pgupgrade into rsync approach the
> best one? Thanks.
>

Not sure about "best", it is certainly the fastest. For data loss concerns,
that's where your backup system (e.g. pgBackRest) is key.

Cheers,
Greg


Re: How should we design our tables and indexes

2024-02-12 Thread Greg Sabino Mullane
> Is there any way to track those historical executions and be able to find
the exact root cause of the slow executions confidently?

https://www.postgresql.org/docs/current/auto-explain.html

auto_explain.log_min_duration = '5s' ## or large enough to capture your
quickest one

Do NOT enable auto_explain.log_analyze.

Cheers,
Greg


Re: Using a Conversion Table

2024-02-14 Thread Greg Sabino Mullane
>
> "Fiscal year" double precision,
>

This column is an INTEGER in your other table, so your schema is not even
internally consistent! Try to use TEXT, INT, DATE and TIMESTAMPTZ whenever
possible, with NUMERIC as needed for things with a precision. See:

https://www.postgresql.org/docs/current/datatype-numeric.html

Cheers,
Greg

P.S. +1 for use of an identity column, though.


Re: Postgres pg_cron extension

2024-02-14 Thread Greg Sabino Mullane
Yes, definitely the wrong forum. RDS is not Postgres, and "parameter
groups" is an Amazon thing. A quick web search would seem to indicate that
the default group is replaced by the custom one, rather than enhancing it.
But really, this is a simple "try it and see" sort of task.

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/parameter-groups-overview.html

Other than that, reach out to Amazon support.

Cheers,
Greg


Re: How should we design our tables and indexes

2024-02-14 Thread Greg Sabino Mullane
On Tue, Feb 13, 2024 at 2:26 PM veem v  wrote:

> Can the optimizer, only scan the TABLE1  using ACCESS criteria "
> TABLE1.MID in ()" or "TABLE1.CID in ()" which will be catered by
> two different index i.e one index on column "MID" and other on column "CID"?
>

Yes:

greg=# create table t1(pr_id int generated always as identity primary key,
mid int, cid int);
CREATE TABLE
greg=# insert into t1(mid,cid) select random()*12345, random()*12345 from
generate_series(1,123456);
INSERT 0 123456
greg=# create index t1_mid on t1(mid);
CREATE INDEX
greg=# create index t1_cid on t1(cid);
CREATE INDEX
greg=# analyze t1;
ANALYZE
greg=#  explain select * from t1 where mid in (1,2,3,4) and cid IN
(5,6,7,8);
   QUERY PLAN
-
 Bitmap Heap Scan on t1  (cost=50.03..109.55 rows=49 width=12)
   Recheck Cond: ((cid = ANY ('{5,6,7,8}'::integer[])) AND (mid = ANY
('{1,2,3,4}'::integer[])))
   ->  BitmapAnd  (cost=50.03..50.03 rows=49 width=0)
 ->  Bitmap Index Scan on t1_cid  (cost=0.00..24.88 rows=2469
width=0)
   Index Cond: (cid = ANY ('{5,6,7,8}'::integer[]))
 ->  Bitmap Index Scan on t1_mid  (cost=0.00..24.88 rows=2469
width=0)
   Index Cond: (mid = ANY ('{1,2,3,4}'::integer[]))

It can utilize other columns as access criteria those used in join
> conditions like MID, PR_ID, in which case a composite index on  the
> columns(CID,PR_ID) (MID, PR_ID) will provide better selectivity and faster
> access?
>

If you query on the primary key, it's going to use the associated PK index,
not a composite one in which the PK is buried. But try creating the sample
table t1 above yourself and play around with the various indexes and query
combinations.

Cheers,
Greg


Re: PITR

2024-02-14 Thread Greg Sabino Mullane
On Wed, Feb 14, 2024 at 1:39 PM Yongye Serkfem  wrote:

>  I checked the Postgresql.conf file and can't find the
> appropriate parameter to set the target time.
>

It is set inside the postgresql.conf file. Unless you have modified it,
there is a section in there called "Recovery Target" which has a handful of
commented-out variables starting with "recovery_"

Cheers,
Greg


Re: How to do faster DML

2024-02-15 Thread Greg Sabino Mullane
>
> So as I also tested the same as you posted, there has been no change in
> "ctid" , when I altered the column data type from 'int' to 'bigint' in the
> table, so that means full table rewriting won't happen in such a scenario.


No it was definitely rewritten - do not depend on the ctid to verify that.
Take our word for it, or use* pg_relation_filenode('int_test');* before and
after, as well as *pg_relation_size('int_test')*;

Cheers,
Greg


Re: How to do faster DML

2024-02-15 Thread Greg Sabino Mullane
On Thu, Feb 15, 2024 at 11:43 AM Adrian Klaver 
wrote:

> That is a mixed bag:
>

Ha! Good point. Our contrived example table does suffer from that, so
perhaps the test should be:

create table int_test(c1 int, c2 int);

Cheers,
Greg


Re: How to do faster DML

2024-02-15 Thread Greg Sabino Mullane
I really worry you are overthinking this. The only real concern is going
from INT to BIGINT, jumping from 4 to 8 bytes or storage. That really
covers 99% of real world cases, and the canonical advice is to start with
BIGINT if you ever think your rows are going to be numbered in the
billions. Also, a NUMERIC can cause a table rewrite - try changing the
scale, not just the precision. And if your scale is 0, why are you using
numeric? :)

Cheers,
Greg


Re: Encryption Options

2024-02-16 Thread Greg Sabino Mullane
You need to clearly define your threat model. What exactly are you
defending against? What scenario do you want to avoid?

Also, your decision of on-premise or Aurora is extremely relevant to your
range of options.

Cheers,
Greg


Re: Encryption Options

2024-02-16 Thread Greg Sabino Mullane
On Fri, Feb 16, 2024 at 4:04 PM sud  wrote:

>
> Yes these are Account number/PCI data and "data at rest" encryption is
> something management is asking to have irrespective of whether we encrypt
> those before storing in the database or not. And this system needs to
> adhere to PCI 4.0 standards , so it looks like we can't persist the PCI
> data as is in th database even if the 'data at rest' encryption is there,
> it has to be encrypted before storing into the database.
> https://www.varonis.com/blog/pci-dss-requirements
>

Even with PCI rules, not all data needs to be encrypted, only very
sensitive things like actual credit card numbers. If you don't have a
compliance department, you may want to outsource that investigation. To
someplace other than pgsql-general. :)

Agreed. The on-premise vs aurora will take a different approach for
> catering to above needs. We are currently evaluating , what would be the
> possible options in each of these cases? and if this would be a factor in
> choosing the on-premise postgres vs aurora postgres?
>

Also outside the scope of this list, but with Aurora, you pay more, and
must 100% trust Amazon with all of your data. On the plus side, they (and
any other managed Postgres service) remove a lot of the complexity and DBA
housekeeping tasks.


> On Sat, Feb 17, 2024 at 12:40 AM Ron Johnson 
> wrote:
>
>> The problem with encrypting "account number" is that you can't JOIN or
>> WHERE on it.
>>
>
I'm hoping the OP meant "credit card number" not "primary key identifying a
customer" or just generic PII. If just cc#s, no joins are needed (and
certainly no WHERE clause, yikes!)

Will try to verify these options. Considering these system processes 100's
>> of millions of transactions, will these encryption add significant overhead?
>>
>
Yes, encryption will always incur overhead. However, this cost should be
absorbed by your application, not the database. Encrypt and store as a
blob[1] in the database. Stolen database = no keys, no access. As Ron
pointed out, you should also have other levels of encryption: your backups,
your WAL, and ideally OS-level as well.

[1] Generic blob of data, not a BLOB

Cheers,
Greg


Re: "reverse" (?) UPSERT -- how to ?

2024-02-17 Thread Greg Sabino Mullane
On Sat, Feb 17, 2024 at 10:24 AM Karsten Hilbert 
wrote:

> There will be a view giving rows for
> each detail row enriched with master table data
> UNION ALL
> rows for each master row that does not have any detail row with
> detail table columns NULLed
>

A better way to do that is to just use a LEFT JOIN.

Given a pk_detail (and pk_master) having been obtained from the view
> (therefore pk_detail being NULL or an integer value)
> UPDATE that detail row (namely when pk_detail is distinct from NULL) or
> insert a new detail row (when pk_detail IS
> NULL) linking that row to the master row identified by pk_master.


What you want is the MERGE command:

https://www.postgresql.org/docs/current/sql-merge.html

To simplify your example a little, let's create two tables, linked to each
other by foreign keys, in which only 2 of the five rows have matching
details:

drop table if exists t2;
drop table if exists t1;

create table t1(id serial primary key, v text);

insert into t1(v) select 'foo' from generate_series(1,5);

create table t2(id serial primary key,  fk int references t1(id), w text);

insert into t2(fk,w) values (2, 'bar1'), (2, 'bar2'), (3,'bar3');

select * from t2 order by 1;

 id | fk |  w
++--
  1 |  2 | bar1
  2 |  2 | bar2
  3 |  3 | bar3
(3 rows)

Now we can use that left join, plus a merge based on the results, to
conditionally update or insert:

WITH x AS (SELECT t1.id as t1id, t2.id AS t2id, * FROM t1 LEFT JOIN t2 ON (
t1.id=t2.fk) )
MERGE INTO t2
USING x ON (x.t2id = t2.id)
WHEN MATCHED THEN UPDATE SET w='new info for existing row'
WHEN NOT MATCHED THEN INSERT (fk,w) VALUES (x.t1id, 'new info for a new
row');

select * from t2 order by 1;

 id | fk | w
++---
  1 |  2 | new info for existing row
  2 |  2 | new info for existing row
  3 |  3 | new info for existing row
  4 |  5 | new info for a new row
  5 |  4 | new info for a new row
  6 |  1 | new info for a new row
(6 rows)

Cheers,
Greg


Re: Partitioning, Identity and Uniqueness (given pg 16 changes)

2024-02-19 Thread Greg Sabino Mullane
On Sun, Feb 18, 2024 at 10:32 PM Darryl Green 
wrote:

> 1) Is my attempt to improve performance of insert AND (as the number of
> sources goes up, as it has) querying which is invariably by id but it is
> easy to include "and src = x" in the majority of usage (the other common
> case is to query across all sources, asking for a set is unusual) a
> reasonable approach?


I am struggling to see the overall benefit of partitioning here. How many
total rows and inserts per second, anyway?

>
If the app already knows the src (as evidenced by adding in "and src = x"),
and you query by id, why would you partition by src?

It may help us all to take a step back and describe the problem here with
specific timing numbers (for a slow INSERT rate maybe?) rather than trying
to force one particular solution to work.

Cheers,
Greg


Re: User pooler error: unsupported startup parameter: extra_float_digits

2024-02-20 Thread Greg Sabino Mullane
Add this to your pgbouncer config file:

ignore_startup_parameters = extra_float_digits

A web search of your subject line finds this solution quite easily.
Providing more details would also be a good idea for future emails.

Cheers,
Greg


Re: Postgresql assistance needed

2024-02-22 Thread Greg Sabino Mullane
On the surface, it looks as though it *should* work if the only thing
changing is the username. Those other more serious errors should get fixed,
but putting those aside for now... We don't know what your program does.
Write a smaller one that just does a PQexec and calls nextval, then returns
a proper error message if it fails.

Cheers,
Greg


Re: Partitioning, Identity and Uniqueness (given pg 16 changes)

2024-02-22 Thread Greg Sabino Mullane
>
> I have a table that is capturing what is, basically, time series data


Time series data usually is concerned with "recent" data, and has a
subsequent drop off date. This is ideal for partitioning by timestamp - not
only do your queries only need to hit a few of the total tables, but you
can simply drop old tables that contain older data. Optionally saving them
first to deep storage. The DROP TABLE approach is extraordinarily better
than the alternative of DELETE FROM ... WHERE ...

Cheers,
Greg


Re: PostgreSQL Guard

2024-02-27 Thread Greg Sabino Mullane
You may need to expand on / rephrase question 2.


Re: PostgreSQL Guard

2024-02-28 Thread Greg Sabino Mullane
On Tue, Feb 27, 2024 at 3:48 PM Jason Long  wrote:

> Suppose you want to transfer the database of a website like Amazon.com to
> a new server and delete the old one. Many users are buying and selling on
> this website at the same time and it is not possible to turn off the
> server. What do you do to move a database to another server?
>

If we assume your question is "how can we migrate to a new system while
keeping the old one up and running to minimize downtime?" the answer is
logical replication.

https://www.crunchydata.com/blog/data-to-go-postgres-logical-replication

Cheers,
Greg


Re: Orphan table files at data/base/

2024-02-28 Thread Greg Sabino Mullane
No, I would not remove those files without making 100% sure they do not
belong to that database or any other. Are you sure you are inside database
16384 when you ran those commands? Does a 'stat' on those files line up
with the time of the crash? If so, I would stop pg, move the files
someplace else, do a pg_dump > /dev/null for another sanity check, then
remove those files.

Cheers,
Greg


Re: When manual analyze is needed

2024-03-04 Thread Greg Sabino Mullane
On Mon, Mar 4, 2024 at 12:23 AM veem v  wrote:

> Additionally if a query was working fine but suddenly takes a
> suboptimal plan because of missing stats , do we have any hash value column
> on any performance view associated with the queryid which we can refer to
> see past vs current plans difference and identify such issues quickly and
> fix it?
>

You can use auto_explain; nothing else tracks things at that fine a level.
You can use pg_stat_statements to track the average and max time for each
query. Save and reset periodically to make it more useful.

https://www.postgresql.org/docs/current/auto-explain.html

https://www.postgresql.org/docs/current/pgstatstatements.html


> I am not seeing any such column in pg_stat_activity or pg_stat_statements
> to hold hash value of the plan and also the query column is showing
> "" for many of the entries, why so?
>

Ordinary users are not allowed to see what other people are running. You
can add a user to the pg_read_all_stats role to allow this:

GRANT pg_read_all_stats TO alice;

Oftentimes someone needing access to the stats also needs a little more
access, so consider the pg_monitor role as well. Both are documented here:

https://www.postgresql.org/docs/current/predefined-roles.html

Cheers,
Greg


Re: When manual analyze is needed

2024-03-05 Thread Greg Sabino Mullane
>
> We were planning to have the auto_explain extension added and set the
> log_min_duration to ~5 seconds and log_analyze to true. So that all the
> queries going above that time period will be logged and provide detailed
> information on the exact point of bottleneck. Will it be a good idea to set
> it on production DB which is a highly active database? or should we only
> have the extension added but only set the parameters while we debug some
> performance issue and then reset it back after we are done.
>

I would not use log_analyze on a highly active production db. Even on a dev
system, use it carefully as it has some downsides. The log_min_duration of
5s should be fine everywhere, however.

Cheers,
Greg


Re: pgBadger: Cannot find any log entries from systemd-journald

2024-03-06 Thread Greg Sabino Mullane
On Tue, Mar 5, 2024 at 3:14 AM Frank Lanitz  wrote:

> $ pgbadger --journalctl "journalctl -u postgresql.service"
>

You could try adding --verbose to see if it gives more clues.


> Having a look into the journal there is a lot of
>

None of the snippets from journald you showed were actually things pgbadger
cares about, FWIW. You can get an idea of what is actually parsed by
running "journalctl -u postgresql --output=short-iso"

log_error_verbosity = 'verbose'
>

This is probably not needed, and there is a finite chance that the extra
context is confusing pgbadger.

Cheers,
Greg


Re: Feature Request: log_line_prefix %h Counterpart

2024-03-06 Thread Greg Sabino Mullane
On Fri, Jan 26, 2024 at 1:44 AM Jawala  wrote:

> Looking for an escape sequence for log_line_prefix for the IP address that
> traffic was destined to, i.e., the *local* address of the TCP connection,
> natural counterpart to "remote IP address".
>

I made a quick proof of concept of this, will post a patch to -hackers soon
and CC you.

Cheers,
Greg


Re: Feature Request: log_line_prefix %h Counterpart

2024-03-06 Thread Greg Sabino Mullane
Actually, will just post it here in case someone finds this thread and
wants to follow along:

https://www.postgresql.org/message-id/flat/CAKAnmmK-U%2BUicE-qbNU23K--Q5XTLdM6bj%2BgbkZBZkjyjrd3Ow%40mail.gmail.com


Re: update to 16.2

2024-03-08 Thread Greg Sabino Mullane
On Fri, Mar 8, 2024 at 5:01 AM Matthias Apitz  wrote:

> The other option (pg_upgrade) we never used.
>

You really should give this a shot. Much easier, and orders of magnitude
faster with the --link option. It should work fine even with a
custom-compiled postgres (really, as long as pg_dump can still read things
and core changes are not radical). Try it and see.

Cheers,
Greg


Re: Windows service randomly stops with no indication why

2024-03-08 Thread Greg Sabino Mullane
Go to "Services", find Postgres, and try a manual restart, see what
happens. Then check the recovery tab and see what it is supposed to do on
failures - you are probably at the "Do nothing" count limit, hence the no
auto restart.

If you can manually duplicate the failure to restart, try increasing the
logging, removing extensions, simplifying the postgresql.conf, etc, until
it starts working again. I suspect it will restart fine, though. Maybe
check anti-virus software logs?

Cheers,
Greg


Re: Seeing high query planning time on Azure Postgres Single Server version 11.

2024-03-10 Thread Greg Sabino Mullane
On Sat, Mar 9, 2024 at 1:57 PM hassan rafi  wrote:

> Would upgrading to the latest version of Postgres potentially solve the
> issue?
>

Potentially, yes, but the only one who can answer that for sure is you.
Upgrade to 11.22 and re-run the query. Worst case scenario, it runs the
same speed but you gained yourself a bunch of bugfixes and CVE resolutions.
If the problem persists on 11.22, spin up a Postgres 16, load the data, and
test it there.

Cheers,
Greg


Re: alter table xxx alter column yyy set (n_distinct=nnnn );

2024-03-11 Thread Greg Sabino Mullane
Works for me on Postgres 14 as well. Maybe you are looking at the wrong
table? Might help to show the exact steps you typed in.


Re: operator is only a shell - Error

2024-03-15 Thread Greg Sabino Mullane
On Fri, Mar 15, 2024 at 6:26 AM Rajesh S  wrote:

> I wanted to implement a new "=" (equal) operator with LEFTARG as numeric
> and RIGHTARG as varchar.  But after creating the function and operator,
> psql shows the error "operator is only a shell: character varying =
> numeric
>
Your operator has numeric on the left and varchar on the right. But your
query is doing numeric on the RIGHT. Probably want to make a matching one
to cover both cases.

Cheers,
Greg


Re: select results on pg_class incomplete

2024-03-15 Thread Greg Sabino Mullane
On Fri, Mar 15, 2024 at 6:43 AM Thiemo Kellner 
wrote:

> I am not sure, we are taking about the same problem, but would be
> surprised to be the only one having experienced filling disks.

...

> So, if I have a disk getting filled up, I would like to get easily
> information on the problematic
> structures in one go.


This is a common problem, and one that has been solved before. There are
many monitoring solutions out there that can help you with this. For an
incomplete list, see:

https://wiki.postgresql.org/wiki/Monitoring

If you want to roll your own, the other thread is full of good information
on that.

Cheers,
Greg


Re: Query on Postgres SQL transaction

2024-03-15 Thread Greg Sabino Mullane
That's a very vague question, but you can trace exactly what is happening
by issuing

SET log_statement = 'all';

Ideally at the session level by your application, but can also set it at
the database and user level. If all else fails, set it globally (i.e.
postgresql.conf). Turn it off again as soon as possible, it will make your
logs extremely verbose. But you can track exactly what your application is
doing.

Cheers,
Greg


Re: operator is only a shell - Error

2024-03-18 Thread Greg Sabino Mullane
The order of the arguments matter. You need an operator that expects a
varchar on the left hand side of the operator, and numeric on the right
side. For example:

create database cast_test;
\c cast_test
You are now connected to database "cast_test" as user "greg".

cast_test=# create table deposit_lien(deposit_no varchar);
CREATE TABLE

cast_test=# select * from deposit_lien where
deposit_no='0002114029832'::numeric;
ERROR:  operator does not exist: character varying = numeric

cast_test=# create function varchar_eq_numeric(varchar,numeric)
cast_test-# returns bool language sql immutable as $$ select
$1::numeric=$2::numeric $$;
CREATE FUNCTION

cast_test=# create operator public.= (function = varchar_eq_numeric,leftarg
= varchar,rightarg = numeric);
CREATE OPERATOR

cast_test=# select * from deposit_lien where
deposit_no='0002114029832'::numeric;
 deposit_no

(0 rows)

But before creating the function and operator it was working fine


Was it? It helps to show us the exact things ran and the exact output,
rather than just "it was working fine" :)

Cheers,
Greg


Re: Postgres database encryption

2024-03-19 Thread Greg Sabino Mullane
What is your threat model? That will determine the best answer.

The easiest thing is to use OS-level encryption.

Cheers,
Greg


Re: Query on Postgres SQL transaction

2024-03-30 Thread Greg Sabino Mullane
That log snippet shows two different PIDs. Check the logs to see the
complete path that process 1876 took after it did the insert at  2024-02-19
15:21:54.850 +08

Make sure you are not using prepared transactions. This should return 0:

select current_setting('max_prepared_transactions');

Cheers,
Greg


Re: Timestamp conversion Error in dynamic sql script

2024-04-02 Thread Greg Sabino Mullane
>
> 1. Declare start_date as DATE when you want to add days with date + int
> 2. Keep TIMESTAMP and use start_date + make_interval(days => i)
>

Also

0. Use TIMESTAMPTZ not TIMESTAMP

Cheers,
Greg


Re: Moving delta data faster

2024-04-04 Thread Greg Sabino Mullane
So you want advice from this open source community about moving things from
a proprietary database to another proprietary database, going through a
proprietary storage system along the way? :)

To answer the original question, use MERGE. That's it's job, and your table
size is very tiny, so I doubt the performance impact will be worth worrying
about.

Cheers,
Greg


Re: Storing and comparing columns of cryptographic hashes?

2024-04-08 Thread Greg Sabino Mullane
On Mon, Apr 8, 2024 at 10:08 AM Josh Triplett  wrote:

> - Is there a way to tell postgresql "this column contains cryptographic
> hashes, so you can do hash joins using any subset of the bits,
> without having to hash them again"? If not, should there be?
>

No, and no. (if I understand your question correctly). You could use a
functional index, I suppose, but seems premature optimization.


> - Is `bit(256)` the right type to use to store 32-byte hash values with no
> additional overhead?
>

No, you would want bytea. I would store the value in a TEXT field, unless
you really worried about space savings. The hexadecimal value will be far
easier to debug and work with, and you can use a simple b-tree index.

- What would be the simplest way, given an input array of hashes (which
>   I may have to pass in as an array and use `unnest`), to filter out all
>   the values that already exist, *and* generate a corresponding bitmap
>   in the same order for present/not-present for the entire array (to
>   send back to the user)? Filtering seems easy enough, but generating
>   the bitmap less so.
>

Something like this:

SELECT array_agg(case when t.bhash is null then 1 else 0 end)
from unnest(array['blakehash1', 'blakehash2', etc...]) as a(x)
left join mytable t on t.bhash = a.x;


> - Does it make more sense to store the values as one row per value, or
>   as one row per group of values?


Hard to answer without knowing more, but I'd lean towards simple and one
row per value.

Your proposal (query db, do external work, update db) also sets of lots of
concurrency red flags, so be mindful of that.

Cheers,
Greg


Re: Tracing libpq client: Only with PQtrace()?

2024-04-09 Thread Greg Sabino Mullane
On Tue, Apr 9, 2024 at 8:51 AM Sebastien Flaesch 
wrote:

> Is the PQtrace() API the only way to enable libpq client tracing?
>
> I thought about some environment variable of client configuration
> setting...
>

That's generally the job of the client, or more specifically, the driver
providing the interface between your program and the libpq API. Unless you
are writing at the low level yourself, in which case, you would need to
have your program detect an env and call PQtrace() as needed.

See for example
https://metacpan.org/pod/DBD::Pg#parse_trace_flag-and-parse_trace_flags

Cheers,
Greg


Re: mystery with postgresql.auto.conf

2024-04-10 Thread Greg Sabino Mullane
On Wed, Apr 10, 2024 at 8:40 AM Matthias Apitz  wrote:

> Maybe later I did it accidently against the 15.1 server from the psql
> history.


Yes, as shown by the timestamps from your first post:

-rw--- 1 postgres postgres 124 Mar 28 11:35
> postgresql151/data/postgresql.auto.conf
>

This is also a good reason to set your log_statement to 'ddl', which will
put the ALTER SYSTEM change into your database logs.

Cheers,
Greg


Re: Forcing INTERVAL days display, even if the interval is less than one day

2024-05-07 Thread Greg Sabino Mullane
Could just use a CASE for that particular case:

CASE WHEN now()-latest_vacuum < '1 day'
   THEN '0 days '
   ELSE '' END
   || DATE_TRUNC('second', now()-latest_vacuum) AS vacuumed_ago

Cheers,
Greg


Re: Postgresql active-active nodes in cluster

2024-05-09 Thread Greg Sabino Mullane
> https://wiki.postgresql.org/wiki/Multimaster
>

That is a pretty old page. There are other solutions, such as pgedge, but
the three most important questions when it comes to active-active
replication are:

* Do you really, really need it? (hint: the answer is no)
* What problem are you trying to solve?
* Are you really willing to suffer the tradeoffs?

Cheers,
Greg

P.S. Please upgrade to v15.7


Re: Postgresql active-active nodes in cluster

2024-05-13 Thread Greg Sabino Mullane
On Mon, May 13, 2024 at 2:53 AM Vidyashree H S <
shreevidya...@exaleapsemi.com> wrote:

>
>- Is that multi master replication is not recommended to do ?
>
>
No, it is not recommended, as it has a very, very narrow use case, and
comes with a lot of drawbacks. Sure, you COULD configure your car to also
be able to ride across the river like a boat, but it's far easier to use a
bridge, or a barge. And now your car is harder to care for, and has a lot
more ways in which it could break.

>
>- If we can do, can I get any resource for implementing that on
>PostgreSQL which is open-source?
>
> You still have not told us your use case. That can help us to guide you to
the right solution. But for totally open source solutions there are not
many out there. Here are some:

* https://github.com/JumpMind/symmetric-ds
* https://github.com/bucardo/bucardo

You can also in theory use logical replication if at least version 16:

* https://www.crunchydata.com/blog/active-active-postgres-16


Cheers,
Greg


Re: Execution history of a single query

2024-05-17 Thread Greg Sabino Mullane
Not directly through pg_stat_statements, no. But you can:

* Use log_min_duration_statement to spy on individual runs
* Periodically copy pg_stat_statements to another table, reset the stats,
and then you have some time-based buckets to play with.

Cheers,
Greg


Re: Finding "most recent" using daterange

2024-05-22 Thread Greg Sabino Mullane
This is a good candidate for a window function. Also note that nulls
already get sorted correctly by the DESC so no need to get 'infinity'
involved, although you could write 'DESC NULLS FIRST' to be explicit about
it.

with x as (select *,  row_number() over (partition by id order by
upper(dates) desc, lower(dates) desc) from example)
  select id,value,dates from x where row_number = 1;

 id | value |  dates
+---+-
  1 | b | [2010-01-01,)
  2 | d | [2010-01-01,2021-01-01)
  3 | g | [2013-01-01,)
  4 | j | [2010-01-01,2015-01-01)
(4 rows)


Cheers,
Greg


Re: Finding "most recent" using daterange

2024-05-22 Thread Greg Sabino Mullane
Oh, you are right - NULLS LAST, got my -infinity crossed with my infinity.
:)

Cheers,
Greg


Re: Timeout gets unset on a syntax error.

2024-05-29 Thread Greg Sabino Mullane
On Wed, May 29, 2024 at 6:29 AM ISHAN CHHANGANI . <
f20200...@hyderabad.bits-pilani.ac.in> wrote:

> Is there some code within Postgres that resets/removes the signals in case
> a query hits any error?
>

See the comment and call to disable_all_timeouts() in postgres.c, part of
the cleanup-after-exception efforts:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/tcop/postgres.c#l4356


Cheers,
Greg


Re: Postgresql 16.3 Out Of Memory

2024-06-03 Thread Greg Sabino Mullane
On Mon, Jun 3, 2024 at 6:19 AM Radu Radutiu  wrote:

> Do you have any idea how to further debug the problem?
>

Putting aside the issue of non-reclaimed memory for now, can you show us
the actual query? The explain analyze you provided shows it doing an awful
lot of joins and then returning 14+ million rows to the client. Surely the
client does not need that many rows?

You can probably also prevent OOM by lowering your work_mem and/or your
max_parallel_workers_per_gather.

Cheers,
Greg


Re: Password complexity/history - credcheck?

2024-06-23 Thread Greg Sabino Mullane
On Sun, Jun 23, 2024 at 5:30 AM Martin Goodson 
wrote:

> I believe that our security team is getting most of this from our
> auditors, who seem convinced that minimal complexity, password history
> etc are the way to go despite the fact that, as you say, server-side
> password checks can't really be implemented when the database receives a
> hash rather than a clear text password and password minimal complexity
> etc is not perhaps considered the gold standard it once was.
>
> In fact, I think they see a hashed password as a disadvantage.


Wow, full stop right there. This is a hill to die on.

Push back and get some competent auditors. This should not be a DBAs
problem. Your best bet is to use Kerberos, and throw the password
requirements out of the database realm entirely.

Also, the discussion should be about 2FA, not password history/complexity.

Cheers,
Greg


Planet Postgres and the curse of AI

2024-07-17 Thread Greg Sabino Mullane
I've been noticing a growing trend of blog posts written mostly, if not
entirely, with AI (aka LLMs, ChatGPT, etc.). I'm not sure where to raise
this issue. I considered a blog post, but this mailing list seemed a better
forum to generate a discussion.

The problem is two-fold as I see it.

First, there is the issue of people trying to game the system by churning
out content that is not theirs, but was written by a LLM. I'm not going to
name specific posts, but after a while it gets easy to recognize things
that are written mostly by AI.

These blog posts are usually generic, describing some part of Postgres
in an impersonal, mid-level way. Most of the time the facts are not
wrong, per se, but they lack nuances that a real DBA would bring to the
discussion, and often leave important things out. Code examples are often
wrong in subtle ways. Places where you might expect a deeper discussion are
glossed over.

So this first problem is that it is polluting the Postgres blogs with
overly bland, moderately helpful posts that are not written by a human, and
do not really bring anything interesting to the table. There is a place for
posts that describe basic Postgres features, but the ones written by humans
are much better. (yeah, yeah, "for now" and all hail our AI overlords in
the future).

The second problem is worse, in that LLMs are not merely gathering
information, but have the ability to synthesize new conclusions and facts.
In short, they can lie. Or hallucinate. However you want to call it, it's a
side effect of the way LLMs work. In a technical field like Postgres, this
can be a very bad thing. I don't know how widespread this is, but I was
tipped off about this over a year ago when I came across a blog suggesting
using the "max_toast_size configuration parameter". For those not
familiar, I can assure you that Postgres does not have, nor will likely
ever have, a GUC with that name.

As anyone who has spoken with ChatGPT knows, getting small important
details correct is not its forte. I love ChatGPT and actually use it daily.
It is amazing at doing certain tasks. But writing blog posts should not be
one of them.

Do we need a policy or a guideline for Planet Postgres? I don't know. It
can be a gray line. Obviously spelling and grammar checking is quite
okay, and making up random GUCs is not, but the middle bit is very hazy.
(Human) thoughts welcome.

Cheers,
Greg


Re: Planet Postgres and the curse of AI

2024-07-18 Thread Greg Sabino Mullane
> But to what degree exactly should that be allowed?

Somewhat ironically, here's a distinction chatgpt and I came up with:

LLM-generated content: Content where the substantial part of the text is
directly created by LLMs without significant human alteration or editing.

Human-edited or reviewed content: Content that has been substantially
revised, corrected, or enhanced by a human after initial generation by
LLMs. This includes using spell and grammar checking, manual edits for
clarity or style, and content that reflects significant human input beyond
the original LLM output.


Re: Planet Postgres and the curse of AI

2024-07-23 Thread Greg Sabino Mullane
On Fri, Jul 19, 2024 at 3:22 AM Laurenz Albe 
wrote:

> I have no problem with that definition, but it is useless as a policy:
> Even in a blog with glaring AI nonsense in it, how can you prove that the
> author did not actually edit and improve other significant parts of the
> text?
>

Well, we can't 100% prove it, but we can have ethical guidelines. We
already have other guidelines that are open to interpretation (and plenty
of planet posts bend the rules quite often, IMO, but that's another post).


> Why not say that authors who repeatedly post grossly counterfactual or
> misleading content can be banned?
>

Banned is a strong word, but certainly they can have the posts removed, and
receive warnings from the planet admins. If the admins can point to a
policy, that helps. Perhaps as you hint at, we need a policy to not just
discourage AI-generated things, but also wrong/misleading things in general
(which was not much of a problem before LLMs arrived, to be honest).

Cheers,
Greg


Re: data checksums

2024-08-06 Thread Greg Sabino Mullane
The penalty is not "considerable", but it is not completely trivial either.
But it's more on the trivial side. Part of the problem is that it is hard
to measure, as it is very workload dependent. As to why it is not the
default, Postgres is very careful and conservative by default, and not
everybody was convinced that enabling checksums is worth the tradeoff,
especially (IIRC) with the lack of real-world examples of people
discovering issues thanks to these checksums. So yes, please enable and
share with us if the checksums catch anything.

I think the argument is worth re-opening again, because (as far as I know),
all of the companies providing Postgres support, whether completely
cloud-managed, setting up a complex HA cluster, or just providing tuning
advice, have enabled checksums by default for many, many years. So a big +1
from me to finally making it the default. It's a de-facto default anyway at
this point.

Cheers,
Greg


Re: Column type modification in big tables

2024-08-08 Thread Greg Sabino Mullane
On Thu, Aug 8, 2024 at 2:39 PM Lok P  wrote:

> Can anybody suggest any other possible way here.
>

Sure - how about not changing the column type at all?

> one of the columns from varchar(20) to varchar(2)

ALTER TABLE foobar ADD CONSTRAINT twocharplease CHECK (length(mycol) <= 2)
NOT VALID;

> one of the columns from Number(10,2) to Numeric(8,2)

ALTER TABLE foobar ADD CONSTRAINT eightprecision CHECK (mycol <= 10^8) NOT
VALID;

> two of the columns from varchar(20) to numeric(3)

This one is trickier, as we don't know the contents, nor why it is going to
numeric(3) - not a terribly useful data type, but let's roll with it and
assume the stuff in the varchar is a number of some sort, and that we don't
allow nulls:

ALTER TABLE foobar ADD CONSTRAINT onekorless CHECK (mycol::numeric(3) is
not null) NOT VALID;

You probably want to check on the validity of the existing rows: see the
docs on VALIDATE CONSTRAINT here:

https://www.postgresql.org/docs/current/sql-altertable.html

Cheers,
Greg


Re: Getting specific partition from the partition name

2024-08-08 Thread Greg Sabino Mullane
_MM_DD is already setup for sorting, so just do:

SELECT table_name FROM information_schema.tables WHERE table_name ~
'table_part_p' ORDER BY 1 DESC;

If you need to grab the numbers:

SELECT substring('table_part_p2022_03_04' from '([\d_]+)$');

Cheers,
Greg


Re: Column type modification in big tables

2024-08-09 Thread Greg Sabino Mullane
On Fri, Aug 9, 2024 at 6:39 AM Lok P  wrote:

> Thank you so much. Will definitely try to evaluate this approach. The Only
> concern I have is , as this data is moving downstream with exactly the same
> data type and length , so will it cause the downstream code to break while
> using this column in the join or filter criteria. Also I believe the
> optimizer won't be able to utilize this information while preparing the
> execution plan.
>

Yes, this is not as ideal as rewriting the table, but you asked for
another approaches :) As to the impact of your downstream stuff, I think
you have to try and see. Not clear what you mean by the optimizer, it's not
going to really care about numeric(10) versus numeric(8) or varchar(20) vs
varchar(2). It's possible the varchar -> numeric could cause issues, but
without real-world queries and data we cannot say.


>  Another thing , correct me if wrong, My understanding is  , if we want to
> run the "validate constraint" command after running this "check constraint
> with not valid" command, this will do a full table scan across all the
> partitions , but it's still beneficial as compared to updating the columns
> values for each rows. Correct me if I'm wrong.
>

Yes, it needs to scan the entire table, but it's a lightweight lock, won't
block concurrent access, will not need to detoast, and makes no table or
index updates. Versus an entire table rewrite which will do heavy locking,
take up tons of I/O, update all the indexes, and generate quite a lot of
WAL.

Cheers,
Greg


Re: Insert works but fails for merge

2024-08-11 Thread Greg Sabino Mullane
So it looks like the OP does not mind updating more than one row. If you
want to keep it simple and not do a lot of casting, consider using a CTE to
do a reverse-upsert and use a prepared statement. Prepare and cast once,
and have your app send the raw uncasted strings many, many times:

prepare foo(text,int,timestamptz) as with x as (update tab1 set mid=$2
where id=$1 returning 1)
  insert into tab1 select $1,$2,$3 where not exists (select 1 from x);

execute foo('5efd4c91-ef93-4477-840c-a723ae212d99', 123,
'2024-08-09T11:33:49.402585600Z');
execute foo('some_other_id', 456, '2024-08-11T21:44:55.8675309Z');
etc.

Your app/driver may or may not already do protocol-level statement
prepare/execute automagically, so test that way first.

It's version 15.4 postgres.


Keep on the latest revision. Right now, that's 15.8. Upgrading revisions is
quick and painless.

Cheers,
Greg


Re: Insert works but fails for merge

2024-08-13 Thread Greg Sabino Mullane
I just remembered that one of the complaints was not wanting to worry about
looking up the data types. In my previous example, you can also leave out
the types and Postgres will do the right thing. I prefer the explicit data
type version for clarity, but though I would provide this one for
completeness:

prepare foo as with x as (update tab1 set mid=$2 where id=$1 returning 1)
  insert into tab1 select $1,$2,$3 where not exists (select 1 from x);

Cheers,
Greg


Re: Column type modification in big tables

2024-08-13 Thread Greg Sabino Mullane
On Sat, Aug 10, 2024 at 5:06 PM Lok P  wrote:


> Can someone through some light , in case we get 5-6hrs downtime for this
> change , then what method should we choose for this Alter operation?
>

We can't really answer that. Only you know what resources you have, what
risk/reward you are willing to handle, and how long things may take. For
that latter item, your best bet is to try this out on the same/similar
hardware and see how long it takes. Do a smaller table and extrapolate if
you need to. Or promote one of your replicas offline and modify that. I've
given you a low-risk / medium-reward option with check constraints, but for
the ALTER TABLE options you really need to try it and see (on non-prod).

it seems the "USING" clause takes more time as compared to normal ALTER.
> But again I don't see any way to see the progress and estimated completion
> time. Can you share your thoughts on this?


There should be no difference if they are doing the same conversion.

 Will this approach be faster/better as compared to the simple "alter table
> alter column approach" as above


Seems a lot more complicated to me than a simple ALTER. But measurement is
key. Create a new test cluster using pgBackRest or whatever you have. Then
run your ALTER TABLE and see how long it takes (remember that multiple
columns can be changed in a single ALTER TABLE statement).

Cheers,
Greg


Re: Column type modification in big tables

2024-08-17 Thread Greg Sabino Mullane
On Thu, Aug 15, 2024 at 4:41 PM Lok P  wrote:

> Additionally , if we are okay with the 7.5hrs of down time , is my
> calculation/extrapolation of total time consumption based on a sample
> table,  for direct alter, accurate? Because, in that case , I was thinking
> it's less complex and also less error prone to just do it in a single alter
> command rather than going for multiple steps of detach, alter, attach
> partition.
>

Well, it's meant to get you a ballpark figure, but yes, it seems as though
you will probably okay, But for something this critical involving
production downtime, I would try out the exact command and see how long it
takes on a test system. Restore a backup (you have backups I hope) or use
pg_basebackup to make a copy of your prod system somewhere.

Cheers,
Greg


Re: Planet Postgres and the curse of AI

2024-08-20 Thread Greg Sabino Mullane
On Tue, Jul 23, 2024 at 12:45 PM Avinash Vallarapu <
avinash.vallar...@gmail.com> wrote:

> However, I do agree with Lawrence that it is impossible to prove whether
> it is written by AI or a human.
> AI can make mistakes and it might mistakenly point out that a blog is
> written by AI (which I know is difficult to implement).
>

Right - I am not interested in "proving" things, but I think a policy to
discourage overuse of AI is warranted.

People may also use AI generated Images in their blogs, and they may be
> meaningful for their article.
> Is it only the content or also the images ?  It might get too complicated
> while implementing some rules.
>

Only the content, the images are perfectly fine. Even expected, these days.


> Ultimately, Humans do make mistakes and we shouldn't discourage people
> assuming it is AI that made that mistake.
>

Humans make mistakes. AI confidently hallucinates.


Re: Planet Postgres and the curse of AI

2024-08-20 Thread Greg Sabino Mullane
On Fri, Jul 19, 2024 at 3:22 AM Laurenz Albe 
wrote:

> Why not say that authors who repeatedly post grossly counterfactual or
> misleading content can be banned?
>

I like this, and feel we are getting closer. How about:

"Posts should be technically and factually correct. Use of AI should be
used for minor editing, not primary generation"

(wordsmithing needed)

Cheers,
Greg


Re: How to validate restore of backup?

2024-08-22 Thread Greg Sabino Mullane
On Thu, Aug 22, 2024 at 8:49 AM o1bigtenor  wrote:

>
>
> On Thu, Aug 22, 2024 at 6:24 AM Ron Johnson 
> wrote:
>
>> That's great on small databases.  Not so practical when they're big.
>>
>> So - - - - what is the recommended procedure for 'large' databases?
>

Use a real backup system like pgBackRest. Stop using pg_dump.

Cheers,
Greg


Re: Using PQsocketPoll() for PIPELINE mode

2024-08-27 Thread Greg Sabino Mullane
On Tue, Aug 27, 2024 at 9:20 AM Dominique Devienne 
wrote:

> Once again, this is late, although my original questions are now 2 weeks
> old.
> After all, PQsocketPoll() has not been released yet officially. Thanks,
> --DD
>

As this is so new, you might have better luck on -hackers than here. I've
not poked at it enough to give an answer yet.

Cheers,
Greg


Re: Pgbackrest specifying the default DB necessary/correct way ?

2024-08-28 Thread Greg Sabino Mullane
On Wed, Aug 28, 2024 at 1:39 AM KK CHN  wrote:

> In this DB serverI have other databases  than the default  "edb"
> database. Specifying the above line aspg1-database=edb   // I am
> not sure this line is necessary  or not ?
>

The pgbackrest process needs to connect to the database, which means it
needs a user and database. You need this variable if you do not have the
default database, "postgres". If you have a database named postgres, you
can leave this out. Otherwise, yes, it is necessary.

 pg1-database=edb // specifying like this, will it block other databases
> on this server to get backed up ?   IF yes how can I overcome this ?
>

pgBackRest works on a cluster level, so *all* the databases are backed up.
Indeed, it is not possible to only backup some of the databases. It's the
whole cluster.

 ( I am just learning and exploring PgbackRest)  found online some
> reference configurations so using like this )


Probably best to stick to the official docs; this section in particular is
worth a read:

https://pgbackrest.org/user-guide-rhel.html

Cheers,
Greg


Re: Ghost data from failed FDW transactions?

2024-08-28 Thread Greg Sabino Mullane
On Tue, Aug 27, 2024 at 9:03 PM Jacob Biesinger 
wrote:

> I'm scratching my head at a few rows in the root DB, where it seems the
> corresponding tenant transaction rolled back, but the root DB transaction
> committed
>
...

> Before I jump into particulars, does this sound like expected behavior?
>

No, it sounds like something is going wrong. Your setup as described should
work to keep both sides in sync.

Through the magic of postgres_fdw, row triggers, and distributed
> transactions,
>

Can you expand on "distributed transactions" here?

Cheers,
Greg


Re: PgBackRest Full backup and N/W reliability

2024-08-29 Thread Greg Sabino Mullane
On Thu, Aug 29, 2024 at 2:21 AM KK CHN  wrote:

> I am doing a full backup  using PgBackRest from a production server to
> Reposerver.
>
...

> If so, does the backup process start  again from  scratch ?   or it
> resumes from  where the backup process is stopped   ?
>

It resumes. You will see a message like this:

WARN: resumable backup 20240829-091727F of same type exists -- invalid
files will be removed then the backup will resume

Any suggestions much appreciated
>

Boost your process-max as high as you are able to speed up your backup time.

Cheers,
Greg


Re: PgBackRest client_loop: send disconnect: Connection reset

2024-08-29 Thread Greg Sabino Mullane
On Thu, Aug 29, 2024 at 9:31 AM KK CHN  wrote:

> "Unable to acquire lock on file '/tmp/pgbackrest/Repo-backup.lock'
>

When this happens, take a look inside this file. If there is another
pgbackrest process running, the pid will be inside that file. Kill that
process before trying to run another backup.

Cheers,
Greg


Re: PgBackRest full backup first time : Verification

2024-08-30 Thread Greg Sabino Mullane
>
> database size: 146.9GB, database backup size: 146.9GB
> repo1: backup size: 20.6GB


It looks to me as though everything is working as expected. You took a full
backup of your system, which was around 147GB - most of which is in a
tablespace. It got compressed down to 20GB. You then took two incremental
backups, which are by definition much smaller and take a short amount of
time to run.

I can't  restore back to the DB server right now to test it as it is a
> production server and down time granting is not immediately possible to
> test it ...
>

You do not have to restore to the same server or the same directory. You
can keep your production system running and do a test restore somewhere
else. Just make sure you specify --archive-mode=off (which prevents the WAL
from being shipped from the restored system to your existing production
repo)

[root@db1 data]# du -h
> returns   537 G
>

This is not relevant, as pgbackrest only cares about the Postgres data
directory (/data/edb/as16/data/)

149G/data/edb/as16/tablespace/ESS


This is where the rest of your backup size is coming from. Postgres and
pgbackrest consider this part of the data directory.

You really should spin up a test Postgres cluster and get very familiar
with how pgbackrest works, rather than continuing to flounder about on a
production system and rely on mailing lists to answer a bunch of questions
for you. While we can answer these questions, you will learn better from
experimenting and trying things out yourself on a non-prod system.

Cheers,
Greg


Re: Using left joins instead of inner joins as an optimization

2024-09-06 Thread Greg Sabino Mullane
On Fri, Sep 6, 2024 at 7:05 AM Xavier Solomon 
wrote:

> > explain select b_id from b natural left join a;
> results in a `Seq Scan on b`. Whereas the query
> > explain select b_id from b natural join a;
> results in a join with sequential scans on both a and b.
>

I think your example is a little too contrived. Try explaining
select * from b natural left join a;
and you should see the plans become equivalent again.

I would expect a query that left joins but only pulls data from one table
to be not feasible in real life. Yes, in an ideal world the non-left join
would be smart enough to not even do the scan on a, but it's kind of a moot
point outside of odd select clauses.

- Is it a bad idea to use left joins to optimize this even if semantically
> an inner join would be correct?
>

Not at all - if it works for you, go ahead. But I'm dubious you will gain
much for queries that actually make use of the left join, at least for
relatively simply selects.

Cheers,
Greg


Re: How effectively do the indexing in postgres in such cases

2024-09-09 Thread Greg Sabino Mullane
Your questions are a little too vague to answer well, but let me try a bit.

1)In the query below , if the optimizer chooses tab1 as the driving table,
> the index on just col1 should be enough or it should be (col1, tab1_id)?
>

No way to tell without trying it yourself. We need information on how the
tables are joined, the cardinality, general distribution, etc. But as a
rough general rule, yes, indexes on the column of interest should be able
to handle the job well by themselves.

2)In scenarios where the customer has a lot of matching transactions (say
> in millions) post all the filters applied , and as the customer has to just
> see the latest 100 rows transaction data, the sorting will be a bottleneck.
> So what can be done to make such types of queries to return the latest
> transactions in quick time on the search screen?
>

Sorting can be quick, if you hit an index (b-trees are already sorted)
Postgres can look at only the first X rows returned and does not need to
read the whole thing. So a well-designed index is the key here.


> 3)As here also the count has to happen in the first step to make the
> customer know the total number of rows(which may be in millions), so what
> additional index will support this requirement?
>

Again, a very vague question, but for things that are in millions, an
estimate is usually sufficient, so you might be able to do something like
SELECT count(*) FROM mytab WHERE mydate BETWEEN x AND y; and use that as
your answer. Compare to the full query to see how close it is. You might
even have cutoffs, where if the results of that first one is < 10,000,
switch to a more accurate version which has more filtering (i.e. the joins
and more where conditions).


> Or if any other optimization strategy we can follow for catering to such
> queries?
>
> select * from tab1, tab2
> where tab1.part_date between '1-jan-2024' and '31-jan-2024'
> and tab1.part_date=tab2.part_date
> and tab1.tab1_id=tab2.tab2_id
> and tab1.col1=<:input_col1>
> and tab2.col2=<:input_col2>
> order by tab1.create_timestamp desc
> limit 100 offset 100;
>

It probably would help to see exact tables and queries. Why are you joining
on part_date? Is tab_id unique to either table? Running EXPLAIN on these
while you try out indexes and change the joins, etc. is a great exercise to
help you learn how Postgres works. As far as asking on lists for help with
specific queries, there is a range between totally abstract and generic
queries that nobody can help you with, and large, complex specific queries
that nobody wants to unravel and help you with. You are definitely on the
former side: try to create some actually runable sample queries that are
small, self-contained, and generate the problem you are trying to solve.

Cheers,
Greg


Re: infinite loop in an update statement

2024-09-11 Thread Greg Sabino Mullane
On Wed, Sep 11, 2024 at 6:14 AM Fabrice Chapuis 
wrote:

> status = active
>> wait event = NULL
>> wait event type = NULL
>>
>
That seems quite unlikely. Perhaps you are seeing the pg_stat_activity
query itself? Try this:

select state, now()-state_change, wait_event_type, wait_event, query
from pg_stat_activity where query ~ 'SELECT MIN' and pg_backend_pid() <>
pid;

Cheers,
Greg


Re: Ghost data from failed FDW transactions?

2024-09-11 Thread Greg Sabino Mullane
Any updates on this?

A few replies from me inline:

On Wed, Aug 28, 2024 at 12:18 PM Jacob Biesinger 
wrote:

> There aren't many details in the docs around failure modes... is there
> anything there that could cause this issue?
>

Nothing that I know of, but it's possible there is some sort of weird race
condition/bug. But frankly, it's far more likely to be some sort of
application bug  / missed behavior. If you could trigger it at will by
developing a self-contained test, that would be ideal.


> The ordering shouldn't matter, but this also means that error handling
> (e.g., when there is a conflict with the `EXCLUDE "serialNumber"`
> constraint) may not stop the second `INSERT` statement from being attempted
> (and rejected by postgres). But I think that's all client-side details that
> shouldn't affect the FDW txn semantics, right?
>

Correct, but it's hard to say for sure without seeing the code or knowing
exactly how the serialization errors are being handled. I'm not clear on
how the inserts are run "in parallel" if your flow is:

BEGIN;
> INSERT INTO "devices" ("orgId", "patientId", "deviceId", "data") ...
> INSERT INTO "devices" ("orgId", "patientId", "deviceId", "data") ...

SELECT * FROM "rootDb"."assets";

-- execute some logic client-side, nothing touching the DB

UPDATE "rootDb"."assets" WHERE ...;

COMMIT;


Perhaps expand on that, because it almost sounds like you have two
processes trying to talk to the same backend at once?

Cheers,
Greg


Re: Logical replication without direct link between publisher and subscriber?

2024-09-11 Thread Greg Sabino Mullane
>
> Dumping changes periodically, sending them directly or uploading to cloud
> storage and then downloading and applying them on the subscriber side.
> But maybe there's a simpler option someone here knows about?


How about using WAL shipping to populate a replica, and either query that
directly or use that as the publisher to your sequestered node?

Cheers,
Greg


Re: Database schema for "custom fields"

2024-09-11 Thread Greg Sabino Mullane
I'd go with option 2 (jsonb), as it's likely already well supported by your
applications, while the other approaches will require a good bit of
customization. JSONB can be indexed, so performance should be on par with
"traditional" tables.

Cheers,
Greg


Re: Performance degrade on insert on conflict do nothing

2024-09-11 Thread Greg Sabino Mullane
On Wed, Sep 11, 2024 at 1:02 AM Durgamahesh Manne 
wrote:

> Hi
>  createdat  | timestamp with time zone |   | not null | now()
>| plain| |  |
>  modified   | timestamp with time zone |   | not null | now()
>| plain| |  |
> Triggers:
> i_dictionary_createdat BEFORE INSERT ON dictionary FOR EACH ROW
> EXECUTE FUNCTION update_createdat_col()
> i_dictionary_modified BEFORE INSERT OR UPDATE ON dictionary FOR EACH
> ROW EXECUTE FUNCTION update_modified_col()
>


> How do we improve this query performance without taking more cpu?
>

1. Lose the first trigger. I don't know exactly what those functions do,
but if they are only for updating those columns, just remove the first one
and let postgres handle it via NOT NULL DEFAULT.

2. Change the second trigger to just ON UPDATE

3. Remove that second trigger as well, and have the app populate that
column (assuming that is all it does), e.g. UPDATE dictionary SET lang =
'foo', modified = DEFAULT, modified_by = 'foo' where tid = 12345;

4. Remove any indexes you do not absolutely need

Cheers,
Greg


Re: Effects of REVOKE SELECT ON ALL TABLES IN SCHEMA pg_catalog FROM PUBLIC

2024-09-12 Thread Greg Sabino Mullane
On Thu, Sep 12, 2024 at 12:52 AM Andreas Joseph Krogh 
wrote:

> I know PG is not designed for this, but I have this requirement
> nonetheless…
> I think preventing “most users and tools" from seeing/presenting this
> information is “good enough”.
>

As pointed out, there are very many workarounds. This is security theater.

If read-access (SELECT) on views in public-schema will still works, and
> pg_dump/restore etc. also works, this sounds like a solution to me.
>

pg_dump will absolutely not work without access to the system catalogs.

If you want to prevent information, stop direct access and make the
application call user functions.

(Also note that determining if a database or user exists does not even
require a successful login to the cluster.)

Cheers,
Greg


Re: Effects of REVOKE SELECT ON ALL TABLES IN SCHEMA pg_catalog FROM PUBLIC

2024-09-12 Thread Greg Sabino Mullane
On Thu, Sep 12, 2024 at 9:12 AM Dominique Devienne 
wrote:

> On Thu, Sep 12, 2024 at 3:06 PM Greg Sabino Mullane 
> wrote:
> > (Also note that determining if a database or user exists does not even
> require a successful login to the cluster.)
>
> Hi. How so? I was not aware of such possibilities.
> Can you please give pointers (docs, examples) of this?
>

$ psql -U eve
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed:
FATAL:  role "eve" does not exist

$ psql -U postgres -d theater
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed:
FATAL:  database "theater" does not exist

$ psql -U alice -d template1
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed:
FATAL:  password authentication failed for user "alice"

Cheers,
Greg


Re: Effects of REVOKE SELECT ON ALL TABLES IN SCHEMA pg_catalog FROM PUBLIC

2024-09-12 Thread Greg Sabino Mullane
On Thu, Sep 12, 2024 at 9:21 AM Andreas Joseph Krogh 
wrote:

> Yes, it *is* theater, but that doesn't prevent “compliance people” to
> care about it. We have to take measures to prevent “information leaks”.
>

*shrug* Then the compliance people are not good at their jobs, frankly.

But if it works for you, go ahead. As Tom said, it will work 95% of the
time. But it will break things that should work, and it will not prevent
the ability to get the information in other ways. To be clear, we never
recommend messing with the system catalogs, and this falls under the
umbrella of messing with the system catalogs.

Cheers,
Greg


Re: Will hundred of thousands of this type of query cause Parsing issue

2024-09-13 Thread Greg Sabino Mullane
On Fri, Sep 13, 2024 at 11:35 AM Wong, Kam Fook (TR Technology) <
kamfook.w...@thomsonreuters.com> wrote:

> 1) Where does query parsing occur?
>
>
Always on the server side, although your driver may do something as well.

2) Will this cause extra parsing to the posgress DB?
>

Yes


>   Any pg system table to measure parsing?
>

No

You want to send an array of values to the same query, so it can be
prepared once, like so:

SELECT abc, efg
FROM docloc a
JOIN collection b USING (collection_name)
WHERE a.column1 = ANY($1)
AND a.stage_id = (
  select max(stage_id) from collection_pit c
  where c.collection_name = a.collection_name
  and c.pid_id < $2 and c.stage_code = $3
);

Then you can always pass in three arguments, the first being an array of
all the column1 values you want.

You might also want to get familiar with plan_cache_mode:
https://www.postgresql.org/docs/current/sql-prepare.html

Cheers,
Greg


Re: IO related waits

2024-09-17 Thread Greg Sabino Mullane
On Mon, Sep 16, 2024 at 11:56 PM veem v  wrote:

> So what can be the caveats in this approach, considering transactions
> meant to be ACID compliant as financial transactions.
>

Financial transactions need to be handled with care. Only you know your
business requirements, but as Christophe pointed out, disabling
synchronous commit means your application may think a particular
transaction has completed when it has not. Usually that's a big red flag
for financial applications.

we are using row by row transaction processing for inserting data into the
> postgres database and commit is performed for each row.


This is a better place to optimize. Batch many rows per transaction. Remove
unused indexes.

flushing of the WAL to the disk has to happen anyway(just that it will be
> delayed now), so can this method cause contention in the database storage
> side if the speed in which the data gets ingested from the client is not
> getting written to the disk , and if it can someway impact the data
> consistency for the read queries?
>

Not quite clear what you are asking here re data consistency. The data will
always be consistent, even if synchronous_commit is disabled. The only
danger window is on a server crash.

(Keep in mind that RDS is not Postgres, so take tuning recommendations and
advice with a grain of salt.)

Cheers,
Greg


Re: question on plain pg_dump file usage

2024-09-17 Thread Greg Sabino Mullane
On Tue, Sep 17, 2024 at 8:22 AM Zwettler Markus (OIZ) <
markus.zwett...@zuerich.ch> wrote:

> pg_dump -F p -f dump.sql …
>
> sed -i "s/old_name/new_name/g"
>
> psql -f dump.sql …
>

Why not rename afterwards? Just "pg_dump mydb | psql -h newhost -f -" and
rename things via ALTER. Certainly much safer than a global replace via sed.

Cheers,
Greg


Re: Question about when PostgreSQL 11.0 was released

2021-03-09 Thread Greg Sabino Mullane


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Ram Bysani asked:
> I am not finding the release cadence for the PostgreSQL databases.
> Please let me know how I can find the base and dot version release dates for:

You can see all the release dates on one page here:

https://bucardo.org/postgres_all_versions

- --
Greg Sabino Mullane g...@turnstep.com
PGP Key: 0x14964AC8
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iF0EARECAB0WIQQlKd9quPeUB+lERbS8m5BnFJZKyAUCYEd1hgAKCRC8m5BnFJZK
yMlLAJwM62zTcEGy7yZRBqi8xsqbmm8t5wCghNDRk4Bqe1seEclN11HP57pVL8Q=
=/dnl
-END PGP SIGNATURE-






Re: Disk is filling up with large files. How can I clean?

2024-10-09 Thread Greg Sabino Mullane
On Wed, Oct 9, 2024 at 4:10 AM Mikael Petterson 
wrote:

> Hi,
>
> I find our disk is filling up.
>
> sudo find /var/lib -type f -size +100M -exec ls -lh {} \; | awk '{ print
> $9 ": " $5 }'
>
...

Those files only add up to about 30GB. That's pretty small these days :
time for a bigger disk? Or perhaps the space is elsewhere: probably want to
do something like
sudo find / -xdev -maxdepth 2 -exec du -sh {} \; | grep -E 'G|M' | sort -g

Cheers,
Greg


Re: Question on indexes

2024-10-11 Thread Greg Sabino Mullane
(please start a new thread in the future rather than replying to an
existing one)

You cannot query on b and use an index on (a,b) as you observed. However,
you can have two indexes:

index1(a)
index2(b)

Postgres will be able to combine those when needed in the case where your
WHERE clause needs to filter by both columns. So then you no longer need
the two-column index.

Cheers,
Greg


  1   2   >