Identity/Serial Column In Subscriber's Tables

2021-12-09 Thread Avi Weinberg
Hi Experts,

What is the best practice when dealing with Identity/Serial column in logical 
replication.
Should the subscriber table have this column also defined as Identity/Serial or 
have it defined as regular integer?
Is leaving it as Identity/Serial also on subscriber tables, can potentially 
cause a problem?

Thanks!
IMPORTANT - This email and any attachments is intended for the above named 
addressee(s), and may contain information which is confidential or privileged. 
If you are not the intended recipient, please inform the sender immediately and 
delete this email: you should not copy or use this e-mail for any purpose nor 
disclose its contents to any person.


Use tsquery to check out occurrence of the same phrase in a cell

2021-12-09 Thread Shaozhong SHI
Are there good examples for using tsquery to check out occurrence of the
same phrase in a cell?

Regards,

David


Re: performance expectations for table(s) with 2B recs

2021-12-09 Thread Marc Millas
Partitioning ?
if  you have some ideas about how this data is accessed, splitting those
big tables into partitions may help:
-vaccum done at the partition level,
-index partitioned too, so much easier to manage (reindex local to a given
partition, so quite easy to reindex the whole thing one partition at a time)

great on perf, too IF the partition key is in the where clause. if not,
postgres will need to aggregate and sort the results gathered for each
partition.
Could still be efficient if it allows to parallel execution.

my 2 cents


Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com



On Wed, Dec 8, 2021 at 8:45 PM David Gauthier 
wrote:

> 11.5 on linux
> server = VM provided by our IT dept (IOW, can be grown if needed)
> DB itself is on NFS
>
> So far, the tables I have in my DB have relatively low numbers of records
> (most are < 10K, all are < 10M).  Things have been running great in terms
> of performance.  But a project is being brainstormed which may require some
> tables to contain a couple billion records.
>
> I'm familiar with the need to properly index columns in these tables which
> will be used for table joining and query performance (candidates are
> columns used in query predicate, etc...).  Let's assume that's done right.
> And let's assume that the biggest table will have 20 varchars (<= 16 chars
> per) and 20 ints.  No blobs or jsonb or anything like that.
>
> What else should I be worried about ?
>
> I suspect that part of why things are running really well so far is that
> the relatively small amounts of data in these tables ends up in the DB
> cache and disk I/O is kept at a minimum.  Will that no longer be the case
> once queries start running on these big tables ?
>
> What about DBA stuff... vacuum and fragmentation and index maintenance,
> etc... ?
>
> I don't want to step into this completely blind.  Any warnings/insights
> would be appreciated.
>


Re: Identity/Serial Column In Subscriber's Tables

2021-12-09 Thread Laurenz Albe
On Thu, 2021-12-09 at 09:13 +, Avi Weinberg wrote:
> What is the best practice when dealing with Identity/Serial column in logical 
> replication.
> Should the subscriber table have this column also defined as Identity/Serial 
> or have it defined as regular integer?
> Is leaving it as Identity/Serial also on subscriber tables, can potentially 
> cause a problem?

It does not really matter if you use "serial" or an identity column on
the logical standby or not.  The sequence on the standby won't be advanced.

The best way to prevent conflicts from happening is *not* to modify
the data on the standby.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com





help

2021-12-09 Thread Willy Skjæveland



Detecting repeated phrase in a string

2021-12-09 Thread Shaozhong SHI
Does anyone know how to detect repeated phrase in a string?

Is there any such function?

Regards,

David


Re: Detecting repeated phrase in a string

2021-12-09 Thread Peter J. Holzer
On 2021-12-09 12:38:15 +, Shaozhong SHI wrote:
> Does anyone know how to detect repeated phrase in a string?

Use regular expressions with backreferences:

bayes=> select regexp_match('foo wikiwiki bar', '(.+)\1');
╔══╗
║ regexp_match ║
╟──╢
║ {o}  ║
╚══╝
(1 row)

"o" is repeated in "foo".

bayes=> select regexp_match('fo wikiwiki bar', '(.+)\1');
╔══╗
║ regexp_match ║
╟──╢
║ {wiki}   ║
╚══╝
(1 row)

"wiki" is repeated in "wikiwiki".

bayes=> select regexp_match('fo wikiwi bar', '(.+)\1');
╔══╗
║ regexp_match ║
╟──╢
║ (∅)  ║
╚══╝
(1 row)

nothing is repeated.

Adjust the expression within parentheses if you want to match somethig
more specific than any sequence of one or more characters.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


RE: Identity/Serial Column In Subscriber's Tables

2021-12-09 Thread Avi Weinberg
Hi Laurenz

Thanks for your answer.

If I do not modify data on the subscriber side, the best practice will be to 
not to define the column as identity or it is better to have subscriber's table 
definition as close as possible to the publisher and leave it as identity?

Thanks

-Original Message-
From: Laurenz Albe [mailto:laurenz.a...@cybertec.at]
Sent: Thursday, December 9, 2021 1:51 PM
To: Avi Weinberg ; pgsql-general@lists.postgresql.org
Subject: Re: Identity/Serial Column In Subscriber's Tables

On Thu, 2021-12-09 at 09:13 +, Avi Weinberg wrote:
> What is the best practice when dealing with Identity/Serial column in logical 
> replication.
> Should the subscriber table have this column also defined as Identity/Serial 
> or have it defined as regular integer?
> Is leaving it as Identity/Serial also on subscriber tables, can potentially 
> cause a problem?

It does not really matter if you use "serial" or an identity column on the 
logical standby or not.  The sequence on the standby won't be advanced.

The best way to prevent conflicts from happening is *not* to modify the data on 
the standby.

Yours,
Laurenz Albe
--
Cybertec | 
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.cybertec-postgresql.com%2F&data=04%7C01%7CAviW%40gilat.com%7C9d8efd6e75b5451a878708d9bb0a2da7%7C7300b1a3573a401092a61c65cd85e927%7C0%7C0%7C637746474671657970%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=SK9ZhzUJireCa4ClEWRGyHX2x6F6f%2FkM4V8Iw4l8vZY%3D&reserved=0

IMPORTANT - This email and any attachments is intended for the above named 
addressee(s), and may contain information which is confidential or privileged. 
If you are not the intended recipient, please inform the sender immediately and 
delete this email: you should not copy or use this e-mail for any purpose nor 
disclose its contents to any person.




Re: Detecting repeated phrase in a string

2021-12-09 Thread Shaozhong SHI
Hi, Peter,

How to define word boundary as either by using
^  , space, or $

So that the following can be done

fox fox is a repeat

foxfox is not a repeat but just one word.

Regards,

David

On Thu, 9 Dec 2021 at 13:35, Peter J. Holzer  wrote:

> On 2021-12-09 12:38:15 +, Shaozhong SHI wrote:
> > Does anyone know how to detect repeated phrase in a string?
>
> Use regular expressions with backreferences:
>
> bayes=> select regexp_match('foo wikiwiki bar', '(.+)\1');
> ╔══╗
> ║ regexp_match ║
> ╟──╢
> ║ {o}  ║
> ╚══╝
> (1 row)
>
> "o" is repeated in "foo".
>
> bayes=> select regexp_match('fo wikiwiki bar', '(.+)\1');
> ╔══╗
> ║ regexp_match ║
> ╟──╢
> ║ {wiki}   ║
> ╚══╝
> (1 row)
>
> "wiki" is repeated in "wikiwiki".
>
> bayes=> select regexp_match('fo wikiwi bar', '(.+)\1');
> ╔══╗
> ║ regexp_match ║
> ╟──╢
> ║ (∅)  ║
> ╚══╝
> (1 row)
>
> nothing is repeated.
>
> Adjust the expression within parentheses if you want to match somethig
> more specific than any sequence of one or more characters.
>
> hp
>
> --
>_  | Peter J. Holzer| Story must make more sense than reality.
> |_|_) ||
> | |   | h...@hjp.at |-- Charles Stross, "Creative writing
> __/   | http://www.hjp.at/ |   challenge!"
>


Re: Detecting repeated phrase in a string

2021-12-09 Thread Andreas Joseph Krogh

På torsdag 09. desember 2021 kl. 15:46:05, skrev Shaozhong SHI <
shishaozh...@gmail.com >: 


Hi, Peter, 

How to define word boundary as either by using 
^ , space, or $ 

So that the following can be done 

fox fox is a repeat 

foxfox is not a repeat but just one word. 

Do you want repeated phrase (list of words) ore repeated words? 
For repeated words (including unicode-chars) you can do: 

(\b\p{L}+\b)(?:\s+\1)+ 

I'm not quite sure how to translate this to PG, but in JAVA it works. 



-- 
Andreas Joseph Krogh 
CTO / Partner - Visena AS 
Mobile: +47 909 56 963 
andr...@visena.com  
www.visena.com  
  


Re: CTE Materialization

2021-12-09 Thread Paul van der Linden
This one quite nicely explains it:
https://stackoverflow.com/questions/14897816/how-can-i-prevent-postgres-from-inlining-a-subquery

On Wed, Dec 8, 2021 at 3:14 AM David G. Johnston 
wrote:

> On Tue, Dec 7, 2021 at 6:40 PM Дмитрий Иванов 
> wrote:
>
>> I beg your pardon.
>> The problem is more or less clear to me, but the solution is not. What
>> does the "hack is to add an "offset 0" to the query" suggest? Thank you.
>>
>>
> A subquery with a LIMIT clause cannot have where clause expressions in
> upper parts of the query tree pushed down it without changing the overall
> query result - something the planner is not allowed to do.  For the hack,
> since adding an actual LIMIT clause doesn't make sense you omit it, but
> still add the related OFFSET clause so the planner still treats the
> subquery as a LIMIT subquery.  And since you don't want to skip any rows
> you specify 0 for the offset.
>
> David J.
>
>


Re: Detecting repeated phrase in a string

2021-12-09 Thread Peter J. Holzer
On 2021-12-09 16:11:31 +0100, Andreas Joseph Krogh wrote:
> For repeated words (including unicode-chars) you can do:
>  
> (\b\p{L}+\b)(?:\s+\1)+
>  
> I'm not quite sure how to translate this to PG, but in JAVA it works.

See 
https://www.postgresql.org/docs/11/functions-matching.html#POSIX-CONSTRAINT-ESCAPES-TABLE

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Need to know more about pg_test_fsync utility

2021-12-09 Thread PGSQL DBA
Hi Team,

I would like to understand the internals of pg_test_fsync utility as I know
pg_test_fsync is intended to give us a reasonable idea of what is the
fastest wal_sync_method for our system but as per the documentation it will
also provide the diagnostic information in the event of an I/O problem. So,
I have below questions:

1) How to interpret the output of pg_test_fsync?
2) What is the meaning of ops/sec & usecs/op?
3) How does this utility work internally?
4) What is the IO pattern of this utility? serial/sequence IO or Multiple
thread with Parallel IO?
5) Can we change the testing like FIO with multiple threads and parallel IO?
6) How a commit happened in the background  while executing this utility?
7) How can we use this tool to measure the I/O issue?
8) In which area or section in the output do we need to focus while
troubleshooting I/O issues?
9) What is the meaning of “Non-sync’ed 8kB writes?

I also would like to know is there any best Practice from PostgreSQL which
mentions what is the disk latency required for the WAL & DATA disk?

Thank You :)


Re: Need to know more about pg_test_fsync utility

2021-12-09 Thread Thomas Munro
On Fri, Dec 10, 2021 at 3:20 PM PGSQL DBA  wrote:
> 1) How to interpret the output of pg_test_fsync?

The main interesting area is probably the top section that compares
the different wal_sync_method settings.  For example, it's useful to
verify the claim that fdatasync() is faster than fsync() (because it
only flushes data, not meta-data like file modified time).  It may
also be useful for measuring the effects of different caching settings
on your OS and storage.  Unfortunately open_datasync is a bit
misleading; we don't actually use O_DIRECT with open_datasync anymore,
unless you set wal_level=minimal, which almost nobody ever does.

> 2) What is the meaning of ops/sec & usecs/op?

Number of times it managed to flush data to disk per second
sequentially, and the same information expressed as microseconds per
flush.

> 3) How does this utility work internally?

It just does a loop over some system calls, or to be more precise,

https://github.com/postgres/postgres/blob/master/src/bin/pg_test_fsync/pg_test_fsync.c

> 4) What is the IO pattern of this utility? serial/sequence IO or Multiple 
> thread with Parallel IO?

Sequential, no threads.

> 5) Can we change the testing like FIO with multiple threads and parallel IO?

Nope.  This is a simple tool.  Fio is much more general and useful.

> 6) How a commit happened in the background  while executing this utility?

Nothing happens in the background, it uses synchronous system calls
from one thread.

> 7) How can we use this tool to measure the I/O issue?

It's a type of micro-benchmark that gives you an idea of a sort of
baseline you can expect from a single PostgreSQL session committing to
the WAL.

> 8) In which area or section in the output do we need to focus while 
> troubleshooting I/O issues?

If PostgreSQL couldn't commit small sequential transactions about that
fast I'd be interested in finding out why, and if fdatasync is
performing faster than published/device IOPS suggest should be
possible then I'd investigate whether data is being cached
unexpectedly, perhaps indicating that committed transactions be lost
in a system crash event.

> 9) What is the meaning of “Non-sync’ed 8kB writes?

Calling the pwrite() system call, which writes into your operating
system's page cache but (usually) doesn't wait for any I/O.  Should be
somewhere north of 1 million/sec.




pqxx accumulating a transaction

2021-12-09 Thread Jeff Abrahamson
I am accumulating a transaction using libpqxx and suspect I'm doing
something overly tortuous to get where I'm going.

I don't mean to make this question longer than it needs to be, but I
think context will help.  So I'll start by explaining briefly how I got
to where I am.  And then I'll show where I am, which I'm pretty sure is
abusive and not how libpqxx means for me to do things.

*Intro: the naive approach to doing thousands of inserts*

I have some code that wants to write a bunch of stuff to the database,
typically |INSERT ... ON CONFLICT UPDATE|. Roughly, this code is looping
over some container and generating the needed SQL to insert each
container object that needs inserting.

*The naive way to do this* (skipping error/exception handling for the
moment) is thus:

pqxx::work txn(conn);
ostringstream sql_stmt;
Container rows;
for (const auto& row : rows) {
    if (row.IsDirty()) {
    RowToSQL(sql_stmt, row, txn);
    txn.exec(sql_stmt.str());
    // Clear sql_stmt here.
    }
}
txn.commit();

The function RowToSQL() takes a transaction object so that it can quote
strings appropriately using txn.quote().

This is inefficient, however: calling exec() over and over turns out to
be quite slow.

So instead I build up a bunch of statements in the ostringstream, thus:

pqxx::work txn(conn);
ostringstream sql_stmt;
Container rows;
for (const auto& row : rows) {
    if (row.IsDirty()) {
    RowToSQL(sql_stmt, row, txn);
    if (++counter > kExecuteThreshold ||
sql_stmt.IsGettingRatherBig()) {
    txn.exec(sql_stmt.str());
    // Clear sql_stmt, reset counter here.
    }
    if (count > kCommitThreshold) {
    txn.commit();
    // Reset statement counter here.
    }
    }
}
// Final commit here.

I chose the two thresholds based on performance testing in our
environment.  Think 100 and 10,000 for order of magnitude.

This worked until it didn't, because re-using the transaction this way
leads to conflicts.

Attempt to activate transaction which is already closed.

I mostly understand all this.

This leads me to write something that has weird code smell, and I so
suspect I've misunderstood how postgresql / libpqxx intend to be used.

Adding back error/exception handling, I've now written this based on
lots of doc reading:

*Less naive but weird code smell*

*// Has weird code smell.*

pqxx::nontransaction non_txn;
ostringstream sql_stmt;
Container rows;
Vector transactions;
for (const auto& row : rows) {
    if (row.IsDirty()) {
    RowToSQL(sql_stmt, row, non_txn);
    if (++counter > kExecuteThreshold ||
sql_stmt.IsGettingRatherBig()) {
    transactions.push_back(sql_stmt.str());
    // Clear sql_stmt, reset counter here.
    }
    if (count > kCommitThreshold) {
    try {
    pqxx::work txn(conn);
    for (const string& trans : transactions) {
    txn.exec(trans);
    }
    txn.commit();
    transactions.clear();
    // Reset statement counter here.
    } catch (const exception& e) {
    txn.abort();    // Pedantic, happens on destruction.
    YellowAlert();  // Something appropriate.
    }
    }
    }
}
// Final commit here.

It seems quite wrong to me that I should build up this vector of things
to execute and maintain these custom-tuned execution/transaction
thresholds rather than using some facility of libpqxx. That is, this
pattern seems common enough to me that the complexity I'm starting to
see strikes me as my own misunderstanding.

Any pointers much appreciated.

||

-- 
Jeff Abrahamson
+33 6 24 40 01 57
+44 7920 594 255

http://p27.eu/jeff/
http://mobilitains.fr/