Re: PostgreSQL

2023-02-09 Thread Joseph Kennedy
As I wtote, I would like restrict access to sensitive or restricted information 
for some users (eg. hide data of one or more clients for some database users).

My question is: do you know some other solution to do that ? 

Some 3rd party solution eg. similar to pgpool or something else ?

Maybe RLS is the beat solution for me. 

Thank you for your suggestions. 

> Wiadomość napisana przez Stephen Frost  w dniu 
> 08.02.2023, o godz. 01:36:
> 
> Greetings,
> 
> * Joseph Kennedy (joseph.kennedy@gmail.com) wrote:
>> Thank you Laurenz, I just wanted to make sure. 
>> 
>> Do you know any alternative solution to RLS ?
> 
> ... to do what?  If you want to limit the set of rows that a given user
> can see, RLS is how to do that.  If you want to limit the schemas or
> tables that a user can see, that isn't possible in PG today (though
> there was some interest a while back in making that work via RLS on the
> catalogs which you might be able to find).
> 
> Thanks,
> 
> Stephen


Re: Question regarding UTF-8 data and "C" collation on definition of field of table

2023-02-09 Thread Jehan-Guillaume de Rorthais
On Sun, 5 Feb 2023 17:14:44 -0800
Peter Geoghegan  wrote:

...
> The OP should see the Postgres ICU docs for hints on how to use these
> facilities to make a custom collation that matches whatever their
> requirements are:
> 
> https://www.postgresql.org/docs/current/collation.html#COLLATION-MANAGING

As you are talking about ICU customization to match whatever the requirement we
want, we were wondering if this would be that easy with ICU to build/create such
custom and odd collation (and not just move numbers after latin)? Even
being able to order letter by letter?

For the record, I helped on an issue last week to sort data using the ebcdic
order. ICU was just a pain, especially with this buggy and annoying bug when
sorting numbers after letters[1] and the fact that it takes whatever we feed it
with without complaining for badly formed or impossible collation. We just gave
up.

One of our recommendation was to write a glibc collation file, built/installed
it using localedef and "import" it in PostgreSQL using
pg_import_system_collations(). The customer actually did it and it works like a
charm.

Regards,

[1] remember ?
https://www.postgresql.org/message-id/flat/20200903105727.064665ce%40firost#0a6e89e58eec7679391c829231a7b3ea




Using PostgreSQL for service discovery and health-check

2023-02-09 Thread Dominique Devienne
Hi. I'm requesting advice on something I'm about to start on.

In a normally 2-tier system, where "desktop" native apps connect directly
to PostgreSQL to access data, some things must be mediated via a
middle-tier service (i.e. 3-tier architecture). That service is HTTP based
thus clients (the "desktop" native apps) must know the URL (host+port+path)
of a server providing the requested service.

Since clients must already have access to PostgreSQL to operate, I wanted
to add a table in PostgreSQL for services, and server(s) on startup
(probably using a random port) would register in that table, and deregister
on shutdown. Also, since crashes are a fact of life, the server would
regularly update the DB with a "heartbeat" on some frequency (e.g. 1s?), so
clients would select (or see) only registered services with a "fresh
enough" heartbeat timestamp.

That's basically my plan. Now come the questions:
1) will updating a row every second (for example) create issues?
2) if yes to #1 above, what would be good mitigation tactics? Use different
table for service vs heartbeat? Special kind of table? or configured in a
particular way?
3) if a service crashes, it won't remove its row(s), obviously. What kind
of mechanism exists to "reap" "zombie" services?
4) Related to #3 above, I think built-in "cron"-like services are only
available via extensions, not in PostgreSQL proper. Why? Seems like such an
essential service.
5) Which cron-like extension to use? Especially since we run both on-prem
but also in managed-PostgreSQL on the cloud?

I'd appreciate community input. Thanks, --DD

PS: Note that there could be more than 1 server registered, for the same
service, possibly on the same machine, for redundancy. But I think that's
mostly orthogonal to my questions above.


WHERE col = ANY($1) extended to 2 or more columns?

2023-02-09 Thread Dominique Devienne
Hi. We are implementing an API which takes a list of row keys, and must
return info about those rows. To implement that efficiently, in as few
round-trips as possible, we bind a (binary) array of keys (ints, uuids, or
strings) and that works great, but only if the key is a scalar one.

Now we'd like to do the same for composite keys, and I don't know how to do
that.
Is it possible? Could someone please help out or demo such a thing?
We are doing it in C++ using libpq, but a pure SQL or PL/pgSQL demo would
still help (I think).

Thanks, --DD


Re: PostgreSQL

2023-02-09 Thread Laurenz Albe
On Thu, 2023-02-09 at 09:54 +0100, Joseph Kennedy wrote:
> As I wtote, I would like restrict access to sensitive or restricted 
> information
> for some users (eg. hide data of one or more clients for some database users).
> 
> My question is: do you know some other solution to do that ? 

It is easy to hide data from users: use permissions or row level security.

It is impossible to hide metadata from users, but we don't consider that a 
problem.

If you want that, use a database or a database cluster per user.

Yours,
Laurenz Albe




Concurrency issue with DROP INDEX CONCURRENTLY

2023-02-09 Thread Kiriakos Georgiou
Hello,

I have an interesting concurrency issue with DROP INDEX CONCURRENTLY that I can 
summarize with this test scenario:

/**/

— suppose we have this table and index
create table test(x int);
create index idx1 on test(x);

— now suppose with the database “live” and the above table super busy (lots of 
queries on the table using index idx1), I decide to make the index unique
create unique index concurrently idx2 on test(x); — runs fine
drop index concurrently idx1; — took 3 hours to finish, since the table is 
super busy

/**/

Taking 3 hours to drop the index is not surprising (lots of queries on the 
table using idx1).  What surprises me is the drop index causes havoc with 
concurrency on the table, causing queries to pile up.
Once the drop index finishes, everything goes back to normal.

I thought by using the CONCURRENTLY option, the drop index is “safe” from 
concurrency issues for the underlying table, but in the above scenario it 
doesn’t appear to be “safe”.

I am trying to formulate a theory to explain this.  Any ideas?

Regards,
Kiriakos



Re: Using PostgreSQL for service discovery and health-check

2023-02-09 Thread David G. Johnston
On Thu, Feb 9, 2023 at 8:30 AM Dominique Devienne 
wrote:

>
> That's basically my plan. Now come the questions:
> 1) will updating a row every second (for example) create issues?
> 2) if yes to #1 above, what would be good mitigation tactics? Use
> different table for service vs heartbeat? Special kind of table? or
> configured in a particular way?
> 3) if a service crashes, it won't remove its row(s), obviously. What kind
> of mechanism exists to "reap" "zombie" services?
> 4) Related to #3 above, I think built-in "cron"-like services are only
> available via extensions, not in PostgreSQL proper. Why? Seems like such an
> essential service.
> 5) Which cron-like extension to use? Especially since we run both on-prem
> but also in managed-PostgreSQL on the cloud?
>
>
You can probably get good mileage from CREATE UNLOGGED TABLE.

I don't have any particular suggestion for PostgreSQL cron extensions but
I'm fully on board with the current division of responsibilities here.
Such a user-space application isn't something that the core developers
should be worried about nor does such a project really want to be tied to
the release cadence that the server is restricted to.

David J.


Re: WHERE col = ANY($1) extended to 2 or more columns?

2023-02-09 Thread David G. Johnston
On Thu, Feb 9, 2023 at 8:41 AM Dominique Devienne 
wrote:

> Hi. We are implementing an API which takes a list of row keys, and must
> return info about those rows. To implement that efficiently, in as few
> round-trips as possible, we bind a (binary) array of keys (ints, uuids, or
> strings) and that works great, but only if the key is a scalar one.
>
> Now we'd like to do the same for composite keys, and I don't know how to
> do that.
> Is it possible? Could someone please help out or demo such a thing?
> We are doing it in C++ using libpq, but a pure SQL or PL/pgSQL demo would
> still help (I think).
>
>
It's trivial in pl/pgsql since I don't have to deal with serialization of
the data.

An array-of-composites is simply:

SELECT ARRAY[ ROW(1, 'one'), ROW(2, 'two') ]::composite_type[];

Not sure about the binary part but there are rules for how to serialize to
text both composites and arrays, and quite probably libpq provides
functions for this already though i've never used it raw.

David J.


Re: Using PostgreSQL for service discovery and health-check

2023-02-09 Thread Dominique Devienne
On Thu, Feb 9, 2023 at 4:46 PM David G. Johnston 
wrote:

> On Thu, Feb 9, 2023 at 8:30 AM Dominique Devienne 
> wrote:
>
>> That's basically my plan. Now come the questions:
>> 1) will updating a row every second (for example) create issues?
>> 2) if yes to #1 above, what would be good mitigation tactics? Use
>> different table for service vs heartbeat? Special kind of table? or
>> configured in a particular way?
>> 3) if a service crashes, it won't remove its row(s), obviously. What kind
>> of mechanism exists to "reap" "zombie" services?
>> 4) Related to #3 above, I think built-in "cron"-like services are only
>> available via extensions, not in PostgreSQL proper. Why? Seems like such an
>> essential service.
>> 5) Which cron-like extension to use? Especially since we run both on-prem
>> but also in managed-PostgreSQL on the cloud?
>>
>>
> You can probably get good mileage from CREATE UNLOGGED TABLE.
>

Thanks. Although I guess the fact it's not replicated to standby servers
could be a problem?
That's not something we test now, and also something the on-prem DBA (i.e.
a client of ours)
might want to setup on his/her own, on top of our client/server arch I
guess.

I have no experience with stand-bys (replication) and HA in PostgreSQL.
Would having the main service table be a regular one, and the
service_heartbeat be an unlogged one be replication friendly?
I.e. if fail over to the stand-by happens, the service table is still there
and populated, but the service_heartbeat is empty, but
then the services would start populating it "transparently" no? I.e. would
using 2 tables instead of 1 be a better design?


Re: WHERE col = ANY($1) extended to 2 or more columns?

2023-02-09 Thread Dominique Devienne
On Thu, Feb 9, 2023 at 4:51 PM David G. Johnston 
wrote:

> On Thu, Feb 9, 2023 at 8:41 AM Dominique Devienne 
> wrote:
>
>> Now we'd like to do the same for composite keys, and I don't know how to
>> do that.
>>
>
> An array-of-composites is simply:
> SELECT ARRAY[ ROW(1, 'one'), ROW(2, 'two') ]::composite_type[];
>

Thanks. I don't consider that "simple" myself :). But I'm definitely not an
advanced PostgreSQL user!
Would still appreciate a more fleshed out demo, if anyone is kind enough to
provide it. Thanks, --DD


Re: Using PostgreSQL for service discovery and health-check

2023-02-09 Thread Adrian Klaver

On 2/9/23 07:30, Dominique Devienne wrote:

Hi. I'm requesting advice on something I'm about to start on.

In a normally 2-tier system, where "desktop" native apps connect 
directly to PostgreSQL to access data, some things must be mediated via 
a middle-tier service (i.e. 3-tier architecture). That service is HTTP 
based thus clients (the "desktop" native apps) must know the URL 
(host+port+path) of a server providing the requested service.


Is there more then one server providing the same service?



Since clients must already have access to PostgreSQL to operate, I 
wanted to add a table in PostgreSQL for services, and server(s) on 
startup (probably using a random port) would register in that table, and 
deregister on shutdown. Also, since crashes are a fact of life, the 
server would regularly update the DB with a "heartbeat" on some 
frequency (e.g. 1s?), so clients would select (or see) only registered 
services with a "fresh enough" heartbeat timestamp.


Would it no be easier to not have random ports and just attempt 
connections to the servers either:


1) In the client with reattempt to different port on failure.

2) From Postgres server and update table to have current up servers.



That's basically my plan. Now come the questions:
1) will updating a row every second (for example) create issues?
2) if yes to #1 above, what would be good mitigation tactics? Use 
different table for service vs heartbeat? Special kind of table? or 
configured in a particular way?
3) if a service crashes, it won't remove its row(s), obviously. What 
kind of mechanism exists to "reap" "zombie" services?
4) Related to #3 above, I think built-in "cron"-like services are only 
available via extensions, not in PostgreSQL proper. Why? Seems like such 
an essential service.
5) Which cron-like extension to use? Especially since we run both 
on-prem but also in managed-PostgreSQL on the cloud?


I'd appreciate community input. Thanks, --DD

PS: Note that there could be more than 1 server registered, for the same 
service, possibly on the same machine, for redundancy. But I think 
that's mostly orthogonal to my questions above.


--
Adrian Klaver
adrian.kla...@aklaver.com





Re: WHERE col = ANY($1) extended to 2 or more columns?

2023-02-09 Thread Dominique Devienne
On Thu, Feb 9, 2023 at 5:03 PM Dominique Devienne 
wrote:

> On Thu, Feb 9, 2023 at 4:51 PM David G. Johnston <
> david.g.johns...@gmail.com> wrote:
>
>> On Thu, Feb 9, 2023 at 8:41 AM Dominique Devienne 
>> wrote:
>>
>>> Now we'd like to do the same for composite keys, and I don't know how to
>>> do that.
>>>
>>
>> An array-of-composites is simply:
>> SELECT ARRAY[ ROW(1, 'one'), ROW(2, 'two') ]::composite_type[];
>>
>
> Thanks. I don't consider that "simple" myself :). But I'm definitely not
> an advanced PostgreSQL user!
> Would still appreciate a more fleshed out demo, if anyone is kind enough
> to provide it. Thanks, --DD
>

Also, I'm still not sure how to write that WHERE clause, with the $1 being
an array of a composite type.
And since this is a binary bind, what kind of OIDs to use. Your example
seems to generate a type on-the-fly for example David.
Will we need to create custom types just so we have OIDs we can use to
assemble the bytes of the array-of-composite bind?
I believe there's an implicit ROW type per table created. Are there also
implicit types for composite PKs and/or UNIQUE constraints?
Lots of questions...


Re: Concurrency issue with DROP INDEX CONCURRENTLY

2023-02-09 Thread Adrian Klaver

On 2/9/23 07:45, Kiriakos Georgiou wrote:

Hello,

I have an interesting concurrency issue with DROP INDEX CONCURRENTLY that I can 
summarize with this test scenario:

/**/

— suppose we have this table and index
create table test(x int);
create index idx1 on test(x);

— now suppose with the database “live” and the above table super busy (lots of 
queries on the table using index idx1), I decide to make the index unique
create unique index concurrently idx2 on test(x); — runs fine
drop index concurrently idx1; — took 3 hours to finish, since the table is 
super busy

/**/

Taking 3 hours to drop the index is not surprising (lots of queries on the 
table using idx1).  What surprises me is the drop index causes havoc with 
concurrency on the table, causing queries to pile up.
Once the drop index finishes, everything goes back to normal.

I thought by using the CONCURRENTLY option, the drop index is “safe” from 
concurrency issues for the underlying table, but in the above scenario it 
doesn’t appear to be “safe”.

I am trying to formulate a theory to explain this.  Any ideas?


1)From here:

https://www.postgresql.org/docs/current/sql-dropindex.html

"With this option, the command instead waits until conflicting 
transactions have completed."


2) Probably too late for this case, but info from

https://www.postgresql.org/docs/current/view-pg-locks.html

and

https://www.postgresql.org/docs/15/monitoring-stats.html#MONITORING-PG-STAT-ACTIVITY-VIEW

would be useful.



Regards,
Kiriakos



--
Adrian Klaver
adrian.kla...@aklaver.com





Re: Using PostgreSQL for service discovery and health-check

2023-02-09 Thread Dominique Devienne
On Thu, Feb 9, 2023 at 5:05 PM Adrian Klaver 
wrote:

> On 2/9/23 07:30, Dominique Devienne wrote:
> > In a normally 2-tier system, where "desktop" native apps connect
> > directly to PostgreSQL to access data, some things must be mediated via
> > a middle-tier service (i.e. 3-tier architecture). That service is HTTP
> > based thus clients (the "desktop" native apps) must know the URL
> > (host+port+path) of a server providing the requested service.
>
> Is there more then one server providing the same service?
>

Yes. That was my PS: basically. The client can connect to any one, randomly.
We need at least one of course. But there could me more than 1, yes.


> > Since clients must already have access to PostgreSQL to operate, I
> > wanted to add a table in PostgreSQL for services, and server(s) on
> > startup (probably using a random port) would register in that table, and
> > deregister on shutdown. Also, since crashes are a fact of life, the
> > server would regularly update the DB with a "heartbeat" on some
> > frequency (e.g. 1s?), so clients would select (or see) only registered
> > services with a "fresh enough" heartbeat timestamp.
>
> Would it no be easier to not have random ports and just attempt
> connections to the servers either:
> 1) In the client with reattempt to different port on failure.
> 2) From Postgres server and update table to have current up servers.
>

I'm sorry, but I'm not following. Can you perhaps rephrase?

Regarding ports, once you have registration of services, just seems easier
to me to NOT have a fixed port,
and let the host assign any port to the HTTP server. Those servers are not
user-facing directly, from the client
side, it calls an API and lookup of the service and connection to the HTTP
server is transparent is an implementation
detail, so the port used doesn't matter. In-DB registration of (HTTP)
servers makes the while URL an implementation detail.


Re: WHERE col = ANY($1) extended to 2 or more columns?

2023-02-09 Thread David G. Johnston
On Thu, Feb 9, 2023 at 9:09 AM Dominique Devienne 
wrote:

> On Thu, Feb 9, 2023 at 5:03 PM Dominique Devienne 
> wrote:
>
>> On Thu, Feb 9, 2023 at 4:51 PM David G. Johnston <
>> david.g.johns...@gmail.com> wrote:
>>
>>> On Thu, Feb 9, 2023 at 8:41 AM Dominique Devienne 
>>> wrote:
>>>
 Now we'd like to do the same for composite keys, and I don't know how
 to do that.

>>>
>>> An array-of-composites is simply:
>>> SELECT ARRAY[ ROW(1, 'one'), ROW(2, 'two') ]::composite_type[];
>>>
>>
>> Thanks. I don't consider that "simple" myself :). But I'm definitely not
>> an advanced PostgreSQL user!
>> Would still appreciate a more fleshed out demo, if anyone is kind enough
>> to provide it. Thanks, --DD
>>
>
> Also, I'm still not sure how to write that WHERE clause, with the $1 being
> an array of a composite type.
> And since this is a binary bind, what kind of OIDs to use. Your example
> seems to generate a type on-the-fly for example David.
> Will we need to create custom types just so we have OIDs we can use to
> assemble the bytes of the array-of-composite bind?
> I believe there's an implicit ROW type per table created. Are there also
> implicit types for composite PKs and/or UNIQUE constraints?
> Lots of questions...
>

 postgres=# select (1, 'one'::text) = any(array[(1,
'one'::text)::record]::record[]);
 ?column?
--
 t
(1 row)

Not sure how much that helps but there it is.

If you wanted to use an actual explicit composite type with an OID it would
need to be created.

There where clause is the easy part, its the code side for setting the
parameter that I don't know.  Ideally the library lets you pass around
language-specific objects and it does it for you.

You could consider writing out a JSONB object and writing your condition in
terms of json operators/expressions.

David J.


Re: WHERE col = ANY($1) extended to 2 or more columns?

2023-02-09 Thread Alban Hertroys


> On 9 Feb 2023, at 16:41, Dominique Devienne  wrote:
> 
> Hi. We are implementing an API which takes a list of row keys, and must 
> return info about those rows. To implement that efficiently, in as few 
> round-trips as possible, we bind a (binary) array of keys (ints, uuids, or 
> strings) and that works great, but only if the key is a scalar one.
> 
> Now we'd like to do the same for composite keys, and I don't know how to do 
> that.
> Is it possible? Could someone please help out or demo such a thing?
> We are doing it in C++ using libpq, but a pure SQL or PL/pgSQL demo would 
> still help (I think).

This works:

=> select (1, 'one'::text) in ((1, 'two'::text), (2, 'one'::text), (1, 
'one'::text), (2, 'two'::text));
 ?column? 
--
 t
(1 row)

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.





Re: WHERE col = ANY($1) extended to 2 or more columns?

2023-02-09 Thread David G. Johnston
On Thu, Feb 9, 2023 at 9:28 AM Alban Hertroys  wrote:

>
> > On 9 Feb 2023, at 16:41, Dominique Devienne  wrote:
> >
> > Hi. We are implementing an API which takes a list of row keys, and must
> return info about those rows. To implement that efficiently, in as few
> round-trips as possible, we bind a (binary) array of keys (ints, uuids, or
> strings) and that works great, but only if the key is a scalar one.
> >
> > Now we'd like to do the same for composite keys, and I don't know how to
> do that.
> > Is it possible? Could someone please help out or demo such a thing?
> > We are doing it in C++ using libpq, but a pure SQL or PL/pgSQL demo
> would still help (I think).
>
> This works:
>
> => select (1, 'one'::text) in ((1, 'two'::text), (2, 'one'::text), (1,
> 'one'::text), (2, 'two'::text));
>  ?column?
> --
>  t
> (1 row)
>
>
But you cannot write the right-side of the IN as a single parameter which
seems to be the primary constraint trying to be conformed to.

David J.


Re: Using PostgreSQL for service discovery and health-check

2023-02-09 Thread Adrian Klaver

On 2/9/23 08:16, Dominique Devienne wrote:
On Thu, Feb 9, 2023 at 5:05 PM Adrian Klaver > wrote:





Is there more then one server providing the same service?


Yes. That was my PS: basically. The client can connect to any one, randomly.
We need at least one of course. But there could me more than 1, yes.



Would it no be easier to not have random ports and just attempt
connections to the servers either:
1) In the client with reattempt to different port on failure.
2) From Postgres server and update table to have current up servers.


I'm sorry, but I'm not following. Can you perhaps rephrase?

Regarding ports, once you have registration of services, just seems 
easier to me to NOT have a fixed port,
and let the host assign any port to the HTTP server. Those servers are 
not user-facing directly, from the client
side, it calls an API and lookup of the service and connection to the 
HTTP server is transparent is an implementation
detail, so the port used doesn't matter. In-DB registration of (HTTP) 
servers makes the while URL an implementation detail.


The flip side of that is that with known ports it would it easier to 
have a process on the Postgres machine or in the database that checks 
the ports on regular basis. And as part of that process mark any non 
responding ports as inactive. That would solve the zombie problem.


--
Adrian Klaver
adrian.kla...@aklaver.com





Re: ERROR: posting list tuple with 2 items cannot be split at offset 17

2023-02-09 Thread Adrian Klaver

On 2/8/23 23:53, Paul McGarry wrote:
I have three databases, two of databases where I am experiencing the 
issue below.


The first database was created from a dump in Feb 2022 (a few weeks 
after the time period for which I seem to have problematic indexes, maybe).
The second database was then cloned from the first (ie filesystem level 
copy) soon after.


What where the commands for the dump/restore?

Second database, same machine or different one?

Since then all databases have undergone a number of minor version 
upgrades, including to 13.9 and an OS update last week for the two 
problem databases (the other is still on 13.8).


The OS's and the update to?



Now, a process which does clears some data > 13 months old is getting an 
error when trying to do that update.


My suspicion is that either:
- there was probably an issue with the index 12 months ago and that 
problem was copied when I cloned the database, and is just becoming 
apparent now a script is accessing 13 month olf data.

- something in our recent upgrade has caused the problem.


A dump/restore would recreate the index at the restore.



The third database is still on 13.8, and with some OS updates pending, 
and is not experiencing the problem.


I'm betting OS change as culprit.




Any suggestions on what else I should look into, in particular anything 
I should check before upgrading the remaining 13.8 DB to 13.9?


Thanks for any help,

Paul


--
Adrian Klaver
adrian.kla...@aklaver.com





Re: ERROR: posting list tuple with 2 items cannot be split at offset 17

2023-02-09 Thread Peter Geoghegan
On Wed, Feb 8, 2023 at 11:54 PM Paul McGarry  wrote:
> But if it is the problem, why did the update start working after I recreated 
> the other index?

There is no reason why reindexing another index ought to have had that
effect. The likely explanation is that subsequent updates used a
successor version heap TID that didn't overlap with some existing
posting list in whatever way. If you repeat the update again and
again, and get an error each time, the incoming TID will differ each
time. Eventually you won't get an error, because at some point there
won't be a posting-list-TID range overlap for some new successor TID
that leads to the insert/posting list split code detecting a problem.

It's also possible that a concurrent autovacuum "fixed" the issue.

The amcheck error shows a problem in an internal page, which cannot
have posting list tuples -- which suggests broad corruption. An issue
with collation instability due to an OS update does seem likely.

Note that the hardening/defensive checks in this area have increased.
I added an additional defensive check to 13.4, and followed up with
another similar check in 13.5. It looks like the error you've seen
("ERROR:  posting list tuple with 2 items cannot be split at offset
17") comes from the initial 13.4 hardening, since I'd expect the
additional 13.5 hardening to catch the same issue sooner, with a
different error message (something like "table tid from new index
tuple (%u,%u) cannot find insert offset between offsets %u and %u of
block %u in index \"%s\"").

> I think I should now:
> - recreate the widget_name_idx on the problem servers
> - run bt_index_check across all other indexes
>
> Any suggestions on what else I should look into, in particular anything I 
> should check before upgrading the remaining 13.8 DB to 13.9?

I recommend running amcheck on all indexes, or at least all
possibly-affected text indexes.

-- 
Peter Geoghegan




Re: WHERE col = ANY($1) extended to 2 or more columns?

2023-02-09 Thread Dominique Devienne
On Thu, Feb 9, 2023 at 5:37 PM David G. Johnston 
wrote:

> On Thu, Feb 9, 2023 at 9:28 AM Alban Hertroys  wrote:
>
>> > On 9 Feb 2023, at 16:41, Dominique Devienne 
>> wrote:
>> > Now we'd like to do the same for composite keys, and I don't know how
>> to do that.
>>
>> This works:
>> => select (1, 'one'::text) in ((1, 'two'::text), (2, 'one'::text), (1,
>> 'one'::text), (2, 'two'::text));
>>
> But you cannot write the right-side of the IN as a single parameter which
> seems to be the primary constraint trying to be conformed to.
>

Right. The goal is to (re)use a prepared statement (i.e. plan once), and
bind the RHS (binary) array
and do a single exec (single round-trip) to get the matching rows. AFAIK,
this is the fastest way.
If there's a better/faster way, I'm interested. --DD


Re: Using PostgreSQL for service discovery and health-check

2023-02-09 Thread Dominique Devienne
On Thu, Feb 9, 2023 at 5:51 PM Adrian Klaver 
wrote:

> On 2/9/23 08:16, Dominique Devienne wrote:
> > On Thu, Feb 9, 2023 at 5:05 PM Adrian Klaver  The flip side of that is that with known ports it would it easier to
> have a process on the Postgres machine or in the database that checks
> the ports on regular basis. And as part of that process mark any non
> responding ports as inactive. That would solve the zombie problem.
>

That's one possibility. But the "reaper" process could just as well scan
the service table,
and probe those too. So again, I'm not sure what the fixed-port approach
gains me, beside
perhaps the reaper not having to connect to PostgreSQL itself. I'm OK with
connecting.

Thanks for the your input. Always good to have one's arguments challenged
by experts.


Re: Using PostgreSQL for service discovery and health-check

2023-02-09 Thread Adrian Klaver

On 2/9/23 09:40, Dominique Devienne wrote:
On Thu, Feb 9, 2023 at 5:51 PM Adrian Klaver > wrote:


On 2/9/23 08:16, Dominique Devienne wrote:
 > On Thu, Feb 9, 2023 at 5:05 PM Adrian Klaver
mailto:adrian.kla...@aklaver.com>
The flip side of that is that with known ports it would it easier to
have a process on the Postgres machine or in the database that checks
the ports on regular basis. And as part of that process mark any non
responding ports as inactive. That would solve the zombie problem.


That's one possibility. But the "reaper" process could just as well scan 
the service table,
and probe those too. So again, I'm not sure what the fixed-port approach 
gains me, beside
perhaps the reaper not having to connect to PostgreSQL itself. I'm OK 
with connecting.


What is the reaper process?



Thanks for the your input. Always good to have one's arguments 
challenged by experts.


--
Adrian Klaver
adrian.kla...@aklaver.com





Re: Sequence vs UUID

2023-02-09 Thread Merlin Moncure
On Wed, Feb 8, 2023 at 5:33 AM Peter J. Holzer  wrote:

> On 2023-02-08 14:48:03 +0530, veem v wrote:
> > So wanted to know from experts here, is there really exists any scenario
> in
> > which UUID really cant be avoided?
>
> Probably not. The question is usually not "is this possible" but "does
> this meet the requirements at acceptable cost".
>
>
> > Sequence Number = n*d+m+offset. Where n is the sequence order number, d
> is the
> > dimensions of the multi-master replication, m ranges from 0 to n-1 is the
> > number assigned to each node in the replication, and offset is the
> number to
> > offset the sequence numbers.
>
> Yes, you can do this. In fact, people (including me) have already done
> this.
>
> But it's relatively easy to mess this up:
>
> Firstly, you have to make sure that d is larger than your number of
> (active) replicas will ever be, but still small enough that you will
> never overflow. Probably not a problem with 64 bit sequences (if you set
> d to 1E6, you can still count to 9E12 on each node), but might be a
> problem if you are for some reason limited to 32 bits.
>
> Secondly (and IMHO more importantly) you have to make sure each node
> gets its own unique offset. So this needs to be ensured during
> deployment, but also during migrations, restores from backups and other
> infrequent events.


??  All you have to do is ensure each node has its own unique id, and that
id is involved in sequence generation.  This has to be done for other
reasons than id generation, and is a zero effort/risk process.

The id would then contain the identifier of the node that *generated* the
id, rather than the node that contains the id.  This is exactly analogous
to strategies that use mac# as part of id prefix for example.  Once
generated, it's known unique and you don't have to consider anything.
 This is exactly what I do, and there is no interaction with backups,
deployments, migrations, etc.  Node expansion does require that each node
requires a unique node id, and that's it.

merlin


Re: Using PostgreSQL for service discovery and health-check

2023-02-09 Thread Adrian Klaver

On 2/9/23 09:40, Dominique Devienne wrote:
On Thu, Feb 9, 2023 at 5:51 PM Adrian Klaver > wrote:


On 2/9/23 08:16, Dominique Devienne wrote:
 > On Thu, Feb 9, 2023 at 5:05 PM Adrian Klaver
mailto:adrian.kla...@aklaver.com>
The flip side of that is that with known ports it would it easier to
have a process on the Postgres machine or in the database that checks
the ports on regular basis. And as part of that process mark any non
responding ports as inactive. That would solve the zombie problem.


That's one possibility. But the "reaper" process could just as well scan 
the service table,
and probe those too. So again, I'm not sure what the fixed-port approach 
gains me, beside
perhaps the reaper not having to connect to PostgreSQL itself. I'm OK 
with connecting.


As to fixed port and pulling vs services pushing, there is a security 
side. Not sure who controls the external services, but there is the 
chance that someone knowing they exist could inject their own version of 
a service/server. With random ports that makes that easier as you would 
not know what is canonical. With the pull process you have a 
verified(presumably) list  of servers and ports they listen on.




Thanks for the your input. Always good to have one's arguments 
challenged by experts.


--
Adrian Klaver
adrian.kla...@aklaver.com





delta.io foreign data wrapper

2023-02-09 Thread Georg Heiler
Hi,

Postgres already has a foreign data wrapper (FDW) for external tables
stored in parquet format on S3
https://www.postgresql.org/about/news/parquet-s3-fdw-030-released-2474/.

Delta https://delta.io/ is an evolution of parquet to a more
cloud-friendly/ACID table format also allowing for mutation.

Is there any support for using an FDW with delta? So far I could not find
any resources.

Best,
Georg


Re: Concurrency issue with DROP INDEX CONCURRENTLY

2023-02-09 Thread Kiriakos Georgiou
My apologies - there is no issue with DROP INDEX CONCURRENTLY.
It’s just brain fade on my part (I dropped the existing index before creating 
the new UNIQUE index, causing TPS on this table to go to zero *facepalm*).

Regards,
Kiriakos
  

> On Feb 9, 2023, at 10:45 AM, Kiriakos Georgiou  
> wrote:
> 
> Hello,
> 
> I have an interesting concurrency issue with DROP INDEX CONCURRENTLY that I 
> can summarize with this test scenario:
> 
> /**/
> 
> — suppose we have this table and index
> create table test(x int);
> create index idx1 on test(x);
> 
> — now suppose with the database “live” and the above table super busy (lots 
> of queries on the table using index idx1), I decide to make the index unique
> create unique index concurrently idx2 on test(x); — runs fine
> drop index concurrently idx1; — took 3 hours to finish, since the table is 
> super busy
> 
> /**/
> 
> Taking 3 hours to drop the index is not surprising (lots of queries on the 
> table using idx1).  What surprises me is the drop index causes havoc with 
> concurrency on the table, causing queries to pile up.
> Once the drop index finishes, everything goes back to normal.
> 
> I thought by using the CONCURRENTLY option, the drop index is “safe” from 
> concurrency issues for the underlying table, but in the above scenario it 
> doesn’t appear to be “safe”.
> 
> I am trying to formulate a theory to explain this.  Any ideas?
> 
> Regards,
> Kiriakos





Re: Concurrency issue with DROP INDEX CONCURRENTLY

2023-02-09 Thread Laurenz Albe
On Thu, 2023-02-09 at 10:45 -0500, Kiriakos Georgiou wrote:
> I have an interesting concurrency issue with DROP INDEX CONCURRENTLY that I
> can summarize with this test scenario:
> 
> /**/
> 
> — suppose we have this table and index
> create table test(x int);
> create index idx1 on test(x);
> 
> — now suppose with the database “live” and the above table super busy (lots
>   of queries on the table using index idx1), I decide to make the index unique
> create unique index concurrently idx2 on test(x); — runs fine
> drop index concurrently idx1; — took 3 hours to finish, since the table is 
> super busy
> 
> /**/
> 
> Taking 3 hours to drop the index is not surprising (lots of queries on the 
> table
> using idx1).  What surprises me is the drop index causes havoc with 
> concurrency
> on the table, causing queries to pile up.
> Once the drop index finishes, everything goes back to normal.
> 
> I thought by using the CONCURRENTLY option, the drop index is “safe” from 
> concurrency
> issues for the underlying table, but in the above scenario it doesn’t appear 
> to be “safe”.

DROP INDEX CONCURRENTLY also locks the table, but it waits until it finds a 
time when
it can get the lock right away.  Once it has the lock, it finishes the task.
In the time when the table is locked, concurrent statements are blocked.  This 
should
not take a long time, but perhaps that is enough to cause the havoc you observe.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




Re: Quoting issue from ODBC

2023-02-09 Thread Brad White
On Tue, Feb 7, 2023 at 10:20 PM Brad White  wrote:

> On 2/7/2023 6:19 PM, Adrian Klaver wrote:
>
> On 2/7/23 16:10, Brad White wrote:
>
> Front end: Access 365
> Back end: Postgres 9.4
> (I know, we are in the process of upgrading)
>
> I'm getting some cases where the SQL sent from MS-Access is failing.
> Looking at the postgres log shows that the field names and table names are
> not being quoted properly.
> It has been my experience that Access usually does a better job at
> converting the queries than I would have expected, but not in this
> instance.
>
> For example
>
> Access: connection.Execute "UPDATE [" & strTable & "] SET [" & strTable &
> "].[InsertFlag] = Null" _
>  & " WHERE ((([" & strTable & "].[InsertFlag])=" & lngCurrUID & "));",
> , adCmdText Or adExecuteNoRecords
> Note that InsertFlag is bracketed the same way in both instances.
>
> PSQL: UPDATE "public"."Orders" SET InsertFlag=NULL  WHERE ("InsertFlag" =
> 166 )
> Note that InsertFlag is quoted once but not the other time.
> Of course this gives the error: column "insertflag" of relation "Orders"
> does not exist at character 35.
>
> Looks like I have about 16 unique instances of statements not being quoted
> correctly resulting in over 500 errors in the log for today.
>
>
> Where these preexisting queries or where they created today?
>
> These queries are decades old but I don't view this log file very often,
> so I don't know how long.
>
> I'll review when I get back on site Thursday and see if I can find any
> users that are not getting the error or when it started.
>
>
>
> Any suggestions on where to look?
>
> Thanks,
> Brad.
>
> Back in the office today and I note that all of the fields that are
getting the issue are the target field in an UPDATE statement.
All the other tables and field names are quoted correctly.

I suspect an ODBC driver bug.  Is there a better place to report those?

Driver: PostgreSQL Unicode
Filename: PSQLODBC35W.DLL
Version: 13.02.00
ReleaseDate: 9/22/2021

On the other hand, the app updates things all the time. Only about 12 of
the update statements are ending up in the log. Still looking for the
common denominator in how those statements are called.

ERROR:  column "*commitrep*" of relation "Order Items" does not exist at
character 35
STATEMENT:  UPDATE "public"."Order Items" SET *CommitRep*='jdoe'  WHERE
(("OrderFID" = 557837 ) AND ("*CommitRep*" IS NULL ) )

Here is the original query. You can see that  [CommitRep] is written the
same way both times in the query, but in the final quoted correctly once
and incorrectly once.
RunADO "CreditRepWithCommit()", "UPDATE [Order Items] SET *[CommitRep]*
= '" & UID & "'" _
& " WHERE [OrderFID] = " & OrderFID & " AND * [CommitRep]* IS NULL;"

A few other samples from the log. Always just the target field of the
UPDATE that is not quoted.

ERROR:  column "*availableserialcount*" of relation "Serial Pools" does not
exist at character 36
STATEMENT:  UPDATE "public"."Serial Pools" SET *AvailableSerialCount*=143
 WHERE ("ID" = 134 )

ERROR:  column "*serialnum*" of relation "Order Items" does not exist at
character 35
STATEMENT:  UPDATE "public"."Order Items" SET *SerialNum*='205757'
,LastSerCaptureTypeID=2  WHERE ("ID" = 1891128 )

ERROR:  column "*issuedate*" of relation "Order Items" does not exist at
character 35
STATEMENT:  UPDATE "public"."Order Items" SET *IssueDate*='2023-02-09
14:28:09'::timestamp ,*IssueRep*=' jdoe  '  WHERE ("ID" = 1891128 )


Re: Quoting issue from ODBC

2023-02-09 Thread Brad White
>> Where these preexisting queries or where they created today?

> These queries are decades old but I don't view this log file very often,
so I don't know how long.

> I'll review when I get back on site Thursday and see if I can find any
users that are not getting the error or when it started.

Going back to early 2020, I don't have any logs that don't have these
errors, so it is not a recent change.

I don't yet see anything that is user specific.

Brad.

>


Re: Quoting issue from ODBC

2023-02-09 Thread Adrian Klaver

On 2/9/23 14:43, Brad White wrote:
On Tue, Feb 7, 2023 at 10:20 PM Brad White > wrote:


On 2/7/2023 6:19 PM, Adrian Klaver wrote:

On 2/7/23 16:10, Brad White wrote:

Front end: Access 365
Back end: Postgres 9.4
(I know, we are in the process of upgrading)

I'm getting some cases where the SQL sent from MS-Access is failing.
Looking at the postgres log shows that the field names and table
names are not being quoted properly.
It has been my experience that Access usually does a better job
at converting the queries than I would have expected, but not in
this instance.

For example

Access: connection.Execute "UPDATE [" & strTable & "] SET [" &
strTable & "].[InsertFlag] = Null" _
     & " WHERE ((([" & strTable & "].[InsertFlag])=" & lngCurrUID
& "));", , adCmdText Or adExecuteNoRecords
Note that InsertFlag is bracketed the same way in both instances.

PSQL: UPDATE "public"."Orders" SET InsertFlag=NULL  WHERE
("InsertFlag" = 166 )
Note that InsertFlag is quoted once but not the other time.
Of course this gives the error: column "insertflag" of relation
"Orders" does not exist at character 35.

Looks like I have about 16 unique instances of statements not
being quoted correctly resulting in over 500 errors in the log
for today.


Where these preexisting queries or where they created today?


These queries are decades old but I don't view this log file very
often, so I don't know how long.

I'll review when I get back on site Thursday and see if I can find
any users that are not getting the error or when it started.





Any suggestions on where to look?

Thanks,
Brad.


Back in the office today and I note that all of the fields that are 
getting the issue are the target field in an UPDATE statement.

All the other tables and field names are quoted correctly.

I suspect an ODBC driver bug.  Is there a better place to report those?

Driver: PostgreSQL Unicode
Filename: PSQLODBC35W.DLL
Version: 13.02.00
ReleaseDate: 9/22/2021


https://www.postgresql.org/list/pgsql-odbc/



On the other hand, the app updates things all the time. Only about 12 of 
the update statements are ending up in the log. Still looking for the 
common denominator in how those statements are called.



So how the successful UPDATE's called?

Are the successful UPDATES's on the same tables and columns?

From your subsequent post:

"Going back to early 2020, I don't have any logs that don't have these 
errors, so it is not a recent change."


Are these UPDATE's actually necessary?

In other words has nobody noticed a problem with the data over that time 
frame?




--
Adrian Klaver
adrian.kla...@aklaver.com





Re: ERROR: posting list tuple with 2 items cannot be split at offset 17

2023-02-09 Thread Paul McGarry
Hi Peter,

Thanks for your input, I'm feeling more comfortable that I correctly
understand the problem now and it's "just" a collation related issue.


> I recommend running amcheck on all indexes, or at least all
> possibly-affected text indexes.
>
>
Will the amcheck reliably identify all issues that may arise from a
collation change?

I don't enough about the details of B-tree index to know whether the
problems are only "obvious" if they happen to interact with some boundary,
or whether it will always be evident with the basic amcheck, ie

bt_index_check(index regclass, false)

and therefore I can just use that to identify bad indexes and recreate
them, or should I recreate all btree indexes involving text fields?

We planned to do a dump/restore upgrade to PG14 in a month or so (already
done in dev).
I am wondering whether it will be less work overall to bring that forward
than rebuild these indexes...

Thanks again for your advice.

Paul


Re: Quoting issue from ODBC

2023-02-09 Thread Brad White
On Thu, Feb 9, 2023 at 5:10 PM Adrian Klaver 
wrote:

> On 2/9/23 14:43, Brad White wrote:
> > On Tue, Feb 7, 2023 at 10:20 PM Brad White  > > wrote:
> >
> > On 2/7/2023 6:19 PM, Adrian Klaver wrote:
> >> On 2/7/23 16:10, Brad White wrote:
> >>> Front end: Access 365
> >>> Back end: Postgres 9.4
> >>> (I know, we are in the process of upgrading)
> >>>
> >>> I'm getting some cases where the SQL sent from MS-Access is
> failing.
> >>> Looking at the postgres log shows that the field names and table
> >>> names are not being quoted properly.
> >>> It has been my experience that Access usually does a better job
> >>> at converting the queries than I would have expected, but not in
> >>> this instance.
> >>>
> >>> For example
> >>>
> >>> Access: connection.Execute "UPDATE [" & strTable & "] SET [" &
> >>> strTable & "].[InsertFlag] = Null" _
> >>>  & " WHERE ((([" & strTable & "].[InsertFlag])=" & lngCurrUID
> >>> & "));", , adCmdText Or adExecuteNoRecords
> >>> Note that InsertFlag is bracketed the same way in both instances.
> >>>
> >>> PSQL: UPDATE "public"."Orders" SET InsertFlag=NULL  WHERE
> >>> ("InsertFlag" = 166 )
> >>> Note that InsertFlag is quoted once but not the other time.
> >>> Of course this gives the error: column "insertflag" of relation
> >>> "Orders" does not exist at character 35.
> >>>
> >>> Looks like I have about 16 unique instances of statements not
> >>> being quoted correctly resulting in over 500 errors in the log
> >>> for today.
> >>
> >> Where these preexisting queries or where they created today?
> >
> > These queries are decades old but I don't view this log file very
> > often, so I don't know how long.
> >
> > I'll review when I get back on site Thursday and see if I can find
> > any users that are not getting the error or when it started.
> >
> >>
> >>>
> >>> Any suggestions on where to look?
> >>>
> >>> Thanks,
> >>> Brad.
> >
> > Back in the office today and I note that all of the fields that are
> > getting the issue are the target field in an UPDATE statement.
> > All the other tables and field names are quoted correctly.
> >
> > I suspect an ODBC driver bug.  Is there a better place to report those?
> >
> > Driver: PostgreSQL Unicode
> > Filename: PSQLODBC35W.DLL
> > Version: 13.02.00
> > ReleaseDate: 9/22/2021
>
> https://www.postgresql.org/list/pgsql-odbc/
>
> >
> > On the other hand, the app updates things all the time. Only about 12 of
> > the update statements are ending up in the log. Still looking for the
> > common denominator in how those statements are called.
>
>
> So how the successful UPDATE's called?
>
I'm still trying to track down all the statements. Because of the
translation between the two database systems, I can't just search on a
simple string.

>
> Are the successful UPDATES's on the same tables and columns?
>
This is the only routine that updates the InsertFlag column. All the order
tables have that flag.

>
> Are these UPDATE's actually necessary?
>
This system is critical to the company, but has a reputation of being
unreliable. I suspect this may be one cause.

>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>
>


Re: ERROR: posting list tuple with 2 items cannot be split at offset 17

2023-02-09 Thread Peter Geoghegan
On Thu, Feb 9, 2023 at 3:55 PM Paul McGarry  wrote:
> Will the amcheck reliably identify all issues that may arise from a collation 
> change?

Theoretically it might not. In practice I'd be very surprised if it
ever failed to detect such an inconsistency. If you want to be extra
careful, and can afford to block concurrent writes, then I suggest
using bt_index_parent_check instead of bt_index_check.

> and therefore I can just use that to identify bad indexes and recreate them, 
> or should I recreate all btree indexes involving text fields?

It might be easier to just reindex them all. Hard to say.

> We planned to do a dump/restore upgrade to PG14 in a month or so (already 
> done in dev).
> I am wondering whether it will be less work overall to bring that forward 
> than rebuild these indexes...

pg_amcheck is available on 14. It offers a much simpler interface for
running amcheck routine, so maybe look into that once you upgrade.

-- 
Peter Geoghegan




pg_trgm vs. Solr ngram

2023-02-09 Thread Chris

Hello list

I'm pondering migrating an FTS application from Solr to Postgres, just 
because we use Postgres for everything else.


The application is basically fgrep with a web frontend. However the 
indexed documents are very computer network specific and contain a lot 
of hyphenated hostnames with dot-separated domains, as well as IPv4 and 
IPv6 addresses. In Solr I was using ngrams and customized the 
TokenizerFactories until more or less only whitespace was as separator, 
while [.:-_\d] remains part of the ngrams. This allows to search for 
".12.255/32" or "xzy-eth5.example.org" without any false positives.


It looks like a straight conversion of this method is not possible since 
the tokenization in pg_trgm is not configurable afaict. Is there some 
other good method to search for a random substring including all the 
punctuation using an index? Or a pg_trgm-style module that is more 
flexible like the Solr/Lucene variant?


Or maybe hacking my own pg_trgm wouldn't be so hard and could be fun, do 
I pretty much just need to change the emitted tokens or will this lead 
to significant complications in the operators, indexes etc.?


thanks for any hints & cheers
Christian




Re: pg_trgm vs. Solr ngram

2023-02-09 Thread Laurenz Albe
On Fri, 2023-02-10 at 03:20 +0100, Chris wrote:
> I'm pondering migrating an FTS application from Solr to Postgres, just 
> because we use Postgres for everything else.
> 
> The application is basically fgrep with a web frontend. However the 
> indexed documents are very computer network specific and contain a lot 
> of hyphenated hostnames with dot-separated domains, as well as IPv4 and 
> IPv6 addresses. In Solr I was using ngrams and customized the 
> TokenizerFactories until more or less only whitespace was as separator, 
> while [.:-_\d] remains part of the ngrams. This allows to search for 
> ".12.255/32" or "xzy-eth5.example.org" without any false positives.
> 
> It looks like a straight conversion of this method is not possible since 
> the tokenization in pg_trgm is not configurable afaict. Is there some 
> other good method to search for a random substring including all the 
> punctuation using an index? Or a pg_trgm-style module that is more 
> flexible like the Solr/Lucene variant?'127.0.0.1/32'
> 
> Or maybe hacking my own pg_trgm wouldn't be so hard and could be fun, do 
> I pretty much just need to change the emitted tokens or will this lead 
> to significant complications in the operators, indexes etc.?

Here is a hack that you can try: pre-process your strings and replace
symbols with rare characters:

  SELECT show_trgm(translate('127.0.0.1/32', './', 'qx'));

  show_trgm
  ═
   {"  1"," 12",0q0,0q1,127,1x3,27q,"32 ",7q0,q0q,q1x,x32}
  (1 row)

Then you could search like

  WHERE translate(search_string, './', 'qx') LIKE translate('%127.0.0.1/32%', 
'./', 'qx')
AND search_string LIKE '%127.0.0.1/32%'

The first condition can use a trigram index, and the second filters out
false positives.

Yours,
Laurenz Albe




Re: pg_trgm vs. Solr ngram

2023-02-09 Thread Tom Lane
Chris  writes:
> Or maybe hacking my own pg_trgm wouldn't be so hard and could be fun, do 
> I pretty much just need to change the emitted tokens or will this lead 
> to significant complications in the operators, indexes etc.?

See KEEPONLYALNUM in pg_trgm/trgm.h ...

Now, using a custom-modified pg_trgm module in production is likely
not something you want to do for long.  It might be interesting to
look into whether the relatively-recently-invented "operator class
parameter" mechanism could be exploited to allow this behavior to be
customized without hacking C code.

regards, tom lane




Re: pg_trgm vs. Solr ngram

2023-02-09 Thread Bertrand Mamasam
Le ven. 10 févr. 2023, 03:20, Chris  a écrit :

> Hello list
>
> I'm pondering migrating an FTS application from Solr to Postgres, just
> because we use Postgres for everything else.
>
> The application is basically fgrep with a web frontend. However the
> indexed documents are very computer network specific and contain a lot
> of hyphenated hostnames with dot-separated domains, as well as IPv4 and
> IPv6 addresses. In Solr I was using ngrams and customized the
> TokenizerFactories until more or less only whitespace was as separator,
> while [.:-_\d] remains part of the ngrams. This allows to search for
> ".12.255/32" or "xzy-eth5.example.org" without any false positives.
>
> It looks like a straight conversion of this method is not possible since
> the tokenization in pg_trgm is not configurable afaict. Is there some
> other good method to search for a random substring including all the
> punctuation using an index? Or a pg_trgm-style module that is more
> flexible like the Solr/Lucene variant?
>
> Or maybe hacking my own pg_trgm wouldn't be so hard and could be fun, do
> I pretty much just need to change the emitted tokens or will this lead
> to significant complications in the operators, indexes etc.?
>
> thanks for any hints & cheers
> Christian
>

In Solr you used FTS so I suggest that you do the same in Postgres and look
at the full text search functions. You can create a tsvector yourself in
many different ways or use one of the provided functions. So you could add
complete IP adresses to your index and then search for them using something
like phrase search. You can also create text search configurations or just
use the "simple" one if you just need something like fgrep. Of course, the
end result will be more like Solr and less like fgrep.

https://www.postgresql.org/docs/current/textsearch.html