Question: consolidating strpos searches?

2025-01-04 Thread James Addison
Hello and Happy New Year,

I'd like to validate or reject a performance-related PostgreSQL patch
idea, regarding multiple strpos calls that operate on a common text
input.

>From local testing using PGSQL 17.2, and searching the mailing lists,
my understanding is that each function expression (including any that
are duplicates) in a SQL query is evaluated independently.

To check that, I constructed a pessimal demonstration (psql session
snippet attached here for reference) by generating an extremely long
random text blob ending with the phrase "known suffix".

Running queries against that text locally I encountered the runtimes
(annotated in the SQL comments) below:

  WHERE strpos(v, 'known') > 0; -- 2 seconds
  WHERE strpos(v, 'known') > 0 AND strpos(v, 'suffix') > 0; -- 4 seconds
  WHERE strpos(v, 'known') > 0 AND strpos(v, 'suffix') > 0 AND
strpos(v, 'suffix') > 0; -- 6 seconds
  WHERE strpos(v, 'known') > 0 AND strpos(v, 'suffix') > 0 AND
strpos(v, 'suffix') > 0 AND strpos(v, 'suffix') > 0; -- 8 seconds

In other words: each additional strpos(value, ...) expression
increased the evaluation time by a similar, significant duration of
two seconds.  This seems to confirm the basis that each expression is
currently evaluated separately, by an independent read from the input
text.

I'd like to suggest the introduction of a documented multiple string
matching algorithm[1], to yield results for each of multiple strpos
calls while only reading through their common input text at-most once.

In the context of considering writing a patch: would the complexity of
implementing such a feature for PostgreSQL be worth the potential
performance benefits?  And either way, is there more I should learn
about and consider?  How would I provide convincing supporting
evidence if I do write a patch?

Thank you,
James

[1] Such as those described in "Flexible Pattern Matching in Strings",
authored by G Navarro and M Raffinot, 2002, published by Cambridge
University Press
testing=> CREATE EXTENSION pgcrypto;
CREATE EXTENSION
testing=> CREATE TEMPORARY TABLE test AS SELECT 
string_agg(gen_random_bytes(1024)::text, '') || 'known suffix' AS value FROM 
generate_series(1, 512*512);
SELECT 1
testing=> EXPLAIN ANALYZE SELECT COUNT(*) FROM test WHERE strpos(value, 
'known') > 0;
 QUERY PLAN 
 
-
 Aggregate  (cost=31.53..31.54 rows=1 width=8) (actual time=2113.529..2113.530 
rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..30.40 rows=453 width=0) (actual 
time=2086.105..2113.521 rows=1 loops=1)
 Filter: (strpos(value, 'known'::text) > 0)
 Planning Time: 0.205 ms
 Execution Time: 2113.600 ms
(5 rows)

testing=> EXPLAIN ANALYZE SELECT COUNT(*) FROM test WHERE strpos(value, 
'known') > 0 AND strpos(value, 'suffix') > 0;
 QUERY PLAN 
 
-
 Aggregate  (cost=37.58..37.59 rows=1 width=8) (actual time=4196.635..4196.636 
rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..37.20 rows=151 width=0) (actual 
time=4140.200..4196.626 rows=1 loops=1)
 Filter: ((strpos(value, 'known'::text) > 0) AND (strpos(value, 
'suffix'::text) > 0))
 Planning Time: 0.111 ms
 Execution Time: 4196.683 ms
(5 rows)

testing=> EXPLAIN ANALYZE SELECT COUNT(*) FROM test WHERE strpos(value, 
'known') > 0 AND strpos(value, 'suffix') > 0 AND strpos(value, 'suffix') > 0;
  QUERY PLAN
  
--
 Aggregate  (cost=44.38..44.39 rows=1 width=8) (actual time=6269.283..6269.284 
rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..44.00 rows=151 width=0) (actual 
time=6179.754..6269.275 rows=1 loops=1)
 Filter: ((strpos(value, 'known'::text) > 0) AND (strpos(value, 
'suffix'::text) > 0) AND (strpos(value, 'suffix'::text) > 0))
 Planning Time: 0.101 ms
 Execution Time: 6269.329 ms
(5 rows)

testing=> EXPLAIN ANALYZE SELECT COUNT(*) FROM test WHERE strpos(value, 
'known') > 0 AND strpos(value, 'suffix') > 0 AND strpos(value, 'suffix') > 0 
AND strpos(value, 'suffix') > 0;

  QUERY PLAN
  
--
 Aggregate  (cost=51.18..51.19 rows=1 width=8) (actual time=8190.171..8190.172 
rows=1 loops=1)

Re: Question: consolidating strpos searches?

2025-01-15 Thread James Addison
On Sat, 4 Jan 2025 at 19:04, Greg Sabino Mullane  wrote:
>
> On Sat, Jan 4, 2025 at 12:16 PM James Addison  wrote:
>>
>> In the context of considering writing a patch: would the complexity of 
>> implementing such a feature for PostgreSQL be worth the potential
>> performance benefits?
>
> Probably not. As Tom said, this sounds like it should be tried as an 
> extension.

Will do; thanks, both of you.

>> And either way, is there more I should learn about and consider?  How would 
>> I provide convincing supporting
>> evidence if I do write a patch?
>
> As this is the performance mailing list, it might help to describe the 
> real-world problem being encountered here. There are other ways to solve this 
> particular issue. Among them would be using OR not AND in your contrived 
> example, using partial indexes, using pg_trgm,  using regular expressions ( 
> i.e. WHERE value ~ '(known|suffix)' ), redesigning your table and/or queries, 
> and outsourcing the searching of large strings to a system more suitable for 
> it.

The example is indeed contrived, and the idea doesn't resolve a
problem I've encountered -- in fact, my interest stems from an open
TODO item to implement Boyer-Moore string search.  I began considering
how to implement multiple string pattern search in that context -- but
LIKE/ILIKE introduce a few non-trivial considerations -- notably
wildcard patterns -- compared to strpos.  Whether to require strict
ordering of search results can also be relevant, depending on the
pattern match approach (and boolean operators, as noted) involved.