CTE with JOIN of two tables is much faster than a regular query

2018-08-18 Thread kpi6288
Running PostgreSQL 9.5 on Windows. 

 

The CTE mentioned below completes the query in 4.5 seconds while the regular
query takes 66 seconds. I read from EXPLAIN ANALYSE that the regular query
starts with a full table scan over "Doc" while the CTE joins the two tables
first and applies the filter condition in the 2nd step. 

I believe that some rows in "Doc" which are not referenced by "F" contain a
large amount of data in the field "szText" and this will slow down the ILIKE
operator. 

 

What can I do to improve the performance of the regular query without using
a CTE? 

 

This is a much simplified extract from a larger application: 

 

CREATE TABLE Doc (

  oID UUID NOT NULL PRIMARY KEY,

  uDocID UUID NOT NULL UNIQUE,

  szText TEXT 

  );

 

CREATE TABLE F (

  oID UUID NOT NULL PRIMARY KEY,

  uDocRef UUID,

  CONSTRAINT F_fkey1 FOREIGN KEY (uDocRef) REFERENCES Doc (uDocID)

  );  

 

-- just in case .

ALTER TABLE Doc ALTER uDocID SET STATISTICS 1;

ALTER TABLE Doc ALTER szText SET STATISTICS 1;

VACUUM ANALYSE Doc;

 

SELECT COUNT(*) FROM Doc;

=> 125946 records 

 

ALTER TABLE F ALTER uDocRef SET STATISTICS 1;

VACUUM ANALYSE F;

 

SELECT COUNT(*) FROM F;

=> 32605 records

 

Result with CTE: 

 

EXPLAIN ANALYSE 

  WITH a AS (

  SELECT F.oID, Doc.szText

  FROM F

  JOIN Doc ON F.uDocRef = Doc.udocid

) 

SELECT * 

  FROM a 

  WHERE szText ILIKE '%480GB%';

 

"CTE Scan on a  (cost=9463.42..10197.03 rows=52 width=48) (actual
time=478.770..4551.613 rows=10 loops=1)"

"  Filter: (sztext ~~* '%480GB%'::text)"

"  Rows Removed by Filter: 32595"

"  CTE a"

"->  Hash Join  (cost=973.61..9463.42 rows=32605 width=359) (actual
time=36.998..100.337 rows=32605 loops=1)"

"  Hash Cond: (doc.udocid = f.udocref)"

"  ->  Seq Scan on doc  (cost=0.00..7691.46 rows=125946 width=359)
(actual time=0.008..18.269 rows=125946 loops=1)"

"  ->  Hash  (cost=566.05..566.05 rows=32605 width=32) (actual
time=35.825..35.825 rows=32605 loops=1)"

"Buckets: 32768  Batches: 1  Memory Usage: 2294kB"

"->  Seq Scan on f  (cost=0.00..566.05 rows=32605 width=32)
(actual time=0.005..14.677 rows=32605 loops=1)"

"Planning time: 4.689 ms"

"Execution time: 4554.893 ms"

 

Result with regular query: 

 

EXPLAIN ANALYSE 

SELECT F.oID, Doc.szText

FROM F

JOIN Doc ON F.uDocRef = Doc.udocid

WHERE szText ILIKE '%480GB%';

 

"Hash Join  (cost=8006.56..8694.93 rows=5 width=359) (actual
time=66500.415..66506.978 rows=10 loops=1)"

"  Hash Cond: (f.udocref = doc.udocid)"

"  ->  Seq Scan on f  (cost=0.00..566.05 rows=32605 width=32) (actual
time=0.002..3.143 rows=32605 loops=1)"

"  ->  Hash  (cost=8006.32..8006.32 rows=19 width=359) (actual
time=66500.023..66500.023 rows=16 loops=1)"

"Buckets: 1024  Batches: 1  Memory Usage: 19kB"

"->  Seq Scan on doc  (cost=0.00..8006.32 rows=19 width=359) (actual
time=8864.720..66499.991 rows=16 loops=1)"

"  Filter: (sztext ~~* '%480GB%'::text)"

"  Rows Removed by Filter: 125930"

"Planning time: 263.542 ms"

"Execution time: 66507.003 ms"

 

 

 

 



Re: CTE with JOIN of two tables is much faster than a regular query

2018-08-18 Thread Andreas Kretschmer




Am 18.08.2018 um 11:36 schrieb kpi6...@gmail.com:
What can I do to improve the performance of the regular query without 
using a CTE? 


try to rewrite it to a subselect:

select ... from ... join (selec ... from ... where ...) x on ...


Regards, Andreas

--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com




AW: CTE with JOIN of two tables is much faster than a regular query

2018-08-18 Thread kpi6288



> -Ursprüngliche Nachricht-
> Von: Andreas Kretschmer 
> Gesendet: Samstag, 18. August 2018 12:27
 
> Am 18.08.2018 um 11:36 schrieb kpi6...@gmail.com:
> > What can I do to improve the performance of the regular query without
> > using a CTE?
> 
> try to rewrite it to a subselect:
> 
> select ... from ... join (selec ... from ... where ...) x on ...
> 

Do mean like this?

EXPLAIN ANALYSE 
SELECT F.oID, D.szText
FROM F
JOIN (SELECT Doc.uDocID, Doc.szText FROM Doc WHERE szText ILIKE '%480GB%')
AS D ON D.uDocID = F.uDocRef;

Just as bad as my regular query: 

"Hash Join  (cost=8006.56..8694.93 rows=5 width=359) (actual
time=66777.898..66784.630 rows=10 loops=1)"
"  Hash Cond: (f.udocref = doc.udocid)"
"  ->  Seq Scan on f  (cost=0.00..566.05 rows=32605 width=32) (actual
time=0.002..3.563 rows=32605 loops=1)"
"  ->  Hash  (cost=8006.32..8006.32 rows=19 width=359) (actual
time=66777.471..66777.471 rows=16 loops=1)"
"Buckets: 1024  Batches: 1  Memory Usage: 19kB"
"->  Seq Scan on doc  (cost=0.00..8006.32 rows=19 width=359) (actual
time=9013.317..66777.438 rows=16 loops=1)"
"  Filter: (sztext ~~* '%480GB%'::text)"
"  Rows Removed by Filter: 125930"
"Planning time: 236.354 ms"
"Execution time: 66784.651 ms"




Re: AW: CTE with JOIN of two tables is much faster than a regular query

2018-08-18 Thread Adrian Klaver

On 08/18/2018 04:08 AM, kpi6...@gmail.com wrote:




-Ursprüngliche Nachricht-
Von: Andreas Kretschmer 
Gesendet: Samstag, 18. August 2018 12:27
  

Am 18.08.2018 um 11:36 schrieb kpi6...@gmail.com:

What can I do to improve the performance of the regular query without
using a CTE?


try to rewrite it to a subselect:

select ... from ... join (selec ... from ... where ...) x on ...



Do mean like this?

EXPLAIN ANALYSE
SELECT F.oID, D.szText
FROM F
JOIN (SELECT Doc.uDocID, Doc.szText FROM Doc WHERE szText ILIKE '%480GB%')
AS D ON D.uDocID = F.uDocRef;


To try to replicate what the CTE is doing I would try:

SELECT
*
FROM
Doc
JOIN

(SELECT
uDocRef, F.oID, Doc.szText
FROM
F
JOIN
Doc
ON
F.uDocRef = Doc.udocid
) AS D
ON
D.uDocRef = Doc.udocid
WHERE
D.szText ILIKE '%480GB%'





Just as bad as my regular query:

"Hash Join  (cost=8006.56..8694.93 rows=5 width=359) (actual
time=66777.898..66784.630 rows=10 loops=1)"
"  Hash Cond: (f.udocref = doc.udocid)"
"  ->  Seq Scan on f  (cost=0.00..566.05 rows=32605 width=32) (actual
time=0.002..3.563 rows=32605 loops=1)"
"  ->  Hash  (cost=8006.32..8006.32 rows=19 width=359) (actual
time=66777.471..66777.471 rows=16 loops=1)"
"Buckets: 1024  Batches: 1  Memory Usage: 19kB"
"->  Seq Scan on doc  (cost=0.00..8006.32 rows=19 width=359) (actual
time=9013.317..66777.438 rows=16 loops=1)"
"  Filter: (sztext ~~* '%480GB%'::text)"
"  Rows Removed by Filter: 125930"
"Planning time: 236.354 ms"
"Execution time: 66784.651 ms"






--
Adrian Klaver
adrian.kla...@aklaver.com



Re: regex match and special characters

2018-08-18 Thread Oleksii Kliukin

> On 16. Aug 2018, at 16:57, Tom Lane  wrote:
> 
> Alex Kliukin  writes:
>> Here is a simple SQL statement that gives different results on PostgreSQL 
>> 9.6 and PostgreSQL 10+. The space character at the end of the string is 
>> actually U+2006 SIX-PER-EM SPACE 
>> (http://www.fileformat.info/info/unicode/char/2006/index.htm)
> 
> I think the reason for the discrepancy is that in v10 we fixed the regex
> locale support so that it could properly classify code points above U+7FF,
> cf
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c54159d44ceaba26ceda9fea1804f0de122a8f30
>  
> 

This nails down the cause, thanks a lot for the link! Apparently I missed it 
from PostgreSQL 10 release notes, where it is present in the “Queries” section, 
although AFAIK it deserved an entry in the "migration to version 10”, as it may 
potentially make dump/restore from previous versions to version 10 error out if 
there are table constraints that use regex classes over the Unicode text fields 
with code points above U+7FF.

> 
> So 10 is giving the right answer (i.e. that \s matches U+2006).
> 9.x is not

Agreed.

Cheers,
Alex

AW: AW: CTE with JOIN of two tables is much faster than a regular query

2018-08-18 Thread kpi6288
> -Ursprüngliche Nachricht-
> Von: Adrian Klaver 
> Gesendet: Samstag, 18. August 2018 16:24
> 
> To try to replicate what the CTE is doing I would try:
> SELECT  *
> FROM  Doc
> JOIN  (SELECT uDocRef, F.oID, Doc.szText
> FROM F JOIN   Doc ON  F.uDocRef = Doc.udocid) AS D
> ON D.uDocRef = Doc.udocid
> WHERE D.szText ILIKE '%480GB%'

No difference - still starting with the full scan on Doc and lasting 67 
seconds: 

"Nested Loop  (cost=8006.98..8700.40 rows=5 width=750) (actual 
time=66845.857..66852.705 rows=10 loops=1)"
"  ->  Hash Join  (cost=8006.56..8694.93 rows=5 width=391) (actual 
time=66845.838..66852.613 rows=10 loops=1)"
"Hash Cond: (f.udocref = doc_1.udocid)"
"->  Seq Scan on f  (cost=0.00..566.05 rows=32605 width=32) (actual 
time=0.002..3.428 rows=32605 loops=1)"
"->  Hash  (cost=8006.32..8006.32 rows=19 width=359) (actual 
time=66845.431..66845.431 rows=16 loops=1)"
"  Buckets: 1024  Batches: 1  Memory Usage: 19kB"
"  ->  Seq Scan on doc doc_1  (cost=0.00..8006.32 rows=19 
width=359) (actual time=9042.984..66845.398 rows=16 loops=1)"
"Filter: (sztext ~~* '%480GB%'::text)"
"Rows Removed by Filter: 125930"
"  ->  Index Scan using doc_udocid_key on doc  (cost=0.42..1.08 rows=1 
width=375) (actual time=0.008..0.008 rows=1 loops=10)"
"Index Cond: (udocid = f.udocref)"
"Planning time: 252.162 ms"
"Execution time: 66852.737 ms"





Re: CTE with JOIN of two tables is much faster than a regular query

2018-08-18 Thread Stephen Frost
Greetings,

* kpi6...@gmail.com (kpi6...@gmail.com) wrote:
> The CTE mentioned below completes the query in 4.5 seconds while the regular
> query takes 66 seconds. I read from EXPLAIN ANALYSE that the regular query
> starts with a full table scan over "Doc" while the CTE joins the two tables
> first and applies the filter condition in the 2nd step. 
> 
> I believe that some rows in "Doc" which are not referenced by "F" contain a
> large amount of data in the field "szText" and this will slow down the ILIKE
> operator. 

Yup, that appears to be what's happening.

> What can I do to improve the performance of the regular query without using
> a CTE? 

You could possibly build a trigram index on the field you're searching,
which could avoid the full table scan.  Of course, that index could be
quite large, so there's downsides to that.  If these are words you're
looking for then you could use PG's full text indexing to build indexes
on the words and then use that instead.  If you are fine working with
words but are concerned about misspellings then you can extract out the
distinct words, build a trigram index on those, find the most similar
words based on the input and then search for those words using the FTI.

Unfortunately, we don't currently pay attention to things like average
string length when considering the cost of performing an 'ilike', so we
figure that doing the filtering first and then the join will be faster,
but that obviously falls over in some cases, like this one.  Using the
CTE forces PG to (today, at least) do the join first, but that isn't
really good to rely on.

Thanks!

Stephen


signature.asc
Description: PGP signature


AW: CTE with JOIN of two tables is much faster than a regular query

2018-08-18 Thread kpi6288
> -Ursprüngliche Nachricht-
> Von: Stephen Frost 
> Gesendet: Samstag, 18. August 2018 16:39

Hello,

> 
> > What can I do to improve the performance of the regular query without
> > using a CTE?
> 
> You could possibly build a trigram index on the field you're searching,
which
> could avoid the full table scan.  Of course, that index could be quite
large, so
> there's downsides to that.  If these are words you're looking for then you
> could use PG's full text indexing to build indexes on the words and then
use
> that instead.  If you are fine working with words but are concerned about
> misspellings then you can extract out the distinct words, build a trigram
index
> on those, find the most similar words based on the input and then search
for
> those words using the FTI.
> 
> Unfortunately, we don't currently pay attention to things like average
string
> length when considering the cost of performing an 'ilike', so we figure
that
> doing the filtering first and then the join will be faster, but that
obviously falls
> over in some cases, like this one.  Using the CTE forces PG to (today, at
least)
> do the join first, but that isn't really good to rely on.

A trigram index would be a possible help in this particular scenario but
size and updating the index in other parts of the application would be
probably create other issues. I may try it, though. 

But thanks to confirming my assumption. I just thought that it should be
obvious to the optimizer to do the join first and filter on this result. But
I'm reading you r post that there is nothing that I can do to modify the
behavior of the optimizer. Or is there a way to specify the cost for an
operator (ILIKE in this case) on a specific column? 

Thanks 
Klaus




Re: regex match and special characters

2018-08-18 Thread Oleksii Kliukin
Hi Adrian,

> On 16. Aug 2018, at 18:13, Adrian Klaver  wrote:
> 
> test=# select 'abcd'||chr(8198) ~ 'abcd\s';
> ?column?
> --
> t
> (1 row)
> 
> 
> Wonder if the OP has standard_conforming_strings='off' and
> escape_string_warning='off'?
> 

Both are set to ‘on’ for me for all versions (I believe those are default 
settings). I have 12devel indeed on my test system alongside 9.6, but I’ve 
tried it as well on PostgreSQL 10 running on a different distro with different 
locale settings and it produced the same result (check being true). 

I think Tom’s answer solves it, although I am wondering how did you get true in 
the statement quoted above on PostgreSQL 9.6, perhaps that result is actually 
from PostgreSQL 10?

Cheers,
Oleksii

Re: CTE with JOIN of two tables is much faster than a regular query

2018-08-18 Thread Tom Lane
Stephen Frost  writes:
> * kpi6...@gmail.com (kpi6...@gmail.com) wrote:
>> The CTE mentioned below completes the query in 4.5 seconds while the regular
>> query takes 66 seconds.

> Unfortunately, we don't currently pay attention to things like average
> string length when considering the cost of performing an 'ilike', so we
> figure that doing the filtering first and then the join will be faster,
> but that obviously falls over in some cases, like this one.  Using the
> CTE forces PG to (today, at least) do the join first, but that isn't
> really good to rely on.

Well, it's simpler than that: filter quals are always evaluated at
the lowest possible plan level.  One of the Berkeley PhD theses that
we ripped out ages ago tried to be smarter about that, but the
cost/benefit/complexity ratio just wasn't very good, mainly because
it's so darn hard to estimate the selectivity of quals on subsets
of relations.

It's not very apparent why the results are so bad in this case,
either.  One of the plans has the ILIKE being applied to circa 32600
rows, and the other one runs it on circa 126000 rows.  That should
produce less than a 4x penalty, not 14x.  Do the rows removed by
the join have significantly-longer-on-average sztext fields?
(If so, the odds that the planner would ever recognize such a
correlation seem pretty small.)

In any case, given that the ILIKE selects so few rows (and the planner
knows it!), finding a way to index that is clearly the right answer.

regards, tom lane



AW: CTE with JOIN of two tables is much faster than a regular query

2018-08-18 Thread kpi6288
> -Ursprüngliche Nachricht-
> Von: Tom Lane 
> Gesendet: Samstag, 18. August 2018 17:29
> 
> Well, it's simpler than that: filter quals are always evaluated at the
lowest
> possible plan level.  

Thank you. This "always" was not clear to me, but it explains a few similar
cases (with not-so-extreme differences) that I could not understand.

Regards
Klaus




Re: vPgSql

2018-08-18 Thread Dmitri Maziuk
On Sat, 18 Aug 2018 10:52:59 +1000
Tim Cross  wrote:
 
> I think pretty much all *nix systems put core shells like sh, bash, zsh
> etc in /bin (as it is guaranteed to be available immediately at boot, while 
> /usr is
> not - it could be a separate partition which isn't available until later
> in the boot process).

They do but at least solaris circa v.9 and now redhat have replaced /bin with a 
symlink to /usr/bin. Whether you get /bin/sh or /usr/bin/sh out of `which` 
depends purely on their order in your $PATH.
 
> A way to avoid platform differences is to use /usr/bin/env e.g.
> 
> #!/usr/bin/env bash

Except now you can't pass arguments to your bash. 

You can't win. Although in the case of shell scripts you can get rid of all 
bash'isms and call '/bin/sh'
-- 
Dmitri Maziuk 



Re: CTE with JOIN of two tables is much faster than a regular query

2018-08-18 Thread Ravi Krishna
> What can I do to improve the performance of the regular query without using a 
> CTE? 

Why do you care ?  When I find that I can write a SQL 3 different ways, I will 
go for the most
efficient one.  So why not accept the CTE version of this SQL.  Just curious.




AW: CTE with JOIN of two tables is much faster than a regular query

2018-08-18 Thread kpi6288
> -Ursprüngliche Nachricht-
> Von: Ravi Krishna 
> Gesendet: Samstag, 18. August 2018 18:25
> 
> > What can I do to improve the performance of the regular query without
> using a CTE?
> 
> Why do you care ?  When I find that I can write a SQL 3 different ways, I will
> go for the most efficient one.  So why not accept the CTE version of this SQL.
> Just curious.

We're using object mapping / entity frameworks (e.g. XPO, Entity Framework 
Core). These frameworks support regular queries out-of-the box; a CTEs require 
additional effort and are more difficult to maintain. 

Regards
Klaus




Re: regex match and special characters

2018-08-18 Thread Adrian Klaver

On 08/18/2018 08:12 AM, Oleksii Kliukin wrote:

Hi Adrian,

On 16. Aug 2018, at 18:13, Adrian Klaver > wrote:


test=# select 'abcd'||chr(8198) ~ 'abcd\s';
?column?
--
t
(1 row)


Wonder if the OP has standard_conforming_strings='off' and
escape_string_warning='off'?



Both are set to ‘on’ for me for all versions (I believe those are 
default settings). I have 12devel indeed on my test system alongside 
9.6, but I’ve tried it as well on PostgreSQL 10 running on a different 
distro with different locale settings and it produced the same result 
(check being true).


I think Tom’s answer solves it, although I am wondering how did you get 
true in the statement quoted above on PostgreSQL 9.6, perhaps that 
result is actually from PostgreSQL 10?


Yes it was on Postgres 10. I got reversed in my thinking. A subsequent post:

https://www.postgresql.org/message-id/8b8d9250-1ee3-6aff-b580-592c7c759586%40aklaver.com

shows the behavior you saw on 9.6. What happens when I work through my 
thought process on line:(




Cheers,
Oleksii



--
Adrian Klaver
adrian.kla...@aklaver.com



Re: AW: CTE with JOIN of two tables is much faster than a regular query

2018-08-18 Thread Tim Cross


kpi6...@gmail.com writes:

>> -Ursprüngliche Nachricht-
>> Von: Ravi Krishna 
>> Gesendet: Samstag, 18. August 2018 18:25
>> 
>> > What can I do to improve the performance of the regular query without
>> using a CTE?
>> 
>> Why do you care ?  When I find that I can write a SQL 3 different ways, I 
>> will
>> go for the most efficient one.  So why not accept the CTE version of this 
>> SQL.
>> Just curious.
>
> We're using object mapping / entity frameworks (e.g. XPO, Entity Framework 
> Core). These frameworks support regular queries out-of-the box; a CTEs 
> require additional effort and are more difficult to maintain. 
>

Ah, another reason to avoid object mapping/entity frameworks! I guess
really the same reason - loss of flexibility and expressive power.

Sorry, having a similar battle with some developers who are insisting on
using a particular framework because it makes maintenance easier as it
'automates' creation of controllers (MVC). However, they are frustrated
by performance and I'm frustrated as the framework also fails to pass
additional information, such as PGAPPNAME, which would make some
analysis easier. Part of the reason for the performance issues is
because the developers are doing things with result sets within the
client that would be far more efficient performed within the database.

One way I have resolved this in the past is to create database
procedures which present a 'mapped' view back to the framework layer
which hides the SQL from the framework. Works well, with the only main
downside being you now have SQL in a different (another) place, which
can make some people uncomfortable and can be a maintenance issue if all
your developers are just front-end devs who treat a database as just a
key/value repository. .

Tim
-- 
Tim Cross