Slow SELECT

2020-05-26 Thread Frank Millman

Hi all

I have a SELECT that runs over 5 times slower on PostgreSQL compared 
with Sql Server and sqlite3. I am trying to understand why.


I have a table that looks like this (simplified) -

CREATE TABLE my_table (
row_id SERIAL PRIMARY KEY,
deleted_id INT DEFAULT 0,
fld_1 INT REFERENCES table_1(row_id),
fld_2 INT REFERENCES table_2(row_id),
fld_3 INT REFERENCES table_3(row_id),
fld_4 INT REFERENCES table_4(row_id),
tran_date DATE,
tran_total DEC(21,2)
);

CREATE UNIQUE INDEX my_table_ndx ON my_table (fld_1, fld_2, fld_3, 
fld_4, tran_date) WHERE deleted_id = 0;


The table sizes are -
my_table : 167 rows
table_1 : 21 rows
table_2 : 11 rows
table_3 : 3 rows
table_4 : 16 rows

Therefore for each tran_date in my_table there are potentially 
21x11x3x16 = 11088 rows. Most will be null.


I want to select the row_id for the last tran_date for each of those 
potential groups. This is my select -


SELECT (
SELECT a.row_id FROM my_table a
WHERE a.fld_1 = b.row_id
AND a.fld_2 = c.row_id
AND a.fld_3 = d.row_id
AND a.fld_4 = e.row_id
AND a.deleted_id = 0
ORDER BY a.tran_date DESC LIMIT 1
)
FROM table_1 b, table_2 c, table_3 d, table_4 e

Out of 11088 rows selected, 103 are not null.

On identical data, this takes 0.06 sec on SQL Server, 0.04 sec on 
sqlite3, and 0.31 sec on PostgreSQL.


I have looked at the EXPLAIN, but I don't really know what to look for. 
I can supply it if that would help.


Thanks for any advice.

Frank Millman




Re: Slow SELECT

2020-05-26 Thread Frank Millman

On 2020-05-26 9:32 AM, Olivier Gautherot wrote:

Hi Frank,

On Tue, May 26, 2020 at 9:23 AM Frank Millman <mailto:fr...@chagford.com>> wrote:


Hi all

I have a SELECT that runs over 5 times slower on PostgreSQL compared
with Sql Server and sqlite3. I am trying to understand why.

I have a table that looks like this (simplified) -

CREATE TABLE my_table (
      row_id SERIAL PRIMARY KEY,
      deleted_id INT DEFAULT 0,
      fld_1 INT REFERENCES table_1(row_id),
      fld_2 INT REFERENCES table_2(row_id),
      fld_3 INT REFERENCES table_3(row_id),
      fld_4 INT REFERENCES table_4(row_id),
      tran_date DATE,
      tran_total DEC(21,2)
      );

CREATE UNIQUE INDEX my_table_ndx ON my_table (fld_1, fld_2, fld_3,
fld_4, tran_date) WHERE deleted_id = 0;

The table sizes are -
      my_table : 167 rows
      table_1 : 21 rows
      table_2 : 11 rows
      table_3 : 3 rows
      table_4 : 16 rows

Therefore for each tran_date in my_table there are potentially
21x11x3x16 = 11088 rows. Most will be null.

I want to select the row_id for the last tran_date for each of those
potential groups. This is my select -

      SELECT (
          SELECT a.row_id FROM my_table a
          WHERE a.fld_1 = b.row_id
          AND a.fld_2 = c.row_id
          AND a.fld_3 = d.row_id
          AND a.fld_4 = e.row_id
          AND a.deleted_id = 0
          ORDER BY a.tran_date DESC LIMIT 1
      )
      FROM table_1 b, table_2 c, table_3 d, table_4 e

Out of 11088 rows selected, 103 are not null.

On identical data, this takes 0.06 sec on SQL Server, 0.04 sec on
sqlite3, and 0.31 sec on PostgreSQL.


SQL Server does a good job at caching data in memory. PostgreSQL does 
too on consecutive calls to the same table. What execution time do you 
get if you issue the query a second time?


My first guess would be to add an index on my_table.tran_date and check 
in EXPLAIN that you don't have a SEQUENTIAL SCAN on that table.


I have looked at the EXPLAIN, but I don't really know what to look for.
I can supply it if that would help.

Thanks for any advice.



Thanks Olivier. Unfortunately that did not help.

I was already running the query twice and only timing the second one.

I added the index on tran_date. The timing is the same, and EXPLAIN 
shows that it is using a SEQUENTIAL SCAN.


Here is the EXPLAIN -


 Nested Loop  (cost=0.00..64155.70 rows=11088 width=4)
   ->  Nested Loop  (cost=0.00..10.36 rows=528 width=12)
 ->  Nested Loop  (cost=0.00..2.56 rows=33 width=8)
   ->  Seq Scan on table_2 c  (cost=0.00..1.11 rows=11 width=4)
   ->  Materialize  (cost=0.00..1.04 rows=3 width=4)
 ->  Seq Scan on table_3 d  (cost=0.00..1.03 rows=3 
width=4)

 ->  Materialize  (cost=0.00..1.24 rows=16 width=4)
   ->  Seq Scan on table_4 e  (cost=0.00..1.16 rows=16 width=4)
   ->  Materialize  (cost=0.00..1.31 rows=21 width=4)
 ->  Seq Scan on table_1 b  (cost=0.00..1.21 rows=21 width=4)
   SubPlan 1
 ->  Limit  (cost=5.77..5.77 rows=1 width=8)
   ->  Sort  (cost=5.77..5.77 rows=1 width=8)
 Sort Key: a.tran_date DESC
 ->  Seq Scan on my_table a  (cost=0.00..5.76 rows=1 
width=8)
   Filter: ((fld_1 = b.row_id) AND (fld_2 = 
c.row_id) AND (fld_3 = d.row_id) AND (fld_4 = e.row_id) AND (deleted_id 
= 0))



Frank




Re: Slow SELECT

2020-05-26 Thread Frank Millman

On 2020-05-26 11:10 AM, Charles Clavadetscher wrote:

Hello

On 2020-05-26 10:38, Frank Millman wrote:

On 2020-05-26 9:32 AM, Olivier Gautherot wrote:

Hi Frank,

On Tue, May 26, 2020 at 9:23 AM Frank Millman <mailto:fr...@chagford.com>> wrote:


    Hi all

    I have a SELECT that runs over 5 times slower on PostgreSQL compared
    with Sql Server and sqlite3. I am trying to understand why.

    I have a table that looks like this (simplified) -

    CREATE TABLE my_table (
      row_id SERIAL PRIMARY KEY,
      deleted_id INT DEFAULT 0,
      fld_1 INT REFERENCES table_1(row_id),
      fld_2 INT REFERENCES table_2(row_id),
      fld_3 INT REFERENCES table_3(row_id),
      fld_4 INT REFERENCES table_4(row_id),
      tran_date DATE,
      tran_total DEC(21,2)
      );

    CREATE UNIQUE INDEX my_table_ndx ON my_table (fld_1, fld_2, fld_3,
    fld_4, tran_date) WHERE deleted_id = 0;

    The table sizes are -
      my_table : 167 rows
      table_1 : 21 rows
      table_2 : 11 rows
      table_3 : 3 rows
      table_4 : 16 rows

    Therefore for each tran_date in my_table there are potentially
    21x11x3x16 = 11088 rows. Most will be null.

    I want to select the row_id for the last tran_date for each of those
    potential groups. This is my select -

      SELECT (
          SELECT a.row_id FROM my_table a
          WHERE a.fld_1 = b.row_id
          AND a.fld_2 = c.row_id
          AND a.fld_3 = d.row_id
          AND a.fld_4 = e.row_id
          AND a.deleted_id = 0
          ORDER BY a.tran_date DESC LIMIT 1
      )
      FROM table_1 b, table_2 c, table_3 d, table_4 e

    Out of 11088 rows selected, 103 are not null.

    On identical data, this takes 0.06 sec on SQL Server, 0.04 sec on
    sqlite3, and 0.31 sec on PostgreSQL.


SQL Server does a good job at caching data in memory. PostgreSQL does 
too on consecutive calls to the same table. What execution time do 
you get if you issue the query a second time?


My first guess would be to add an index on my_table.tran_date and 
check in EXPLAIN that you don't have a SEQUENTIAL SCAN on that table.


    I have looked at the EXPLAIN, but I don't really know what to 
look for.

    I can supply it if that would help.

    Thanks for any advice.



Thanks Olivier. Unfortunately that did not help.

I was already running the query twice and only timing the second one.

I added the index on tran_date. The timing is the same, and EXPLAIN
shows that it is using a SEQUENTIAL SCAN.

Here is the EXPLAIN -


 Nested Loop  (cost=0.00..64155.70 rows=11088 width=4)
   ->  Nested Loop  (cost=0.00..10.36 rows=528 width=12)
 ->  Nested Loop  (cost=0.00..2.56 rows=33 width=8)
   ->  Seq Scan on table_2 c  (cost=0.00..1.11 rows=11 
width=4)

   ->  Materialize  (cost=0.00..1.04 rows=3 width=4)
 ->  Seq Scan on table_3 d  (cost=0.00..1.03 
rows=3 width=4)

 ->  Materialize  (cost=0.00..1.24 rows=16 width=4)
   ->  Seq Scan on table_4 e  (cost=0.00..1.16 rows=16 
width=4)

   ->  Materialize  (cost=0.00..1.31 rows=21 width=4)
 ->  Seq Scan on table_1 b  (cost=0.00..1.21 rows=21 width=4)
   SubPlan 1
 ->  Limit  (cost=5.77..5.77 rows=1 width=8)
   ->  Sort  (cost=5.77..5.77 rows=1 width=8)
 Sort Key: a.tran_date DESC
 ->  Seq Scan on my_table a  (cost=0.00..5.76 rows=1 
width=8)

   Filter: ((fld_1 = b.row_id) AND (fld_2 =
c.row_id) AND (fld_3 = d.row_id) AND (fld_4 = e.row_id) AND
(deleted_id = 0))


Frank


If I see it correct, the query runs sequential scans on all tables, i.e. 
table_1 to table_4.

Do you have an index on the referenced keys (row_id) in table_1 to table_4?

It happens often that referenced keys are not indexed, leading to poor 
execution plans.




The referenced keys are all defined as SERIAL PRIMARY KEY in their own 
tables, so I presume that that are all indexed automatically.


On the other hand, there are not many rows in those tables, so the 
planner may decide not to use the index in that case.


Frank





Re: Slow SELECT

2020-05-26 Thread Frank Millman




On 2020-05-26 12:02 PM, Christian Ramseyer wrote:

Hi

On 26.05.20 09:22, Frank Millman wrote:


I have looked at the EXPLAIN, but I don't really know what to look for.
I can supply it if that would help.



My favorite approach to tuning Postgres queries is:

1. Run EXPLAIN ANALYZE 
2. Copy/Paste the output into the fantastic https://explain.depesz.com/

This will turn the somewhat hard-to-understand explain output into a
nice colored structure. If it's not obvious from the orange-reddish
boxes where the slowness comes from, please post the link here and
somebody will certainly have some advice.



Thanks, Christian. I will definitely look into that.

Frank





Re: Slow SELECT

2020-05-26 Thread Frank Millman




On 2020-05-26 11:27 AM, Charles Clavadetscher wrote:

On 2020-05-26 11:10, Charles Clavadetscher wrote:

Hello

On 2020-05-26 10:38, Frank Millman wrote:

On 2020-05-26 9:32 AM, Olivier Gautherot wrote:

Hi Frank,

On Tue, May 26, 2020 at 9:23 AM Frank Millman <mailto:fr...@chagford.com>> wrote:


    Hi all

    I have a SELECT that runs over 5 times slower on PostgreSQL 
compared

    with Sql Server and sqlite3. I am trying to understand why.

    I have a table that looks like this (simplified) -

    CREATE TABLE my_table (
      row_id SERIAL PRIMARY KEY,
      deleted_id INT DEFAULT 0,
      fld_1 INT REFERENCES table_1(row_id),
      fld_2 INT REFERENCES table_2(row_id),
      fld_3 INT REFERENCES table_3(row_id),
      fld_4 INT REFERENCES table_4(row_id),
      tran_date DATE,
      tran_total DEC(21,2)
      );

    CREATE UNIQUE INDEX my_table_ndx ON my_table (fld_1, fld_2, fld_3,
    fld_4, tran_date) WHERE deleted_id = 0;

    The table sizes are -
      my_table : 167 rows
      table_1 : 21 rows
      table_2 : 11 rows
      table_3 : 3 rows
      table_4 : 16 rows

    Therefore for each tran_date in my_table there are potentially
    21x11x3x16 = 11088 rows. Most will be null.

    I want to select the row_id for the last tran_date for each of 
those

    potential groups. This is my select -

      SELECT (
          SELECT a.row_id FROM my_table a
          WHERE a.fld_1 = b.row_id
          AND a.fld_2 = c.row_id
          AND a.fld_3 = d.row_id
          AND a.fld_4 = e.row_id
          AND a.deleted_id = 0
          ORDER BY a.tran_date DESC LIMIT 1
      )
      FROM table_1 b, table_2 c, table_3 d, table_4 e

    Out of 11088 rows selected, 103 are not null.

    On identical data, this takes 0.06 sec on SQL Server, 0.04 sec on
    sqlite3, and 0.31 sec on PostgreSQL.


SQL Server does a good job at caching data in memory. PostgreSQL 
does too on consecutive calls to the same table. What execution time 
do you get if you issue the query a second time?


My first guess would be to add an index on my_table.tran_date and 
check in EXPLAIN that you don't have a SEQUENTIAL SCAN on that table.


    I have looked at the EXPLAIN, but I don't really know what to 
look for.

    I can supply it if that would help.

    Thanks for any advice.



Thanks Olivier. Unfortunately that did not help.

I was already running the query twice and only timing the second one.

I added the index on tran_date. The timing is the same, and EXPLAIN
shows that it is using a SEQUENTIAL SCAN.

Here is the EXPLAIN -


 Nested Loop  (cost=0.00..64155.70 rows=11088 width=4)
   ->  Nested Loop  (cost=0.00..10.36 rows=528 width=12)
 ->  Nested Loop  (cost=0.00..2.56 rows=33 width=8)
   ->  Seq Scan on table_2 c  (cost=0.00..1.11 rows=11 
width=4)

   ->  Materialize  (cost=0.00..1.04 rows=3 width=4)
 ->  Seq Scan on table_3 d  (cost=0.00..1.03 
rows=3 width=4)

 ->  Materialize  (cost=0.00..1.24 rows=16 width=4)
   ->  Seq Scan on table_4 e  (cost=0.00..1.16 rows=16 
width=4)

   ->  Materialize  (cost=0.00..1.31 rows=21 width=4)
 ->  Seq Scan on table_1 b  (cost=0.00..1.21 rows=21 width=4)
   SubPlan 1
 ->  Limit  (cost=5.77..5.77 rows=1 width=8)
   ->  Sort  (cost=5.77..5.77 rows=1 width=8)
 Sort Key: a.tran_date DESC
 ->  Seq Scan on my_table a  (cost=0.00..5.76 rows=1 
width=8)

   Filter: ((fld_1 = b.row_id) AND (fld_2 =
c.row_id) AND (fld_3 = d.row_id) AND (fld_4 = e.row_id) AND
(deleted_id = 0))


Frank


If I see it correct, the query runs sequential scans on all tables,
i.e. table_1 to table_4.
Do you have an index on the referenced keys (row_id) in table_1 to 
table_4?


It happens often that referenced keys are not indexed, leading to poor
execution plans.

Bye
Charles


I noticed later that you have very small tables. This will probably lead 
to a sequential scan althought there is an index in place.


I am not sure if it makes a difference, but what about using explicit 
joins?


SELECT a.row_id FROM my_table a
JOIN b table_1 ON (b.row_id = a.fld_1)
JOIN c table_2 ON (c.row_id = a.fld_2)
JOIN d table_3 ON (d.row_id = a.fld_3)
JOIN e table_4 ON (e.row_id = a.fld_4)
WHERE a.deleted_id = 0
ORDER BY a.tran_date DESC LIMIT 1;



Thanks, Charles. I tried that, but unfortunately it produces a different 
result. I need to test for every possible combination of fld1-4, and get 
the highest date for each one. Using joins only tests existing 
combinations, and gets the highest date for all of them combined.


Seel my reply to David Rowley. I do not fully understand his solution 
yet, but it seems to be what I am looking for.


Thanks again

Frank





Re: Slow SELECT

2020-05-26 Thread Frank Millman




On 2020-05-26 12:04 PM, David Rowley wrote:

On Tue, 26 May 2020 at 19:23, Frank Millman  wrote:

The table sizes are -
  my_table : 167 rows
  table_1 : 21 rows
  table_2 : 11 rows
  table_3 : 3 rows
  table_4 : 16 rows

Therefore for each tran_date in my_table there are potentially
21x11x3x16 = 11088 rows. Most will be null.

I want to select the row_id for the last tran_date for each of those
potential groups. This is my select -

  SELECT (
  SELECT a.row_id FROM my_table a
  WHERE a.fld_1 = b.row_id
  AND a.fld_2 = c.row_id
  AND a.fld_3 = d.row_id
  AND a.fld_4 = e.row_id
  AND a.deleted_id = 0
  ORDER BY a.tran_date DESC LIMIT 1
  )
  FROM table_1 b, table_2 c, table_3 d, table_4 e

Out of 11088 rows selected, 103 are not null.


Perhaps SQL Server is doing something to rewrite the subquery in the
target list to a LEFT JOIN.  PostgreSQL currently does not do that.

Since "my_table" is small, you'd likely be much better doing a manual
rewrite of the query to join a subquery containing the required
details from "my_table".  It looks like you want the row_id from the
latest tran_date for each fld_N column. So something like:

SELECT a.row_id
FROM table_1 b
CROSS JOIN table_2 c
CROSS JOIN table_3 d
CROSS JOIN table_4 e
LEFT OUTER JOIN (
SELECT fld_1,fld_2,fld_3,fld_4,row_id,tran_date,
ROW_NUMBER() OVER (PARTITION BY fld_1,fld_2,fld_3,fld_4 ORDER BY
tran_date DESC) row_num
FROM my_table
WHERE deleted_id = 0
) a ON a.fld_1 = b.row_id AND a.fld_2 = c.row_id AND a.fld_3 =
d.row_id AND a.fld_4 = e.row_id AND a.row_num = 1;

Should do it. You could also perhaps do something with DISTINCT ON
instead of using ROW_NUMBER(). That might be a bit more efficient, but
it's unlikely to matter too much since there are only 167 rows in that
table.



Thank you David. I tried that and it produced the correct result in 
53ms, which is what I am looking for.


It will take me some time to understand it fully, so I have some 
homework to do!


Much appreciated.

Frank





Re: Slow SELECT

2020-05-26 Thread Frank Millman




On 2020-05-26 1:45 PM, David Rowley wrote:

On Tue, 26 May 2020 at 22:31, Frank Millman  wrote:


Thank you David. I tried that and it produced the correct result in
53ms, which is what I am looking for.

It will take me some time to understand it fully, so I have some
homework to do!


The main problem with your previous query was that the subquery was
being executed 11088 times and could only ever find anything 167
times. The remaining number of times nothing would be found.

I just changed the subquery which would be executed once per output
row and altered it so it became a subquery that's joined and only
executed once.  The ROW_NUMBER() is a windowing function, which is
explained in [1].  I used this to get the row_id of the record with
the lowest tran_date, just like you were doing with the ORDER BY
tran_date DESC LIMIT 1, but the subquery with the windowing function
gets them all at once, rather than doing it in a way that requires it
to be executed once for each row in the top-level query. In this case,
the functionality that the LIMIT 1 does in your query is achieved with
"AND a.row_num = 1;" in my version. This is pretty fast to execute
once due to there only being 167 rows.

It's also important to know that there may be cases where the method I
proposed is slower.  For example, if my_table was very large and
contained rows that were not in table_1 to table_4. Since the subquery
in my version calculates everything then it could be wasteful to do
that for values that would never be used. For you, you have foreign
keys that ensure my_table does not contain records that are not in the
other tables, but you could still see this issue if you were to add
some restrictive WHERE clause to the outer query.  Perhaps this won't
be a problem for you, but it's likely good to know.

[1] https://www.postgresql.org/docs/current/tutorial-window.html



Thanks very much for the explanation. I will go through it carefully.

For the record, your query works without modification in both Sql Server 
and sqlite3. It is also much faster in all three cases - all around 
0.005 sec instead of 0.05 sec.


Frank




Re: Slow SELECT

2020-05-29 Thread Frank Millman




On 2020-05-26 12:04 PM, David Rowley wrote:


Since "my_table" is small, you'd likely be much better doing a manual
rewrite of the query to join a subquery containing the required
details from "my_table".  It looks like you want the row_id from the
latest tran_date for each fld_N column. So something like:

SELECT a.row_id
FROM table_1 b
CROSS JOIN table_2 c
CROSS JOIN table_3 d
CROSS JOIN table_4 e
LEFT OUTER JOIN (
SELECT fld_1,fld_2,fld_3,fld_4,row_id,tran_date,
ROW_NUMBER() OVER (PARTITION BY fld_1,fld_2,fld_3,fld_4 ORDER BY
tran_date DESC) row_num
FROM my_table
WHERE deleted_id = 0
) a ON a.fld_1 = b.row_id AND a.fld_2 = c.row_id AND a.fld_3 =
d.row_id AND a.fld_4 = e.row_id AND a.row_num = 1;

Should do it. You could also perhaps do something with DISTINCT ON
instead of using ROW_NUMBER(). That might be a bit more efficient, but
it's unlikely to matter too much since there are only 167 rows in that
table.



I have studied the above SELECT, and I now more or less understand it. I 
would not have come up with that unaided, so many thanks.


I tried DISTINCT ON, and it was very efficient, but unfortunately that 
is not supported by SQL Server or sqlite3.


Then I came up with this alternative, which works on all three platforms 
and seems a bit faster than the above -


SELECT a.row_id FROM (
SELECT row_id,
ROW_NUMBER() OVER (PARTITION BY fld_1, fld_2, fld_3, fld_4
ORDER BY tran_date DESC) row_num
FROM my_table
WHERE deleted_id = 0
   ) as a
WHERE a.row_num = 1

Do you see any problem with this?

Thanks

Frank




Question about locking

2022-08-06 Thread Frank Millman

Hi all

Apologies if this is an FAQ, but I could not find the answer I was 
looking for.


I want to run two SELECTs, one immediately after the other.

I want to ensure that no other process updates the database in between 
the two.


What is the best way to achieve this?

Thanks

Frank Millman






Re: Question about locking

2022-08-06 Thread Frank Millman



On 2022-08-07 7:54 AM, Ron wrote:

On 8/7/22 00:30, Frank Millman wrote:

Hi all

Apologies if this is an FAQ, but I could not find the answer I was 
looking for.


I want to run two SELECTs, one immediately after the other.

I want to ensure that no other process updates the database in 
between the two.


What is the best way to achieve this?


You probably want to put the queries in a Repeatable Read transaction, 
or even a READ ONLY serialized transaction.
https://www.postgresql.org/docs/12/transaction-iso.html#XACT-REPEATABLE-READ 


https://www.postgresql.org/docs/12/transaction-iso.html#XACT-SERIALIZABLE
https://www.postgresql.org/docs/12/sql-set-transaction.html


Thanks, Ron. That is exactly what I am looking for.

Frank






SELECT is faster on SQL Server

2021-03-19 Thread Frank Millman
    Filter: ((tran_date < 
'2018-03-01'::date) AND (deleted_id = 0) AND (ledger_row_id = 1))

(24 rows)

Maybe SQL Server has a way of optimising this, and there is nothing more 
I can do. I can live with that. But I just thought I would ask the question.


Thanks for any advice.

Frank Millman






Re: SELECT is faster on SQL Server

2021-03-19 Thread Frank Millman



On 2021-03-19 10:29 AM, Thomas Kellerer wrote:

Frank Millman schrieb am 19.03.2021 um 09:19:

This may be a non-issue, and I don't want to waste your time. But perhaps 
someone can have a look to see if there is anything obvious I have missed.

I am writing a cross-platform accounting app, and I test using Sql
Server on Windows 10 and PostgreSql on Fedora 31. Performance is
usually very similar, with a slight edge to PostgreSql. Now I have a
SELECT which runs over twice as fast on Sql Server compared to
PostgreSql.


Can you change the SELECT statement?

Very often "distinct on ()" is faster in Postgres compared to the equivalent 
solution using window functions

Something along the lines (for the first derived table):

SELECT ...
FROM (
 SELECT a.source_code_id, SUM(a.tran_tot) AS cl_tot
 FROM (
 SELECT distinct on (location_row_id, function_row_id, source_code_id) 
source_code_id, tran_tot
 FROM prop.ar_totals
 WHERE deleted_id = 0
   AND tran_date <= '2018-03-31'
   AND ledger_row_id = 1
 ORDER BY location_row_id, function_row_id, source_code_id, tran_date 
DESC
 ) AS a
 GROUP BY a.source_code_id
) as cl_bal
...

Thanks, Thomas

I tried that, and it ran about 10% faster. Every little helps, but SQL 
Server appears to have some secret sauce!


Frank






Re: SELECT is faster on SQL Server

2021-03-19 Thread Frank Millman



On 2021-03-19 11:04 AM, Thomas Kellerer wrote:

Frank Millman schrieb am 19.03.2021 um 09:52:

I am writing a cross-platform accounting app, and I test using Sql
Server on Windows 10 and PostgreSql on Fedora 31. Performance is
usually very similar, with a slight edge to PostgreSql. Now I have a
SELECT which runs over twice as fast on Sql Server compared to
PostgreSql.


Can you change the SELECT statement?

Very often "distinct on ()" is faster in Postgres compared to the equivalent 
solution using window functions

Thanks, Thomas

I tried that, and it ran about 10% faster. Every little helps, but SQL Server 
appears to have some secret sauce!

The two derived tables (cl_bal, op_bal) seem to be doing exactly the same thing 
- at least I can't spot a difference.

If that is correct, you can move them into a common table expression - maybe 
detecting that is SQL Server's secret sauce.

 with totals as (
SELECT a.source_code_id, SUM(a.tran_tot) AS total
FROM (
SELECT distinct on (location_row_id, function_row_id, 
source_code_id) source_code_id, tran_tot
FROM prop.ar_totals
WHERE deleted_id = 0
  AND tran_date <= '2018-03-31'
  AND ledger_row_id = 1
ORDER BY location_row_id, function_row_id, source_code_id, 
tran_date DESC
) AS a
GROUP BY a.source_code_id
 )
 select
 '2018-03-01' AS op_date, '2018-03-31' AS cl_date,
 cl_bal.source_code_id, op_bal.total as op_tot, cl_bal.total.cl_tot
 FROM totals as cl_bal
   LEFT JOIN totals as op_bal ON op_bal.source_code_id = 
cl_bal.source_code_id;


There is a difference.

cl_bal selects WHERE tran_date <= '2018-03-31'.

op_bal selects WHERE tran_date < '2018-03-01'.

The second one could be written as WHERE tran_date <= '2018-02-28', but 
I don't think that would make any difference.


Frank






Re: SELECT is faster on SQL Server

2021-03-19 Thread Frank Millman


On 2021-03-19 10:56 AM, Pavel Stehule wrote:



pá 19. 3. 2021 v 9:53 odesílatel Frank Millman <mailto:fr...@chagford.com>> napsal:



On 2021-03-19 10:29 AM, Thomas Kellerer wrote:
    > Frank Millman schrieb am 19.03.2021 um 09:19:
>> This may be a non-issue, and I don't want to waste your time.
But perhaps someone can have a look to see if there is anything
obvious I have missed.
>>
>> I am writing a cross-platform accounting app, and I test using Sql
>> Server on Windows 10 and PostgreSql on Fedora 31. Performance is
>> usually very similar, with a slight edge to PostgreSql. Now I
have a
>> SELECT which runs over twice as fast on Sql Server compared to
>> PostgreSql.
>>
> Can you change the SELECT statement?
>
> Very often "distinct on ()" is faster in Postgres compared to
the equivalent solution using window functions
>
> Something along the lines (for the first derived table):
>
> SELECT ...
> FROM (
>      SELECT a.source_code_id, SUM(a.tran_tot) AS cl_tot
>      FROM (
>          SELECT distinct on (location_row_id, function_row_id,
source_code_id) source_code_id, tran_tot
>          FROM prop.ar_totals
>          WHERE deleted_id = 0
>            AND tran_date <= '2018-03-31'
>            AND ledger_row_id = 1
>          ORDER BY location_row_id, function_row_id,
source_code_id, tran_date DESC
>      ) AS a
>      GROUP BY a.source_code_id
> ) as cl_bal
> ...
Thanks, Thomas

I tried that, and it ran about 10% faster. Every little helps, but
SQL
Server appears to have some secret sauce!


can you send a result of EXPLAIN ANALYZE?

QUERY PLAN

 Nested Loop Left Join  (cost=5.66..5.74 rows=1 width=132) (actual 
time=0.213..0.248 rows=5 loops=1)

   Join Filter: (a_1.source_code_id = a.source_code_id)
   Rows Removed by Join Filter: 4
   ->  GroupAggregate  (cost=3.65..3.67 rows=1 width=36) (actual 
time=0.144..0.157 rows=5 loops=1)

 Group Key: a.source_code_id
 ->  Sort  (cost=3.65..3.65 rows=1 width=10) (actual 
time=0.131..0.135 rows=29 loops=1)

   Sort Key: a.source_code_id
   Sort Method: quicksort  Memory: 26kB
   ->  Subquery Scan on a  (cost=2.36..3.64 rows=1 
width=10) (actual time=0.063..0.116 rows=29 loops=1)

 Filter: (a.row_num = 1)
 Rows Removed by Filter: 3
 ->  WindowAgg  (cost=2.36..3.24 rows=32 width=34) 
(actual time=0.062..0.107 rows=32 loops=1)
   ->  Sort  (cost=2.36..2.44 rows=32 
width=26) (actual time=0.054..0.059 rows=32 loops=1)
 Sort Key: ar_totals.location_row_id, 
ar_totals.function_row_id, ar_totals.source_code_id, 
ar_totals.tran_date DESC

 Sort Method: quicksort Memory: 27kB
 ->  Seq Scan on ar_totals 
(cost=0.00..1.56 rows=32 width=26) (actual time=0.014..0.028 rows=32 
loops=1)
   Filter: ((tran_date <= 
'2018-03-31'::date) AND (deleted_id = 0) AND (ledger_row_id = 1))
   ->  GroupAggregate  (cost=2.01..2.03 rows=1 width=36) (actual 
time=0.017..0.017 rows=1 loops=5)

 Group Key: a_1.source_code_id
 ->  Sort  (cost=2.01..2.02 rows=1 width=10) (actual 
time=0.012..0.013 rows=8 loops=5)

   Sort Key: a_1.source_code_id
   Sort Method: quicksort  Memory: 25kB
   ->  Subquery Scan on a_1  (cost=1.68..2.00 rows=1 
width=10) (actual time=0.032..0.047 rows=8 loops=1)

 Filter: (a_1.row_num = 1)
 ->  WindowAgg  (cost=1.68..1.90 rows=8 width=34) 
(actual time=0.031..0.043 rows=8 loops=1)
   ->  Sort  (cost=1.68..1.70 rows=8 width=26) 
(actual time=0.023..0.024 rows=8 loops=1)
 Sort Key: 
ar_totals_1.location_row_id, ar_totals_1.function_row_id, 
ar_totals_1.source_code_id, ar_totals_1.tran_date DESC

 Sort Method: quicksort Memory: 25kB
 ->  Seq Scan on ar_totals 
ar_totals_1  (cost=0.00..1.56 rows=8 width=26) (actual 
time=0.006..0.013 rows=8 loops=1)
   Filter: ((tran_date < 
'2018-03-01'::date) AND (deleted_id = 0) AND (ledger_row_id = 1))

   Rows Removed by Filter: 24
 Planning Time: 0.479 ms
 Execution Time: 0.344 ms
(33 rows)





Re: SELECT is faster on SQL Server

2021-03-19 Thread Frank Millman


On 2021-03-19 12:00 PM, Pavel Stehule wrote:


In this query the most slow operation is query planning. You try to do 
tests on almost empty tables. This has no practical sense. You should 
test queries on tables with size similar to production size.



Sorry about that. I hope this one is better. Same query, different data set.

QUERY PLAN

 Merge Left Join  (cost=1401.00..1401.12 rows=1 width=132) (actual 
time=3.595..3.611 rows=5 loops=1)

   Merge Cond: (a.source_code_id = a_1.source_code_id)
   ->  GroupAggregate  (cost=673.16..673.18 rows=1 width=36) (actual 
time=1.101..1.108 rows=5 loops=1)

 Group Key: a.source_code_id
 ->  Sort  (cost=673.16..673.16 rows=1 width=12) (actual 
time=1.092..1.093 rows=5 loops=1)

   Sort Key: a.source_code_id
   Sort Method: quicksort  Memory: 25kB
   ->  Subquery Scan on a  (cost=670.67..673.15 rows=1 
width=12) (actual time=1.008..1.086 rows=5 loops=1)

 Filter: (a.row_num = 1)
 Rows Removed by Filter: 59
 ->  WindowAgg  (cost=670.67..672.37 rows=62 
width=36) (actual time=1.006..1.076 rows=64 loops=1)
   ->  Sort  (cost=670.67..670.82 rows=62 
width=28) (actual time=0.996..1.004 rows=64 loops=1)
 Sort Key: ar_totals.location_row_id, 
ar_totals.function_row_id, ar_totals.source_code_id, ar_totals.tran_date 
DESC

 Sort Method: quicksort  Memory: 30kB
 ->  Seq Scan on ar_totals 
(cost=0.00..668.82 rows=62 width=28) (actual time=0.012..0.933 rows=64 
loops=1)
   Filter: ((tran_date <= 
'2015-04-30'::date) AND (deleted_id = 0) AND (ledger_row_id = 1))

   Rows Removed by Filter: 840
   ->  GroupAggregate  (cost=727.85..727.89 rows=2 width=36) (actual 
time=2.490..2.495 rows=5 loops=1)

 Group Key: a_1.source_code_id
 ->  Sort  (cost=727.85..727.85 rows=3 width=12) (actual 
time=2.485..2.485 rows=5 loops=1)

   Sort Key: a_1.source_code_id
   Sort Method: quicksort  Memory: 25kB
   ->  Subquery Scan on a_1  (cost=700.70..727.82 rows=3 
width=12) (actual time=1.684..2.479 rows=5 loops=1)

 Filter: (a_1.row_num = 1)
 Rows Removed by Filter: 674
 ->  WindowAgg  (cost=700.70..719.35 rows=678 
width=36) (actual time=1.682..2.397 rows=679 loops=1)
   ->  Sort  (cost=700.70..702.40 rows=678 
width=28) (actual time=1.676..1.758 rows=679 loops=1)
 Sort Key: ar_totals_1.location_row_id, 
ar_totals_1.function_row_id, ar_totals_1.source_code_id, 
ar_totals_1.tran_date DESC

 Sort Method: quicksort  Memory: 78kB
 ->  Seq Scan on ar_totals ar_totals_1  
(cost=0.00..668.82 rows=678 width=28) (actual time=0.007..0.836 rows=679 
loops=1)
   Filter: ((tran_date < 
'2015-09-01'::date) AND (deleted_id = 0) AND (ledger_row_id = 1))

   Rows Removed by Filter: 225
 Planning Time: 0.496 ms
 Execution Time: 3.695 ms
(34 rows)




Re: SELECT is faster on SQL Server

2021-03-19 Thread Frank Millman



On 2021-03-19 12:58 PM, Frank Millman wrote:

QUERY PLAN


 Merge Left Join  (cost=1401.00..1401.12 rows=1 width=132) (actual 
time=3.595..3.611 rows=5 loops=1)

   Merge Cond: (a.source_code_id = a_1.source_code_id)
   ->  GroupAggregate  (cost=673.16..673.18 rows=1 width=36) (actual 
time=1.101..1.108 rows=5 loops=1)

 Group Key: a.source_code_id
 ->  Sort  (cost=673.16..673.16 rows=1 width=12) (actual 
time=1.092..1.093 rows=5 loops=1)

   Sort Key: a.source_code_id
   Sort Method: quicksort  Memory: 25kB
   ->  Subquery Scan on a  (cost=670.67..673.15 rows=1 
width=12) (actual time=1.008..1.086 rows=5 loops=1)

 Filter: (a.row_num = 1)
 Rows Removed by Filter: 59
 ->  WindowAgg  (cost=670.67..672.37 rows=62 
width=36) (actual time=1.006..1.076 rows=64 loops=1)
   ->  Sort  (cost=670.67..670.82 rows=62 
width=28) (actual time=0.996..1.004 rows=64 loops=1)
 Sort Key: ar_totals.location_row_id, 
ar_totals.function_row_id, ar_totals.source_code_id, 
ar_totals.tran_date DESC

 Sort Method: quicksort  Memory: 30kB
 ->  Seq Scan on ar_totals 
(cost=0.00..668.82 rows=62 width=28) (actual time=0.012..0.933 rows=64 
loops=1)
   Filter: ((tran_date <= 
'2015-04-30'::date) AND (deleted_id = 0) AND (ledger_row_id = 1))

   Rows Removed by Filter: 840
   ->  GroupAggregate  (cost=727.85..727.89 rows=2 width=36) (actual 
time=2.490..2.495 rows=5 loops=1)

 Group Key: a_1.source_code_id
 ->  Sort  (cost=727.85..727.85 rows=3 width=12) (actual 
time=2.485..2.485 rows=5 loops=1)

   Sort Key: a_1.source_code_id
   Sort Method: quicksort  Memory: 25kB
   ->  Subquery Scan on a_1  (cost=700.70..727.82 rows=3 
width=12) (actual time=1.684..2.479 rows=5 loops=1)

 Filter: (a_1.row_num = 1)
 Rows Removed by Filter: 674
 ->  WindowAgg  (cost=700.70..719.35 rows=678 
width=36) (actual time=1.682..2.397 rows=679 loops=1)
   ->  Sort  (cost=700.70..702.40 rows=678 
width=28) (actual time=1.676..1.758 rows=679 loops=1)
 Sort Key: 
ar_totals_1.location_row_id, ar_totals_1.function_row_id, 
ar_totals_1.source_code_id, ar_totals_1.tran_date DESC

 Sort Method: quicksort  Memory: 78kB
 ->  Seq Scan on ar_totals 
ar_totals_1  (cost=0.00..668.82 rows=678 width=28) (actual 
time=0.007..0.836 rows=679 loops=1)
   Filter: ((tran_date < 
'2015-09-01'::date) AND (deleted_id = 0) AND (ledger_row_id = 1))

   Rows Removed by Filter: 225
 Planning Time: 0.496 ms
 Execution Time: 3.695 ms
(34 rows)



@Pavel & depesz

Thanks for the replies. I am now focusing on the index. I tried dropping 
the index 'ar_tots_cover', and then adding back the index columns one at 
a time. Adding 'tran_date desc' made a small difference. Adding 
'tran_day' and 'tran_tot' made a big difference. This changed the index 
into a 'covering' index, and this is reflected in the new EXPLAIN 
ANALYSE (see below).


Execution of my main query has improved from 50ms to 33ms. Sql Server 
takes 25ms, but this is much better than it was.


However, the bizarre thing is that I have simply restored the index to 
what it was in the first place. If you look at the table definition in 
my original message you can see that all the columns were included in 
the index. But the query did not use it as a covering index. Now the 
EXPLAIN ANALYSE clearly shows 'Index Only Scan using ar_tots_cover'. I 
have no idea what changed.


Here is the new EXPLAIN ANALYSE -

QUERY PLAN

 Merge Left Join  (cost=161.39..161.51 rows=1 width=132) (actual 
time=1.566..1.581 rows=5 loops=1)

   Merge Cond: (a.source_code_id = a_1.source_code_id)
   ->  GroupAggregate  (cost=50.27..50.29 rows=1 width=36) (actual 
time=0.226..0.232 rows=5 loops=1)

 Group Key: a.source_code_id
 ->  Sort  (cost=50.27..50.28 rows=1 width=12) (actual 
time=0.217..0.218 rows=5 loops=1)

   Sort Key: a.source_code_id
   Sort Method: quicksort  Memory: 25kB
   ->  Subquery Scan on a  (cost=47.78..50.26 rows=1 
width=

Re: SELECT is faster on SQL Server

2021-03-19 Thread Frank Millman



On 2021-03-19 4:38 PM, Tom Lane wrote:

Frank Millman  writes:

However, the bizarre thing is that I have simply restored the index to
what it was in the first place. If you look at the table definition in
my original message you can see that all the columns were included in
the index. But the query did not use it as a covering index. Now the
EXPLAIN ANALYSE clearly shows 'Index Only Scan using ar_tots_cover'. I
have no idea what changed.

VACUUM, maybe?  Even if there's a covering index, the planner is not
likely to prefer an index-only scan unless it thinks that most of the
table's pages are known all-visible.  If they're not, most of the
rows will require heap probes anyway to check row visibility, meaning
that the "index-only" scan's performance degrades to about that of a
regular indexscan.

In this example, since you're fetching such a large fraction of the
table (which the planner is accurately estimating), there's not a lot
of daylight between the estimated costs of seqscan and index-only
scan to begin with.  I'm not surprised that it'd prefer the former
if the table isn't recently vacuumed.


It is possible. I know that I *did* vacuum. But I also ran a program to 
generate a few hundred additional rows, and I cannot remember if I ran 
the vacuum before or after that.


Frank






Re: SELECT is faster on SQL Server

2021-03-19 Thread Frank Millman



On 2021-03-19 5:32 PM, Jehan-Guillaume de Rorthais wrote:

On Fri, 19 Mar 2021 14:28:27 +0200
Frank Millman  wrote:


[...]
Execution of my main query has improved from 50ms to 33ms. Sql Server
takes 25ms, but this is much better than it was.

[...]

Here is the new EXPLAIN ANALYSE -

QUERY PLAN

   Merge Left Join  (...) (actual time=1.566..1.581 rows=5 loops=1)

1.581ms to output the very last row of this plan. This is in contradiction with
the 33ms you are referencing above.

What do I miss here? Maybe your 33ms comes yet from another set of data? Could
you share an explain analyze actually showing this 33ms total execution time?


Sorry, I should have explained.

The query I showed selects data for a single month. The 'real' query 
repeats this 12 times, each with different dates, and combines the 
results using UNION ALL. This was the timing mentioned above.


BTW, I know that I can improve this by setting up the dates in a CTE and 
using JOIN LATERAL. I am avoiding this as it is not supported by SQL 
Server or sqlite3, and I am trying to stick to one code base for all 
databases. But I will look into it further.


Frank






Re: SELECT is faster on SQL Server

2021-03-20 Thread Frank Millman



On 2021-03-19 7:11 PM, Thomas Kellerer wrote:

Frank Millman schrieb am 19.03.2021 um 10:16:


cl_bal selects WHERE tran_date <= '2018-03-31'.

op_bal selects WHERE tran_date < '2018-03-01'.

The second one could be written as WHERE tran_date <= '2018-02-28', 
but I don't think that would make any difference.


I knew I overlooked something ;)

But as one is a true subset of the other, I think you can merge that 
into a single SELECT statement:


    select '2018-03-01' AS op_date,
   '2018-03-31' AS cl_date,
   a.source_code_id,
   sum(a.tran_tot) AS cl_tot,
   sum(a.tran_tot) filter (where tran_date < '2018-03-01') AS 
op_tot

    FROM (
   SELECT distinct on (location_row_id, function_row_id, 
source_code_id) source_code_id, tran_tot, tran_date

   FROM prop.ar_totals
   WHERE deleted_id = 0
 AND tran_date <= '2018-03-31'
 AND ledger_row_id = 1
   ORDER BY location_row_id, function_row_id, source_code_id, 
tran_date DESC

    ) AS a
    GROUP BY a.source_code_id


Thanks very much Thomas - I did not know about FILTER.

But it does not quite work. If the SELECT does find a row where the max 
tran_date is <= '2018-03-31' it correctly includes it in 'cl_tot'. But 
the filter returns nothing for 'op_tot' because there is no 
corresponding row where tran_date < '2018-03-01'.


But I have learned something new, so thanks for that.

Frank