Why is there a tenfold difference between Postgres's alleged query execution time and packet transmission time?

2021-04-15 Thread Rollo Konig-Brock
Hey all,

I've been pulling my hair out over this for days now, as I'm trying to
build a low latency application. Databases should be fast, but I can not
work out why so much latency is added between the actual database process
and the application code. For simple queries, that should take less than a
millisecond, this mystery latency is by far the biggest performance hit.

For example, I have a very simple table on a local Postgres database, which
takes 324us to query. This is absolutely fine.

postgres=# EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
 select count("ID") from example t;
-[ RECORD 1 ]--
QUERY PLAN | Aggregate (actual rows=1 loops=1)
-[ RECORD 2 ]--
QUERY PLAN |   ->  Seq Scan on example t (actual rows=4119 loops=1)
-[ RECORD 3 ]--
QUERY PLAN | Planning Time: 0.051 ms
-[ RECORD 4 ]--
QUERY PLAN | Execution Time: 0.324 ms

But then if I want to connect to the database in Python, getting the actual
data takes over 2500us! The loopback interface is not the problem here, I
sanity checked that (it adds 100us at a stretch).

In [1]: %time cursor.execute("select count(\"ID\") from example;");r =
cursor.fetchmany()
CPU times: user 316 µs, sys: 1.12 ms, total: 1.44 ms
Wall time: 2.73 ms

Investigating further, I opened up WireShark to see how long the packets
themselves take.

[image: Wireshark screenshot] 

>From this I can tell that Postgres took 2594us to send the data. This kind
of overhead for an loopback IPC call is way beyond what I would expect. Why
is this? And how can I optimise this away?

This is a problem for the software ecosystem, because people start thinking
databases are too slow to do simple lookups, and they start caching queries
in Redis that take 300us to execute, but 3ms to actually fetch back to the
client.  It adds up to an awful lot of latency.


As an aside, I've been really miffed by how bad database IPC performance
has been in general.  Oracle has a problem where each ~1500 bytes makes
receiving the packet 800us slower (on a gigabit network).


I'd really love to know what's happening here.


Rollo


Strange behavior once statistics are there

2021-04-15 Thread Daniel Westermann (DWE)
Hi,

I currently have a strange behavior once statistics are collected. This is the 
statement (I don't know the application, the statement is as it is):

explain (analyze, buffers) select distinct standardzi4_.code   as col_0_0_,
person1_.personalnummer as col_1_0_,
person1_.funktion   as col_2_0_,
person1_.vorgesetzter   as col_3_0_,
person1_.ga_nr  as col_5_0_,
person1_.gueltig_ab as col_6_0_
from pia_01.pesz_person_standardziel personstan0_
 left outer join pia_01.pes_person_zielvergabe personziel3_ on 
personstan0_.pes_id = personziel3_.id
 left outer join pia_01.stz_standardziel standardzi4_ on 
personstan0_.stz_id = standardzi4_.id
 cross join pia_01.per_person person1_
 cross join pia_01.pess_person_stufe personstan2_
 cross join pia_01.zid_zieldefinition zieldefini5_
 cross join pia_01.stuv_stufe_vorgabe stufevorga8_
where personziel3_.zid_id = zieldefini5_.id
  and personstan2_.stuv_id = stufevorga8_.id
  and zieldefini5_.jahr=2021
  and (person1_.id in
   (select persons6_.per_id from pia_01.pesr_zielvergabe_person persons6_ 
where personziel3_.id = persons6_.pes_id))
  and (personstan0_.pess_id is null)
  and (personstan2_.id in
   (select stufen7_.id from pia_01.pess_person_stufe stufen7_ where 
personstan0_.id = stufen7_.pesz_id))
  and stufevorga8_.default_prozent_sollwert = 100
  and personziel3_.finale_vis_status = 'ANGENOMMEN';

Without any statistics the query runs quite fine (this is after a fresh import 
of a dump, no statistics yet):



QUERY PLAN  
  
--
 Unique  (cost=1549875.41..1550295.03 rows=23978 width=32) (actual 
time=14990.816..14990.842 rows=80 loops=1)
   Buffers: shared hit=24422449
   ->  Sort  (cost=1549875.41..1549935.36 rows=23978 width=32) (actual 
time=14990.815..14990.824 rows=80 loops=1)
 Sort Key: person1_.personalnummer, standardzi4_.code, 
person1_.funktion, person1_.vorgesetzter, person1_.ga_nr, person1_.gueltig_ab
 Sort Method: quicksort  Memory: 31kB
 Buffers: shared hit=24422449
 ->  Nested Loop  (cost=28.90..1548131.08 rows=23978 width=32) (actual 
time=13496.302..14990.767 rows=80 loops=1)
   Join Filter: (SubPlan 1)
   Rows Removed by Join Filter: 859840
   Buffers: shared hit=24422449
   ->  Seq Scan on per_person person1_  (cost=0.00..250.49 
rows=10749 width=38) (actual time=0.025..1.119 rows=10749 loops=1)
 Buffers: shared hit=143
   ->  Materialize  (cost=28.90..678088.61 rows=4 width=10) (actual 
time=0.019..1.252 rows=80 loops=10749)
 Buffers: shared hit=21842546
 ->  Nested Loop Left Join  (cost=28.90..678088.59 rows=4 
width=10) (actual time=198.423..13419.226 rows=80 loops=1)
   Buffers: shared hit=21842546
   ->  Nested Loop  (cost=28.76..678087.75 rows=4 
width=16) (actual time=198.414..13418.767 rows=80 loops=1)
 Join Filter: (SubPlan 2)
 Rows Removed by Join Filter: 5143360
 Buffers: shared hit=21842386
 ->  Nested Loop  (cost=9.43..5319.71 rows=1 
width=24) (actual time=69.778..90.956 rows=80 loops=1)
   Join Filter: (personziel3_.id = 
personstan0_.pes_id)
   Rows Removed by Join Filter: 435696
   Buffers: shared hit=18053
   ->  Index Scan using dwe30 on 
pesz_person_standardziel personstan0_  (cost=0.29..5300.69 rows=321 width=24) 
(actual time=0.037..25.543 rows=54472 loops=1)
 Filter: (pess_id IS NULL)
 Rows Removed by Filter: 9821
 Buffers: shared hit=18031
   ->  Materialize  (cost=9.14..14.22 
rows=1 width=8) (actual time=0.000..0.000 rows=8 loops=54472)
 Buffers: shared hit=22
 ->  Nested Loop  (cost=9.14..14.21 
rows=1 width=8) (actual time=0.534..0.564 rows=8 loops=1)
   Buffers: shared hit=22
   ->  Seq Scan on 
zid_zieldefinition zieldefini5_  (cost=0.00..1.05 rows=1 width=8) (actual 
time=0.012..0.01

Is there a way to change current time?

2021-04-15 Thread Warstone

Hi,
 
Is there any way to set time that CURRENT_TIMESTAMP and/or now() will give next 
time? (We need it only for testing purposes so if there is any hack, cheat, 
etc. It will be fine)
 
 
 
 

Re: Is there a way to change current time?

2021-04-15 Thread Bruce Momjian
On Thu, Apr 15, 2021 at 04:45:44PM +0300, [email protected] wrote:
> Hi,
>  
> Is there any way to set time that CURRENT_TIMESTAMP and/or now() will give 
> next
> time? (We need it only for testing purposes so if there is any hack, cheat,
> etc. It will be fine)

No, it gets the time from the operating system.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Is there a way to change current time?

2021-04-15 Thread Justin Pryzby
On Thu, Apr 15, 2021 at 09:58:23AM -0400, Bruce Momjian wrote:
> On Thu, Apr 15, 2021 at 04:45:44PM +0300, [email protected] wrote:
> > Hi,
> >  
> > Is there any way to set time that CURRENT_TIMESTAMP and/or now() will give 
> > next
> > time? (We need it only for testing purposes so if there is any hack, cheat,
> > etc. It will be fine)
> 
> No, it gets the time from the operating system.

You could overload now():

postgres=# CREATE DATABASE pryzbyj;
postgres=# \c pryzbyj
pryzbyj=# CREATE SCHEMA pryzbyj;
pryzbyj=# CREATE FUNCTION pryzbyj.now() RETURNS timestamp LANGUAGE SQL AS $$ 
SELECT 'today'::timestamp $$;
pryzbyj=# ALTER ROLE pryzbyj SET search_path=pryzbyj,public,pg_catalog;
pryzbyj=# SELECT now();
now | 2021-04-15 00:00:00

-- 
Justin




Re: Is there a way to change current time?

2021-04-15 Thread Tom Lane
Bruce Momjian  writes:
> On Thu, Apr 15, 2021 at 04:45:44PM +0300, [email protected] wrote:
>> Is there any way to set time that CURRENT_TIMESTAMP and/or now() will give 
>> next
>> time? (We need it only for testing purposes so if there is any hack, cheat,
>> etc. It will be fine)

> No, it gets the time from the operating system.

I think there are OS-level solutions for this on some operating systems.
If nothing else, you could manually set the system clock; but what you'd
really want is for the phony time to be visible to just some processes.

regards, tom lane




Re: Is there a way to change current time?

2021-04-15 Thread Pavel Stehule
Hi

čt 15. 4. 2021 v 15:45 odesílatel [email protected] 
napsal:

> Hi,
>
> Is there any way to set time that CURRENT_TIMESTAMP and/or now() will give
> next time? (We need it only for testing purposes so if there is any hack,
> cheat, etc. It will be fine)
>

This is a bad way - don't use now() in queries - use a variable instead.
Later you can use now as an argument or some constant value.

Regards

Pavel


>
>
>
>
>


Re: Is there a way to change current time?

2021-04-15 Thread Eugene Pazhitnov
select pg_sleep(1);

чт, 15 апр. 2021 г. в 16:45, [email protected] :

> Hi,
>
> Is there any way to set time that CURRENT_TIMESTAMP and/or now() will give
> next time? (We need it only for testing purposes so if there is any hack,
> cheat, etc. It will be fine)
>
>
>
>
>


-- 
Evgeny Pazhitnov


Re: Strange behavior once statistics are there

2021-04-15 Thread Tom Lane
"Daniel Westermann (DWE)"  writes:
> I currently have a strange behavior once statistics are collected. This is 
> the statement (I don't know the application, the statement is as it is):

I think your problem is with the subplan conditions, ie

>   and (person1_.id in
>(select persons6_.per_id from pia_01.pesr_zielvergabe_person persons6_ 
> where personziel3_.id = persons6_.pes_id))
> ...
>   and (personstan2_.id in
>(select stufen7_.id from pia_01.pess_person_stufe stufen7_ where 
> personstan0_.id = stufen7_.pesz_id))

These essentially create weird join conditions (between person1_ and
personziel3_ in the first case or personstan2_ and personstan0_ in
the second case) that the planner has no idea how to estimate.
It just throws up its hands and uses a default selectivity of 0.5,
which is nowhere near reality in this case.

You accidentally got an acceptable (not great) plan anyway without
statistics, but not so much with statistics.  Worse yet, the subplans
have to be implemented essentially as nestloop joins.

I'd suggest trying to flatten these to be regular joins, ie
try to bring up persons6_ and stufen7_ into the main JOIN nest.
It looks like persons6_.pes_id might be unique, meaning that you
don't really need the IN behavior in the first case so flattening
it should be straightforward.  The other one is visibly not unique,
but since you're using "select distinct" at the top level anyway,
getting duplicate rows might not be a problem (unless there are
a lot of duplicates?)

regards, tom lane




Re: Disabling options lowers the estimated cost of a query

2021-04-15 Thread Arne Roland
The startup cost is pretty expensive. This seems to be common issue using 
partition wise joins.


I attached a simplified reproducer. Thanks for having a look!


Regards

Arne


From: Tom Lane 
Sent: Friday, February 26, 2021 4:00:18 AM
To: Arne Roland
Cc: [email protected]
Subject: Re: Disabling options lowers the estimated cost of a query

Arne Roland  writes:
> I want to examine the exhaustive search and not the geqo here. I'd expect the 
> exhaustive search to give the plan with the lowest cost, but apparently it 
> doesn't. I have found a few dozen different querys where that isn't the case. 
> I attached one straight forward example. For the join of two partitions a row 
> first approach would have been reasonable.

Hmm.  While the search should be exhaustive, there are pretty aggressive
pruning heuristics (mostly in and around add_path()) that can cause us to
drop paths that don't seem to be enough better than other alternatives.
I suspect that the seqscan plan may have beaten out the other one at
some earlier stage that didn't think that the startup-cost advantage
was sufficient reason to keep it.

It's also possible that you've found a bug.  I notice that both
plans are using incremental sort, which has been, um, rather buggy.
Hard to tell without a concrete test case to poke at.

regards, tom lane





optimizer_first_rows.sql
Description: optimizer_first_rows.sql


Re: Disabling options lowers the estimated cost of a query

2021-04-15 Thread Tomas Vondra

On 2/26/21 4:00 AM, Tom Lane wrote:
> Arne Roland  writes:
>> I want to examine the exhaustive search and not the geqo here. I'd
>> expect the exhaustive search to give the plan with the lowest cost,
>> but apparently it doesn't. I have found a few dozen different
>> querys where that isn't the case. I attached one straight forward
>> example. For the join of two partitions a row first approach would
>> have been reasonable.
> 
> Hmm.  While the search should be exhaustive, there are pretty
> aggressive pruning heuristics (mostly in and around add_path()) that
> can cause us to drop paths that don't seem to be enough better than
> other alternatives. I suspect that the seqscan plan may have beaten
> out the other one at some earlier stage that didn't think that the
> startup-cost advantage was sufficient reason to keep it.
> 
> It's also possible that you've found a bug.  I notice that both plans
> are using incremental sort, which has been, um, rather buggy. Hard to
> tell without a concrete test case to poke at.
> 

Well, it's true incremental sort was not exactly bug free. But I very
much doubt it's causing this issue, for two reasons:

(a) It's trivial to simplify the reproducer further, so that there are
no incremental sort nodes. See the attached script, which has just a
single partition.

(b) The incremental sort patch does not really tweak the costing or
add_path in ways that would break this.

(c) PostgreSQL 12 has the same issue.


It seems the whole problem is in generate_orderedappend_paths(), which
simply considers two cases - paths with minimal startup cost and paths
with minimal total costs. But with LIMIT that does not work, of course.

With the simplified reproducer, I get these two plans:


   QUERY PLAN
  ---
   Limit  (cost=9748.11..10228.11 rows=1 width=8)
 ->  Merge Left Join (cost=9748.11..14548.11 rows=10 width=8)
   Merge Cond: (a.id = b.id)
   ->  Index Only Scan Backward using a0_pkey on a0 a
   (cost=0.29..3050.29 rows=10 width=8)
   ->  Sort  (cost=9747.82..9997.82 rows=10 width=8)
 Sort Key: b.id DESC
 ->  Seq Scan on b0 b
 (cost=0.00..1443.00 rows=10 width=8)
  (7 rows)

   QUERY PLAN
  ---
   Limit  (cost=0.58..3793.16 rows=1 width=8)
 ->  Nested Loop Left Join (cost=0.58..37926.29 rows=10 ...)
   ->  Index Only Scan Backward using a0_pkey on a0 a
   (cost=0.29..3050.29 rows=10 width=8)
   ->  Index Only Scan using b0_pkey on b0 b
   (cost=0.29..0.34 rows=1 width=8)
 Index Cond: (id = a.id)
  (5 rows)


The reason is quite simple - we get multiple join paths for each child
(not visible in the plans, because there's just a single partition),
with these costs:

A: nestloop_path   startup 0.585000total 35708.292500
B: nestloop_path   startup 0.292500total 150004297.292500
C: mergejoin_path  startup 9748.112737 total 14102.092737

The one we'd like is the nestloop (A), and with disabled partition-wise
join that's what we pick. But generate_orderedappend_paths calls
get_cheapest_path_for_pathkeys for startup/total cost, and gets the two
other paths. Clearly, nestlop (B) is pretty terrible for LIMIT, because
of the high total cost, and mergejoin (C) is what we end up with.

Not sure how to fix this without making generate_orderedappend_paths way
more complicated ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


optimizer_first_rows_simple.sql
Description: application/sql


Re: Strange behavior once statistics are there

2021-04-15 Thread Daniel Westermann (DWE)
From: Tom Lane 
Sent: Thursday, April 15, 2021 17:00
To: Daniel Westermann (DWE) 
Cc: [email protected] 

Subject: Re: Strange behavior once statistics are there 
 
>I'd suggest trying to flatten these to be regular joins, ie
>try to bring up persons6_ and stufen7_ into the main JOIN nest.
>It looks like persons6_.pes_id might be unique, meaning that you
>don't really need the IN behavior in the first case so flattening
>it should be straightforward.  The other one is visibly not unique,
>but since you're using "select distinct" at the top level anyway,
>getting duplicate rows might not be a problem (unless there are
>a lot of duplicates?)

Thank you, Tom